Contents
  1. The Confusion Matrix
  2. Core Metrics
  3. When to Use Which
  4. ROC Curve
  5. AUC: Area Under the Curve
  6. What You Can Do Now
← All posts

Classification Metrics: Precision, Recall, F1, and ROC

The confusion matrix is the foundation of every classification metric. Precision measures how often you are right when you say yes. Recall measures how often you catch what you should. F1 balances them. ROC/AUC summarises performance across all thresholds.

The Confusion Matrix

Every classification prediction falls into one of four cells:

Predicted PositivePredicted Negative
Actual PositiveTP (True Positive)FN (False Negative)
Actual NegativeFP (False Positive)TN (True Negative)
  • TP: correctly predicted positive
  • FP: predicted positive, actually negative (false alarm)
  • FN: predicted negative, actually positive (missed detection)
  • TN: correctly predicted negative

Total positives = TP+FNTP + FN. Total negatives = FP+TNFP + TN. Total retrieved (predicted positive) = TP+FPTP + FP.

Core Metrics

Recall (True Positive Rate): of all actual positives, how many did you catch?

Recall=TPTP+FN=TPTotal Positives\text{Recall} = \frac{TP}{TP + FN} = \frac{TP}{\text{Total Positives}}

Precision: of all the positives you predicted, how many were correct?

Precision=TPTP+FP=TPTotal Retrieved\text{Precision} = \frac{TP}{TP + FP} = \frac{TP}{\text{Total Retrieved}}

F1 Score: harmonic mean of precision and recall:

F1=2×Precision×RecallPrecision+RecallF_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

The harmonic mean penalises extreme imbalances. An F1 of 0.9 requires both precision and recall to be reasonably high.

Specificity: of all actual negatives, how many did you correctly reject?

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}

False Positive Rate: the complement of specificity:

FPR=1Specificity=FPTN+FP\text{FPR} = 1 - \text{Specificity} = \frac{FP}{TN + FP}

When to Use Which

MetricUse when
RecallMissing a positive is costly, e.g. disease detection, fraud, safety systems
PrecisionFalse alarms are costly, e.g. spam filters, content moderation
F1You need to balance both, e.g. imbalanced classes, no clear priority
SpecificityYou care specifically about negative detection

ROC Curve

The ROC (Receiver Operating Characteristic) curve plots TPR (recall) vs FPR as the classification threshold varies from 0 to 1.

  • Perfect classifier: reaches the top-left corner (TPR = 1, FPR = 0) immediately
  • Random classifier: falls on the diagonal (TPR = FPR for all thresholds)
  • Worse than random: below the diagonal

AUC: Area Under the Curve

AUC summarises the entire ROC curve as a single number:

  • Perfect: AUC=1.0\text{AUC} = 1.0 (perfect classifier)
  • Random: AUC=0.5\text{AUC} = 0.5 (random classifier, falls on the diagonal)
  • Worse than random: AUC<0.5\text{AUC} < 0.5 (below the diagonal)

AUC is threshold-independent, useful when you have not yet chosen an operating threshold. A classifier with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time.

What You Can Do Now

The code below computes precision, recall, F1, and AUC from scratch using only numpy, then cross-checks with sklearn. This is useful for understanding exactly what each metric measures.

import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve

np.random.seed(3)
# Simulate a classifier: true labels and predicted probabilities
y_true  = np.array([1,1,1,1,1,1,0,0,0,0,0,0,0,0,0])
y_score = np.array([0.9,0.8,0.75,0.6,0.55,0.3,0.85,0.7,0.5,0.4,0.35,0.25,0.2,0.1,0.05])

threshold = 0.5
y_pred = (y_score >= threshold).astype(int)

# Confusion matrix components
TP = np.sum((y_pred == 1) & (y_true == 1))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))

precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
f1        = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
fpr_val   = FP / (FP + TN) if (FP + TN) > 0 else 0

print(f"Confusion matrix: TP={TP} FP={FP} FN={FN} TN={TN}")
print(f"Precision:  {precision:.3f}")
print(f"Recall:     {recall:.3f}")
print(f"F1:         {f1:.3f}")
print(f"FPR:        {fpr_val:.3f}")

auc = roc_auc_score(y_true, y_score)
print(f"AUC:        {auc:.3f}")

# Show how precision/recall shift as threshold changes
print("\nThreshold sweep:")
for t in [0.3, 0.5, 0.7, 0.9]:
    yp = (y_score >= t).astype(int)
    tp = np.sum((yp==1)&(y_true==1)); fp = np.sum((yp==1)&(y_true==0))
    fn = np.sum((yp==0)&(y_true==1))
    p = tp/(tp+fp) if tp+fp>0 else 0; r = tp/(tp+fn) if tp+fn>0 else 0
    print(f"  t={t}: precision={p:.2f}, recall={r:.2f}")

Change the threshold to observe the precision-recall trade-off directly: a lower threshold catches more positives (higher recall) at the cost of more false alarms (lower precision). Replace y_score with outputs from any sklearn classifier’s predict_proba to evaluate it with these metrics.

← All posts