Classification Metrics: Precision, Recall, F1, and ROC

The Confusion Matrix

Every classification prediction falls into one of four cells:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

TP: correctly predicted positive
FP: predicted positive, actually negative (false alarm)
FN: predicted negative, actually positive (missed detection)
TN: correctly predicted negative

Total positives = $TP + FN$ . Total negatives = $FP + TN$ . Total retrieved (predicted positive) = $TP + FP$ .

Core Metrics

Recall (True Positive Rate): of all actual positives, how many did you catch?

\text{Recall} = \frac{TP}{TP + FN} = \frac{TP}{\text{Total Positives}}

Precision: of all the positives you predicted, how many were correct?

\text{Precision} = \frac{TP}{TP + FP} = \frac{TP}{\text{Total Retrieved}}

F1 Score: harmonic mean of precision and recall:

F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

The harmonic mean penalises extreme imbalances. An F1 of 0.9 requires both precision and recall to be reasonably high.

Specificity: of all actual negatives, how many did you correctly reject?

\text{Specificity} = \frac{TN}{TN + FP}

False Positive Rate: the complement of specificity:

\text{FPR} = 1 - \text{Specificity} = \frac{FP}{TN + FP}

When to Use Which

Metric	Use when
Recall	Missing a positive is costly, e.g. disease detection, fraud, safety systems
Precision	False alarms are costly, e.g. spam filters, content moderation
F1	You need to balance both, e.g. imbalanced classes, no clear priority
Specificity	You care specifically about negative detection

ROC Curve

The ROC (Receiver Operating Characteristic) curve plots TPR (recall) vs FPR as the classification threshold varies from 0 to 1.

Perfect classifier: reaches the top-left corner (TPR = 1, FPR = 0) immediately
Random classifier: falls on the diagonal (TPR = FPR for all thresholds)
Worse than random: below the diagonal

AUC: Area Under the Curve

AUC summarises the entire ROC curve as a single number:

Perfect: $\text{AUC} = 1.0$ (perfect classifier)
Random: $\text{AUC} = 0.5$ (random classifier, falls on the diagonal)
Worse than random: $\text{AUC} < 0.5$ (below the diagonal)

AUC is threshold-independent, useful when you have not yet chosen an operating threshold. A classifier with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time.

What You Can Do Now

The code below computes precision, recall, F1, and AUC from scratch using only numpy, then cross-checks with sklearn. This is useful for understanding exactly what each metric measures.

import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve

np.random.seed(3)
# Simulate a classifier: true labels and predicted probabilities
y_true  = np.array([1,1,1,1,1,1,0,0,0,0,0,0,0,0,0])
y_score = np.array([0.9,0.8,0.75,0.6,0.55,0.3,0.85,0.7,0.5,0.4,0.35,0.25,0.2,0.1,0.05])

threshold = 0.5
y_pred = (y_score >= threshold).astype(int)

# Confusion matrix components
TP = np.sum((y_pred == 1) & (y_true == 1))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))

precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall    = TP / (TP + FN) if (TP + FN) > 0 else 0
f1        = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
fpr_val   = FP / (FP + TN) if (FP + TN) > 0 else 0

print(f"Confusion matrix: TP={TP} FP={FP} FN={FN} TN={TN}")
print(f"Precision:  {precision:.3f}")
print(f"Recall:     {recall:.3f}")
print(f"F1:         {f1:.3f}")
print(f"FPR:        {fpr_val:.3f}")

auc = roc_auc_score(y_true, y_score)
print(f"AUC:        {auc:.3f}")

# Show how precision/recall shift as threshold changes
print("\nThreshold sweep:")
for t in [0.3, 0.5, 0.7, 0.9]:
    yp = (y_score >= t).astype(int)
    tp = np.sum((yp==1)&(y_true==1)); fp = np.sum((yp==1)&(y_true==0))
    fn = np.sum((yp==0)&(y_true==1))
    p = tp/(tp+fp) if tp+fp>0 else 0; r = tp/(tp+fn) if tp+fn>0 else 0
    print(f"  t={t}: precision={p:.2f}, recall={r:.2f}")

Change the threshold to observe the precision-recall trade-off directly: a lower threshold catches more positives (higher recall) at the cost of more false alarms (lower precision). Replace y_score with outputs from any sklearn classifier’s predict_proba to evaluate it with these metrics.