Classification Metrics: Precision, Recall, F1, and ROC
The confusion matrix is the foundation of every classification metric. Precision measures how often you are right when you say yes. Recall measures how often you catch what you should. F1 balances them. ROC/AUC summarises performance across all thresholds.
The Confusion Matrix
Every classification prediction falls into one of four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
- TP: correctly predicted positive
- FP: predicted positive, actually negative (false alarm)
- FN: predicted negative, actually positive (missed detection)
- TN: correctly predicted negative
Total positives = . Total negatives = . Total retrieved (predicted positive) = .
Core Metrics
Recall (True Positive Rate): of all actual positives, how many did you catch?
Precision: of all the positives you predicted, how many were correct?
F1 Score: harmonic mean of precision and recall:
The harmonic mean penalises extreme imbalances. An F1 of 0.9 requires both precision and recall to be reasonably high.
Specificity: of all actual negatives, how many did you correctly reject?
False Positive Rate: the complement of specificity:
When to Use Which
| Metric | Use when |
|---|---|
| Recall | Missing a positive is costly, e.g. disease detection, fraud, safety systems |
| Precision | False alarms are costly, e.g. spam filters, content moderation |
| F1 | You need to balance both, e.g. imbalanced classes, no clear priority |
| Specificity | You care specifically about negative detection |
ROC Curve
The ROC (Receiver Operating Characteristic) curve plots TPR (recall) vs FPR as the classification threshold varies from 0 to 1.
- Perfect classifier: reaches the top-left corner (TPR = 1, FPR = 0) immediately
- Random classifier: falls on the diagonal (TPR = FPR for all thresholds)
- Worse than random: below the diagonal
AUC: Area Under the Curve
AUC summarises the entire ROC curve as a single number:
- Perfect: (perfect classifier)
- Random: (random classifier, falls on the diagonal)
- Worse than random: (below the diagonal)
AUC is threshold-independent, useful when you have not yet chosen an operating threshold. A classifier with AUC = 0.9 correctly ranks a random positive above a random negative 90% of the time.
What You Can Do Now
The code below computes precision, recall, F1, and AUC from scratch using only numpy, then cross-checks with sklearn. This is useful for understanding exactly what each metric measures.
import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve
np.random.seed(3)
# Simulate a classifier: true labels and predicted probabilities
y_true = np.array([1,1,1,1,1,1,0,0,0,0,0,0,0,0,0])
y_score = np.array([0.9,0.8,0.75,0.6,0.55,0.3,0.85,0.7,0.5,0.4,0.35,0.25,0.2,0.1,0.05])
threshold = 0.5
y_pred = (y_score >= threshold).astype(int)
# Confusion matrix components
TP = np.sum((y_pred == 1) & (y_true == 1))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
fpr_val = FP / (FP + TN) if (FP + TN) > 0 else 0
print(f"Confusion matrix: TP={TP} FP={FP} FN={FN} TN={TN}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")
print(f"FPR: {fpr_val:.3f}")
auc = roc_auc_score(y_true, y_score)
print(f"AUC: {auc:.3f}")
# Show how precision/recall shift as threshold changes
print("\nThreshold sweep:")
for t in [0.3, 0.5, 0.7, 0.9]:
yp = (y_score >= t).astype(int)
tp = np.sum((yp==1)&(y_true==1)); fp = np.sum((yp==1)&(y_true==0))
fn = np.sum((yp==0)&(y_true==1))
p = tp/(tp+fp) if tp+fp>0 else 0; r = tp/(tp+fn) if tp+fn>0 else 0
print(f" t={t}: precision={p:.2f}, recall={r:.2f}")
Change the threshold to observe the precision-recall trade-off directly: a lower threshold catches more positives (higher recall) at the cost of more false alarms (lower precision). Replace y_score with outputs from any sklearn classifier’s predict_proba to evaluate it with these metrics.