13. Metrics and calibration — picking and trusting numbers¶
Five minutes. Why accuracy is a liar, precision is a pessimist, recall is a worrier, and calibration is the only honest friend.
Built on the ELI5 in
00-eli5.md. The confidence score — the probability the model assigns to the prediction — comes back hard. If the score reads 90% fraud but only 60% of those transactions are actually fraud, the score is broken. Everything in this file is about fixing that score.
The picture before the formulas¶
See. The fraud model evaluates 100 transactions. Some are fraud. Some are legit. The model says "fraud" or "legit" for each one. Four things can happen:
Model says
"fraud" "legit"
┌───────────┬───────────┐
Actually fraud │ TP = 40 │ FN = 10 │ 50 actually fraud
├───────────┼───────────┤
Actually legit │ FP = 5 │ TN = 45 │ 50 actually legit
└───────────┴───────────┘
45 called 55 called
fraud legit
- TP — true positive. Fraud, called fraud. Correct alarm.
- FP — false positive. Legit, called fraud. False alarm. The legit transaction gets blocked.
- FN — false negative. Fraud, called legit. Missed. The fraud transaction gets approved.
- TN — true negative. Legit, called legit. Correct silence.
Every metric is a ratio carved from these four boxes. The question is which ratio matters for your fraud queue.
The four metrics, each with a job¶
Accuracy — the crowd-pleaser¶
Sounds great. But imagine 95 transactions are legit, 5 are fraud. A model that says "legit" to everyone scores 95% accuracy. The fraud queue approves every bad swipe. Accuracy lies when classes are imbalanced. This is underfitting dressed up as a good number.
Precision — "of everyone I flagged, how many were right?"¶
High precision → few false alarms. The review system does not block too many good transactions. Good for spam filters (flag only when sure), ad targeting (do not waste budget), and credit approval (do not reject good borrowers).
Recall — "of everyone actually fraud, how many did I catch?"¶
High recall → few missed fraud cases. The model catches most fraud. Good for fraud detection (every missed fraud costs money), safety systems (missing a defect kills someone), and medical screening (missing one can be catastrophic).
F1 — the compromise¶
F1 is the harmonic mean. It punishes extremes — a model with 100% precision and 1% recall scores F1 = 2%. It forces both numbers to be reasonable. Use F1 when you need one number and neither precision nor recall alone tells the story.
AUC-ROC — the threshold-free scorecard¶
The model does not just say "fraud" or "legit". The confidence score gives a probability: 0.92, 0.47, 0.03. We pick a threshold — say 0.5 — and call everyone above it "fraud".
But what if we swept the threshold from 0 to 1? At each threshold, we get a different (FPR, TPR) pair. Plot them all:
TPR (recall)
1.0 ┤ ●
│ ●
│ ●
│ ●
0.5 ┤ ● ← good model (bows to top-left)
│ ●
│ ●
│●
0.0 ┤●─────────────────────── ← random model (diagonal)
0.0 0.5 1.0
FPR
- AUC = 1.0 — perfect separation. Every fraud transaction scored higher than every legit one.
- AUC = 0.5 — random. The model is flipping coins.
- AUC = 0.85 — means "if you pick one random fraud transaction and one random legit transaction, the model ranks the fraud one higher 85% of the time."
AUC-ROC is threshold-independent. It tells you the model's ranking power before you pick a threshold. Use it when the threshold will be tuned later by business rules.
AUC-PR — when positives are rare¶
AUC-ROC can look amazing on imbalanced data. 1% fraud rate, model catches most fraud, FPR is tiny because the denominator (all negatives) is huge. The ROC curve bows beautifully. But precision might be 5% — 95% of your "fraud" flags are false alarms.
The precision-recall curve fixes this. Plot precision (y) vs recall (x) at every threshold:
precision
1.0 ┤●
│ ●
│ ● ← good model (stays high)
│ ●
0.5 ┤ ●
│ ●
│ ●
│ ● ← bad model (drops fast)
0.0 ┤ ●
0.0 0.5 1.0
recall
AUC-PR is harsh. A random classifier on 1% positive rate scores AUC-PR ≈ 0.01, not 0.5. Use AUC-PR when the positive class is rare and false alarms are expensive.
Macro vs micro vs weighted averaging — for multi-class reports¶
For multi-class metrics, you must say how you averaged.
- Micro = pool all TP/FP/FN globally, then compute the metric once. Dominated by the majority class.
- Macro = compute the metric per class, then average the class metrics equally. Rare classes get equal weight.
- Weighted = compute per-class metrics, then average by class support. Majority classes still dominate, but less brutally than micro.
Use macro when rare classes matter. Use micro when overall accuracy matters. Use weighted when you want one number that still reflects class frequency.
Calibration — does the confidence score tell the truth?¶
The model says "90% chance of fraud." You collect 100 transactions that all got 90%. If 90 of them were actually fraud, the model is calibrated. If only 60 were fraud, the score is broken — it is overconfident.
actual fraction fraud
1.0 ┤ ●
│ ● / ← perfect calibration (diagonal)
│ ● /
│ ● /
0.5 ┤ ● /
│ ● / ← overconfident (below diagonal)
│ ● /
│ ● /
0.0 ┤● /
0.0 0.5 1.0
predicted fraud probability
Brier score — calibration in one number¶
Lower is better. Brier = 0 means perfect. It rewards both discrimination (ranking) and calibration (probability accuracy). Unlike AUC, Brier punishes a model that ranks well but gives garbage probabilities.
Worked example — 5 transactions¶
| Transaction | Predicted P(fraud) | Actual | Squared error |
|---|---|---|---|
| 1 | 0.90 | 1 (fraud) | (0.90 - 1)² = 0.01 |
| 2 | 0.70 | 0 (legit) | (0.70 - 0)² = 0.49 |
| 3 | 0.30 | 0 (legit) | (0.30 - 0)² = 0.09 |
| 4 | 0.85 | 1 (fraud) | (0.85 - 1)² = 0.0225 |
| 5 | 0.10 | 1 (fraud) | (0.10 - 1)² = 0.81 |
Brier = (0.01 + 0.49 + 0.09 + 0.0225 + 0.81) / 5 = 0.2845
Transaction 2 (legit, predicted 0.70) and transaction 5 (fraud, predicted 0.10) hurt the most. The model was confidently wrong on both.
Log loss — punish confident mistakes hard¶
It punishes confident wrong predictions brutally. A 0.99 fraud score on a legit transaction costs much more than a 0.60 fraud score on that same wrong class.
So use log loss when you care about calibrated probabilities, not just ranking. AUC asks "did you rank fraud above legit?" Log loss asks "did you assign sensible probabilities?"
Fixing calibration — Platt scaling and isotonic regression¶
- Platt scaling. Fit a logistic regression on top of the model's raw scores using a held-out calibration set. Maps raw scores → calibrated probabilities. Simple. Works well when the calibration curve is roughly sigmoid-shaped.
- Isotonic regression. Fit a non-decreasing step function. More flexible. Needs more data. Better when the miscalibration is non-monotonic.
Both are post-hoc — train the model first, calibrate second. Sklearn: CalibratedClassifierCV(method='sigmoid') for Platt, method='isotonic' for isotonic.
Pause and recall. Without scrolling — what is the difference between precision and recall? When does AUC-ROC lie? What does a perfectly calibrated model's reliability diagram look like? If any link is fuzzy, scroll up.
Where this lives in the wild¶
- Google Ads click prediction. Calibration is the product. The bid price = predicted click probability × advertiser's value per click. If predicted P(click) = 0.05 but true P(click) = 0.02, Google charges 2.5x too much. Platt scaling runs on billions of predictions daily.
- PayPal and Adyen fraud review queues. Calibration matters because a "90% fraud risk" score drives manual review load and auto-block rules. If that 90% bucket is really 55%, the team either blocks too many legit payments or understaffs review.
- Stripe fraud detection. Optimizes precision at high recall — catch ≥95% of fraud (recall), then maximize precision within that constraint. AUC-PR is the offline metric. The threshold is set by the business cost of a false positive (blocking a legitimate purchase).
- Netflix recommendation. AUC-ROC to rank candidates. Calibration for the "% match" number shown to users. The match score must be honest — users learn to distrust a system that says 97% and delivers a dud.
- Weather forecasting (NOAA). Among the most calibrated prediction systems in the world. "30% chance of rain" means it rains 30% of the time when they say that. Decades of calibration research. ML teams aspire to this.
Interview Q&A¶
Q: When should you use precision vs recall vs F1?
A: Precision when false positives are expensive (spam filter flagging legitimate email, ad spend on wrong audience). Recall when false negatives are dangerous (cancer screening, fraud detection, safety systems). F1 when you need one number and neither dominates. Always ask: "what is the cost of a false positive vs a false negative?" That decides the metric.
Common wrong answer to avoid: "always use F1." F1 treats precision and recall as equally important. In most production systems, they are not. A cancer screen cares about recall. A spam filter cares about precision.
Q: Why can AUC-ROC be misleading on imbalanced data?
A: Because FPR = FP / (FP + TN), and when TN is huge (many negatives), even a large number of false positives produces a tiny FPR. The ROC curve looks excellent while the model flags 95% false alarms. AUC-PR exposes this because precision = TP / (TP + FP) — the huge TN count does not help.
Common wrong answer to avoid: "ROC is always fine, just use a different threshold." The problem is not the threshold — it is that ROC hides how many false alarms you produce when positives are rare.
Q: What is calibration and why does it matter beyond ranking?
A: Calibration means predicted probabilities match empirical frequencies. It matters whenever downstream decisions use the probability directly — bidding (ad tech), risk pricing (insurance), or fraud-review routing. A well-ranked but miscalibrated model gives good ordering but wrong dollar amounts or wrong risk levels.
Common wrong answer to avoid: "calibration is just accuracy." No. A model can be 95% accurate and terribly miscalibrated — it predicts 0.99 for every positive instead of the correct 0.70.
Q: How do you fix a model that ranks well but is poorly calibrated?
A: Post-hoc calibration. Hold out a calibration set (not test). Fit Platt scaling (logistic regression on raw scores → probabilities) or isotonic regression (non-parametric monotonic mapping). Platt is simpler and works for sigmoid-shaped miscalibration. Isotonic is more flexible but needs more data. Sklearn's CalibratedClassifierCV handles both.
Common wrong answer to avoid: "just change the threshold." Threshold changes decisions, not probabilities. A badly calibrated 0.93 score is still a badly calibrated 0.93 score after thresholding.
Q: What is the difference between macro and micro F1?
A: Micro F1 pools all predictions first, so the majority class dominates. Macro F1 computes F1 per class and averages, so rare classes count equally. On imbalanced data, micro F1 can sit near accuracy while macro F1 drops hard and exposes weak performance on rare classes.
Common wrong answer to avoid: "they're basically the same." On imbalanced data they can be very different, and that difference is the whole signal.
Apply now (5 min)¶
Scenario. A payments company deploys a card-fraud model. Prevalence is 5% (5 in 100 transactions are fraud). The model scores every transaction 0–1.
By hand:
- Write the confusion matrix for a threshold of 0.5 if: TP = 4, FP = 10, FN = 1, TN = 85.
- Compute accuracy, precision, recall, F1.
- Compute Brier score for these 5 transactions: (0.92, fraud), (0.80, legit), (0.15, legit), (0.60, fraud), (0.05, fraud).
- The payments team says "we cannot miss more than 5% of fraud." Which metric constraint is that? (Recall ≥ 95%.) What happens to precision when you enforce it?
Then — without looking — sketch from memory:
- The confusion matrix layout (2×2, TP/FP/FN/TN).
- The ROC curve shape for a good model vs random.
- One sentence: what does "calibrated" mean?
If you can do all three in 90 seconds, you own this.
Bridge. The metrics are honest now. But the fraud queue sees 95 legit transactions for every 5 fraud ones. Accuracy celebrates doing nothing. The threshold that maximizes F1 might still miss half the fraud. Class imbalance changes everything. Read
14-class-imbalance.mdnext.