Skip to content

14. Class imbalance and thresholds — when accuracy celebrates doing nothing

Five minutes. Why a model that says "legit" to everyone scores 99.9% — and why moving one slider fixes everything.

Built on the ELI5 in 00-eli5.md. The confidence score gives a probability. But probability alone is not a decision. Someone must draw a line — "above this, call fraud." That line is the threshold. Move it, and the system's behavior flips. This file is about where to draw it when fraud is rare.


The picture before the math

See. Your payment system sees 10,000 transactions today. 10 are fraud. 9,990 are legit.

A lazy model stamps "legit" on every row. Does not inspect anything.

   Confusion matrix — the lazy model

                        Model says
                     "fraud"     "legit"
                  ┌──────────┬──────────┐
   Actually fraud │  TP = 0  │  FN = 10 │      10 fraud
                  ├──────────┼──────────┤
   Actually legit │  FP = 0  │  TN = 9990│   9,990 legit
                  └──────────┴──────────┘

   accuracy = (0 + 9990) / 10000 = 99.9%
   precision = 0 / 0 = undefined
   recall = 0 / (0 + 10) = 0%
   F1 = 0

99.9% accuracy. Zero recall. Every fraud transaction gets approved. The confidence score was never consulted. The dashboard celebrated a meaningless number.

This is underfitting at its worst — a model that learns "say the majority class" and the metric rewards it.


Why imbalance breaks default training

Most loss functions treat every sample equally. In a batch of 100 rows — 1 fraud, 99 legit — the gradient from the 1 fraud example is drowned by the 99 legit ones pulling the other way. The model learns: predicting legit always minimizes average loss. Signal from the rare class vanishes in the noise of the common class.

   gradient direction

   99 legit rows ───────────→  "predict legit"
    1 fraud row ←──                               (overwhelmed)

   net gradient ───────────→  "predict legit"

The model is not stupid. It found the easiest path — and the easiest path ignores the rare class entirely.


Fix 1 — class weights

Multiply the positive class loss by w_pos = N_neg / N_pos. Now the one fraud example contributes as much gradient as all 99 legit ones combined.

   without weights:  loss = (1/100) · [99 · loss_neg  +  1 · loss_pos]
   with weights:     loss = (1/100) · [99 · loss_neg  + 99 · loss_pos]
                                                         ↑ scaled up

Sklearn: class_weight='balanced' in LogisticRegression, RandomForestClassifier, SVC. XGBoost: scale_pos_weight = N_neg / N_pos.

When class weights fail

Class weights fix the gradient imbalance. They do not fix the information imbalance — if you have 10 positive samples, no amount of weighting gives you 1000 positive samples worth of signal. The model still has few examples to learn from. Weights are necessary but not sufficient for extreme imbalance.


Fix 2 — oversampling and SMOTE

Instead of weighting, physically add more positive samples.

Random oversampling. Duplicate existing positive rows. Simple. But the model can memorize duplicates instead of learning the pattern.

SMOTE (Synthetic Minority Over-sampling Technique). For each positive sample, find its k nearest positive neighbors. Create synthetic points along the line between them.

   feature space (2D)

        ●  pos_1
       / \
      /   \
   ●──●s───●  pos_2         s = synthetic sample
      pos_3                  (interpolated between pos_1 and pos_2)

Worked example — 3 positive, 97 negative, SMOTE to 50/50

Original: 3 positive, 97 negative. SMOTE creates 94 synthetic positives (interpolating between the 3 originals and their neighbors). New training set: 97 positive, 97 negative.

Step Positives Negatives Ratio
Original 3 97 1:32
After SMOTE 97 97 1:1

The model now sees equal classes. Gradients are balanced. The synthetic points add diversity beyond pure duplication.

When SMOTE fails

  • High-dimensional sparse data (text, genomics). Interpolating in 10,000 dimensions creates meaningless points. Use class weights instead.
  • Overlapping classes. If positives and negatives sit on top of each other in feature space, SMOTE creates synthetic points inside the negative region. The decision boundary gets confused.
  • Apply only to train. Never SMOTE the validation or test set. That inflates the metric. The real world is still imbalanced.

Calibration after resampling — do not trust raw probabilities

SMOTE and oversampling shift the class prior. The model's probabilities are calibrated to the resampled distribution, not the real one.

So after resampling, always recalibrate on a held-out set with the original fraud ratio. Platt scaling or isotonic regression are the standard fixes.


Fix 3 — undersampling the majority

The opposite move. Throw away majority-class samples until the classes balance.

Random undersampling. Drop random negative rows. Training set shrinks from 1000 to 20 (10 pos + 10 neg). Fast to train. But you discard 98% of your data — the model has less to learn from.

When undersampling wins. When you have millions of negatives and the compute budget for full training is unacceptable. Better to train on a balanced 200K subset than wait days for a full 10M-row imbalanced run. Common in ad-click prediction and fraud detection.


Fix 4 — threshold tuning

The model outputs a probability. The default threshold is 0.5. But why 0.5?

0.5 is arbitrary. On imbalanced data, the model's probabilities cluster near the base rate. If 0.1% of transactions are fraud, the model might output 0.002, 0.003, 0.001 for most rows. The few fraud cases get 0.008, 0.012, 0.015. A threshold of 0.5 calls everyone legit.

Lower the threshold. Set it to 0.005. Now the fraud cases (0.008, 0.012, 0.015) are caught. Some legit rows near 0.005–0.008 are falsely flagged. But you catch the fraud.

   model output distribution

   legit transactions:  |||||||||||||||||||  (clustered 0.000–0.004)
   fraud transactions:             |||       (clustered 0.006–0.015)

   threshold = 0.50:   catches 0 fraud, 0 false alarms    ← useless
   threshold = 0.005:  catches 3 fraud, 2 false alarms    ← useful
   threshold = 0.002:  catches 3 fraud, 8 false alarms    ← too noisy

How to pick the threshold

Three common methods:

  1. Maximize F1. Sweep thresholds 0.01 to 0.99. At each, compute F1 on validation. Pick the threshold with highest F1. Works when precision and recall are equally valued.

  2. Fix recall, maximize precision. "We must catch ≥95% of fraud." Find the lowest threshold that gives 95% recall on validation. Report the resulting precision. This is how Stripe sets fraud thresholds.

  3. Cost-based. Assign dollar costs: false negative = $10,000 (missed fraud), false positive = $5 (extra review). At each threshold, compute total cost = FN × $10,000 + FP × $5. Pick the threshold that minimizes total cost.


Worked example — threshold sweep on 200 transactions

200 transactions. 10 fraud, 190 legit. Model outputs probabilities.

Threshold TP FP FN TN Precision Recall F1
0.50 2 1 8 189 66.7% 20.0% 30.8%
0.30 5 4 5 186 55.6% 50.0% 52.6%
0.15 8 12 2 178 40.0% 80.0% 53.3%
0.08 9 25 1 165 26.5% 90.0% 40.9%
0.03 10 55 0 135 15.4% 100% 26.7%
  • At 0.50 — only 2 of 10 fraud caught. Terrible recall.
  • At 0.15 — 8 of 10 caught. F1 peaks.
  • At 0.03 — all 10 caught but 55 false alarms. Precision collapsed.

The right threshold depends on the cost of missing vs falsely flagging. There is no universally correct answer.


Pause and recall. Without scrolling — why does accuracy lie on imbalanced data? Name three fixes for class imbalance. What does threshold tuning actually change? If any link is fuzzy, scroll up.


Where this lives in the wild

  • Stripe fraud detection. Base fraud rate ≈ 0.1%. Default threshold catches nothing. They set threshold to achieve ≥97% recall, then optimize precision within that constraint. Threshold is re-tuned weekly as fraud patterns shift.
  • Google Safe Browsing. Flags malicious URLs. Prevalence ≈ 0.01%. Class weights + threshold tuning. False negative = user visits malware site. False positive = safe site blocked. Cost asymmetry drives threshold very low.
  • Account takeover detection. True attacks are rare, so platforms run very low thresholds and accept extra review prompts. High recall matters because one missed takeover can drain wallets, reset credentials, and trigger support escalations.
  • Ad click prediction (Meta, Google). Prevalence ≈ 1–3%. SMOTE does not work at this scale (billions of rows). Class weights + threshold tuning. The threshold is effectively baked into the bidding formula — predicted P(click) × bid value — so calibration and threshold are intertwined.
  • Credit card default prediction. Default rate ≈ 2%. Banks use cost-based thresholds — a default costs $5,000, a rejected good customer costs $50 in lost revenue. The optimal threshold is far below 0.5.

Interview Q&A

Q: Why is accuracy a bad metric for imbalanced classification?
A: Because a model that always predicts the majority class achieves accuracy equal to the majority proportion. On 99:1 imbalance, the lazy model scores 99% by doing nothing useful. Precision, recall, F1, or AUC-PR expose this lie because they measure performance on the minority class specifically.
Common wrong answer to avoid: "accuracy is always bad." Accuracy is fine when classes are balanced. It fails specifically when one class dominates.

Q: When would you use SMOTE vs class weights?
A: SMOTE when you have enough positives to interpolate meaningfully (say, 50+), low-to-moderate dimensionality, and classes that are separable in feature space. Class weights when data is high-dimensional or sparse (text, genomics), when positives are extremely few (<20), or when you need to avoid creating synthetic artifacts. In most production tabular pipelines, class weights are simpler and almost as effective.
Common wrong answer to avoid: "always use SMOTE because more data is better." SMOTE on 5 positives in 10,000 dimensions creates meaningless interpolations.

Q: How do you choose a classification threshold?
A: Start from the business constraint. If the cost of missing a positive is 1000× the cost of a false alarm, set the threshold to achieve high recall and accept low precision. If costs are symmetric, maximize F1 on the validation set. If you have explicit dollar costs for FP and FN, minimize total cost = FN × cost_fn + FP × cost_fp. Never use 0.5 as default on imbalanced data without checking.
Common wrong answer to avoid: "use 0.5 because that's the standard." 0.5 is a mathematical convenience, not a principled choice. It only makes sense when classes are balanced and costs are symmetric.

Q: Should I balance the test set?
A: Never. The test set must reflect the real-world distribution. If production sees 1% positives, the test set should too. Balance only the training set (via weights, oversampling, or undersampling). Balancing the test set gives you a metric that has nothing to do with production performance.
Common wrong answer to avoid: "yes, otherwise the model does not get a fair chance." The test set is not there to help the model. It is there to tell you the truth about production.


Apply now (5 min)

Scenario. You build a model to detect defective parts on a factory line. Defect rate = 0.5% (5 in 1000). The model outputs probabilities.

By hand:

  1. Write the confusion matrix for a "predict all good" baseline. Compute accuracy. Notice it is 99.5%.
  2. If a missed defect costs $50,000 (faulty part ships, recall, lawsuits) and a false alarm costs $100 (extra inspection), what is the cost ratio? (500:1)
  3. Given that ratio, should your threshold be above or below 0.5? (Far below.)
  4. If you set threshold = 0.02 and get TP = 4, FP = 30, FN = 1, TN = 965 — compute precision, recall, and total cost.

Then — without looking — sketch from memory:

  1. The lazy-model confusion matrix (all predicted legit).
  2. The threshold sweep table shape (precision drops as recall rises).
  3. One sentence: why never balance the test set.

If you can do all three in 90 seconds, you own this.


Bridge. You now have metrics that do not lie and thresholds tuned to business cost. But there are still entire model families we skipped — margin-based classifiers, distance-based learners, probabilistic models, and unsupervised methods. Read 15-svm.md next.