02. Bias and variance — naming the two diseases¶
Five minutes. Two pictures. One diagnostic table you will use for the rest of your career.
Built on the ELI5 in
00-eli5.md. Underfitting, overfitting, and the training-set-vs-production picture are called back here.
Two models. Both wrong. Different reasons.¶
Picture two failing house-price models.
Model A predicts the citywide average price for every house. One blunt rule. It is wrong on most homes — and wrong in the same direction every time. It underprices prime localities and overprices weak ones. Steady. Confident. Wrong.
Model B has memorized 4000 past homes. It sees a blue gate on Elm Street and says "This is exactly like house #2871 — ₹82 lakh." Training homes — perfect. New homes — all over the place. Brilliant on the past. Useless on the future.
A is underfitting. High bias. Too rigid. B is overfitting. High variance. Too memorizing.
Same surface result — bad performance. Opposite causes. Opposite fixes. If you cannot tell which disease, every fix is a guess.
The dart-thrower picture¶
Forget house prices for a second. Imagine throwing darts at a bullseye. Two ways to miss.
LOW BIAS, LOW VAR LOW BIAS, HIGH VAR
(the goal) (right plan, shaky hand)
. . . . . o . . o .
. o o . o . . . . o
. o o . . . o . . .
. . . . o . . o . .
bullseye darts scattered
all hits around bullseye
HIGH BIAS, LOW VAR HIGH BIAS, HIGH VAR
(steady aim, wrong spot) (the worst case)
. . . . . . o . . . o
. . . . . . . . . . .
. . . . . o . . . . .
o o o . . . . . o . .
o o o . . o . . . . o
cluster off-target scattered AND off
Bias = consistent miss in one direction. Steady hand pulled toward the wrong spot. Variance = scattered hits. Right aim, shaky hand.
That is the whole intuition. Now we name it.
The formula under the picture¶
The squared prediction error splits into three pieces:
Bias² = the model's average miss because it is too rigid.
Variance = how much the model's answer wiggles across different training sets.
σ²_noise = irreducible error — also called Bayes error. This is the noise no model can remove: bad measurements, hidden variables, random shocks, plain world messiness.
So what to do with this formula? Reduce bias if the model is too simple. Reduce variance if the model is too twitchy. But do not promise zero error — σ²_noise is still sitting there.
Reading the loss curves — the 30-second diagnostic¶
You will not see darts in production. You will see numbers — train loss and validation loss across epochs (or across model complexity). Three pictures cover every case.
Underfit (high bias). Both curves bad. Both flat.
loss
^
| train ────────────────────
| val ──────────────────── (gap small, both high)
|
+───────────────────────────→ epochs
Both lines plateau at high loss. Train and validation are both poor. The model cannot even fit what you showed it. Underfitting.
Overfit (high variance). Train great. Validation diverges.
loss
^
| val ─────╲ ╱──── (val rises again)
| ╲ ╱
| ╲ ╱
| ╲╱
| train ─────────────────╲___ (train keeps falling)
|
+───────────────────────────→ epochs
Train loss keeps falling. Val loss bottoms out, then rises. The gap is the disease. Overfitting.
Balanced. Both fall together. Both plateau at a similar low value.
loss
^
| val ───╲
| ╲___
| train ──╲__ ─────── (small gap, both low)
|
+───────────────────────────→ epochs
Train and val ride down together. Small steady gap. Production is happy.
So the rule fits on a sticky note:
Both bad → bias. Train good, val bad → variance. Both good → ship.
Why a high-bias model cannot be saved by more data¶
Now the load-bearing claim. Adding more data does not fix underfitting. People get this wrong constantly. So we prove it with numbers.
Setup. The true relationship is a curve — y = x². We fit a straight line y = w·x + b. The straight line cannot bend. Watch what happens as we throw more data at it.
Attempt 1 — fit on 10 points¶
Sample 10 points along y = x² from x = -3 to 3 (with tiny noise). Fit best straight line by least squares.
| Best line | Train RMSE | Val RMSE |
|---|---|---|
| y ≈ 0·x + 3.0 | 2.7 | 2.8 |
The line goes through the average y. Useless for any specific x. Both errors high. Both close. Classic underfit.
Attempt 2 — fit on 100 points¶
Same true curve. 10× more samples.
| Best line | Train RMSE | Val RMSE |
|---|---|---|
| y ≈ 0·x + 3.0 | 2.7 | 2.7 |
Same line. Same error. The extra 90 points just confirmed what 10 already said — the best straight line through a parabola is the horizontal line at the mean. No improvement.
Attempt 3 — fit on 1000 points¶
Same true curve. 100× more samples than attempt 1.
| Best line | Train RMSE | Val RMSE |
|---|---|---|
| y ≈ 0·x + 3.0 | 2.7 | 2.7 |
Still the same. We could go to a million. The line still cannot bend. The error floor is structural, not statistical.
The structural reason. A straight line has two parameters — slope and intercept. A parabola has curvature that no slope-and-intercept can express. The hypothesis class is too small. More data sharpens your estimate of the best line within that class — but the best line in that class is already bad.
So what to do? Add capacity. Switch to y = a·x² + b·x + c. Three parameters. Now the model can bend. Train on the same 10 points and the error collapses. Capacity, not data, fixed bias.
The mirror lesson. Variance is the opposite — more data does help, because variance comes from the model latching onto random noise in a small training set. More data drowns the noise. But adding capacity to a high-variance model makes things worse. Opposite diseases. Opposite fixes.
Where this lives in the wild¶
Bias-variance is not a textbook chart. It is the daily knob real teams turn.
- Stripe Radar — gradient-boosted trees with tuned depth. Tree depth is the bias-variance dial. Shallow trees → high bias, miss interactions like "new merchant and high amount". Deep trees → high variance, memorize the last fraud ring and miss the next. Stripe tunes depth + number of trees against a held-out fraud set so the voting panel finds the sweet spot.
- Zillow Zestimate. Early years — too rigid. Linear-ish models underfit micro-markets, large systematic errors in unusual neighborhoods. Later — overfit during stable markets, then exploded when 2021 prices shifted. Same model. Variance disease showed only when the world moved.
- Affirm credit decisioning. Heavy regularization deliberately. Underwriting models must hold up on customers who do not look like training data. They accept slightly higher bias in exchange for radically lower variance — a model that memorizes 2024 borrowers will deny good 2026 borrowers. Stable beats clever.
- DoorDash ETA prediction. Bias from missing features dominates. If the model does not know "the restaurant is small and the kitchen is backed up", no amount of data or capacity rescues the estimate. The team's wins came from adding signals (live order count at the merchant) — bias-fix by feature engineering, not regularization.
Four products. Same equation. Different language.
Pause and recall. Without scrolling — name each disease in house-pricing words. Average-price-for-every-home model = which one? Memorize-every-house model = which one? Read the train/val rule for each. State why adding 100× data did not move the straight line. If any link is fuzzy, scroll back.
Interview Q&A¶
Q: How do you tell underfit from overfit just from the loss curves? A: Compare train and validation. Both bad and close → underfit (high bias). Train good and val noticeably worse → overfit (high variance). Both good and close → balanced. The signal is the gap combined with the level — never one alone. Common wrong answer to avoid: "high training loss means overfitting." Wrong direction. High training loss means the model cannot even fit what it saw — that is bias, not variance.
Q: Will adding more training data fix overfitting? What about underfitting? A: More data attacks variance, not bias. Variance comes from the model latching onto noise in a small sample — more data drowns the noise, the validation gap narrows. Bias comes from a model class too small to express the truth — more data only sharpens an already-bad estimate. Common wrong answer to avoid: "more data always helps." It only helps the high-variance disease. Throw 1M rows at a linear model fitting a quadratic — same error as 10 rows.
Q: Will adding capacity (deeper tree, more layers) fix underfitting? A: Yes — that is the bias fix. More parameters lets the model bend through patterns the simpler one could not. But push it too far and you slide into overfitting. The cure for one disease is the cause of the other. Common wrong answer to avoid: "always go bigger." Capacity without regularization or enough data flips you straight into high variance.
Q: Train loss is 0.1, validation loss is 0.4. What is your move? A: Diagnosis — high variance, so the model is overfitting. Fix direction — add regularization (L2, dropout, early stopping), get more data, reduce capacity, or move to an ensemble. Do not add features or layers. Confirm the validation set looks like production before celebrating any improvement. Common wrong answer to avoid: "The model needs more capacity because validation is worse." Wrong direction. A bigger model usually widens the gap further.
Apply now (5 min)¶
Take a blank sheet. Without looking, sketch from memory:
- The four-quadrant dart picture — low/high bias × low/high variance, with a one-line caption per quadrant.
- The three loss curves — underfit, overfit, balanced.
- The one-sentence rule: both bad → bias, train good val bad → variance, both good → ship.
- The 1000-point straight-line attempt — why error did not move from the 10-point case.
If you can reproduce all four in under two minutes, you own the detection vocabulary. The rest of this module is what to do once you have named the disease.
Bridge. You can now spot the disease. Next question — what shape does each model family draw, and which shape suits which problem? Read
03-model-shapes.mdnext.