04. Regularization — the soft leash that cures overthinking¶
Five minutes. Two shapes. Why one shape kills weights and the other only shrinks them.
Built on the ELI5 in
00-eli5.md. Overfitting is the disease. Regularization is the cure. Keep that picture close.
The picture before the formula¶
See. The model has too much rope. Weights swing to large values to chase every wiggle in training data. That is overfitting from ELI5 — a model that memorized noise in the training set.
So what to do? Tie every weight back toward zero with a tiny spring.
Picture a balance scale. Big weights tip the scale violently — one rare feature drags the whole price prediction. Small weights tip gently — many features contribute, none dominates. A spring on every weight pulls toward zero. The data pulls outward to fit the signal. Equilibrium is where signal beats spring.
weight = 0 weight = 5
| ━━━━━━●
●━━━━━● |
| | <-- balance tips violently
gentle tip one noisy feature dominates prediction
That is regularization. Not punishment. Preference. Among equally good fits, prefer the simpler explanation.
The formula, in one breath¶
Add a penalty term to the loss. The penalty grows with weight size.
λ is the spring stiffness. Small λ → weak spring, weights free. Large λ → stiff spring, weights crushed toward zero.
How do you choose λ in practice?¶
Cross-validation. Sweep 1e-4, 1e-3, 1e-2, ..., 1e2, pick the λ that minimizes validation loss, and if several are nearly tied, prefer the larger λ — simpler model, same accuracy.
Two formulas, almost identical. But the shape of the penalty decides everything.
L1 diamond vs L2 circle — the shape that decides¶
The penalty defines a fence around the origin in weight space. The model can live inside or on the fence. The data pulls outward; the fence pushes back.
L1 (diamond, |w1|+|w2| ≤ t) L2 (circle, w1²+w2² ≤ t)
w2 w2
↑ ↑
| |
◇ | ●●●●|●●●●
◇ | ◇ ● | ●
◇ | ◇ ● | ●
───◇─────+─────◇───→ w1 ──●──────+──────●──→ w1
◇ | ◇ ● | ●
◇ | ◇ ● | ●
◇ | ●●●●|●●●●
| |
corners SIT ON THE AXES smooth boundary, NO corners
loss valley first hits a loss valley first hits a
corner → one weight = 0 smooth point → both weights ≠ 0
Now overlay the loss contours — concentric ovals around the unconstrained best fit. Picture them shrinking inward until they first touch the fence.
L1 — touches a corner L2 — touches the side
w2 w2
↑ ↑
◇ | ●●●●|●●●●
╱(w1=0,w2=t)╲ ● ╱─╲ ●
◇ | ◇ ←contour ● ╱ ╲ ● ←contour
──╱───────+───────╲── ──╲╱─────╲╱──
◇ | ◇ ● ╲ ╱ ●
╲ | ╱ ● ╲─╱ ●
◇ | ◇ ●●●●●●●
(touches at non-axis point)
tangent at the corner → tangent on smooth arc →
w1 = 0 exactly both w1, w2 nonzero
That corner is the whole secret. A diamond has corners on the axes. The axes are the places where one weight is exactly zero. So L1 keeps deleting weights. The circle has no corners — it touches the loss at a generic angle, so all weights stay alive, just smaller.
This is L1's sparsity. Not "smaller weights" — exactly zero weights.
Worked example — three features, three λ values¶
Three features. Tiny linear regression. Targets fit by hand-tuned weights. Watch what happens as we crank λ.
Suppose unregularized least squares gives:
The model in ELI5 has three signals. Signal 1 is sharp. Signal 2 is decent. Signal 3 is mostly junk.
Sweep — Ridge (L2)¶
λ w1 w2 w3
────────────────────────────────
0.0 0.90 0.40 0.05 ← unregularized
0.5 0.72 0.30 0.03 ← all shrunk smoothly
2.0 0.45 0.18 0.015 ← all shrunk more
5.0 0.22 0.08 0.006 ← all small, all alive
L2 shrinks every weight proportionally. The noise weight w3 gets very small but never reaches zero. The model still uses signal 3, just whispers about it.
Sweep — Lasso (L1)¶
λ w1 w2 w3
────────────────────────────────
0.0 0.90 0.40 0.05 ← unregularized
0.5 0.78 0.30 0.00 ← w3 SNAPPED to 0
2.0 0.55 0.10 0.00 ← w2 nearly gone, w3 dead
5.0 0.20 0.00 0.00 ← only w1 survives
L1 deletes features. At λ = 0.5, signal 3 is gone — the model refuses to use it at all. Crank further, signal 2 dies too. This is the soft leash that the voting panel from ELI5 needed — a budget that says "do not over-rely on weak signals."
Sweep — ElasticNet (mix)¶
ElasticNet kills the noise weight (L1 part) but shrinks the rest smoothly (L2 part). ElasticNet is a strong tabular default when you want both sparsity and stability.
Why corners produce sparsity — three coordinate examples¶
The diamond |w1| + |w2| ≤ t has four corners: (t,0), (-t,0), (0,t), (0,-t). Every corner sits exactly on one axis. The other coordinate is zero.
Consider three loss-contour orientations and where they first touch the diamond:
Case A — loss valley pulled toward (+w1)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
contour shrinks inward, first hits corner (t, 0)
→ w1 = t, w2 = 0 (sparse, w2 deleted)
Case B — loss valley pulled toward (+w2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
contour first hits corner (0, t)
→ w1 = 0, w2 = t (sparse, w1 deleted)
Case C — loss valley pulled diagonally
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
contour first hits the EDGE between (t,0) and (0,t)
→ w1 ≠ 0, w2 ≠ 0 (rare — only when pull is exactly diagonal)
Most of the time, the loss is pulled toward one axis more than the other — so the contour kisses a corner. That corner is a sparse solution. Hit a corner = delete a weight. The diamond geometry forces it. The circle has no corners, so it never forces it.
Simple, no?
Two equally-good solutions — why L1 picks the sparse one¶
Suppose two weight choices fit the data equally well:
| Solution | Weights | L1 penalty | L2 penalty |
|---|---|---|---|
| A — spread | (4, 4) | 4+4 = 8 | 16+16 = 32 |
| B — sparse | (8, 0) | 8+0 = 8 | 64+0 = 64 |
L1 sees both as equal — penalty 8 either way. When the data nudges slightly toward sparse, L1 takes B. L2 strongly prefers A — 32 is half of 64. L1 invites sparsity. L2 fights it.
ElasticNet — why production uses both¶
Pure L1 has a problem. When two features are highly correlated, L1 picks one and zeros the other almost arbitrarily. Small data changes flip which one survives. Unstable.
The L1 piece still drives sparsity. The L2 piece keeps correlated features stable. Best of both. This is why ElasticNet is a strong tabular default when you want both sparsity and stability, not pure Lasso.
Early stopping and tree depth caps — same idea, different knob¶
Regularization is not just penalties on weights. Anything that limits flexibility is regularization.
- Early stopping in gradient boosting (XGBoost, LightGBM): stop training when validation loss stops improving. Equivalent in spirit to L2 — the weights never reach their large-magnitude extreme.
- Tree depth caps in decision trees and random forests: a tree of depth 3 cannot memorize. A tree of depth 30 can. The cap is a regularizer.
- Min samples per leaf: forces each leaf to represent ≥k training points. Prevents the tree from creating a leaf for one weird house.
- Dropout in neural networks: randomly zero activations. Forces the network to not over-rely on any single neuron. Same spring, different form. All of these say the same thing: do not let the model memorize noise in the training set.
Real-world products¶
- FICO credit scoring (Lasso). L1 selects ~15 features from 200 candidates. Other 185 weights are exactly zero. ECOA regulation requires that the model be reportable in a brochure. Sparsity makes the model auditable.
- Genomics — gene expression to disease (Lasso). 20,000 genes, ~50 likely matter. L1 finds ~50 and zeros the rest. The list of selected genes is the scientific finding.
- A/B test analysis with covariates (Ridge). Treatment-effect estimation with multicollinearity. L2 keeps every covariate's coefficient small and stable. L1 would zero some and bias the treatment estimate.
- XGBoost / LightGBM in production tabular pipelines (early stopping). Train rounds until validation loss plateaus, then stop. Standard
early_stopping_rounds=50is the most-used regularizer in industry tabular ML. - sklearn
ElasticNet. A tabular workhorse when you want sparsity and stability with correlated features.
Pause and recall. Without scrolling: (a) sketch the L1 diamond and L2 circle in 2D weight space; (b) explain in one sentence why L1 produces sparsity; (c) name two products where L1 sparsity is required and one where L2 stability wins; (d) what does overfitting from ELI5 have to do with all this? If any link is fuzzy, scroll back.
Interview Q&A¶
Q: Why does L1 zero out weights but L2 doesn't?
A: The shape of the penalty fence. L1's fence is a diamond with corners on the axes. The loss contours typically first touch the fence at a corner, where one coordinate is exactly zero. L2's fence is a smooth circle with no corners — the loss touches at a generic angle, so all weights stay non-zero, just smaller.
Common wrong answer to avoid: "L1 has stronger penalty so it crushes weights to zero." False. L1 and L2 with the same λ can produce very different sparsity even when the L2 weights are smaller in magnitude. It is the geometry of the corners, not the strength.
Q: Should I always use L2?
A: No. Pick by goal. Use L1 when interpretability or feature selection is required (genomics, FICO, ad CTR pruning). Use L2 when stability under multicollinearity matters (A/B test analysis, finance). Use ElasticNet when you want sparsity and stability — often the best first sweep for tabular work.
Common wrong answer to avoid: "L2 is safer so it should always be the default." Too rigid. The right default depends on whether you need sparsity, stability, or both.
Q: What's the difference between regularization and feature selection?
A: Feature selection picks features before fitting. Regularization shrinks weights during fitting. L1 happens to do both — it is regularization that also produces feature selection as a side effect of zero weights. L2 does pure shrinkage, no selection.
Common wrong answer to avoid: "They are the same thing." No. You can do feature selection with no regularization, and you can do regularization with no features being removed.
Q: I have 100 features but only 1000 rows. Which regularizer?
A: Start with L1 or ElasticNet. Underdetermined regression — more features than the data can support — needs sparsity to even be solvable in a useful way. Pure L2 will give you 100 small but non-zero weights, none of which generalize. L1 picks a small subset that actually fits.
Common wrong answer to avoid: "Just use Ridge with high λ." High-λ Ridge gives stable but uninformative tiny weights; the model is essentially the mean. The structural problem is too many features, not just too much variance.
Apply now (5 min)¶
Take a sheet of paper. Draw two pictures from memory:
- The L1 diamond in 2D weight space. Mark the four corners. Label which axis each corner sits on.
- The L2 circle in 2D weight space. Label one tangent point with an oval loss contour.
Now write one sentence under each: why this shape produces sparsity / does not produce sparsity. If you cannot do this in 90 seconds without notes, scroll back to §"L1 diamond vs L2 circle" and stare at the diagram until you can.
Then — without looking — write the three-feature sweep table. λ = 0.5, 2.0, 5.0. Three weights each. Show the L2 row (smooth shrinkage) and the L1 row (snapping to zero). Even rough numbers are fine. The shape of the table is what matters.
Bridge. Regularization is the cure for overfitting. But before we can regularize anything, we need the simplest learner to regularize — the model that takes a feature list and produces a number. That is linear regression. Read
05-linear-regression.mdnext.