Skip to content

10. Regularization — keeping the rule pile honest

Why a fat rule pile memorizes photos instead of learning cat-ness, and the four knobs that fix it.

Built on the cat-robot in 00-eli5.md. The "rule pile" returns. So does the cat-robot. The story now: pile is plenty big, training photos look perfect, and yet the wild fails.


The new failure — perfect on the album, blind on the street

See. You trained the cat-robot. Loss on the training photos went to zero. You celebrate.

Then you show it a new cat. It says "not cat".

What happened? The rule pile did not learn cat-ness. It memorized the exact pixels of every training photo. Whiskers in the same place. Same lighting. Same fur angle. New photo — different pixels — robot is lost.

This is overfitting. The pile had so much capacity that it just stored the album. Generalization, gone.

The opposite failure also exists. If the pile is too small, it cannot even fit the album. Both training and validation loss stay high. Underfitting.

So we have three regimes. Look at them as curves over training time.

loss ↑
     |
     |\         underfit            good fit             overfit
     | \      both high            both low,           train↓ val↑
     |  \    (pile too small)    close together      (memorizing)
     |   \________  val
     |____________  train
     |
     |       val ──╮          val ──╮     val rises ──╯
     |             ╰── train         ╰── train         ╰── train still falling
     +────────────────────────────────────────────────→ epochs

Three regimes. Diagnose by the gap between train and val.

  • Both high. Pile too small. Train longer. Make pile bigger.
  • Both low, close. Done. Ship.
  • Train low, val rising. Pile is memorizing the album. Time for regularization.

Bias-variance — one balance scale

Two ways the cat-robot can be wrong.

        bias                            variance
   (too rigid, wrong shape)        (too jumpy, fits noise)
   ──────────────────────         ──────────────────────
   underfit                       overfit
   small pile                     huge pile, small data
   one straight line on XOR       memorizing each photo

Capacity goes up — bias drops, variance climbs. Capacity goes down — variance drops, bias climbs. We want a sweet spot. Regularization lets you keep capacity high but suppress variance with side knobs. Best of both.


Proof — a tiny pile with way too many rules will memorize

Take a small classification task. 50 training points. Build an MLP with about 1000 parameters — twenty parameters per data point. Way over.

Train to zero training loss. Watch validation loss.

Epoch Train loss Val loss (no dropout) Val loss (dropout 30%) Val loss (dropout 70%)
5 0.42 0.45 0.51 0.68
20 0.08 0.39 0.41 0.55
50 0.01 0.52 0.38 0.49
200 0.00 0.71 0.36 0.47
1000 0.00 0.84 0.37 0.50

Three honest attempts. Same architecture. Same data. Different dropout.

  • 0% dropout. Train loss crashes to zero. Val loss climbs from 0.39 → 0.84. Pure memorization. Cat-robot has the album by heart, the street by nothing.
  • 30% dropout. Train loss stays moderate. Val loss settles at ~0.37 and stays. Sweet spot.
  • 70% dropout. Pile is starved of neurons. Val loss is okay but not great. Now too rigid — drifting toward underfit.

See the shape? Capacity alone is not the enemy. Unconstrained capacity is. The robot needs a force keeping the rules less specific so they cannot memorize.

That force is regularization. Four flavors below.


Dropout — random rules go missing each step

Mental picture. Every training step, flip a coin for each neuron. With probability p, set its output to zero for this step. The pile runs as if those neurons did not exist. Next step, different coins, different neurons gone.

full pile             dropout step 1         dropout step 2
●●●●●●                ●○●○●○                 ○●●●○●
●●●●●●                ○●●○○●                 ●○●○●●
●●●●●●  →             ●●○●○●     →           ●●○●●○
   |                     |                      |
output                output                  output

What does this force? Each rule must be useful on its own. It cannot rely on a buddy rule next door — that buddy might be missing this step. So features become distributed. No fragile chain.

At test time, all neurons are on. To balance the average activation, weights are scaled by (1 - p) (or "inverted dropout" scales activations by 1/(1-p) during training). Either way, train and test agree on average.

Typical p: 0.1 to 0.5. Higher p for fully-connected layers, lower for conv/attention. Modern transformers use small dropout (0.0–0.1) on attention probs and FFN outputs.

This is the rule pile with random gaps each step. The robot stops betting on one specific neuron firing and learns to spread responsibility.


Weight decay (L2) — every rule tied to zero with a tiny spring

Big weights pick up tiny pixel patterns. Small weights smooth across noise. So we add a penalty.

The total loss becomes:

L_total = L_data  +  λ · Σ wᵢ²
            ↑              ↑
        normal loss    pull toward zero

Picture. Every weight has a tiny spring attached to zero. The data loss pushes the weight toward whatever value fits the photos. The spring pulls back. If the data signal is real and strong, weight wins. If the data signal is just one weird photo, spring wins — weight stays small.

Smaller weights = smoother decision boundary = generalizes better.

Tuning. λ is small — typical 1e-4 to 1e-2. Too big — pile underfits, all weights vanish. Too small — no effect. Sweep on log scale.

A subtlety. Plain "L2 added to loss" and AdamW's decoupled weight decay are not the same when using Adam. Adam scales gradients per-parameter. Adding the L2 term to the loss makes the spring scale too — distorted. AdamW pulls the weight toward zero after the Adam step, separately. Cleaner. This is why every modern LLM trainer uses AdamW.


Batch norm and layer norm — stop the rule pile from chasing moving targets

Different problem. Imagine each layer's input distribution is shifting around as upstream weights change during training. Layer 5 keeps adjusting to compensate. Internal covariate shift, the old name.

Fix. Normalize each layer's input to have mean 0, variance 1. Then add a learnable scale γ and shift β so the layer can re-center if it wants.

z_norm = (z - μ) / sqrt(σ² + ε)
output = γ · z_norm + β

Two flavors of where μ and σ² come from.

Batch norm Layer norm
Compute mean/var across the batch dimension (per feature) the feature dimension (per sample)
Depends on batch size? Yes No
Stateful at inference? Yes — running mean/var No — fully deterministic
Picture "average this neuron over 32 photos in the batch" "average all neurons within one photo"
Where used CNNs, ResNet — every block Transformers — pre/post attention and FFN

Why batch norm in vision but layer norm in transformers?

  • Vision batches are big, fixed shape, stable. Batch statistics are reliable.
  • Transformer sequences vary in length. Batches at inference can be size 1 (one user, one stream). Batch norm stats wobble. Layer norm does not care about batch — it normalizes within each token's feature vector.

Either way, the role is the same — keep each layer's input distribution stable so the rule pile is not chasing a moving target. A side effect: regularization. Mild noise injected by batch statistics during training acts a bit like dropout.


Early stopping — stop when the cat-robot starts memorizing

Simplest knob. Watch validation loss every epoch. When it starts climbing while training loss keeps falling — stop. Save the weights from the best val-loss point.

val loss ↑
         |        ← best val (save weights here)
         |   ╮ ╱
         |    V         ← val rising = memorizing now
         |   ╱ ╲___
         |  ╱      ↑ STOP HERE
         | ╱
         |─────────────→ epochs

In practice, "patience" — wait k epochs to confirm the rise is real, not noise.

Early stopping is implicit regularization — fewer training steps means weights stayed closer to their initialization, which on average means smaller weights. Same direction as weight decay, free of charge.


Beyond weight penalties

Two more heavy hitters sit outside the loss. Data augmentation: flip, crop, color jitter for vision. Back-translation or synonym replacement for text. You did not collect more data, but the pile behaves as if the dataset got larger. That is implicit regularization. Label smoothing: replace hard targets like [0, 1] with [ε, 1−ε], often ε = 0.1 in transformer training. This prevents overconfident logits and keeps the output layer less brittle. In modern training, these tricks are often more important than dropout itself.


Pause and recall. Without scrolling — what are the three regimes on the train-vs-val curve? Why does dropout force features to be distributed? What does label smoothing do to logits? Why layer norm, not batch norm, in transformers?


Where this lives in the wild

  • ResNet — batch norm in every residual block. ImageNet-scale CNNs would not train without it. BN stabilizes deep stacks so 100+ layers actually backprop signal end-to-end. The cat-robot's rule pile, finally tall and stable.
  • Transformers (BERT, GPT, Llama) — layer norm pre/post attention and FFN. Pre-LN (norm before attention) is the modern default — gives stable training without warmup tricks. Without LN, attention scores would saturate or blow up at scale.
  • GPT-style decoders — dropout on attention probs and FFN. Original GPT used 0.1 dropout. Newer models often drop to 0.0 at scale because data is so abundant the pile cannot memorize anyway. Dropout matters most when data is the bottleneck.
  • AdamW in LLM pretraining — decoupled weight decay as the regularizer. Llama, Mistral, GPT — all use AdamW with weight decay around 0.1. The spring on every weight, applied cleanly outside the Adam moment estimates.

The pattern. Every production deep learning system uses at least two of these knobs simultaneously. Vision: BN + weight decay + augmentation. Language: LN + weight decay + (sometimes small dropout). Each closes a different leak.


Interview Q&A

Q: Why does dropout act as regularization? A: It forces each neuron to be useful even when its neighbors are randomly absent. So no rule can lean on a co-conspirator — features become distributed and redundant. Equivalently, training with dropout is like averaging over an exponential number of thinned subnetworks (Hinton's view), which is a form of ensembling. Common wrong answer to avoid: "dropout adds noise so the model is robust to noisy inputs". The noise is on internal activations, not inputs. The point is breaking neuron co-dependence, not input robustness.

Q: Why batch norm in CNNs but layer norm in transformers? A: Batch norm normalizes across the batch axis, so it depends on batch statistics being stable. Vision batches are large and fixed-shape — fine. Transformer inference often runs batch size 1 with variable sequence lengths, which makes batch stats unreliable. Layer norm normalizes within each token's feature vector, batch-independent and deterministic at inference. Common wrong answer to avoid: "layer norm is just a better batch norm". Not a strict upgrade — BN gives stronger regularization in vision because of the across-batch noise. Transformers need batch-independence more than that bonus regularization.

Q: Is weight decay the same as L2 regularization? A: For plain SGD, yes — they are mathematically equivalent. For Adam-family optimizers, no. Adam scales gradients per-parameter, so adding λ·||W||² to the loss gets scaled too, distorting the penalty. AdamW decouples the decay — applies it directly to weights after the Adam update — restoring the intended behavior. Every modern LLM uses AdamW for this reason. Common wrong answer to avoid: "they are interchangeable". Only true for SGD. With Adam, you must use AdamW or your weight decay is silently wrong.

Q: What does early stopping really do? A: It prevents the optimizer from spending the late-training budget on memorizing peculiarities of the training set. Mathematically, fewer steps means weights stayed closer to their (small) initialization — implicit shrinkage, like weight decay. It is the cheapest regularizer because it costs nothing extra to compute; you were going to track val loss anyway.


Apply now (5 min)

Mentally train a small MLP on a tiny dataset — 100 points, 2 input features, 50 hidden neurons. Two runs.

Run A — no dropout. Imagine training loss falling smoothly to near zero by epoch 100. Validation loss drops to a minimum around epoch 30, then starts creeping back up.

Run B — dropout 0.3. Training loss falls slower, settles around 0.05. Validation loss drops slower too — but stays low, no climb.

Now without scrolling — sketch both curves on the same axes. Mark the early-stopping point on Run A. Mark where the gap opens. One sentence: what is the cat-robot doing in run A after epoch 30 that run B prevents?

If you can sketch the two curves in 60 seconds and write that sentence, you own the bias-variance picture.


Bridge. Now the pile is honest — it generalizes. The next question is how big should it be, and how much data do we need? That is scaling laws — the math rule of thumb that says roughly twenty photos per rule. Read 11-scaling-laws.md next.