09. Optimizers — smart nudging that actually moves¶
The nudge has the right direction. The optimizer decides how the nudge actually moves through the landscape.
Built on the smart nudging placeholder from
00-eli5.md. Plain SGD is the dumb nudge. Adam-family is smart nudging. This file shows what "smart" means, step by step.
The picture before the math¶
Loss surface is a landscape. The minimum is at the bottom. The nudge points downhill. Plain SGD takes one blind step in that direction. That is it. No memory. No awareness of terrain.
Now picture three hikers descending the same valley:
- SGD. Blindfolded. Reads slope under feet. Takes one fixed-size step downhill. Repeats.
- SGD + momentum. Rolling marble with memory. Remembers which way it was rolling. Builds speed in consistent directions. Bounces cancel out.
- RMSProp. Hiker with different stride lengths for different terrain. Small strides where slopes are steep. Long strides where ground is flat.
- Adam. Marble with memory and per-direction stride lengths. Momentum and RMSProp combined.
- AdamW. Adam with the weight-decay leak fixed. Default for serious training today.
This is the smart nudge from ELI5. Optimizer = how the nudge actually moves.
Why plain SGD bounces — the canyon problem¶
See. Most loss landscapes are not round bowls. They are canyons. Steep walls in some directions. Gentle floors in others. Plain SGD cannot handle this.
Worked example. Take a 2-D loss with a narrow valley:
Gradient is (200·x, 2·y). Steep in x. Flat in y. Start at (x, y) = (1, 1). Learning rate α = 0.01.
Vanilla SGD — three steps¶
step 0: (1.000, 1.000) grad = (200, 2) update = (-2.00, -0.02)
step 1: (-1.000, 0.980) grad = (-200, 1.96) update = (+2.00, -0.0196)
step 2: (1.000, 0.960) grad = (200, 1.92) update = (-2.00, -0.0192)
step 3: (-1.000, 0.941)
Look. x flips sign every step. Forever. y creeps down by 2% each step. The hiker is wasting every step bouncing wall-to-wall in the steep direction. Barely moves down the floor.
Shrink α to fix the x bounce → y becomes glacial. There is no winning trade-off. This is the rule pile collapsing into a noisy walk version of failure — the nudge is correct, but the way it moves is dumb.
vanilla SGD with momentum Adam (per-direction)
in (x, y): in (x, y): in (x, y):
x: ╱╲╱╲╱╲╱╲╱╲ x: ╲___ x: ╲___
bouncing bounces cancel dampened by √v
wall-to-wall marble glides per-parameter
y: slow drift down y: faster drift down y: amplified by 1/√v
SGD with momentum — marble with memory¶
Add memory of recent direction. Bounces cancel, drift accumulates.
The update keeps a running velocity:
v is a smoothed history of recent gradients. Picture a marble rolling. It does not just look at the current slope — it carries momentum from the last few steps.
Same problem — three steps with momentum¶
α = 0.01, β = 0.9. Velocity starts at (0, 0).
step 0: (1.000, 1.000)
g = (200, 2) v = (200, 2) update = (-2.00, -0.02)
step 1: (-1.000, 0.980)
g = (-200, 1.96) v = (-20, 3.76) update = (+0.20, -0.0376)
step 2: (-0.800, 0.942)
g = (-160, 1.884) v = (-178, 5.27) update = (+1.78, -0.0527)
step 3: (0.980, 0.890)
See what happened. In step 1, the gradient flipped sign in x (from +200 to -200). Momentum had v = 200 already. New v = 0.9·200 + (-200) = -20. The bounce cancelled itself down to a tenth of the size. Meanwhile in y, gradients are all positive — v grows from 2 to 3.76 to 5.27. Drift accumulates.
Bounces die. Steady descent lives. Simple, no?
RMSProp — different strides for different terrain¶
Per-parameter learning rates, scaled to recent gradient size.
Momentum fixes direction memory but does not fix the scale problem. The x direction still gets gradients of magnitude 200. The y direction gets gradients of magnitude 2. They should not use the same step size.
RMSProp tracks the second moment — running average of squared gradients per parameter:
Steep direction → g² large → √s large → step is dampened. Flat direction → g² small → √s small → step is amplified. Each parameter gets its own stride length scaled to its recent gradient size.
This is the smart-nudge idea from ELI5 made concrete — use different speeds for different rules.
Adam — momentum plus per-parameter strides¶
Track both moments. Use first for direction, second for scale.
Adam combines momentum (running mean of gradients, called m) with RMSProp (running mean of squared gradients, called v):
m ← β₁·m + (1 − β₁)·g # first moment — direction memory
v ← β₂·v + (1 − β₂)·g² # second moment — scale memory
m̂ = m / (1 − β₁ᵗ) # bias correction (early-step warmup)
v̂ = v / (1 − β₂ᵗ)
w ← w − α · m̂ / (√v̂ + ε)
Defaults: α = 1e-3, β₁ = 0.9, β₂ = 0.999, ε = 1e-8. These work for almost everything out of the box. That is why Adam became the LLM-training default — sane behaviour with no tuning.
Same problem — three steps with Adam¶
α = 0.1 (Adam tolerates bigger learning rates), defaults otherwise. Skip bias correction here for clarity.
step 0: (1.000, 1.000)
g = (200, 2)
m = (20, 0.2) v = (40, 0.004)
update ≈ -0.1·20/√40, -0.1·0.2/√0.004
≈ (-0.316, -0.316)
step 1: (0.684, 0.684)
g = (136.8, 1.368)
m = (31.68, 0.317) v = (58.7, 0.00587)
update ≈ (-0.413, -0.413)
step 2: (0.271, 0.271)
g = (54.2, 0.542)
...
Both x and y move at roughly the same speed, even though their gradient magnitudes differ by 100×. The √v normalization equalized the strides. The hiker walks straight down the canyon floor instead of bouncing.
AdamW — fix the weight decay leak¶
Decouple weight decay from gradient. The fix that made transformers trainable.
Adam had a quiet bug. Weight decay (the L2 regularizer) was added into the gradient before scaling by 1/√v. So in steep directions, weight decay got dampened along with the gradient. Decay strength varied per parameter — exactly what you do not want for a regularizer.
AdamW separates them:
# Adam (broken)
g_with_decay = g + λ·w
m, v updated from g_with_decay
w ← w − α · m̂ / (√v̂ + ε)
# AdamW (decoupled)
m, v updated from g (no decay added)
w ← w − α · m̂ / (√v̂ + ε) − α · λ · w
In AdamW, weight decay shrinks every parameter by the same fraction each step. Exactly what L2 should do.
This sounds like an implementation detail. It is not. AdamW is the difference between transformers that converge cleanly and ones that drift. Same architecture, same data — switch from Adam to AdamW and the loss curve changes shape.
Learning rate schedules¶
Warmup: start tiny, then ramp linearly for the first ~5% of steps. This protects early weights from large random gradients.
Why this matters for Adam. Early on, the second-moment estimate v is noisy, so the step size is not trustworthy yet.
Then use cosine decay: after warmup, let the learning rate follow a half-cosine down toward ~0. Smooth landing.
Pause and recall. Without scrolling — what leak does AdamW fix? Why does warmup help Adam early on? What shape does cosine decay have?
Where this lives in the wild¶
Optimizer choice is one of the most consequential decisions in training. Some named uses:
- Hugging Face Transformers —
Trainerdefaults to AdamW. Every BERT, RoBERTa, DistilBERT, T5 fine-tune you have ever run used AdamW withβ₁=0.9, β₂=0.999, weight_decay=0.01. Switching to plain Adam silently weakens regularization. - Lion in PaLM and Gemini training (Google). Lion uses only the sign of the momentum, not the magnitude. Half the memory of Adam (no second moment), competitive performance at scale. Used inside Google's frontier-model training stack to save HBM.
- Shampoo / Sophia in recent large-model training. Second-order optimizers that approximate the Hessian. Sophia in particular reported 2× speedup on GPT-style pretraining. Still niche — most labs stick with AdamW because robust beats clever at scale.
- SGD with momentum in ResNet/ImageNet vision benchmarks. Plain SGD with
momentum=0.9and a cosine schedule still produces the best top-1 accuracy on ImageNet ResNet-50 training. Adam-family overfits subtly on this regime — vision-conv training is one of the few places where SGD wins. - Mixed defaults in Stable Diffusion / diffusion training. AdamW for the U-Net, with EMA (exponential moving average of weights) layered on top. The EMA weights are what get released, not the live training weights.
The pattern. AdamW is the safe default. SGD+momentum is the right default for vision. Lion/Sophia/Shampoo are research-frontier swaps when you have the engineering budget to debug them.
Picking, and when to deviate¶
| Setting | Default | Why |
|---|---|---|
| LLM pretraining / fine-tune | AdamW | Regularization works correctly. Stable across scales. |
| Vision (ResNet, EfficientNet on ImageNet) | SGD + momentum 0.9 | Generalizes better than Adam on conv-heavy vision. |
| RL policy training | Adam (sometimes RMSProp) | Non-stationary gradients; momentum can hurt. |
| Tiny model / fast experiment | AdamW | No tuning needed. Move on. |
| Memory-bound frontier training | Lion or 8-bit AdamW | Half the optimizer state. |
The 90% rule. Use AdamW with lr=1e-3 (or lr=2e-5 for Hugging Face fine-tunes), weight_decay=0.01, defaults otherwise. Deviate only with a reason.
Interview Q&A¶
Q: Why is Adam the LLM-training default?
A: It handles ill-conditioned losses (canyons) without manual learning-rate tuning. The per-parameter 1/√v scaling automatically normalizes step sizes across parameters with very different gradient magnitudes — common in transformers where embedding gradients are huge and deep-layer gradients are tiny. Defaults (β₁=0.9, β₂=0.999, lr=1e-3) work without per-task tuning.
Common wrong answer to avoid: "because it converges faster". Adam often reaches a worse final loss than well-tuned SGD+momentum on convex or near-convex problems. The real reason is robustness, not speed.
Q: When should you NOT use Adam?
A: ImageNet-style conv vision training, where SGD+momentum with cosine schedule produces better generalization. Also avoid Adam when memory is tight — it stores two extra tensors (m and v) per parameter, doubling optimizer memory. For a 7B-parameter model in fp32, that is 56 GB of optimizer state alone.
Common wrong answer to avoid: "Adam is always better than SGD". False on vision benchmarks; false at extreme scale where memory dominates.
Q: Difference between Adam and AdamW?
A: AdamW decouples weight decay from the gradient. In Adam, weight decay was folded into the gradient before the 1/√v scaling, so steep directions got their decay dampened — regularization strength varied per parameter. AdamW applies decay as a separate fixed-fraction shrink. This is why every modern transformer training script uses AdamW, not Adam.
Q: What do β₁ and β₂ control in Adam?
A: β₁ (default 0.9) is the decay for the first moment — how much past gradient direction to remember. Larger β₁ → more momentum, smoother trajectory. β₂ (default 0.999) is the decay for the second moment — how much past squared gradient to remember. Larger β₂ → smoother per-parameter scale estimates. β₂ is much closer to 1 because squared gradients are noisier and need more averaging.
Common wrong answer to avoid: "they are both momentum". Only β₁ is momentum-like. β₂ controls the adaptive learning-rate scaling, which is a different mechanism.
Apply now (5 min)¶
Take the loss L(x, y) = 100·x² + y² from this file. Start at (1, 1).
- Compute three steps of vanilla SGD with
α = 0.01by hand. Write the(x, y)after each step. - Compute three steps of SGD + momentum with
α = 0.01, β = 0.9. Note where thexbounce starts to die. - Compute three steps of Adam with
α = 0.1and the standard betas (skip bias correction). Note thatxandymove at roughly the same speed.
Then, without looking, sketch:
- The bouncing-canyon trajectory of vanilla SGD.
- The gliding trajectory of SGD+momentum.
- The straight-down trajectory of Adam.
If you can draw the three trajectories from memory and explain in one sentence each why they look that way, you own this idea. The smart nudge is no longer a mystery — it is just memory plus per-direction strides.
Bridge. Smart nudging gets the rule pile to a low loss fast. But low training loss does not mean the robot will work on new photos. The next file shows how the rule pile memorizes instead of learning, and the tools that keep it honest — dropout, weight decay, batch norm. Read
10-regularization.mdnext.