06. Gradient descent — the blindfolded hiker on a foggy mountain¶
How weights are actually found. The picture, the rule, and what breaks when the step is wrong.
Built on the ELI5 in
00-eli5.md. The model's weights are not handed down — they are found, one nudge at a time. That nudging is gradient descent.
The picture before the math¶
See. Imagine a hiker on a mountain. Thick fog. Cannot see the valley. Cannot see ten metres ahead. Only one thing is felt — the slope under the feet.
So what to do? Simple. Feel which direction goes downhill the steepest. Take one step that way. Stop. Feel again. Step again. Repeat.
Eventually the hiker reaches a low point. That low point is where the loss is smallest — where the model's weights make the prediction most accurate.
That is the whole idea. The mountain is the loss landscape. The hiker's position is the current weight setting. The slope under the feet is the gradient. The size of each step is the learning rate.
loss
^
| * ← start here (high loss, big slope)
| \
| * ← step 1
| \
| * ← step 2
| \_*_ ← near bottom, slope flat, steps small
| \_*__
+----------------------> weight w
^
minimum
Each * is one iteration. Steep slope → bigger move. Flat slope → tiny move. The hiker slows down naturally as the valley flattens.
This is how every weight in classical ML and every weight in a neural network is actually trained. The model was fitted by exactly this hiker.
The update rule¶
One line. Memorize the picture, then read the line:
∂L/∂wis the gradient — the slope of loss with respect to weightw. Positive gradient means loss goes up ifwgoes up. So we go the other way.η(eta) is the learning rate — the step size. A scalar like 0.01 or 0.1.- The minus sign is the downhill flip. Gradient says "uphill is this way"; we step opposite.
Picture: gradient is the direction. Learning rate is the step size. Together they decide where the hiker lands next.
Worked numerical example — minimize f(w) = (w-3)²¶
The minimum is obviously at w = 3, where f(w) = 0. Pretend we don't know that. Let the hiker find it.
The gradient: ∂f/∂w = 2(w - 3).
Start at w = 0, learning rate η = 0.1. Run three iterations.
| Step | w | f(w) | gradient 2(w-3) | update -η·grad | new w |
|---|---|---|---|---|---|
| 0 | 0.00 | 9.00 | -6.00 | +0.60 | 0.60 |
| 1 | 0.60 | 5.76 | -4.80 | +0.48 | 1.08 |
| 2 | 1.08 | 3.69 | -3.84 | +0.384 | 1.464 |
| 3 | 1.464 | 2.36 | -3.07 | +0.307 | 1.771 |
See the pattern. Loss falls every step. w creeps toward 3. The gradient shrinks as we approach — so the steps shrink too. The hiker slows down near the valley, automatically.
Run it 30 more times in your head — w lands very close to 3. That is convergence.
"X breaks" — when the step size is wrong¶
Same problem: f(w) = (w - 3)², start at w = 0. Gradient is 2(w - 3). Run 5 iterations with three different learning rates.
Attempt 1 — η = 0.01 (too small, painfully slow)¶
| Step | w | gradient | new w |
|---|---|---|---|
| 0 | 0.000 | -6.000 | 0.060 |
| 1 | 0.060 | -5.880 | 0.119 |
| 2 | 0.119 | -5.762 | 0.176 |
| 3 | 0.176 | -5.647 | 0.233 |
| 4 | 0.233 | -5.534 | 0.288 |
| 5 | 0.288 | — | — |
After 5 steps, w = 0.29. The target is 3. The hiker is barely off the porch. This will eventually converge — but you'll need hundreds of iterations. In a real model, that is hours of compute wasted on too-cautious steps.
Attempt 2 — η = 0.1 (just right)¶
| Step | w | gradient | new w |
|---|---|---|---|
| 0 | 0.000 | -6.000 | 0.600 |
| 1 | 0.600 | -4.800 | 1.080 |
| 2 | 1.080 | -3.840 | 1.464 |
| 3 | 1.464 | -3.072 | 1.771 |
| 4 | 1.771 | -2.458 | 2.017 |
| 5 | 2.017 | — | — |
After 5 steps, w = 2.02. Two-thirds of the way. Smooth descent. Steps shrink as we near the valley. The hiker is doing fine.
Attempt 3 — η = 2.0 (too large, diverges)¶
| Step | w | gradient 2(w-3) | new w = w − 2·grad |
|---|---|---|---|
| 0 | 0.000 | -6.000 | 12.000 |
| 1 | 12.000 | +18.000 | -24.000 |
| 2 | -24.000 | -54.000 | 84.000 |
| 3 | 84.000 | +162.000 | -240.000 |
| 4 | -240.000 | -486.000 | 732.000 |
| 5 | 732.000 | — | — |
Each step overshoots the valley and lands further from 3 than where we started. The hiker is not walking — the hiker is being thrown across the valley, overshooting harder each time. Loss explodes to infinity. NaN territory in real training.
This is divergence. The classic symptom: loss starts decreasing, then suddenly shoots to inf or nan. First instinct should be: learning rate is too high.
Bouncing in narrow valleys¶
Even with a "good" learning rate, the landscape can bite. Picture a valley shaped like a long thin gutter — steep walls, gentle floor.
___ ___
| \ / |
| \ *->* / | ← bouncing across walls
| \ / \ / | (zig-zag), barely moving
| * * | toward the actual minimum
|________\/_______|
^
minimum along gentle direction
The hiker keeps shooting up one wall, bouncing back, shooting up the other. Sideways progress frantic. Forward progress along the gutter floor — glacial.
Momentum adds a memory of past gradients — the hiker remembers which direction she has been going. That memory smooths the zig-zag and helps her glide down the gutter instead of bouncing wall to wall. Adam combines momentum with per-parameter learning rates, so some coordinates step cautiously while others step boldly. Both are covered in the neural networks module. For now, just know: plain GD can stall in narrow valleys even when nothing is technically broken.
Convex vs non-convex landscapes¶
Two shapes of mountain.
Convex — one bowl. One minimum. Wherever the hiker starts, downhill always leads to the same valley.
Linear regression with squared loss. Logistic regression. Ridge. Lasso. SVM. All convex. Gradient descent is guaranteed to find the global minimum (with a sensible learning rate).
Non-convex — many valleys. Local minima. Saddle points. Plateaus.
Neural networks. Tree boosting on flexible loss. The hiker may settle in a local valley that is not the lowest one. No guarantee of global optimum.
Practical truth: for big neural nets, most local minima turn out to be nearly as good as the global one. So we don't lose sleep. But for now, in classical ML, almost everything you optimize is convex — one valley, find it, done.
Pause and recall. Without scrolling — what does the gradient mean geometrically? What does the learning rate control? Why does too-large
ηdiverge instead of just being inefficient? Where in the house-pricing model from ELI5 does the hiker live?
Where this lives in the wild¶
- scikit-learn
SGDRegressor/SGDClassifier— stochastic gradient descent on linear models. Used when the dataset is too big for the closed-form normal equation. One step per mini-batch, billions of rows possible. - XGBoost — gradient boosting. Each new tree is fitted to the negative gradient of the loss with respect to current predictions. Gradient descent, but in function-space (adding tree predictions) instead of parameter-space.
- LightGBM — same idea as XGBoost but uses histogram-binned features and leaf-wise growth. The "gradient step" becomes very cheap; trains 10× faster on tabular data.
- TensorFlow / PyTorch optimizers — every neural network in production (GPT, Llama, Stable Diffusion, recommendation models at Meta and YouTube) is trained with gradient descent variants (Adam, AdamW, SGD+momentum). The hiker scales from one weight to a trillion.
- Zillow-style house pricing models — linear and neural house-price models both rely on gradient descent when the feature list gets large. Square footage, neighbourhood, age, renovation score — the hiker nudges every weight until predicted prices line up with sold prices.
Interview Q&A¶
Q: Why doesn't a constant learning rate always work? A: Early in training, loss is steep — a big step makes fast progress. Late in training, near the minimum, the same big step overshoots back and forth. So practitioners use learning-rate schedules — start high, decay over time. Cosine decay, step decay, warmup-then-decay are common. Common wrong answer to avoid: "constant LR is fine if you tune it well." It can converge, but you waste compute either being too slow early or oscillating late. Schedules dominate in production.
Q: Is gradient descent guaranteed to find the global optimum?
A: Only on convex loss surfaces (linear/logistic regression, ridge, lasso, SVM). On non-convex surfaces (neural networks, deep tree ensembles), it finds a local minimum or saddle point. In practice, for high-dim networks, local minima are usually good enough — but the guarantee is gone.
Common wrong answer to avoid: "yes, with a small enough learning rate." Small η does not escape local minima. It just descends slowly into whichever valley you started in.
Q: What's the difference between batch, mini-batch, and stochastic gradient descent? A: Batch GD computes the gradient on the entire dataset per step — accurate, very slow, memory-heavy. Stochastic GD uses one sample per step — fast, noisy, can escape shallow minima but jitters near the valley. Mini-batch GD uses a small batch (32, 256, 1024) per step — the production default. Best of both: efficient hardware use, smooth-enough gradient. Common wrong answer to avoid: "mini-batch is just batch with less memory." The real difference is gradient noise. That noise changes optimization behavior, not just hardware footprint.
Q: Loss starts decreasing, then suddenly jumps to NaN. First thing to check?
A: Learning rate is too high — the hiker overshot a cliff and exploded. Drop η by 10×. If still NaN, check for division-by-zero or log-of-zero in the loss (numerical instability). Add gradient clipping if gradients themselves are unbounded.
Common wrong answer to avoid: "the model architecture is bad." Architecture can matter, yes, but the first and fastest check is always the step size. Exploding loss is usually a training-dynamics problem before it is a model-family problem.
Apply now (5 min)¶
Take pen and paper. Pick f(w) = (w - 5)², start w = 0, η = 0.2. Compute three iterations by hand. Show w approaching 5. Then redo with η = 1.5 — show it diverging.
Then — without looking — sketch from memory:
- The hill-walker picture (1D loss curve with descending iteration markers).
- The update rule
w_new = w_old - η · ∂L/∂w. - Three rows of the η = 0.1 worked example for
f(w) = (w - 3)².
If you reproduce all three in under 90 seconds, the hiker lives in your head now.
Bridge. Gradient descent finds weights for any differentiable loss. The next file uses it to fit a classifier — logistic regression. Same hiker, different mountain. Continue to
07-logistic-regression.md.