Skip to content

08. Vanishing gradients — when the nudge dies in deep piles

Why deep nets refused to train for years. The chain rule is a multiplier. Multiply enough small numbers and you get zero.

Built on the ELI5 in 00-eli5.md. The picture here is the deep rule pile with the nudge dying before it reaches the bottom. We have seen this phrase already. Now we open it up.


The whisper game — picture before formula

See. Imagine thirty people standing in a line. Person 30 hears a sentence. She whispers to person 29. He whispers to person 28. And so on, down to person 1.

Each whisperer muffles the sentence a little. Maybe 25% of the message survives each hop. By person 1?

 message strength reaching layer k:

 layer:   30   25    20      15        10           5             1
 strength: 1   ~0.3  ~0.1   ~0.03    ~0.01       ~0.003       ~10⁻¹⁸
                       ↑                                          ↑
                       still hear it                         pure silence

That last person hears nothing. Not "a little". Nothing. The signal is gone.

This is the nudge dying from the ELI5. Backprop is the whisper. Every layer it passes through, multiplied by that layer's local derivative. If that derivative is small, the whisper is muffled. Stack thirty layers of muffling and the early layers receive zero gradient.

Zero gradient means zero update. Zero update means the early layers freeze at their random init. The deep robot becomes a shallow robot with extra dead weight.


Why the chain rule is a multiplier (one line of math)

Backprop takes the gradient at the loss and walks it back through the rule pile. At each layer, it multiplies by the local derivative of that layer's bend.

∂L/∂w_layer_1 = (∂L/∂out_30) · σ'(z_30) · w_30 · σ'(z_29) · w_29 · ... · σ'(z_2) · w_2 · σ'(z_1) · x
                └── the loss ──┘ └── 30 of these factors stacked up ──┘

So what to do? Look at the size of each σ'(z) factor. If each is small, the product is microscopic. That is the whole story.


Sigmoid is the worst offender — three numerical attempts

Sigmoid: σ(x) = 1 / (1 + e⁻ˣ). Its derivative: σ'(x) = σ(x) · (1 − σ(x)).

The peak of σ' is at x = 0, where σ = 0.5. So peak derivative = 0.5 · 0.5 = 0.25. Anywhere else it is smaller.

activation x σ(x) σ'(x)
0 0.5 0.25 (the maximum, ever)
1 0.73 0.197
2 0.88 0.105
4 0.98 0.020
−3 0.05 0.047

Now propagate a unit gradient through L sigmoid layers. Best case — every layer hits exactly x=0, every derivative is exactly 0.25.

Attempt 1 — 5-layer sigmoid net

0.25⁵ = 9.77 × 10⁻⁴

So a top gradient of 1.0 reaches layer 1 as ~0.001. With learning rate 0.01, the weight update is ~10⁻⁵. Slow, but moving. Trains, just sluggishly.

Attempt 2 — 10-layer sigmoid net

0.25¹⁰ = 9.54 × 10⁻⁷

Top gradient 1.0 reaches layer 1 as ~10⁻⁶. Learning rate 0.01 gives a weight update of ~10⁻⁸. With float32 noise floor near 10⁻⁷ on accumulated stats, the update is indistinguishable from rounding error. Effectively frozen.

Attempt 3 — 20-layer sigmoid net

0.25²⁰ = 9.09 × 10⁻¹³

That is one trillionth. Now multiply by the realistic case — most activations are not at x=0; the average derivative is more like 0.1. Then 0.1²⁰ = 10⁻²⁰. Far below float32 epsilon. Layer 1 might as well not exist.

So the claim "10-layer sigmoid net cannot train" is not hand-waving. It is arithmetic. The early layers stop updating. The network learns only its top half. The bottom half stays at random init — which destroys the features the top half is trying to combine.

This is exactly the deep rule pile killing the nudge that the ELI5 warned about. The picture and the numbers agree.

The flip side: exploding gradients

Same chain rule. Opposite problem. If each factor is 1.5 instead of 0.25, then over 20 layers you get 1.5²⁰ ≈ 3,325. A top gradient of 1 becomes thousands by the time it reaches the bottom. One update can throw the weights across the room. Loss spikes. Activations saturate. Soon you are staring at nan. Standard fix: gradient clipping. Compute the global norm. If it is above a cap like 1.0, scale the whole gradient down to that norm. Simple, no? Exploding is easier than vanishing. You can clip a scream, but you cannot cleanly recover a whisper after the signal has already disappeared.


ReLU saves the river — same numbers, different bend

ReLU: f(x) = max(0, x). Derivative: 1 if x > 0, else 0.

So in the active region, every factor in the chain is 1. Multiply ten ones. Multiply twenty ones. Still one. The whisper stays loud.

 sigmoid derivative              ReLU derivative
        │                              │
   0.25 │ ___                       1  │ ───────────
        │/   \                         │
        │     \___                     │
      0 ┴─────────                   0 ┴──────╱
        −4  0   4                       neg │ pos

 product over 10 layers:        product over 10 layers:
   0.25¹⁰ = 9.5e-7                 1¹⁰ = 1.0
   0.10¹⁰ = 1.0e-10                (no muffling whatsoever)

Run the same three attempts with ReLU:

Depth Sigmoid worst-case ReLU worst-case (active path)
5 layers 9.77 × 10⁻⁴ 1.0
10 layers 9.54 × 10⁻⁷ 1.0
20 layers 9.09 × 10⁻¹³ 1.0

ReLU has its own failure — "dying ReLU", where a unit gets stuck in the negative side and the derivative is zero forever. That is a different beast. We will name it below.

GELU and SiLU smooth the elbow. Same idea — derivative stays close to 1 over a wide range. Modern transformers ship GELU. Llama ships SiLU. Either one keeps the nudge alive.


Pause and recall. Without scrolling — what is the maximum value of the sigmoid derivative? Why is that the killer? What is the ReLU derivative on the active side? If any of these is fuzzy, scroll back.


Residual connections — a highway around the muffling stack

Even with ReLU, very deep nets (50+ layers) struggle. Why? Because the active path through ReLUs still multiplies many weight matrices W_k. If those have spectral norm < 1, the product still shrinks.

The fix is structural, not activation-based. Residual connection: each block computes out = layer(x) + x. The + x is a direct shortcut.

Picture a highway running parallel to the muffling stack:

   input ──┬─────────────────────────────┬───→ output
           │                             │
           └──→ [layer 1] → ... → [layer 30] ──┘
                  ↑                    ↑
              gradient gets stuck    here
              in deep blocks

              but the highway carries
              gradient back unchanged

When backprop walks through out = layer(x) + x, the derivative w.r.t. x is (1 + ∂layer/∂x). That 1 is the highway. Even if the layer's contribution muffles to zero, the gradient still flows back as a clean copy of itself. No layer can starve the bottom of the pile.

This is what made 152-layer ResNets, and later 100-billion-parameter transformers, actually trainable.


Layer norm and gradient flow — a small clarification

Layer norm rescales activations so their distribution stays well-behaved per token. It does not directly fix vanishing gradients. What it does is keep activations in a regime where bend derivatives are not pathologically small. So it helps indirectly — it shifts the typical σ' or f'(x) factor closer to its useful range.

Pre-norm transformers (norm before the sublayer, residual added after) train more stably than post-norm transformers, specifically because the residual highway is left untouched by the norm. That is a vanishing-gradient story dressed up as an architecture choice.


Where this lives in the wild

  1. ResNet-152 in ImageNet classification. 152 layers of conv blocks, each with a residual connection. Without the residuals, the network does not train at all — He et al. showed plain 56-layer plain nets are worse than 20-layer plain nets. Residuals flip that.
  2. Transformer pre-norm vs post-norm. GPT-2 used post-norm and needed careful warmup. GPT-3, Llama, and most modern frontier models use pre-norm — the residual highway is exposed, gradient flow is clean from output to embedding without scaling.
  3. RWKV and Mamba (state-space models). These architectures avoid deep softmax-attention backprop entirely. The recurrent/state formulation means gradient flow is governed by an eigenvalue parameterization, sidestepping the muffling stack altogether. Vanishing gradient was a design constraint here.
  4. Gradient clipping in GPT pretraining. The opposite cousin — exploding gradients — kills training too. OpenAI and others clip global gradient norm at ~1.0 during pretraining. The story is the same: chain rule is a multiplier; products of large numbers also blow up. Clipping caps the explosion; residuals + ReLU/GELU prevent the vanishing.

Interview Q&A

Q: Why is sigmoid the worst activation for deep nets? A: Its derivative peaks at 0.25 and shrinks fast outside x near 0. Stacking L layers multiplies L of these. By 10 layers the gradient at the bottom is below float epsilon. ReLU has derivative 1 on the active side — no shrinkage. Common wrong answer to avoid: "sigmoid saturates". True but incomplete. Tanh also saturates. The unique sin of sigmoid is the peak being only 0.25 — so even at the best point in its domain, it is already shrinking the gradient by 4×.

Q: Don't residual connections just paper over the problem? The bad layer is still bad. A: No — they reroute the gradient so the bad layer cannot starve the rest of the pile. Even if layer(x) contributes zero, the + x carries gradient backward unchanged. The early layers keep getting useful updates, so they actually learn rather than freezing at init. Common wrong answer to avoid: "residuals just add the input back". They do, but the load-bearing fact is the gradient identity term — not the forward-pass addition.

Q: Is "dying ReLU" the same as a vanishing gradient problem? A: Related but distinct. Vanishing gradient is a product-of-derivatives problem across depth. Dying ReLU is a single-unit problem — a neuron whose pre-activation went deeply negative and stays there, so its derivative is zero forever. Fixes are different too: leaky ReLU or careful init for dying ReLU; residuals and good bends for vanishing gradients.

Q: Does layer norm fix vanishing gradients? A: Indirectly. It keeps activations in a range where bend derivatives are not pathologically small — so the typical σ' factor is closer to its usable peak. The actual fix is residuals plus a good bend. Norm is a stabilizer, not a cure. Common wrong answer to avoid: "layer norm normalizes gradients". It normalizes activations. The gradient effect is downstream of that.


Apply now (5 min)

Take a unit gradient at the top. Walk it back by hand through 5 layers — one column for sigmoid (each σ' = 0.2), one column for ReLU (each f' = 1).

Layer Sigmoid grad reaching here ReLU grad reaching here
top 1.0 1.0
4 0.2 1.0
3 0.04 1.0
2 0.008 1.0
1 0.0016 1.0

Now sketch from memory — without looking — two curves on the same axes:

  • x-axis: layer depth, 1 to 30.
  • y-axis: gradient magnitude (log scale).
  • One line for sigmoid (steeply collapsing toward 10⁻¹⁸ by layer 1).
  • One line for ReLU (flat near 1 across all 30 layers).

If you can reproduce this picture in 60 seconds, you own the vanishing gradient story.


Bridge. ReLU and residuals keep the nudge alive. But the nudge itself is plain — same step size for every weight, no memory of past direction. That is why we need smart nudging — momentum, per-parameter scaling, Adam. Read 09-optimizers.md next.