01. The stacking failure — why depth without rails produces garbage¶

Five minutes. One picture. Why stacking smart blocks can still destroy the signal.

Built on the ELI5 in 00-eli5.md. The station — one transformer block doing heavy work — becomes dangerous when you stack it without rails.

The picture before the math¶

One station can help. Ten stations can hurt. See. A transformer block is powerful. Power is not the same as stability. If every layer freely rewrites the signal, small mistakes compound. Now what is the problem? The packet has no safety rail. It gets hit again and again.

x0 ── station 1 ──→ x1 ── station 2 ──→ x2 ── station 3 ──→ x3
      rewrite            rewrite            rewrite

No shortcut pipe.
No quality inspector.
Only repeated transformation.

If each station stretches a bit, the stack can explode. If each station shrinks a bit, the stack can collapse. If each station flips sign while stretching, the stack can oscillate. So what to do? First understand the failure clearly. Then the fix will feel obvious.

The naive stack in one formula¶

Without rails, a deep stack looks like this:

x_{l+1} = f_l(x_l)

After many layers:

x_L = f_L(f_{L-1}(...f_2(f_1(x_0))))

That looks harmless. But composition is merciless. Every layer acts on the output of the previous mistake. A 10% scaling error is not one error. It becomes a chain of errors. A sign flip is not local anymore. It keeps echoing through depth. This is the key lesson. Stacking powerful functions is not the same as stacking stable functions.

Failure mode 1 — explode¶

Take the simplest case. Each layer multiplies the signal by 1.8. Start with x0 = 1.

| Layer | Computation | Value |

|---|---|---:|

| 0 | start | 1.0000 |

| 1 | 1 × 1.8 | 1.8000 |

| 2 | 1.8 × 1.8 | 3.2400 |

| 3 | 3.24 × 1.8 | 5.8320 |

| 4 | 5.832 × 1.8 | 10.4976 |

| 5 | 10.4976 × 1.8 | 18.8957 |

| 6 | 18.8957 × 1.8 | 34.0122 |

Six layers only. The signal is already 34× the start. Start from x0 = 5 and layer 6 becomes 170.0611. That is not a gentle drift. That is a blow-up. Picture it:

1
└─×1.8→ 1.8
    └─×1.8→ 3.24
        └─×1.8→ 5.832
            └─×1.8→ 10.4976
                └─×1.8→ 18.8957
                    └─×1.8→ 34.0122

Now make it a vector. Let x0 = [1, 2]. Multiply by 1.8I each layer.

| Layer | Vector | L2 norm |

|---|---|---:|

| 0 | [1.000, 2.000] | 2.236 |

| 1 | [1.800, 3.600] | 4.025 |

| 2 | [3.240, 6.480] | 7.245 |

| 3 | [5.832, 11.664] | 13.041 |

| 4 | [10.498, 20.995] | 23.474 |

See the norm. It is not adding calmly. It is running away. In a transformer, runaway norms make later computations brittle. Attention scores can get too sharp. FFN activations can become huge. The model stops behaving like a careful editor. It behaves like a loud amplifier.

Failure mode 2 — collapse¶

Now go the other way. Each layer multiplies by 0.2. Start with x0 = 10.

| Layer | Computation | Value |

|---|---|---:|

| 0 | start | 10.00000 |

| 1 | 10 × 0.2 | 2.00000 |

| 2 | 2 × 0.2 | 0.40000 |

| 3 | 0.4 × 0.2 | 0.08000 |

| 4 | 0.08 × 0.2 | 0.01600 |

| 5 | 0.016 × 0.2 | 0.00320 |

| 6 | 0.0032 × 0.2 | 0.00064 |

Now the signal is almost gone. The stack has not exploded. It has faded. That is just as bad. A deep model cannot reason with a vanishing packet. It forgets what mattered. Picture the collapse:

10
└─×0.2→ 2
    └─×0.2→ 0.4
        └─×0.2→ 0.08
            └─×0.2→ 0.016
                └─×0.2→ 0.0032
                    └─×0.2→ 0.00064

Vector version. Let x0 = [3, 4]. Its norm is 5. Again multiply by 0.2I.

| Layer | Vector | L2 norm |

|---|---|---:|

| 0 | [3.000, 4.000] | 5.000 |

| 1 | [0.600, 0.800] | 1.000 |

| 2 | [0.120, 0.160] | 0.200 |

| 3 | [0.024, 0.032] | 0.040 |

| 4 | [0.0048, 0.0064] | 0.008 |

Now what is left? Almost nothing. A collapsed residual stream cannot carry information forward. The next station is reading smoke.

Failure mode 3 — oscillate¶

Third case. Each layer multiplies by -1.2. Start with x0 = 5.

| Layer | Computation | Value |

|---|---|---:|

| 0 | start | 5.0000 |

| 1 | 5 × -1.2 | -6.0000 |

| 2 | -6 × -1.2 | 7.2000 |

| 3 | 7.2 × -1.2 | -8.6400 |

| 4 | -8.64 × -1.2 | 10.3680 |

| 5 | 10.368 × -1.2 | -12.4416 |

| 6 | -12.4416 × -1.2 | 14.9299 |

The magnitude grows. The sign keeps flipping. So the stack does not settle. It bounces.

5 → -6 → 7.2 → -8.64 → 10.368 → -12.4416 → 14.9299

This is not healthy diversity. This is unstable ping-pong. Vector picture. Take x0 = [2, -1]. Multiply by -1.2I each layer.

| Layer | Vector | L2 norm |

|---|---|---:|

| 0 | [2.000, -1.000] | 2.236 |

| 1 | [-2.400, 1.200] | 2.683 |

| 2 | [2.880, -1.440] | 3.220 |

| 3 | [-3.456, 1.728] | 3.864 |

| 4 | [4.147, -2.074] | 4.637 |

Same direction? No. Same size? Also no. The representation keeps over-correcting. In practice that means layer-to-layer disagreement instead of steady refinement.

Why transformers care even more¶

A transformer station is not a plain multiply. It contains the social bench and the private bench. Attention can strongly copy from one token. The FFN can strongly stretch one feature and squash another. Both are useful. Both are risky. Now stack dozens of such stations. Without rails, each station can rewrite too much. One block says, "Boost this feature." Next block says, "Kill it." Next block says, "Boost it more." Soon the residual stream is numerically messy. See the real point. Naive depth does not fail because layers are dumb. It fails because layers are strong. Strong modules need guardrails.

The core lesson to remember¶

Say this once clearly. One powerful block is not a proof that many powerful blocks will behave. That is the stacking failure. Depth without rails produces garbage. The garbage has three common shapes:

explode
collapse
oscillate

So what to do? Two rails enter the design. First, the shortcut pipe. It lets the old packet survive. Second, the quality inspector. It keeps the scale sane. These are not cosmetic tricks. They are the reason deep transformers stay trainable.

Where this lives in the wild¶

OpenAI ChatGPT / GPT-4. Deep decoder stacks need residual paths and normalization, or long chains of token updates would become unstable.
Anthropic Claude. The model generates over many layers and long contexts, so stable depth is not optional.
Google Gemini. Multimodal tokens still travel through a deep transformer stack, and the same stability problem appears.
Meta Llama 3. Open model code shows residual pathways and normalization inside every block for exactly this reason.
GitHub Copilot. Code completion depends on deep transformer passes over long token histories, so the stack must preserve signal rather than mangle it.

Interview Q&A¶

Q: Why does stacking more expressive blocks not automatically improve a transformer? A: Because expressiveness and stability are different properties. A block can be very capable in isolation and still amplify, erase, or flip the signal when composed many times. Deep learning works only when the composition stays numerically controlled. Common wrong answer to avoid: "If one block works, more blocks always work better." Depth helps only when the signal has rails. Q: What are the three failure shapes in naive deep stacks? A: Explode, collapse, and oscillate. Explode means norms grow layer by layer. Collapse means norms shrink toward zero. Oscillate means the representation keeps flipping direction while often also growing or shrinking. Q: Why is oscillation not harmless if the values stay bounded for a few layers? A: Because the model is still over-correcting. One layer pushes a feature positive, the next pushes it negative, and the stack never settles into consistent refinement. Common wrong answer to avoid: "Oscillation is fine because the average might cancel out." Cancellation is exactly the problem when you need a reliable signal. Q: What two design choices are introduced to fix this failure in transformers? A: Residual connections and layer normalization. The residual path preserves the old packet. Layer normalization keeps the scale of the packet from drifting too far.

Apply now (5 min)¶

Take a page. Write three tiny chains:

1 → ×1.8 → ×1.8 → ×1.8
10 → ×0.2 → ×0.2 → ×0.2
5 → ×-1.2 → ×-1.2 → ×-1.2

Compute all values by hand. Then say out loud what each chain is doing. Explode. Collapse. Oscillate. Now sketch from memory:

one pipeline with three stations and no shortcut pipe
the three scalar chains
one sentence: stacking powerful functions ≠ stacking stable functions

If you can draw that in under two minutes, you own the failure.

Bridge. If naive depth destroys the packet, the next step is obvious: keep the old packet alive while each station adds only an edit. Read 02-residual-connections.md next.