02. Residual connections — the shortcut pipe¶
Five minutes. One picture. Why edits travel better than rewrites.
Built on the ELI5 in
00-eli5.md. The shortcut pipe — the old packet surviving around the station — is the whole point of this file.
The picture before the math¶
Imagine a highway. There is the main road. There is also a bypass lane. If road work goes wrong, traffic still moves. That is a residual connection. The station does some work. But the old packet has its own protected path.
┌──────── station computes f(x) ────────┐
x ──────┤ + ├────→ y
└──────────── shortcut pipe ────────────┘
The formula¶
Residual connections use one simple equation:
Read it slowly.x is the old packet. f(x) is the block's proposed edit. y is old packet plus edit. Not replacement. Addition. That is why the shortcut
pipe is such a good name. The packet survives through the pipe even if the station adds a bad note.
Why edits are easier than full rewrites¶
Suppose the right answer is close to the input. That happens all the time in deep models. Most layers do not need a revolution. They need a correction. If a block must learn the full mapping, it must produce the whole output from scratch. If a block has a shortcut pipe, it only learns the difference. See the contrast.
-
Full rewrite target: from
4to5 -
Residual edit target: from
4add+1
Which is easier? Usually the edit. Another case.
-
Full rewrite target: from
4to4 -
Residual edit target: from
4add0
Now the best move is almost effortless. The block can stay quiet. That is exactly what good deep stacks need.
Worked forward example 1 — positive edit from x = 4¶
Start with x = 4. Let the station compute f(x) = 1. Then:
5 from nothing. It only said, "Add one."
Worked forward example 2 — negative edit from x = 4¶
Again start with x = 4. Now let the station compute f(x) = -2.5. Then:
Worked forward example 3 — zero edit from x = 4¶
Again x = 4. Now let the station compute f(x) = 0. Then:
Three stations in a row — the packet survives¶
Now chain three stations. Start with x0 = 4.
-
Station 1 proposes edit
+1 -
Station 2 proposes edit
-0.5 -
Station 3 proposes edit
+2
Walk it:
| Step | Formula | Value |
|---|---|---:|
| start | x0 | 4.0 |
| after station 1 | x1 = 4 + 1 | 5.0 |
| after station 2 | x2 = 5 + (-0.5) | 4.5 |
| after station 3 | x3 = 4.5 + 2 | 6.5 |
ASCII picture:
4 ──(+1)──→ 5 ──(-0.5)──→ 4.5 ──(+2)──→ 6.5
│ │ │
old survives old survives old survives
through pipe through pipe through pipe
The gradient picture — the hidden superpower¶
Forward pass is only half the story. Training also needs gradients to flow backward. For y = x + f(x), the derivative is:
1 matters. Very much. It is the identity path. Even if the block's own derivative is weak, the shortcut pipe still contributes a clean route for gradient
flow. Now what is the problem in a plain stack? Without residuals, the gradient is only df/dx. If many derivatives are small, gradients vanish. If many are
large, gradients explode. Residuals do not magically solve everything. But they give gradients a dependable lane.
Worked gradient example 1 — helpful positive slope¶
Suppose df/dx = 0.2. Then:
0.2. That is much easier to lose across many
layers.
Worked gradient example 2 — even with a negative local slope¶
Suppose df/dx = -0.7. Then:
-0.7 only. That is a noisier training signal.
Worked gradient example 3 — silent station, live gradient¶
Suppose df/dx = 0. Then:
Why this matters inside a transformer block¶
A transformer station has two strong sublayers. The social bench computes attention-based edits. The private bench computes FFN-based edits. Both can produce sharp changes. Residual connections keep those changes additive. The old packet in the residual stream never disappears completely. This is why people say deep transformers learn refinements. Layer by layer, the model nudges the representation. It does not repeatedly erase and redraw the whole thing. That is the engineering win.
A quick compare — rewrite versus edit¶
Suppose you want a layer to map x = 4 to 4.2. Two ways to think:
| Design | What the layer must learn |
|---|---|
| no residual | g(4) = 4.2 |
| residual | f(4) = 0.2 |
The second target is smaller. Cleaner. More local. This is why the shortcut pipe is not just a convenience. It changes the optimization problem.
Where this lives in the wild¶
-
OpenAI ChatGPT / GPT-4. Every decoder block uses residual pathways so token representations survive deep editing.
-
Anthropic Claude. Long chains of transformer blocks rely on additive updates rather than full rewrites at each layer.
-
Google Gemini. The same shortcut idea stabilizes deep multimodal transformer stacks.
-
Meta Llama 3. Open architectures show residual additions around attention and FFN sublayers.
-
GitHub Copilot. Code tokens move through many transformer blocks, and residual edits help preserve earlier context.
Interview Q&A¶
Q: What does a residual connection compute? A: It computes y = x + f(x). The block keeps the old input x and adds a learned edit f(x) on top of it.
Q: Why are residuals often described as “learning edits”? A: Because the block is not asked to generate the full output from scratch. It only has to predict
the difference between the old representation and the improved one. Common wrong answer to avoid: "Residual means the model remembers everything
automatically." It preserves a path, not perfect memory. Q: Why does dy/dx = 1 + df/dx help training? A: The 1 is an identity gradient path. Even when
the block's own derivative is tiny or awkward, some gradient can still pass cleanly through the shortcut connection. Q: Do residuals make bad layers
harmless? A: Not fully. A bad layer can still add a harmful edit. But the old packet still survives, so the damage is usually smaller than in a full rewrite
design. Common wrong answer to avoid: "Residuals solve stability completely." They help a lot, but transformers still need normalization and careful design.
Apply now (5 min)¶
Take one number. Use x = 4. Now invent three edits:
-
one positive
-
one negative
-
one zero
Compute x + f(x) for each. Then do the same for gradients. Pick three values for df/dx and compute 1 + df/dx. Now sketch from memory:
-
one bypass diagram with the shortcut pipe
-
the formula
y = x + f(x) -
the formula
dy/dx = 1 + df/dx -
one sentence: "Residual layers learn edits, not full rewrites."
If you can explain why the packet survives through the pipe, you are ready for the next idea.
Bridge. Once the shortcut pipe preserves the old packet, you can finally name the thing flowing through all layers: the shared running vector called the residual stream. Read
03-residual-stream.mdnext.