02. Residual connections — the shortcut pipe¶

Five minutes. One picture. Why edits travel better than rewrites.

Built on the ELI5 in 00-eli5.md. The shortcut pipe — the old packet surviving around the station — is the whole point of this file.

The picture before the math¶

Imagine a highway. There is the main road. There is also a bypass lane. If road work goes wrong, traffic still moves. That is a residual connection. The station does some work. But the old packet has its own protected path.

        ┌──────── station computes f(x) ────────┐
x ──────┤                                      + ├────→ y
        └──────────── shortcut pipe ────────────┘

See the design. The block is not forced to rewrite everything. It only needs to add an edit. That small idea changes deep learning.

The formula¶

Residual connections use one simple equation:

y = x + f(x)

Read it slowly. x is the old packet. f(x) is the block's proposed edit. y is old packet plus edit. Not replacement. Addition. That is why the shortcut pipe is such a good name. The packet survives through the pipe even if the station adds a bad note.

Why edits are easier than full rewrites¶

Suppose the right answer is close to the input. That happens all the time in deep models. Most layers do not need a revolution. They need a correction. If a block must learn the full mapping, it must produce the whole output from scratch. If a block has a shortcut pipe, it only learns the difference. See the contrast.

Full rewrite target: from 4 to 5
Residual edit target: from 4 add +1

Which is easier? Usually the edit. Another case.

Full rewrite target: from 4 to 4
Residual edit target: from 4 add 0

Now the best move is almost effortless. The block can stay quiet. That is exactly what good deep stacks need.

Worked forward example 1 — positive edit from `x = 4`¶

Start with x = 4. Let the station compute f(x) = 1. Then:

y = x + f(x) = 4 + 1 = 5

Picture it:

old packet: 4
edit:      +1
result:     5

The station did not rebuild 5 from nothing. It only said, "Add one."

Worked forward example 2 — negative edit from `x = 4`¶

Again start with x = 4. Now let the station compute f(x) = -2.5. Then:

y = 4 + (-2.5) = 1.5

Picture it:

old packet: 4
edit:      -2.5
result:     1.5

So residual does not mean only boosting. It means controlled correction. The station can increase or decrease features.

Worked forward example 3 — zero edit from `x = 4`¶

Again x = 4. Now let the station compute f(x) = 0. Then:

y = 4 + 0 = 4

That looks trivial. It is not trivial. It means a layer can do no harm when no edit is needed. Without residuals, matching the identity map is awkward. With residuals, identity is one quiet edit away. That is a huge reason deep stacks train better.

Three stations in a row — the packet survives¶

Now chain three stations. Start with x0 = 4.

Station 1 proposes edit +1
Station 2 proposes edit -0.5
Station 3 proposes edit +2

Walk it:

| Step | Formula | Value |

|---|---|---:|

| start | x0 | 4.0 |

| after station 1 | x1 = 4 + 1 | 5.0 |

| after station 2 | x2 = 5 + (-0.5) | 4.5 |

| after station 3 | x3 = 4.5 + 2 | 6.5 |

ASCII picture:

4 ──(+1)──→ 5 ──(-0.5)──→ 4.5 ──(+2)──→ 6.5
│            │               │
old survives old survives    old survives
through pipe through pipe    through pipe

See the rhythm. Each station is an editor. Not a dictator. The packet keeps its history.

The gradient picture — the hidden superpower¶

Forward pass is only half the story. Training also needs gradients to flow backward. For y = x + f(x), the derivative is:

dy/dx = 1 + df/dx

That 1 matters. Very much. It is the identity path. Even if the block's own derivative is weak, the shortcut pipe still contributes a clean route for gradient flow. Now what is the problem in a plain stack? Without residuals, the gradient is only df/dx. If many derivatives are small, gradients vanish. If many are large, gradients explode. Residuals do not magically solve everything. But they give gradients a dependable lane.

Worked gradient example 1 — helpful positive slope¶

Suppose df/dx = 0.2. Then:

dy/dx = 1 + 0.2 = 1.2

So the gradient is slightly amplified. Not wildly. Just enough. Without the residual path, it would be only 0.2. That is much easier to lose across many layers.

Worked gradient example 2 — even with a negative local slope¶

Suppose df/dx = -0.7. Then:

dy/dx = 1 + (-0.7) = 0.3

The block locally pushes back. Still, the total derivative is positive. The identity path keeps some gradient alive. Without the shortcut pipe, the derivative would be -0.7 only. That is a noisier training signal.

Worked gradient example 3 — silent station, live gradient¶

Suppose df/dx = 0. Then:

dy/dx = 1 + 0 = 1

This is beautiful. The station can be silent. The gradient still passes through perfectly. So what to do when a layer does not need to change much? Let it stay close to zero edit. Depth remains usable.

Why this matters inside a transformer block¶

A transformer station has two strong sublayers. The social bench computes attention-based edits. The private bench computes FFN-based edits. Both can produce sharp changes. Residual connections keep those changes additive. The old packet in the residual stream never disappears completely. This is why people say deep transformers learn refinements. Layer by layer, the model nudges the representation. It does not repeatedly erase and redraw the whole thing. That is the engineering win.

A quick compare — rewrite versus edit¶

Suppose you want a layer to map x = 4 to 4.2. Two ways to think:

| Design | What the layer must learn |

|---|---|

| no residual | g(4) = 4.2 |

| residual | f(4) = 0.2 |

The second target is smaller. Cleaner. More local. This is why the shortcut pipe is not just a convenience. It changes the optimization problem.

Where this lives in the wild¶

OpenAI ChatGPT / GPT-4. Every decoder block uses residual pathways so token representations survive deep editing.
Anthropic Claude. Long chains of transformer blocks rely on additive updates rather than full rewrites at each layer.
Google Gemini. The same shortcut idea stabilizes deep multimodal transformer stacks.
Meta Llama 3. Open architectures show residual additions around attention and FFN sublayers.
GitHub Copilot. Code tokens move through many transformer blocks, and residual edits help preserve earlier context.

Interview Q&A¶

Q: What does a residual connection compute? A: It computes y = x + f(x). The block keeps the old input x and adds a learned edit f(x) on top of it. Q: Why are residuals often described as “learning edits”? A: Because the block is not asked to generate the full output from scratch. It only has to predict the difference between the old representation and the improved one. Common wrong answer to avoid: "Residual means the model remembers everything automatically." It preserves a path, not perfect memory. Q: Why does dy/dx = 1 + df/dx help training? A: The 1 is an identity gradient path. Even when the block's own derivative is tiny or awkward, some gradient can still pass cleanly through the shortcut connection. Q: Do residuals make bad layers harmless? A: Not fully. A bad layer can still add a harmful edit. But the old packet still survives, so the damage is usually smaller than in a full rewrite design. Common wrong answer to avoid: "Residuals solve stability completely." They help a lot, but transformers still need normalization and careful design.

Apply now (5 min)¶

Take one number. Use x = 4. Now invent three edits:

one positive
one negative
one zero

Compute x + f(x) for each. Then do the same for gradients. Pick three values for df/dx and compute 1 + df/dx. Now sketch from memory:

one bypass diagram with the shortcut pipe
the formula y = x + f(x)
the formula dy/dx = 1 + df/dx
one sentence: "Residual layers learn edits, not full rewrites."

If you can explain why the packet survives through the pipe, you are ready for the next idea.

Bridge. Once the shortcut pipe preserves the old packet, you can finally name the thing flowing through all layers: the shared running vector called the residual stream. Read 03-residual-stream.md next.