03. The residual stream — the shared canvas¶
Five minutes. One picture. Why every block writes on the same page.
Built on the ELI5 in
00-eli5.md. The residual stream — the packet traveling through all stations — is the main highway this file names and studies.
The picture before the math¶
Think of one shared Google Doc. Reviewer 1 opens it and adds comments. Reviewer 2 opens the same doc and adds more comments. Reviewer 3 does the same. Nobody makes a fresh private copy and replaces the old one. That shared live document is the residual stream. In a transformer, every station reads from the same running vector. Then it writes an edit back into that same running vector. See. There is one canvas. Not a new canvas per block.
x0 ── station 1 ──→ x1 ── station 2 ──→ x2 ── station 3 ──→ x3
│ │ │
shared canvas shared canvas shared canvas
keeps growing keeps growing keeps growing
The one sentence definition¶
The residual stream is the fixed-width vector that travels through the whole transformer. Every block reads from it. Every block writes back into it. The width stays the same. Only the contents change. If the model width is d_model =
4096, then every layer sees a 4096-dimensional stream. Layer 1 sees width 4096. Layer 20 sees width 4096. Layer 80 also sees width 4096. Same road width. New traffic content.
The math picture¶
Ignoring normalization placement for a moment, a block does this:
And that edit often comes from two sub-edits: So you can picture one block as: The social bench reads the stream and proposes an attention edit. The private bench reads the stream and proposes an FFN edit. Both edits go back onto the same canvas. That is the core architecture.The same width from start to finish¶
This part is easy to miss. The stream is not getting wider every layer. It is not a staircase of changing widths. It is one fixed-width highway. Take a tiny toy model with d_model = 4. Then every stream state has four numbers.
layer 0 [ _ _ _ _ ]
layer 1 [ _ _ _ _ ]
layer 2 [ _ _ _ _ ]
layer 3 [ _ _ _ _ ]
same width = d_model = 4
Worked example — a 3-layer pass on one shared stream¶
Let the starting stream be:
We use three stations. Each one writes two edits. One from attention. One from the FFN.Station 1¶
Attention edit:
After attention write-back: FFN edit: After FFN write-back:Station 2¶
Attention edit:
After attention write-back: FFN edit: After FFN write-back:Station 3¶
Attention edit:
After attention write-back: FFN edit: After FFN write-back: Put the whole journey in one table.| Stage | Stream |
|---|---|
| start | [1, 0, 2, 1] |
| after station 1 | [2, 1, 2, 1] |
| after station 2 | [1, 1, 4, 0] |
| after station 3 | [2, 0, 2, 2] |
See what happened. Nothing got replaced wholesale. The shared canvas accumulated edits. That is the residual stream in action.
How attention writes to the stream¶
The social bench looks across tokens. It asks, "Which other tokens matter for me right now?" Then it writes back an edit. That edit is still a vector of width d_model. So attention is not a side channel. It is a writer onto the
stream.
How the FFN writes to the stream¶
The private bench works per token. No consultation. It reads the current stream vector for one token. Then it applies a learned transformation. Then it writes an edit back.
Attention mixes information across tokens. The FFN remixes information inside one token's representation. Both still use the same write-back destination. The shared canvas keeps integrating both styles of edits.Superposition — many features in one vector¶
Now comes the subtle part. People often imagine one feature per coordinate. Real models are messier. Many features can be packed into the same residual stream. That is called superposition. One direction in the stream may partly represent tense. Another direction may partly represent indentation in code. Another may partly represent whether a name is plural. These can overlap. So the stream is not a neat spreadsheet. It is a dense shared canvas. Tiny toy picture:
| Stream direction | Might contribute to |
|---|---|
| direction A | subjecthood + code scope |
| direction B | tense + bracket balance |
| direction C | topic + sentiment |
| direction D | position + quoting state |
This is why the residual stream is so interesting. It is where many abstract features coexist.
Why mechanistic interpretability loves the residual stream¶
If you want to understand what a transformer is thinking, the residual stream is a natural place to look. Why? Because it is the shared state. Every layer reads it. Every layer writes it. Interpretability tools often inspect stream directions, stream norms, or stream projections onto output logits. People use tricks like the logit lens for exactly this reason. Anthropic and open-model researchers often ask questions like:
-
which direction in the stream carries an entity name?
-
where does a factual feature first appear?
-
which block writes the correction that changes the next-token prediction?
So if you want a window into model internals, the shared canvas is where you often start.
The Google Doc analogy, cleaned up¶
Let us return to the shared doc. Suppose three reviewers see the sentence:
Reviewer 1 adds a grammar note. Reviewer 2 adds a number-agreement note. Reviewer 3 adds a style note. By the end, the document has absorbed all three edits. Nobody threw away the old doc and started from zero. That is the transformer story. The residual stream is the common working draft. Each block is a reviewer adding comments.Where this lives in the wild¶
-
OpenAI ChatGPT / GPT-4. Final next-token predictions are read from the last residual stream state.
-
Anthropic Claude. Interpretability work often tracks how specific residual stream directions carry features across layers.
-
Google Gemini / Gemma. Decoder blocks repeatedly read and write a fixed-width hidden state through the whole stack.
-
Meta Llama 3. Open weights let researchers inspect residual stream activations layer by layer with tools like TransformerLens.
-
GitHub Copilot. Code context is progressively updated in a shared hidden state as the model prepares the next token.
Interview Q&A¶
Q: What is the residual stream in one line? A: It is the fixed-width vector that carries the token's representation through all transformer layers, with every block reading from it and writing back into it. Q: Why is it called a
shared canvas? A: Because attention and FFN do not maintain separate long-term documents. They both write edits onto the same running representation. Common wrong answer to avoid: "Attention owns one hidden state and FFN owns another
hidden state." They contribute to the same stream. Q: Does the residual stream get wider as the model gets deeper? A: No. Depth adds more rounds of editing, not more stream width. The width is d_model, and it stays fixed from the
first layer to the last. Q: Why do interpretability researchers inspect the residual stream so much? A: Because it is the central state that connects all blocks. If a feature matters to prediction, it often becomes visible as a
direction or pattern in that stream. Common wrong answer to avoid: "The residual stream is just storage, so it is not where computation lives." It is both the storage and the meeting place for computation.
Apply now (5 min)¶
Take a tiny width-4 vector. Start with [1, 0, 2, 1]. Invent one attention edit and one FFN edit. Add them by hand. Do that for three stations. Now sketch from memory:
-
x0 → x1 → x2 → x3with the same width each time -
one note saying "attention writes here"
-
one note saying "FFN writes here"
-
one sentence: "The residual stream is the shared canvas."
If you can explain the Google Doc analogy without looking, you own the idea.
Bridge. A shared canvas is useful only if its scale stays sane, so the next file brings in the quality inspector that normalizes each token before heavy work. Read
04-layer-normalization.mdnext.