06. The transformer block — two benches, one station¶
8 minutes. One station. Two edits. One residual stream keeps moving.
Built on the ELI5 in
00-eli5.md. The station — one transformer block — is the exact place where the social bench and private bench write into the residual stream.
Mental model first¶
One transformer block is one station on the assembly line.
It does not replace the packet.
It reads the packet, writes an edit, and passes the updated packet onward.
Inside the station, there are two benches.
First the social bench.
Then the private bench.
Each bench gets its own inspector.
Each bench writes through its own shortcut add.
So the rhythm is fixed: inspect-consult-add-inspect-transform-add.
That is the whole beat of a pre-norm block.
The residual stream is the packet that enters as x and leaves as y.
Nothing mystical.
Just two learned edits written onto the same running vector.
See the full station.
residual stream x
|
+----------------------------------------------+
| |
| [quality inspector] |
| LN |
| | |
| [social bench] Attention |
| | |
+-------- add --------------------------------> u
|
+--------------------------------+
| |
| [quality inspector] |
| LN |
| | |
| [private bench] FFN |
| | |
+-------- add ------------------> y
Formula view¶
For a pre-norm block, write the two sublayers like this.
Read the two lines slowly. First inspectx, let the social bench consult other tokens, write the attention edit, and add it back. Then inspect u, send it through the private bench, write the FFN edit, and add that back too.
That is why people say a transformer block has two sublayers.
Both sublayers have the same pattern.
The only difference is the kind of computation.
Attention is cross-token consultation.
FFN is per-token transformation.
A compact box view helps.
x
|
+--> LN --> Attention --+
| |
+-----------------------+--> u
u
|
+--> LN --> FFN --------+
| |
+-----------------------+--> y
l receives x^(l), then it returns x^(l+1).
Same width, same pattern, repeated N times. That is depth.
Worked numerical examples with ASCII diagrams¶
Use one token vector for arithmetic. Say the incoming residual stream for one token is:
Remember one fact. Attention still consulted the other tokens. We are only showing the edit returned for this one token.Step 1: inspect before the social bench¶
Compute LayerNorm on x.
mean = (2 + -1 + 3) / 3 = 1.333
centered = [0.667, -2.333, 1.667]
variance ≈ (0.444 + 5.444 + 2.778) / 3 = 2.889
std ≈ 1.700
LN(x) ≈ [0.392, -1.373, 0.980]
Step 2: the social bench writes an attention edit¶
Suppose the social bench looks at all tokens and decides this token needs:
Now add it through the shortcut pipe. ASCII picture: Notice the idea. The social bench did not replacex.
It only wrote a correction.
Step 3: inspect before the private bench¶
Now normalize u.
mean = (2.6 + -0.8 + 2.9) / 3 = 1.567
centered = [1.033, -2.367, 1.333]
variance ≈ (1.067 + 5.602 + 1.777) / 3 = 2.815
std ≈ 1.678
LN(u) ≈ [0.616, -1.411, 0.795]
Step 4: the private bench writes an FFN edit¶
Suppose the private bench produces:
Add again. ASCII picture: That is one full forward pass for one token through one block: inspect, consult, add, inspect, transform, add.Full block trace in one view¶
x = [2, -1, 3]
|
+--> LN --> Attention --> [0.6, 0.2, -0.1] --+
| |
+---------------------------------------------+--> u = [2.6, -0.8, 2.9]
|
+--> LN --> FFN --> [0.3, -0.4, 0.2] --+
| |
+-----------------------------------------+--> y = [2.9, -1.2, 3.1]
[2, -1, 3]
After social bench: [2.6, -0.8, 2.9]
After private bench: [2.9, -1.2, 3.1]
Same token slot.
Richer packet.
How blocks stack¶
Now zoom out. One station is useful, but a full transformer is many stations.
x^(0) --> Block 1 --> x^(1) --> Block 2 --> x^(2) --> Block 3 --> x^(3) --> ... --> Block N --> x^(N)
N times. That is the assembly line.
Where this lives in the wild¶
- ChatGPT and the GPT family are stacks of these stations, repeated many times with the same pre-norm rhythm.
- GitHub Copilot relies on decoder blocks of this form so each token can consult context, then locally transform its representation.
- Meta LLaMA models implement the same two-bench station, with RMSNorm as the inspector variant.
- Mistral and Mixtral keep this block pattern while changing other details like grouped-query attention or mixture-of-experts layers.
- Google Gemma checkpoints also use repeated transformer stations, because scaling comes from stacking many stable blocks.
Interview Q&A¶
Q: What are the two sublayers inside a standard decoder transformer block?
A: Multi-head self-attention first, then the feed-forward network, with a residual add after each.
Common wrong answer to avoid: "Attention is the block." The block also has the private bench, and both benches write into the same residual stream.
Q: Why are there two LayerNorm operations in a pre-norm block?
A: Because each heavy computation gets its own inspected input before writing an edit.
Q: What exactly moves from block to block?
A: The residual stream. It is the running token representation that every station reads and updates.
Common wrong answer to avoid: "Only attention outputs move forward." The whole residual stream moves forward; attention and FFN only add edits to it.
Q: If you double the number of blocks, what is the conceptual change?
A: You are just repeating the same station more times, so the model gets more chances to refine the residual stream.
Apply now (5 min)¶
Sketch the full pre-norm station from memory. Label the two benches, the two shortcut adds, and then write the two equations from memory.
Now close the file and say the rhythm aloud: inspect, consult, add, inspect, transform, add. Last step. Sketch from memory the packet[2, -1, 3] becoming [2.6, -0.8, 2.9] and then [2.9, -1.2, 3.1].
Bridge. You have seen the full station. Next, zoom into the social bench inside it:
07-attention-in-the-block.md.