Skip to content

06. The transformer block — two benches, one station

8 minutes. One station. Two edits. One residual stream keeps moving.

Built on the ELI5 in 00-eli5.md. The station — one transformer block — is the exact place where the social bench and private bench write into the residual stream.


Mental model first

One transformer block is one station on the assembly line. It does not replace the packet. It reads the packet, writes an edit, and passes the updated packet onward. Inside the station, there are two benches. First the social bench. Then the private bench. Each bench gets its own inspector. Each bench writes through its own shortcut add. So the rhythm is fixed: inspect-consult-add-inspect-transform-add. That is the whole beat of a pre-norm block. The residual stream is the packet that enters as x and leaves as y. Nothing mystical. Just two learned edits written onto the same running vector. See the full station.

residual stream x
      |
      +----------------------------------------------+
      |                                              |
      |   [quality inspector]                        |
      |          LN                                  |
      |          |                                   |
      |   [social bench] Attention                   |
      |          |                                   |
      +-------- add --------------------------------> u
                                                     |
                                                     +--------------------------------+
                                                     |                                |
                                                     |   [quality inspector]          |
                                                     |          LN                    |
                                                     |          |                     |
                                                     |   [private bench] FFN          |
                                                     |          |                     |
                                                     +-------- add ------------------> y
The first bench mixes information across tokens. The second bench transforms each token locally. The shortcut pipe keeps the old packet alive at both benches, so a block is an edit machine, not a rewrite machine.

Formula view

For a pre-norm block, write the two sublayers like this.

u = x + Attention(LN(x))
y = u + FFN(LN(u))
Read the two lines slowly. First inspect x, let the social bench consult other tokens, write the attention edit, and add it back. Then inspect u, send it through the private bench, write the FFN edit, and add that back too. That is why people say a transformer block has two sublayers. Both sublayers have the same pattern.
normalized input -> computation -> residual add
The only difference is the kind of computation. Attention is cross-token consultation. FFN is per-token transformation. A compact box view helps.
x
|
+--> LN --> Attention --+
|                       |
+-----------------------+--> u
u
|
+--> LN --> FFN --------+
|                       |
+-----------------------+--> y
One more formula matters. If block l receives x^(l), then it returns x^(l+1).
x^(l+1) = Block_l(x^(l))
Same width, same pattern, repeated N times. That is depth.

Worked numerical examples with ASCII diagrams

Use one token vector for arithmetic. Say the incoming residual stream for one token is:

x = [2, -1, 3]
Remember one fact. Attention still consulted the other tokens. We are only showing the edit returned for this one token.

Step 1: inspect before the social bench

Compute LayerNorm on x.

mean = (2 + -1 + 3) / 3 = 1.333
centered = [0.667, -2.333, 1.667]
variance ≈ (0.444 + 5.444 + 2.778) / 3 = 2.889
std ≈ 1.700
LN(x) ≈ [0.392, -1.373, 0.980]
So the quality inspector hands the social bench a numerically cleaner vector. ASCII picture:
[2, -1, 3]
     |
     v
LN -> [0.392, -1.373, 0.980]

Step 2: the social bench writes an attention edit

Suppose the social bench looks at all tokens and decides this token needs:

attention edit = [0.6, 0.2, -0.1]
Now add it through the shortcut pipe.
u = x + attention edit
u = [2, -1, 3] + [0.6, 0.2, -0.1]
u = [2.6, -0.8, 2.9]
ASCII picture:
old packet x ----------------+
                             +--> [2.6, -0.8, 2.9]
attention edit -------------+
Notice the idea. The social bench did not replace x. It only wrote a correction.

Step 3: inspect before the private bench

Now normalize u.

mean = (2.6 + -0.8 + 2.9) / 3 = 1.567
centered = [1.033, -2.367, 1.333]
variance ≈ (1.067 + 5.602 + 1.777) / 3 = 2.815
std ≈ 1.678
LN(u) ≈ [0.616, -1.411, 0.795]
Again, the inspector gives the next bench a stable input. ASCII picture:
[2.6, -0.8, 2.9]
       |
       v
LN -> [0.616, -1.411, 0.795]

Step 4: the private bench writes an FFN edit

Suppose the private bench produces:

ffn edit = [0.3, -0.4, 0.2]
Add again.
y = u + ffn edit
y = [2.6, -0.8, 2.9] + [0.3, -0.4, 0.2]
y = [2.9, -1.2, 3.1]
ASCII picture:
intermediate u --------------+
                             +--> [2.9, -1.2, 3.1]
ffn edit --------------------+
That is one full forward pass for one token through one block: inspect, consult, add, inspect, transform, add.

Full block trace in one view

x = [2, -1, 3]
  |
  +--> LN --> Attention --> [0.6, 0.2, -0.1] --+
  |                                             |
  +---------------------------------------------+--> u = [2.6, -0.8, 2.9]
                                                 |
                                                 +--> LN --> FFN --> [0.3, -0.4, 0.2] --+
                                                 |                                         |
                                                 +-----------------------------------------+--> y = [2.9, -1.2, 3.1]
See how the residual stream changed. Start: [2, -1, 3] After social bench: [2.6, -0.8, 2.9] After private bench: [2.9, -1.2, 3.1] Same token slot. Richer packet.

How blocks stack

Now zoom out. One station is useful, but a full transformer is many stations.

x^(0) --> Block 1 --> x^(1) --> Block 2 --> x^(2) --> Block 3 --> x^(3) --> ... --> Block N --> x^(N)
Every block has the same rhythm. The numbers and learned weights change, but the pattern does not. So what to do when you feel lost? Return to the beat: inspect, consult, add, inspect, transform, add. Repeat that station N times. That is the assembly line.

Where this lives in the wild

  • ChatGPT and the GPT family are stacks of these stations, repeated many times with the same pre-norm rhythm.
  • GitHub Copilot relies on decoder blocks of this form so each token can consult context, then locally transform its representation.
  • Meta LLaMA models implement the same two-bench station, with RMSNorm as the inspector variant.
  • Mistral and Mixtral keep this block pattern while changing other details like grouped-query attention or mixture-of-experts layers.
  • Google Gemma checkpoints also use repeated transformer stations, because scaling comes from stacking many stable blocks.

Interview Q&A

Q: What are the two sublayers inside a standard decoder transformer block?
A: Multi-head self-attention first, then the feed-forward network, with a residual add after each. Common wrong answer to avoid: "Attention is the block." The block also has the private bench, and both benches write into the same residual stream. Q: Why are there two LayerNorm operations in a pre-norm block?
A: Because each heavy computation gets its own inspected input before writing an edit. Q: What exactly moves from block to block?
A: The residual stream. It is the running token representation that every station reads and updates. Common wrong answer to avoid: "Only attention outputs move forward." The whole residual stream moves forward; attention and FFN only add edits to it. Q: If you double the number of blocks, what is the conceptual change?
A: You are just repeating the same station more times, so the model gets more chances to refine the residual stream.

Apply now (5 min)

Sketch the full pre-norm station from memory. Label the two benches, the two shortcut adds, and then write the two equations from memory.

u = x + Attention(LN(x))
y = u + FFN(LN(u))
Now close the file and say the rhythm aloud: inspect, consult, add, inspect, transform, add. Last step. Sketch from memory the packet [2, -1, 3] becoming [2.6, -0.8, 2.9] and then [2.9, -1.2, 3.1].


Bridge. You have seen the full station. Next, zoom into the social bench inside it: 07-attention-in-the-block.md.