07. Attention inside the block — the social bench¶

8 minutes. Not attention from scratch. Attention exactly where it sits inside one station.

Built on the ELI5 in 00-eli5.md. The social bench — multi-head attention inside the station — lets parallel crews read the normalized residual stream and write one shared edit.

Mental model first¶

Do not restart from module 02. That file taught attention itself. This file asks a narrower question: how does attention live inside one transformer block? It sits on the first bench of the station, reads the normalized residual stream, creates queries, keys, and values, lets several parallel crews work on slices of the width, concatenates their edits, and uses one output projection to mix them back to model width. So attention inside the block is not the whole model. It is one edit writer inside one station. Picture it like this.

normalized residual stream X
        |
        +--> Q projection --> Q1 | Q2 | ... | Qh
        |
        +--> K projection --> K1 | K2 | ... | Kh
        |
        +--> V projection --> V1 | V2 | ... | Vh
head 1: softmax(Q1 K1^T / sqrt(d_k)) V1
head 2: softmax(Q2 K2^T / sqrt(d_k)) V2
...
head h: softmax(Qh Kh^T / sqrt(d_k)) Vh
concat heads --> W_O --> attention edit --> residual add

See the flow: one input stream, many parallel crews, one merged edit.

Formula view¶

Let the normalized residual stream be H with shape T x d_model. T is sequence length. d_model is the width of the residual stream. Inside the block, attention begins with three learned projections.

Q = H W_Q
K = H W_K
V = H W_V

For multi-head attention, split the model width across h heads.

d_k = d_model / h

Each head gets its own slice.

Q -> (Q_1, Q_2, ..., Q_h)
K -> (K_1, K_2, ..., K_h)
V -> (V_1, V_2, ..., V_h)

Each head computes attention on its own slice.

head_i = softmax(Q_i K_i^T / sqrt(d_k)) V_i

Then concatenate all head outputs.

HeadCat = concat(head_1, head_2, ..., head_h)

Finally project back to model width.

AttnEdit = HeadCat W_O

Inside the pre-norm block, that edit is used like this.

U = X + MHA(LN(X))

So what is the real role of attention here? Not final prediction, not full block output, just one learned cross-token edit written onto the residual stream.

Worked numerical examples with ASCII diagrams¶

Use a tiny toy setup. Let d_model = 4. Let h = 2 heads. So each head gets:

d_k = d_model / h = 4 / 2 = 2

Use three tokens after normalization.

r1 = [1, 0, 2, 1]
r2 = [0, 1, 1, 2]
r3 = [1, 1, 0, 1]

For the toy arithmetic, let head 1 read the first two dimensions. Let head 2 read the last two dimensions. That is a simplified stand-in for learned projections.

Head 1¶

For token r3, the head-1 query is:

q1 = [1, 1]

Head-1 keys are:

k1(r1) = [1, 0]
k1(r2) = [0, 1]
k1(r3) = [1, 1]

Compute dot scores.

q1·k1(r1) = 1
q1·k1(r2) = 1
q1·k1(r3) = 2

Scale by sqrt(2).

scores ≈ [0.71, 0.71, 1.41]

Softmax gives roughly:

weights ≈ [0.248, 0.248, 0.503]

Use head-1 values equal to the same slice for the toy example.

v1(r1) = [1, 0]
v1(r2) = [0, 1]
v1(r3) = [1, 1]

Weighted sum:

head1(r3)
≈ 0.248[1,0] + 0.248[0,1] + 0.503[1,1]
≈ [0.751, 0.751]

ASCII picture:

head 1 sees first two dims
r3 query [1,1]
   |---- talks to r1 with weight 0.248
   |---- talks to r2 with weight 0.248
   |---- talks to r3 with weight 0.503
output -> [0.751, 0.751]

Head 2¶

For token r3, the head-2 query is:

q2 = [0, 1]

Head-2 keys are:

k2(r1) = [2, 1]
k2(r2) = [1, 2]
k2(r3) = [0, 1]

Compute dot scores.

q2·k2(r1) = 1
q2·k2(r2) = 2
q2·k2(r3) = 1

Scale by sqrt(2).

scores ≈ [0.71, 1.41, 0.71]

Softmax gives roughly:

weights ≈ [0.248, 0.503, 0.248]

Use values from the same slice again.

v2(r1) = [2, 1]
v2(r2) = [1, 2]
v2(r3) = [0, 1]

Weighted sum:

head2(r3)
≈ 0.248[2,1] + 0.503[1,2] + 0.248[0,1]
≈ [0.999, 1.503]

ASCII picture:

head 2 sees last two dims
r3 query [0,1]
   |---- talks to r1 with weight 0.248
   |---- talks to r2 with weight 0.503
   |---- talks to r3 with weight 0.248
output -> [0.999, 1.503]

Concatenate heads and project back¶

Now concatenate the two head outputs.

HeadCat(r3) = [0.751, 0.751, 0.999, 1.503]

Use a simple output projection W_O = 0.5 I for the toy example. So the attention edit becomes:

AttnEdit(r3) ≈ [0.376, 0.376, 0.500, 0.752]

Now add it back to the original residual stream for token r3.

r3 = [1, 1, 0, 1]
U(r3) = r3 + AttnEdit(r3)
U(r3) ≈ [1.376, 1.376, 0.500, 1.752]

Full ASCII view:

normalized r3
   |
   +--> head 1 -> [0.751, 0.751] --+
   |                                |
   +--> head 2 -> [0.999, 1.503] --+--> concat --> W_O --> [0.376, 0.376, 0.500, 0.752]
                                                                    |
original residual ---------------------------------------------------+
                                                                    |
                                                                    v
                                                        updated stream [1.376, 1.376, 0.500, 1.752]

That is attention inside the block: read normalized residual stream, split into parallel crews, compute per-head attention, concatenate, project, and add the edit.

What different heads often learn¶

Not every head learns the same job. That is the point of the parallel crews. Some heads become positional, some syntactic, some semantic, and some formatting-sensitive. They may look one token left, connect subjects to verbs, pull related topic words together, or watch indentation and punctuation. So when people say one layer has many heads, hear this: many small specialists are writing one shared edit.

Where this lives in the wild¶

OpenAI ChatGPT models use many attention heads per block so different relation types can be processed in parallel.
GitHub Copilot depends heavily on attention heads that track code structure, indentation, and long-range symbol reuse.
Meta LLaMA models split model width across many heads, then merge them back with an output projection.
Google Gemini and Gemma systems use the same head-splitting idea, even when other details differ.
Whisper by OpenAI also relies on attention inside transformer blocks, except the tokens represent audio frames and text positions rather than plain text alone.

Interview Q&A¶

Q: Inside a transformer block, where do Q, K, and V come from?
A: They are learned projections of the normalized residual stream entering the attention sublayer. Common wrong answer to avoid: "Q, K, and V come from embeddings only." They come from the current block input, which already contains edits from earlier blocks. Q: Why do we split d_model across heads?
A: So multiple parallel crews can learn different relation patterns while keeping total model width fixed. Q: What does the output projection do after concatenation?
A: It mixes the head outputs and maps the concatenated result back to d_model, producing one attention edit for the residual stream. Common wrong answer to avoid: "Concatenation is the final block output." Concatenation still has to go through W_O and then through the residual add. Q: Does attention inside the block replace the residual stream?
A: No. It writes an edit that is added through the shortcut pipe.

Apply now (5 min)¶

Without looking, write these lines from memory.

Q = H W_Q
K = H W_K
V = H W_V
head_i = softmax(Q_i K_i^T / sqrt(d_k)) V_i

Now say aloud what d_k equals when d_model = 4 and h = 2. Then sketch from memory the two-head diagram. Show the split, the concat, and the output projection. Last step. Sketch from memory how token r3 became [1.376, 1.376, 0.500, 1.752] after the attention edit.

Bridge. The social bench mixes information across tokens. Next, see the private bench that transforms one token at a time: 08-ffn-in-the-block.md.