08. Output projection — concatenation stacks, W_O mixes¶
~8 min read. Heads can speak separately. W_O turns them into one shared layer voice.
Built on the ELI5 in 00-eli5.md. The parallel graders have produced separate notes, and now one learned combiner must synthesise them.
1) Concatenation is stacking, not mixing¶
- After attention, each head has an output vector.
- Suppose there are H heads.
- Suppose each head has width d_head.
- Together they occupy width D = H * d_head.
- Merging heads usually means concatenation first.
- Concatenation restores shape [B, T, D].
- That shape change can look finished.
- It is not finished.
- Concatenation places channels side by side.
- It does not blend them.
- Head 0 features remain in one block.
- Head 1 features remain in another block.
- The parallel graders have all spoken.
- Nobody has written one final verdict yet.
- W_O is that final combiner.
- The exam rule already decided which earlier tokens each head could read.
- The answer sheet already fixed the visible positions.
- The projection lens already created Q, K, and V.
- The memory shortcut may already have supplied cached K and V.
- Even after all that,
- head outputs still need synthesis.
- That is why concatenation alone is not enough.
- Stacking restores width.
- Mixing restores interaction.
2) What W_O actually does¶
- W_O is a learned matrix with shape [D, D].
- Input shape stays [B, T, D].
- Output shape also stays [B, T, D].
- Same shape does not mean same meaning.
- That sentence matters everywhere in deep learning.
- A shape-preserving linear layer can still rotate the basis completely.
- It can amplify some directions.
- It can suppress others.
- It can merge several head channels into one feature.
- It can also spread one head feature across many channels.
- So W_O is expressive even without a shape change.
- Never judge a layer only by its input and output sizes.
- Judge it by which channels can interact after it.
- That is the useful question here.
- W_O remaps the concatenated basis.
- Every output channel can depend on every input channel.
- So any head can influence any final feature.
- Without W_O,
- head blocks stay isolated inside this layer's basis.
- With W_O,
- the layer can combine evidence across heads immediately.
- One head might track syntax.
- Another head might track entities.
- Another head might track discourse state.
- W_O can blend those signals into one useful feature.
- That is the point.
- The rhythm is simple.
- Split.
- Attend in parallel.
- Concatenate.
- Mix.
- That last word is W_O.
- It is not a cosmetic final step.
- It defines the output representation of the attention sublayer itself.
3) One worked numerical example¶
- Use two heads.
- Let d_head = 2.
- Then D = 4.
- Suppose one token receives these head outputs.
- h0 = [2, 1].
- h1 = [4, 3].
- After concatenation,
- z = [2, 1, 4, 3].
- Now choose a tiny output projection.
- Compute y = z @ W_O.
- First channel uses columns one and three.
- So y_0 = 21 + 10 + 41 + 30 = 6.
- Second channel uses columns two and four.
- So y_1 = 20 + 11 + 40 + 31 = 4.
- Third channel mixes only the first head here.
- So y_2 = 21 + 11 + 40 + 30 = 3.
- Fourth channel mixes only the second head here.
- So y_3 = 20 + 10 + 41 + 31 = 7.
- The final mixed output is [6, 4, 3, 7].
- Compare that with raw concat [2, 1, 4, 3].
- Concatenation preserved separation.
- W_O created interaction.
- Output channel 0 now depends on both heads.
- Output channel 1 also depends on both heads.
- That is the behaviour we wanted.
- The layer can now speak in one combined basis.
4) Why the next layer does not replace W_O¶
- A common question sounds reasonable.
- Could the next layer mix heads anyway?
- Partly, yes.
- But that misses the abstraction boundary.
- W_O closes the current attention computation.
- It decides how head-wise messages become one layer output.
- The next layer starts from that mixed basis.
- Without W_O,
- this attention block leaves head partitions exposed.
- With W_O,
- the block defines one shared representation before the residual path.
- That changes what the next layer receives.
- So W_O is not redundant.
- It is the attention sublayer's final learned mixer.
- The projection lens specialised roles before reading.
- The parallel graders then read independently.
- The exam rule constrained every head equally.
- The answer sheet fixed the same token positions for everyone.
- The memory shortcut may have reused old keys and values.
- W_O now turns those separate outcomes into one coordinated result.
- That matters before the residual add happens.
- It matters before the next MLP starts reading features.
- It matters before the next attention layer builds new queries.
- Every later block benefits from receiving one mixed basis instead of frozen head partitions.
- That is why W_O lives inside the current layer, not outside it.
5) Practical implementation notes¶
- In code, W_O is usually one linear layer.
- Its input and output shapes match.
- Beginners sometimes dismiss it because of that.
- Do not dismiss it.
- Shape preservation can still mean major basis change.
- The residual connection also expects model width D.
- W_O makes the attention output compatible with that width.
- It also lets later MLP layers see mixed head information immediately.
- If you remove W_O,
- head blocks survive longer than intended.
- If you randomise W_O badly,
- the layer may mix useful signals poorly.
- When debugging, compare concat output and projected output norms.
- Large mismatches can indicate scaling problems.
- Also inspect whether different heads meaningfully influence final channels.
- That check reveals whether W_O is actually using multi-head diversity.
- In mature models, it usually does.
- Some channels become strong mixtures.
- Some channels stay more head-specific.
- The matrix can learn both behaviours simultaneously.
- That flexibility is the real benefit.
- Remember the short rule.
- Concatenation restores width.
- W_O restores coordination.
- Keep that contrast sharp in your head.
- It explains why multi-head attention ends with one more linear layer.
- It also explains why concatenation by itself feels incomplete.
- The final mixer is part of the design, not an afterthought.
Where this lives in the wild¶
- GPT-2 —
c_projmixes concatenated attention heads before the residual add. - LLaMA —
o_projis the explicit output linear after multi-head attention. - BERT —
BertSelfOutput.denseremaps the attended tensor back into model space. - Vision Transformer — self-attention ends with an output projection before the residual path.
- Whisper decoder — attention outputs pass through a final output projection inside each block.
Pause and recall¶
- Why is concatenation alone not enough after multi-head attention?
- In the example, which output channels used information from both heads?
- Why is W_O different from the next layer's input projection?
- What short rule summarises concat versus W_O?
Interview Q&A¶
Q1. What is the role of W_O in multi-head attention? - It mixes concatenated head outputs back into one shared model representation. Common wrong answer to avoid: "W_O only changes the shape back to D." Q2. Why can heads not coordinate fully without W_O? - Because concatenation preserves separate head blocks instead of learned cross-head mixing. Common wrong answer to avoid: "Concatenation already averages the heads together." Q3. How would you explain W_O with the exam hall picture? - The parallel graders submit separate notes, and one combiner writes the unified remark. Common wrong answer to avoid: "W_O is another causal mask."
Apply now (5 min)¶
- Quick exercise.
- Take head outputs [1, 2] and [3, 4].
- Concatenate them.
- Then choose your own 4 by 4 W_O.
- Compute one mixed output channel by hand.
- Sketch from memory.
- Draw head outputs entering concat.
- Then draw W_O producing one final vector.
Bridge. Good. We now understand the full attention sublayer output. Next we zoom out and track every major tensor through a whole transformer block. → 09-transformer-block-shapes.md