Skip to content

08. Output projection — concatenation stacks, W_O mixes

~8 min read. Heads can speak separately. W_O turns them into one shared layer voice.

Built on the ELI5 in 00-eli5.md. The parallel graders have produced separate notes, and now one learned combiner must synthesise them.


1) Concatenation is stacking, not mixing

  • After attention, each head has an output vector.
  • Suppose there are H heads.
  • Suppose each head has width d_head.
  • Together they occupy width D = H * d_head.
  • Merging heads usually means concatenation first.
  • Concatenation restores shape [B, T, D].
  • That shape change can look finished.
  • It is not finished.
  • Concatenation places channels side by side.
  • It does not blend them.
  • Head 0 features remain in one block.
  • Head 1 features remain in another block.
  • The parallel graders have all spoken.
  • Nobody has written one final verdict yet.
  • W_O is that final combiner.
  • The exam rule already decided which earlier tokens each head could read.
  • The answer sheet already fixed the visible positions.
  • The projection lens already created Q, K, and V.
  • The memory shortcut may already have supplied cached K and V.
  • Even after all that,
  • head outputs still need synthesis.
    ┌────────┐   ┌────────┐
    │ head 0 │──→│ [a, b] │──┐
    ├────────┤   ├────────┤  │
    │ head 1 │──→│ [c, d] │──┼──→ concat [a,b,c,d,e,f] ──→ [ W_O ] ──→ mixed output [D]
    ├────────┤   ├────────┤  │
    │ head 2 │──→│ [e, f] │──┘
    └────────┘   └────────┘
    heads stay separate here         heads can now talk to each other
    
  • That is why concatenation alone is not enough.
  • Stacking restores width.
  • Mixing restores interaction.

2) What W_O actually does

  • W_O is a learned matrix with shape [D, D].
  • Input shape stays [B, T, D].
  • Output shape also stays [B, T, D].
  • Same shape does not mean same meaning.
  • That sentence matters everywhere in deep learning.
  • A shape-preserving linear layer can still rotate the basis completely.
  • It can amplify some directions.
  • It can suppress others.
  • It can merge several head channels into one feature.
  • It can also spread one head feature across many channels.
  • So W_O is expressive even without a shape change.
  • Never judge a layer only by its input and output sizes.
  • Judge it by which channels can interact after it.
  • That is the useful question here.
  • W_O remaps the concatenated basis.
  • Every output channel can depend on every input channel.
  • So any head can influence any final feature.
  • Without W_O,
  • head blocks stay isolated inside this layer's basis.
  • With W_O,
  • the layer can combine evidence across heads immediately.
  • One head might track syntax.
  • Another head might track entities.
  • Another head might track discourse state.
  • W_O can blend those signals into one useful feature.
  • That is the point.
  • The rhythm is simple.
  • Split.
  • Attend in parallel.
  • Concatenate.
  • Mix.
  • That last word is W_O.
  • It is not a cosmetic final step.
  • It defines the output representation of the attention sublayer itself.

3) One worked numerical example

  • Use two heads.
  • Let d_head = 2.
  • Then D = 4.
  • Suppose one token receives these head outputs.
  • h0 = [2, 1].
  • h1 = [4, 3].
  • After concatenation,
  • z = [2, 1, 4, 3].
  • Now choose a tiny output projection.
    W_O =
    [
      [1, 0, 1, 0],
      [0, 1, 0, 1],
      [1, 1, 0, 0],
      [0, 0, 1, 1]
    ]
    
  • Compute y = z @ W_O.
  • First channel uses columns one and three.
  • So y_0 = 21 + 10 + 41 + 30 = 6.
  • Second channel uses columns two and four.
  • So y_1 = 20 + 11 + 40 + 31 = 4.
  • Third channel mixes only the first head here.
  • So y_2 = 21 + 11 + 40 + 30 = 3.
  • Fourth channel mixes only the second head here.
  • So y_3 = 20 + 10 + 41 + 31 = 7.
  • The final mixed output is [6, 4, 3, 7].
  • Compare that with raw concat [2, 1, 4, 3].
  • Concatenation preserved separation.
  • W_O created interaction.
  • Output channel 0 now depends on both heads.
  • Output channel 1 also depends on both heads.
  • That is the behaviour we wanted.
  • The layer can now speak in one combined basis.

4) Why the next layer does not replace W_O

  • A common question sounds reasonable.
  • Could the next layer mix heads anyway?
  • Partly, yes.
  • But that misses the abstraction boundary.
  • W_O closes the current attention computation.
  • It decides how head-wise messages become one layer output.
  • The next layer starts from that mixed basis.
  • Without W_O,
  • this attention block leaves head partitions exposed.
  • With W_O,
  • the block defines one shared representation before the residual path.
  • That changes what the next layer receives.
  • So W_O is not redundant.
  • It is the attention sublayer's final learned mixer.
  • The projection lens specialised roles before reading.
  • The parallel graders then read independently.
  • The exam rule constrained every head equally.
  • The answer sheet fixed the same token positions for everyone.
  • The memory shortcut may have reused old keys and values.
  • W_O now turns those separate outcomes into one coordinated result.
  • That matters before the residual add happens.
  • It matters before the next MLP starts reading features.
  • It matters before the next attention layer builds new queries.
  • Every later block benefits from receiving one mixed basis instead of frozen head partitions.
  • That is why W_O lives inside the current layer, not outside it.

5) Practical implementation notes

  • In code, W_O is usually one linear layer.
  • Its input and output shapes match.
  • Beginners sometimes dismiss it because of that.
  • Do not dismiss it.
  • Shape preservation can still mean major basis change.
  • The residual connection also expects model width D.
  • W_O makes the attention output compatible with that width.
  • It also lets later MLP layers see mixed head information immediately.
  • If you remove W_O,
  • head blocks survive longer than intended.
  • If you randomise W_O badly,
  • the layer may mix useful signals poorly.
  • When debugging, compare concat output and projected output norms.
  • Large mismatches can indicate scaling problems.
  • Also inspect whether different heads meaningfully influence final channels.
  • That check reveals whether W_O is actually using multi-head diversity.
  • In mature models, it usually does.
  • Some channels become strong mixtures.
  • Some channels stay more head-specific.
  • The matrix can learn both behaviours simultaneously.
  • That flexibility is the real benefit.
  • Remember the short rule.
  • Concatenation restores width.
  • W_O restores coordination.
  • Keep that contrast sharp in your head.
  • It explains why multi-head attention ends with one more linear layer.
  • It also explains why concatenation by itself feels incomplete.
  • The final mixer is part of the design, not an afterthought.

Where this lives in the wild

  • GPT-2c_proj mixes concatenated attention heads before the residual add.
  • LLaMAo_proj is the explicit output linear after multi-head attention.
  • BERTBertSelfOutput.dense remaps the attended tensor back into model space.
  • Vision Transformer — self-attention ends with an output projection before the residual path.
  • Whisper decoder — attention outputs pass through a final output projection inside each block.

Pause and recall

  1. Why is concatenation alone not enough after multi-head attention?
  2. In the example, which output channels used information from both heads?
  3. Why is W_O different from the next layer's input projection?
  4. What short rule summarises concat versus W_O?

Interview Q&A

Q1. What is the role of W_O in multi-head attention? - It mixes concatenated head outputs back into one shared model representation. Common wrong answer to avoid: "W_O only changes the shape back to D." Q2. Why can heads not coordinate fully without W_O? - Because concatenation preserves separate head blocks instead of learned cross-head mixing. Common wrong answer to avoid: "Concatenation already averages the heads together." Q3. How would you explain W_O with the exam hall picture? - The parallel graders submit separate notes, and one combiner writes the unified remark. Common wrong answer to avoid: "W_O is another causal mask."


Apply now (5 min)

  • Quick exercise.
  • Take head outputs [1, 2] and [3, 4].
  • Concatenate them.
  • Then choose your own 4 by 4 W_O.
  • Compute one mixed output channel by hand.
  • Sketch from memory.
  • Draw head outputs entering concat.
  • Then draw W_O producing one final vector.

Bridge. Good. We now understand the full attention sublayer output. Next we zoom out and track every major tensor through a whole transformer block. → 09-transformer-block-shapes.md