08. Output projection — concatenation stacks, W_O mixes¶

~8 min read. Heads can speak separately. W_O turns them into one shared layer voice.

Built on the ELI5 in 00-eli5.md. The parallel graders have produced separate notes, and now one learned combiner must synthesise them.

1) Concatenation is stacking, not mixing¶

After attention, each head has an output vector.
Suppose there are H heads.
Suppose each head has width d_head.
Together they occupy width D = H * d_head.
Merging heads usually means concatenation first.
Concatenation restores shape [B, T, D].
That shape change can look finished.
It is not finished.
Concatenation places channels side by side.
It does not blend them.
Head 0 features remain in one block.
Head 1 features remain in another block.
The parallel graders have all spoken.
Nobody has written one final verdict yet.
W_O is that final combiner.
The exam rule already decided which earlier tokens each head could read.
The answer sheet already fixed the visible positions.
The projection lens already created Q, K, and V.
The memory shortcut may already have supplied cached K and V.
Even after all that,

head outputs still need synthesis.

┌────────┐   ┌────────┐
│ head 0 │──→│ [a, b] │──┐
├────────┤   ├────────┤  │
│ head 1 │──→│ [c, d] │──┼──→ concat [a,b,c,d,e,f] ──→ [ W_O ] ──→ mixed output [D]
├────────┤   ├────────┤  │
│ head 2 │──→│ [e, f] │──┘
└────────┘   └────────┘
heads stay separate here         heads can now talk to each other

That is why concatenation alone is not enough.
Stacking restores width.
Mixing restores interaction.

2) What W_O actually does¶

W_O is a learned matrix with shape [D, D].
Input shape stays [B, T, D].
Output shape also stays [B, T, D].
Same shape does not mean same meaning.
That sentence matters everywhere in deep learning.
A shape-preserving linear layer can still rotate the basis completely.
It can amplify some directions.
It can suppress others.
It can merge several head channels into one feature.
It can also spread one head feature across many channels.
So W_O is expressive even without a shape change.
Never judge a layer only by its input and output sizes.
Judge it by which channels can interact after it.
That is the useful question here.
W_O remaps the concatenated basis.
Every output channel can depend on every input channel.
So any head can influence any final feature.
Without W_O,
head blocks stay isolated inside this layer's basis.
With W_O,
the layer can combine evidence across heads immediately.
One head might track syntax.
Another head might track entities.
Another head might track discourse state.
W_O can blend those signals into one useful feature.
That is the point.
The rhythm is simple.
Split.
Attend in parallel.
Concatenate.
Mix.
That last word is W_O.
It is not a cosmetic final step.
It defines the output representation of the attention sublayer itself.

3) One worked numerical example¶

Use two heads.
Let d_head = 2.
Then D = 4.
Suppose one token receives these head outputs.
h0 = [2, 1].
h1 = [4, 3].
After concatenation,
z = [2, 1, 4, 3].

Now choose a tiny output projection.

W_O =
[
  [1, 0, 1, 0],
  [0, 1, 0, 1],
  [1, 1, 0, 0],
  [0, 0, 1, 1]
]

Compute y = z @ W_O.
First channel uses columns one and three.
So y_0 = 21 + 10 + 41 + 30 = 6.
Second channel uses columns two and four.
So y_1 = 20 + 11 + 40 + 31 = 4.
Third channel mixes only the first head here.
So y_2 = 21 + 11 + 40 + 30 = 3.
Fourth channel mixes only the second head here.
So y_3 = 20 + 10 + 41 + 31 = 7.
The final mixed output is [6, 4, 3, 7].
Compare that with raw concat [2, 1, 4, 3].
Concatenation preserved separation.
W_O created interaction.
Output channel 0 now depends on both heads.
Output channel 1 also depends on both heads.
That is the behaviour we wanted.
The layer can now speak in one combined basis.

4) Why the next layer does not replace W_O¶

A common question sounds reasonable.
Could the next layer mix heads anyway?
Partly, yes.
But that misses the abstraction boundary.
W_O closes the current attention computation.
It decides how head-wise messages become one layer output.
The next layer starts from that mixed basis.
Without W_O,
this attention block leaves head partitions exposed.
With W_O,
the block defines one shared representation before the residual path.
That changes what the next layer receives.
So W_O is not redundant.
It is the attention sublayer's final learned mixer.
The projection lens specialised roles before reading.
The parallel graders then read independently.
The exam rule constrained every head equally.
The answer sheet fixed the same token positions for everyone.
The memory shortcut may have reused old keys and values.
W_O now turns those separate outcomes into one coordinated result.
That matters before the residual add happens.
It matters before the next MLP starts reading features.
It matters before the next attention layer builds new queries.
Every later block benefits from receiving one mixed basis instead of frozen head partitions.
That is why W_O lives inside the current layer, not outside it.

5) Practical implementation notes¶

In code, W_O is usually one linear layer.
Its input and output shapes match.
Beginners sometimes dismiss it because of that.
Do not dismiss it.
Shape preservation can still mean major basis change.
The residual connection also expects model width D.
W_O makes the attention output compatible with that width.
It also lets later MLP layers see mixed head information immediately.
If you remove W_O,
head blocks survive longer than intended.
If you randomise W_O badly,
the layer may mix useful signals poorly.
When debugging, compare concat output and projected output norms.
Large mismatches can indicate scaling problems.
Also inspect whether different heads meaningfully influence final channels.
That check reveals whether W_O is actually using multi-head diversity.
In mature models, it usually does.
Some channels become strong mixtures.
Some channels stay more head-specific.
The matrix can learn both behaviours simultaneously.
That flexibility is the real benefit.
Remember the short rule.
Concatenation restores width.
W_O restores coordination.
Keep that contrast sharp in your head.
It explains why multi-head attention ends with one more linear layer.
It also explains why concatenation by itself feels incomplete.
The final mixer is part of the design, not an afterthought.

Where this lives in the wild¶

GPT-2 — c_proj mixes concatenated attention heads before the residual add.
LLaMA — o_proj is the explicit output linear after multi-head attention.
BERT — BertSelfOutput.dense remaps the attended tensor back into model space.
Vision Transformer — self-attention ends with an output projection before the residual path.
Whisper decoder — attention outputs pass through a final output projection inside each block.

Pause and recall¶

Why is concatenation alone not enough after multi-head attention?
In the example, which output channels used information from both heads?
Why is W_O different from the next layer's input projection?
What short rule summarises concat versus W_O?

Interview Q&A¶

Q1. What is the role of W_O in multi-head attention? - It mixes concatenated head outputs back into one shared model representation. Common wrong answer to avoid: "W_O only changes the shape back to D." Q2. Why can heads not coordinate fully without W_O? - Because concatenation preserves separate head blocks instead of learned cross-head mixing. Common wrong answer to avoid: "Concatenation already averages the heads together." Q3. How would you explain W_O with the exam hall picture? - The parallel graders submit separate notes, and one combiner writes the unified remark. Common wrong answer to avoid: "W_O is another causal mask."

Apply now (5 min)¶

Quick exercise.
Take head outputs [1, 2] and [3, 4].
Concatenate them.
Then choose your own 4 by 4 W_O.
Compute one mixed output channel by hand.
Sketch from memory.
Draw head outputs entering concat.
Then draw W_O producing one final vector.

Bridge. Good. We now understand the full attention sublayer output. Next we zoom out and track every major tensor through a whole transformer block. → 09-transformer-block-shapes.md