05. Why separate projections — routing and payload need freedom¶

~8 min read. One shared matrix looks elegant. It quietly blocks useful behaviour.

Built on the ELI5 in 00-eli5.md. The projection lens now becomes three learned jobs with different goals.

1) One matrix feels tidy¶

We already have token features in X.
So one matrix W sounds sufficient.
That gives Q = XW, K = XW, V = XW.
The notation looks clean.
The implementation looks compact.
The restriction hides inside that neatness.
Queries search for useful earlier tokens.
Keys advertise how tokens can be found.
Values carry the payload that gets copied.
Search and payload are related.
Search and payload are not identical.
A token may match by tense.
The same token may send entity identity.
A token may match by syntax.
The same token may send semantic content.
One shared map ties those roles together.
That reduces expressive freedom immediately.
The exam rule is unchanged here.
Causal masking still blocks future peeking.
The answer sheet is unchanged too.
Sequence length still sets the visible table.
The parallel graders arrive later.
They still need flexible inputs.
The memory shortcut arrives later too.
It still stores different streams for K and V.
So separation matters before heads or caching appear.

2) Matching space and content space should separate¶

Attention scores use Q and K only.
Values never enter score computation directly.
That fact should stay memorised.
Queries and keys decide who talks.
Values decide what gets said.
If Q, K, and V begin identical,
routing and payload share one bottleneck.
Real language rarely wants that bottleneck.
A token might advertise agreement features.
That same token might send meaning features.
A token might advertise position clues.
That same token might send entity clues.
Folder labels and folder contents differ.
Search fields and document bodies differ.

Attention wants the same separation.

SHARED (one lens)                  SEPARATE (three lenses)
┌─────────────────────┐            ┌────────────────────────┐
│ X ──→ W ──→ Q       │            │ X ──┬──→ W_Q ──→ Q     │
│ X ──→ W ──→ K       │            │     ├──→ W_K ──→ K     │
│ X ──→ W ──→ V       │            │     └──→ W_V ──→ V     │
├─────────────────────┤            ├────────────────────────┤
│ route = payload ✗   │            │ route ≠ payload ✓     │
└─────────────────────┘            └────────────────────────┘

With one lens, routing and payload entangle.
With three lenses, they decouple cleanly.
One matrix can highlight matchability.
Another can highlight retrievability.
Another can preserve message content.
That freedom helps every later head.
That freedom also helps cached tensors later.
Separate projections are expressive, not ornamental.

3) What three matrices actually buy you¶

W_Q lets a token ask the right question.
W_K lets a token publish searchable features.
W_V lets a token keep useful payload features.
Those jobs cooperate without becoming identical.
This is the whole design win.
W_Q can emphasise request features.
W_K can emphasise index features.
W_V can emphasise content features.
The model can match on tense.
The model can send entity identity.
The model can match on local syntax.
The model can send semantic facts.
It can match on discourse markers.
It can send topic information.
Shared projection blocks those clean splits.
Separate projection makes them learnable.
Inside one layer, that already helps.
Across many layers, that compounds.
Across many parallel graders, that compounds again.
Each head receives better raw material.
Each head can specialise differently.
Later, the memory shortcut caches K and V.
Cached keys stay searchable summaries.
Cached values stay payload summaries.
That division only works cleanly if representations differ.
So the separation is structural.

4) One worked numerical example¶

Use two token vectors.
Let x_A = [1, 0].
Let x_B = [1, 2].
Feature one means shared tense.
Feature two means entity signal.
Both tokens share tense.
The entity signal differs.
We want equal keys.
We want different values.
Now force one shared scalar projection.
Write W = [a, b]^T.
Then y_A = [1, 0] · [a, b]^T.
So y_A = a.
And y_B = [1, 2] · [a, b]^T.
So y_B = a + 2b.
Shared projection means k_A = y_A.
Shared projection means k_B = y_B.
To match equally by tense,
we need k_A = k_B.
Therefore a = a + 2b.
Subtract a from both sides.
Then 0 = 2b.
So b = 0.
Now inspect the values.
Shared projection also means v_A = y_A.
Shared projection also means v_B = y_B.
Therefore v_A = a.
And v_B = a + 2b.
But b already equals zero.
So v_B = a.
Therefore v_A = v_B.
The entity difference vanished.
That is the trap.
Equal keys forced equal values.
One matrix could not satisfy both goals.
Now use separate projections.
Choose W_Q = [1, 0]^T.
Choose W_K = [1, 0]^T.
Choose W_V = [0, 1]^T.
Then k_A = 1.
Then k_B = 1.
Good.
The tokens match equally by tense.
Now compute values.
v_A = 0.
v_B = 2.
Good again.
The tokens still send different entity information.
Three matrices decouple routing from payload.
That is the point worth memorising.

5) What real code and models do¶

Modern implementations almost always keep separate Q, K, V blocks.
Sometimes the code calls one packed linear layer.
That does not mean one shared matrix exists.
It often means three blocks are concatenated for speed.
The learned jobs remain different internally.
Ablations usually show the same pattern.
Tying Q, K, and V hurts quality.
The exact drop depends on model and task.
The direction stays stable.
That fits the theory above.
Attention needs routing freedom.
Attention also needs payload freedom.
Separate projections provide both freedoms.
The exam rule still masks future tokens.
The answer sheet still sets the score table size.
The parallel graders still split work across heads.
The memory shortcut still reuses old keys and values.
Separate projections simply feed those later mechanisms better.
So this design is not decorative.
It is foundational.

Where this lives in the wild¶

GPT-2 — c_attn packs separate Q, K, and V blocks inside one linear call.
BERT — BertSelfAttention uses explicit query, key, and value layers before scoring.
LLaMA — q_proj, k_proj, and v_proj stay separate, even with grouped-query variants.
T5 — relative attention still keeps routing and payload projections distinct.
Whisper — cross-attention uses separate decoder queries and encoder keys and values.

Pause and recall¶

Why does one shared W couple routing space and payload space?
In the worked example, why did equal keys force b = 0?
What sentence explains the jobs of queries, keys, and values?
Why does caching make separate K and V representations even more natural?

Interview Q&A¶

Q1. Why not use one projection matrix for Q, K, and V? - Because routing and payload are different jobs with different useful features. Common wrong answer to avoid: "We separate them only to add parameters." Q2. What is the clean sentence to remember here? - Queries and keys decide who talks to whom. - Values decide what gets said. Common wrong answer to avoid: "Values also create the attention scores." Q3. How would you explain separate projections with the exam hall picture? - One projection lens helps a token ask, another helps it be found, and another carries the answer. Common wrong answer to avoid: "The projection lens only reduces dimensionality."

Apply now (5 min)¶

Quick exercise.
Invent two token vectors with one shared feature and one different feature.
Show why one shared scalar projection cannot keep equal keys and different values.
Then choose W_Q, W_K, and W_V that fix the problem.
Sketch from memory.
Draw the shared-lens picture once.
Then draw the three-lens picture once.
Write the gold sentence about routing and payload without looking.

Bridge. Good. We now have separate Q, K, and V. Next we split that width across several parallel graders. → 06-multi-head-split-merge.md