05. Why separate projections — routing and payload need freedom¶
~8 min read. One shared matrix looks elegant. It quietly blocks useful behaviour.
Built on the ELI5 in 00-eli5.md. The projection lens now becomes three learned jobs with different goals.
1) One matrix feels tidy¶
- We already have token features in X.
- So one matrix W sounds sufficient.
- That gives Q = XW, K = XW, V = XW.
- The notation looks clean.
- The implementation looks compact.
- The restriction hides inside that neatness.
- Queries search for useful earlier tokens.
- Keys advertise how tokens can be found.
- Values carry the payload that gets copied.
- Search and payload are related.
- Search and payload are not identical.
- A token may match by tense.
- The same token may send entity identity.
- A token may match by syntax.
- The same token may send semantic content.
- One shared map ties those roles together.
- That reduces expressive freedom immediately.
- The exam rule is unchanged here.
- Causal masking still blocks future peeking.
- The answer sheet is unchanged too.
- Sequence length still sets the visible table.
- The parallel graders arrive later.
- They still need flexible inputs.
- The memory shortcut arrives later too.
- It still stores different streams for K and V.
- So separation matters before heads or caching appear.
2) Matching space and content space should separate¶
- Attention scores use Q and K only.
- Values never enter score computation directly.
- That fact should stay memorised.
- Queries and keys decide who talks.
- Values decide what gets said.
- If Q, K, and V begin identical,
- routing and payload share one bottleneck.
- Real language rarely wants that bottleneck.
- A token might advertise agreement features.
- That same token might send meaning features.
- A token might advertise position clues.
- That same token might send entity clues.
- Folder labels and folder contents differ.
- Search fields and document bodies differ.
- Attention wants the same separation.
SHARED (one lens) SEPARATE (three lenses) ┌─────────────────────┐ ┌────────────────────────┐ │ X ──→ W ──→ Q │ │ X ──┬──→ W_Q ──→ Q │ │ X ──→ W ──→ K │ │ ├──→ W_K ──→ K │ │ X ──→ W ──→ V │ │ └──→ W_V ──→ V │ ├─────────────────────┤ ├────────────────────────┤ │ route = payload ✗ │ │ route ≠ payload ✓ │ └─────────────────────┘ └────────────────────────┘ - With one lens, routing and payload entangle.
- With three lenses, they decouple cleanly.
- One matrix can highlight matchability.
- Another can highlight retrievability.
- Another can preserve message content.
- That freedom helps every later head.
- That freedom also helps cached tensors later.
- Separate projections are expressive, not ornamental.
3) What three matrices actually buy you¶
- W_Q lets a token ask the right question.
- W_K lets a token publish searchable features.
- W_V lets a token keep useful payload features.
- Those jobs cooperate without becoming identical.
- This is the whole design win.
- W_Q can emphasise request features.
- W_K can emphasise index features.
- W_V can emphasise content features.
- The model can match on tense.
- The model can send entity identity.
- The model can match on local syntax.
- The model can send semantic facts.
- It can match on discourse markers.
- It can send topic information.
- Shared projection blocks those clean splits.
- Separate projection makes them learnable.
- Inside one layer, that already helps.
- Across many layers, that compounds.
- Across many parallel graders, that compounds again.
- Each head receives better raw material.
- Each head can specialise differently.
- Later, the memory shortcut caches K and V.
- Cached keys stay searchable summaries.
- Cached values stay payload summaries.
- That division only works cleanly if representations differ.
- So the separation is structural.
4) One worked numerical example¶
- Use two token vectors.
- Let x_A = [1, 0].
- Let x_B = [1, 2].
- Feature one means shared tense.
- Feature two means entity signal.
- Both tokens share tense.
- The entity signal differs.
- We want equal keys.
- We want different values.
- Now force one shared scalar projection.
- Write W = [a, b]^T.
- Then y_A = [1, 0] · [a, b]^T.
- So y_A = a.
- And y_B = [1, 2] · [a, b]^T.
- So y_B = a + 2b.
- Shared projection means k_A = y_A.
- Shared projection means k_B = y_B.
- To match equally by tense,
- we need k_A = k_B.
- Therefore a = a + 2b.
- Subtract a from both sides.
- Then 0 = 2b.
- So b = 0.
- Now inspect the values.
- Shared projection also means v_A = y_A.
- Shared projection also means v_B = y_B.
- Therefore v_A = a.
- And v_B = a + 2b.
- But b already equals zero.
- So v_B = a.
- Therefore v_A = v_B.
- The entity difference vanished.
- That is the trap.
- Equal keys forced equal values.
- One matrix could not satisfy both goals.
- Now use separate projections.
- Choose W_Q = [1, 0]^T.
- Choose W_K = [1, 0]^T.
- Choose W_V = [0, 1]^T.
- Then k_A = 1.
- Then k_B = 1.
- Good.
- The tokens match equally by tense.
- Now compute values.
- v_A = 0.
- v_B = 2.
- Good again.
- The tokens still send different entity information.
- Three matrices decouple routing from payload.
- That is the point worth memorising.
5) What real code and models do¶
- Modern implementations almost always keep separate Q, K, V blocks.
- Sometimes the code calls one packed linear layer.
- That does not mean one shared matrix exists.
- It often means three blocks are concatenated for speed.
- The learned jobs remain different internally.
- Ablations usually show the same pattern.
- Tying Q, K, and V hurts quality.
- The exact drop depends on model and task.
- The direction stays stable.
- That fits the theory above.
- Attention needs routing freedom.
- Attention also needs payload freedom.
- Separate projections provide both freedoms.
- The exam rule still masks future tokens.
- The answer sheet still sets the score table size.
- The parallel graders still split work across heads.
- The memory shortcut still reuses old keys and values.
- Separate projections simply feed those later mechanisms better.
- So this design is not decorative.
- It is foundational.
Where this lives in the wild¶
- GPT-2 —
c_attnpacks separate Q, K, and V blocks inside one linear call. - BERT —
BertSelfAttentionuses explicit query, key, and value layers before scoring. - LLaMA —
q_proj,k_proj, andv_projstay separate, even with grouped-query variants. - T5 — relative attention still keeps routing and payload projections distinct.
- Whisper — cross-attention uses separate decoder queries and encoder keys and values.
Pause and recall¶
- Why does one shared W couple routing space and payload space?
- In the worked example, why did equal keys force b = 0?
- What sentence explains the jobs of queries, keys, and values?
- Why does caching make separate K and V representations even more natural?
Interview Q&A¶
Q1. Why not use one projection matrix for Q, K, and V? - Because routing and payload are different jobs with different useful features. Common wrong answer to avoid: "We separate them only to add parameters." Q2. What is the clean sentence to remember here? - Queries and keys decide who talks to whom. - Values decide what gets said. Common wrong answer to avoid: "Values also create the attention scores." Q3. How would you explain separate projections with the exam hall picture? - One projection lens helps a token ask, another helps it be found, and another carries the answer. Common wrong answer to avoid: "The projection lens only reduces dimensionality."
Apply now (5 min)¶
- Quick exercise.
- Invent two token vectors with one shared feature and one different feature.
- Show why one shared scalar projection cannot keep equal keys and different values.
- Then choose W_Q, W_K, and W_V that fix the problem.
- Sketch from memory.
- Draw the shared-lens picture once.
- Then draw the three-lens picture once.
- Write the gold sentence about routing and payload without looking.
Bridge. Good. We now have separate Q, K, and V. Next we split that width across several parallel graders. → 06-multi-head-split-merge.md