Skip to content

05. Why separate projections — routing and payload need freedom

~8 min read. One shared matrix looks elegant. It quietly blocks useful behaviour.

Built on the ELI5 in 00-eli5.md. The projection lens now becomes three learned jobs with different goals.


1) One matrix feels tidy

  • We already have token features in X.
  • So one matrix W sounds sufficient.
  • That gives Q = XW, K = XW, V = XW.
  • The notation looks clean.
  • The implementation looks compact.
  • The restriction hides inside that neatness.
  • Queries search for useful earlier tokens.
  • Keys advertise how tokens can be found.
  • Values carry the payload that gets copied.
  • Search and payload are related.
  • Search and payload are not identical.
  • A token may match by tense.
  • The same token may send entity identity.
  • A token may match by syntax.
  • The same token may send semantic content.
  • One shared map ties those roles together.
  • That reduces expressive freedom immediately.
  • The exam rule is unchanged here.
  • Causal masking still blocks future peeking.
  • The answer sheet is unchanged too.
  • Sequence length still sets the visible table.
  • The parallel graders arrive later.
  • They still need flexible inputs.
  • The memory shortcut arrives later too.
  • It still stores different streams for K and V.
  • So separation matters before heads or caching appear.

2) Matching space and content space should separate

  • Attention scores use Q and K only.
  • Values never enter score computation directly.
  • That fact should stay memorised.
  • Queries and keys decide who talks.
  • Values decide what gets said.
  • If Q, K, and V begin identical,
  • routing and payload share one bottleneck.
  • Real language rarely wants that bottleneck.
  • A token might advertise agreement features.
  • That same token might send meaning features.
  • A token might advertise position clues.
  • That same token might send entity clues.
  • Folder labels and folder contents differ.
  • Search fields and document bodies differ.
  • Attention wants the same separation.
    SHARED (one lens)                  SEPARATE (three lenses)
    ┌─────────────────────┐            ┌────────────────────────┐
    │ X ──→ W ──→ Q       │            │ X ──┬──→ W_Q ──→ Q     │
    │ X ──→ W ──→ K       │            │     ├──→ W_K ──→ K     │
    │ X ──→ W ──→ V       │            │     └──→ W_V ──→ V     │
    ├─────────────────────┤            ├────────────────────────┤
    │ route = payload ✗   │            │ route ≠ payload ✓     │
    └─────────────────────┘            └────────────────────────┘
    
  • With one lens, routing and payload entangle.
  • With three lenses, they decouple cleanly.
  • One matrix can highlight matchability.
  • Another can highlight retrievability.
  • Another can preserve message content.
  • That freedom helps every later head.
  • That freedom also helps cached tensors later.
  • Separate projections are expressive, not ornamental.

3) What three matrices actually buy you

  • W_Q lets a token ask the right question.
  • W_K lets a token publish searchable features.
  • W_V lets a token keep useful payload features.
  • Those jobs cooperate without becoming identical.
  • This is the whole design win.
  • W_Q can emphasise request features.
  • W_K can emphasise index features.
  • W_V can emphasise content features.
  • The model can match on tense.
  • The model can send entity identity.
  • The model can match on local syntax.
  • The model can send semantic facts.
  • It can match on discourse markers.
  • It can send topic information.
  • Shared projection blocks those clean splits.
  • Separate projection makes them learnable.
  • Inside one layer, that already helps.
  • Across many layers, that compounds.
  • Across many parallel graders, that compounds again.
  • Each head receives better raw material.
  • Each head can specialise differently.
  • Later, the memory shortcut caches K and V.
  • Cached keys stay searchable summaries.
  • Cached values stay payload summaries.
  • That division only works cleanly if representations differ.
  • So the separation is structural.

4) One worked numerical example

  • Use two token vectors.
  • Let x_A = [1, 0].
  • Let x_B = [1, 2].
  • Feature one means shared tense.
  • Feature two means entity signal.
  • Both tokens share tense.
  • The entity signal differs.
  • We want equal keys.
  • We want different values.
  • Now force one shared scalar projection.
  • Write W = [a, b]^T.
  • Then y_A = [1, 0] · [a, b]^T.
  • So y_A = a.
  • And y_B = [1, 2] · [a, b]^T.
  • So y_B = a + 2b.
  • Shared projection means k_A = y_A.
  • Shared projection means k_B = y_B.
  • To match equally by tense,
  • we need k_A = k_B.
  • Therefore a = a + 2b.
  • Subtract a from both sides.
  • Then 0 = 2b.
  • So b = 0.
  • Now inspect the values.
  • Shared projection also means v_A = y_A.
  • Shared projection also means v_B = y_B.
  • Therefore v_A = a.
  • And v_B = a + 2b.
  • But b already equals zero.
  • So v_B = a.
  • Therefore v_A = v_B.
  • The entity difference vanished.
  • That is the trap.
  • Equal keys forced equal values.
  • One matrix could not satisfy both goals.
  • Now use separate projections.
  • Choose W_Q = [1, 0]^T.
  • Choose W_K = [1, 0]^T.
  • Choose W_V = [0, 1]^T.
  • Then k_A = 1.
  • Then k_B = 1.
  • Good.
  • The tokens match equally by tense.
  • Now compute values.
  • v_A = 0.
  • v_B = 2.
  • Good again.
  • The tokens still send different entity information.
  • Three matrices decouple routing from payload.
  • That is the point worth memorising.

5) What real code and models do

  • Modern implementations almost always keep separate Q, K, V blocks.
  • Sometimes the code calls one packed linear layer.
  • That does not mean one shared matrix exists.
  • It often means three blocks are concatenated for speed.
  • The learned jobs remain different internally.
  • Ablations usually show the same pattern.
  • Tying Q, K, and V hurts quality.
  • The exact drop depends on model and task.
  • The direction stays stable.
  • That fits the theory above.
  • Attention needs routing freedom.
  • Attention also needs payload freedom.
  • Separate projections provide both freedoms.
  • The exam rule still masks future tokens.
  • The answer sheet still sets the score table size.
  • The parallel graders still split work across heads.
  • The memory shortcut still reuses old keys and values.
  • Separate projections simply feed those later mechanisms better.
  • So this design is not decorative.
  • It is foundational.

Where this lives in the wild

  • GPT-2c_attn packs separate Q, K, and V blocks inside one linear call.
  • BERTBertSelfAttention uses explicit query, key, and value layers before scoring.
  • LLaMAq_proj, k_proj, and v_proj stay separate, even with grouped-query variants.
  • T5 — relative attention still keeps routing and payload projections distinct.
  • Whisper — cross-attention uses separate decoder queries and encoder keys and values.

Pause and recall

  1. Why does one shared W couple routing space and payload space?
  2. In the worked example, why did equal keys force b = 0?
  3. What sentence explains the jobs of queries, keys, and values?
  4. Why does caching make separate K and V representations even more natural?

Interview Q&A

Q1. Why not use one projection matrix for Q, K, and V? - Because routing and payload are different jobs with different useful features. Common wrong answer to avoid: "We separate them only to add parameters." Q2. What is the clean sentence to remember here? - Queries and keys decide who talks to whom. - Values decide what gets said. Common wrong answer to avoid: "Values also create the attention scores." Q3. How would you explain separate projections with the exam hall picture? - One projection lens helps a token ask, another helps it be found, and another carries the answer. Common wrong answer to avoid: "The projection lens only reduces dimensionality."


Apply now (5 min)

  • Quick exercise.
  • Invent two token vectors with one shared feature and one different feature.
  • Show why one shared scalar projection cannot keep equal keys and different values.
  • Then choose W_Q, W_K, and W_V that fix the problem.
  • Sketch from memory.
  • Draw the shared-lens picture once.
  • Then draw the three-lens picture once.
  • Write the gold sentence about routing and payload without looking.

Bridge. Good. We now have separate Q, K, and V. Next we split that width across several parallel graders. → 06-multi-head-split-merge.md