V projections — one token, three jobs¶

~10 min read. Same input. Three different questions. That separation is the engine of attention.

Built on the ELI5 in 00-eli5.md. The projection lens — one token viewed as query, key, and value — explains how attention decides who talks, who listens, and what information moves forward.

1) One token enters once and leaves in three roles¶

Picture the exam hall again. A student answer can play three roles.

It can ask for help. It can advertise what it contains.

It can carry useful content onward. That is why we use a projection lens. Same token.

Three jobs. Look. Queries ask, “What am I searching for?” Keys say,

“What pattern do I match well?” Values say, “If chosen, what content should I send?”

So the core sentence is this. Keys and queries decide who talks to whom.

Values decide what gets said. Simple, no? The exam rule still governs visibility after projection.

The answer sheet still limits how many tokens can be compared.

The parallel graders will later do this in several heads.

The memory shortcut will later store past keys and values.

But first we need the three projected views.

2) Linear projection is just matrix multiplication¶

Now what is the math? Suppose input embeddings are stored in X. Shape first. Always shape first.

X      : [B, T, D]
W_Q    : [D, D]
W_K    : [D, D]
W_V    : [D, D]
Q, K, V: [B, T, D]

Then the formulas are simply:

Q = X @ W_Q
K = X @ W_K
V = X @ W_V

See. No mystery layer is required to understand this. It is three learned linear maps. A picture helps.

           X [B,T,D]
               │
       ┌───────┼───────┐
       │       │       │
       ▼       ▼       ▼
     W_Q     W_K     W_V
       │       │       │
       ▼       ▼       ▼
  Q [B,T,D] K [B,T,D] V [B,T,D]

One source tensor branches into three paths. That is the projection lens idea in one diagram.

If we later split into heads, each head sees a slice.

Then multiple parallel graders can focus on different patterns.

The exam rule still applies only after scores are formed. Yes?

3) One-token numerical walkthrough by hand¶

Let one token embedding be:

x = [2, 1, 0, 3]

Use small projection matrices.

W_Q =
[[1, 0],
 [0, 1],
 [1, 1],
 [0, 1]]

W_K =
[[ 1,  1],
 [ 1,  0],
 [ 0,  1],
 [ 1, -1]]

W_V =
[[0, 1],
 [1, 0],
 [1, 1],
 [1, 0]]

Now compute the query. Take dot products with each column of W_Q.

q_1 = 2*1 + 1*0 + 0*1 + 3*0 = 2
q_2 = 2*0 + 1*1 + 0*1 + 3*1 = 4
q   = [2, 4]

Now compute the key.

k_1 = 2*1 + 1*1 + 0*0 + 3*1  = 6
k_2 = 2*1 + 1*0 + 0*1 + 3*(-1) = -1
k   = [6, -1]

Now compute the value.

v_1 = 2*0 + 1*1 + 0*1 + 3*1 = 4
v_2 = 2*1 + 1*0 + 0*1 + 3*0 = 2
v   = [4, 2]

Good. The same token became three different vectors. Not because the token changed.

Because the job changed. That is the whole point of the projection lens.

Inside one answer sheet, each token gets these three roles.

Later, the memory shortcut will cache K and V, not raw embeddings.

That detail matters in inference code.

4) Two tokens show how attention really uses Q, K, and V¶

Now add one more token. Suppose token one produced:

q1 = [2, 4]
k1 = [6, -1]
v1 = [4, 2]

Suppose token two uses embedding x2 = [1, 0, 2, 1].

Let us compute its projections with the same matrices.

q2 = [3, 3]
k2 = [2, 2]
v2 = [3, 3]

Now imagine query two compares itself with both keys. Raw compatibility scores are dot products.

score(q2, k1) = 3*6 + 3*(-1) = 15
score(q2, k2) = 3*2 + 3*2    = 12

So query two likes key one a bit more.

That means token two will pull more from value one. See how the roles separate.

q2 asks. k1 and k2 advertise. v1 and v2 supply actual content.

This is why we do not reuse one matrix for everything.

If Q, K, and V were identical, the model loses flexibility.

The parallel graders also become less expressive then. And the exam rule would still only control visibility, not representation quality. A tiny picture helps.

query from token 2:   q2 ---------
                                  \ compare with k1 -> score 15
                                   \ compare with k2 -> score 12
                                    \ softmax -> weights
                                     \ mix v1 and v2 -> context for token 2

Look. Keys and queries decide who talks to whom. Values decide what gets said.

Keep that sentence ready for interviews.

5) The clean coding pattern and the common bugs¶

Now we turn the idea into code. A small helper keeps things readable.

import numpy as np

def project_qkv(X, W_Q, W_K, W_V):
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    return Q, K, V

If X has shape [B, T, D], the outputs keep [B, T, D].

Then later we reshape for heads. So what to do carefully? Project first.

Reshape for heads after that. Do not mix those steps casually. Now the common bugs. 1. Mixing D and d_head in shapes. 2. Forgetting that attention needs K^T during score computation. 3. Reusing one shared weight matrix for Q, K, and V by mistake. 4. Reshaping into heads before projection when the code expects full D first. 5. Assuming values also decide matching strength.

Let us sharpen two of them. Bug one. Reusing one matrix.

That removes the whole point of role separation. Bug two.

Reshaping before projection without matching weight shapes. Then dimensions silently drift. And later scores become nonsense.

See how all five anchors fit together. The projection lens creates specialized roles.

The exam rule later restricts who may look where.

The answer sheet sets the available token span. The parallel graders repeat the process across heads.

The memory shortcut stores past K and V for faster decoding.

That is the full decoder picture.¶

Where this lives in the wild¶

GPT-2's c_attn combined QKV projection — one layer computes all three projections before attention scoring.
LLaMA's separate q_proj, k_proj, v_proj layers — explicit modules keep each role visibly separate.
BERT's self-attention query/key/value projections — encoder attention uses the same role split, without causal masking.
Vision Transformer (ViT) patch attention projections — image patch tokens still branch into query, key, and value views.
Whisper encoder self-attention at OpenAI — speech tokens also use learned Q/K/V projections before attention mixing.

Pause and recall¶

Why does one token need three projected views instead of one raw embedding?
What are the shapes of X, W_Q, W_K, W_V, and Q?
In one sentence, what do queries, keys, and values each do?
Why does caching usually store K and V rather than raw X?

Interview Q&A¶

Q1. What is the simplest explanation of Q, K, and V? A1. Queries ask, keys advertise, and values carry the content that gets mixed. Common wrong answer to avoid: “They are three copies of the same tensor for convenience.” Q2. Why not use one shared projection matrix for all three? A2. Because matching and content transport are different jobs that benefit from different learned spaces. Common wrong answer to avoid: “Separate matrices only reduce memory use.” Q3. What sentence should you remember for interviews? A3. Keys and queries decide who talks to whom, and values decide what gets said. Common wrong answer to avoid: “Values decide both matching and transport.” Q4. What is a common implementation-order mistake? A4. Reshaping into heads before projection when the weight matrices assume full model width. Common wrong answer to avoid: “Projection order never matters if dimensions multiply.”

Apply now (5 min)¶

Quick exercise. Take a new vector x = [1, 2, 1, 0].

Invent tiny W_Q, W_K, and W_V matrices. Compute q, k, and v by hand.

Then sketch from memory the three-branch ASCII diagram. Also say aloud where the projection lens ends and attention scoring begins.

Bridge. Good. We can now compute three roles from one token. Next we ask the deeper design question: why are separate projections better than one shared representation? → 05-why-separate-projections.md