08. Scaled dot-product attention — the scorecard math¶

The spotlight beam has intuition. Now we make the scorecard numeric and precise.

Built on the ELI5 in 00-eli5.md. The scorecard — the relevance weights behind the spotlight beam — now becomes a concrete formula with numbers.

The picture before the math¶

First the plain-language roles. Q means: what am I looking for? K means: what do I advertise about myself? V means: what information do I contribute if chosen? So one token asks a question with Q. Every token presents a label with K. Every token also carries payload with V. Attention compares the query against all keys. That gives a scorecard. Then the values are mixed using that scorecard. See the shape.

query token
|
v
compare with all keys --> scorecard --> weighted mix of values --> output

The earlier file called this soft lookup. Now we write the exact math.

The formula¶

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Break it calmly. - QK^T gives raw similarity scores between queries and keys. - / √d_k scales those scores down. - softmax(...) turns scores into positive weights summing to 1. - multiplying by V gives the weighted sum of value vectors. That is all. No magic. Just compare, scale, normalize, mix.

Why each letter has a separate job¶

A common confusion is this: "Why not use one vector for everything?" Because matching and contributing are different jobs. A token may be good at being found for one reason. It may contribute a different kind of payload afterward. So: - K says how I should be matched - V says what I should contribute - Q says what pattern I want This separation makes attention flexible.

Worked example setup¶

Use one query vector:

q = [2, 1]

Use three keys:

k1 = [2, 1]
k2 = [1, 0]
k3 = [0, -1]

Use three values:

v1 = [10, 0]
v2 = [0, 8]
v3 = [1, 1]

We will compute the whole thing end to end. Nice and clean.

Step 1: raw dot products¶

Compute q · k_i.

With key 1¶

[2,1] · [2,1] = 2×2 + 1×1 = 5

With key 2¶

[2,1] · [1,0] = 2×1 + 1×0 = 2

With key 3¶

[2,1] · [0,-1] = 2×0 + 1×(-1) = -1

So the raw scores are:

[5, 2, -1]

Key 1 matches best. Key 3 is actually anti-aligned.

Step 2: scale by `√d_k`¶

Here d_k = 2 because each key has length 2. So:

√d_k = √2 ≈ 1.414

Now divide each raw score by 1.414.

5   / 1.414 ≈  3.54
2   / 1.414 ≈  1.41
-1  / 1.414 ≈ -0.71

Scaled scores:

[3.54, 1.41, -0.71]

Why scale? Hold that thought. We will prove it after the worked example.

Step 3: softmax the scaled scores¶

Softmax turns arbitrary scores into weights. Approximate result:

softmax([3.54, 1.41, -0.71]) ≈ [0.88, 0.11, 0.01]

Now the scorecard is easy to read. - key 1 gets 88% - key 2 gets 11% - key 3 gets 1% All weights are positive. All weights sum to 1. That is why the output becomes a weighted average of values.

Step 4: weighted sum of values¶

Now mix the values with the scorecard.

output = 0.88·v1 + 0.11·v2 + 0.01·v3

Substitute the actual values.

output = 0.88·[10,0] + 0.11·[0,8] + 0.01·[1,1]

Compute each part.

0.88·[10,0] = [8.80, 0.00]
0.11·[0,8]  = [0.00, 0.88]
0.01·[1,1]  = [0.01, 0.01]

Add them.

output = [8.80,0.00] + [0.00,0.88] + [0.01,0.01]
= [8.81, 0.89]

Done. The final attended output is:

[8.81, 0.89]

See the meaning. Most of the output came from v1. A little came from v2. Almost nothing came from v3. That is soft lookup in full numeric form.

ASCII diagram of the whole flow¶

q = [2,1]
|
+--> dot with k1=[2,1]  -> 5
+--> dot with k2=[1,0]  -> 2
+--> dot with k3=[0,-1] -> -1
raw scores      : [5, 2, -1]
scaled scores   : [3.54, 1.41, -0.71]
scorecard       : [0.88, 0.11, 0.01]
values          : [10,0], [0,8], [1,1]
weighted output : [8.81, 0.89]

Same story. Now fully visible.

Why `√d_k` matters¶

Dot products grow with dimension. If keys get wider, raw scores usually get larger in magnitude. Large scores make softmax very peaky. Peaky softmax means one item gets almost all the mass. Then gradients become weak for the others. Training becomes unstable or slow. So we divide by √d_k to keep score scale healthy. That keeps softmax in a useful range.

Worked comparison: without scaling vs with scaling¶

Suppose d_k = 64. And raw scores are:

[15, 12, 9]

Without scaling¶

Softmax of [15, 12, 9] is approximately:

[0.95, 0.05, 0.00]

Very sharp. The first key almost fully wins. The third key is basically ignored.

With scaling¶

Now divide by √64 = 8.

[15, 12, 9] / 8 = [1.875, 1.5, 1.125]

Softmax is approximately:

[0.46, 0.32, 0.22]

Much healthier. Still prefers the first key. But the others remain alive. Gradients can still flow. That is the whole point of scaling.

Common reading mistakes¶

Mistake 1. Thinking the output picks one value. No. That would be hard attention. Here the output is a weighted mixture. Mistake 2. Thinking softmax acts on values. No. Softmax acts on scores from Q and K. Values are mixed afterward. Mistake 3. Thinking scaling changes the ranking. Usually it does not change the order. It changes the sharpness. That is what training cares about.

Where this lives in the wild¶

OpenAI GPT-style models. Every generated token uses scaled dot-product attention to score earlier context before producing the next token.
Google BERT. Bidirectional attention uses the same scorecard math, just without causal masking during encoding.
GitHub Copilot. When code completion inspects imports, signatures, and nearby variables, the relevance calculation is this exact attention mechanism.
Meta Llama. Long-context decoder stacks repeatedly apply scaled dot-product attention inside every layer and head.
Vision Transformers in Google and Apple pipelines. Image patches attend to other patches using the same formula, only the tokens are patches instead of words.

Interview Q&A¶

Q: What does QK^T represent intuitively? A: It is the similarity table between what each query wants and what each key advertises. Bigger dot product means better match. Common wrong answer to avoid: "It multiplies queries with values." Values are not used until after softmax. Q: Why use softmax after the scores? A: Softmax converts arbitrary real-valued scores into positive weights that sum to 1, so the output becomes a stable weighted combination of values. Q: Why divide by √d_k instead of by d_k? A: Because dot-product variance grows roughly with the dimension, so standard deviation grows with √d_k. Dividing by √d_k keeps score magnitudes in a sensible range. Common wrong answer to avoid: "It is only for computational speed." No. It is for numerical behaviour and gradient health. Q: In the worked example, why did v2 affect the second output dimension even though key 1 won overall? A: Because attention mixes values, not just winning indices. Even an 11% weight on v2 = [0,8] contributes 0.88 to the second dimension.

Apply now (5 min)¶

Use this tiny setup:

q  = [1, 2]
k1 = [1, 1]
k2 = [0, 2]
v1 = [4, 0]
v2 = [0, 6]

1. Compute the two raw dot products. 2. Divide by √2. 3. Estimate the softmax weights. 4. Compute the weighted output. 5. Say which value dominates and why. Then sketch from memory: - Q = what I seek - K = what I advertise - V = what I contribute - the four-step pipeline: dot, scale, softmax, mix If you can reproduce [5,2,-1] -> [3.54,1.41,-0.71] -> [0.88,0.11,0.01], the scorecard math is now yours.

Bridge. One more ingredient is needed for next-token generation. The model must not look into the future. The next file adds the triangular block that hides future tokens: causal masking. Read 09-causal-masking.md next.