08. Scaled dot-product attention — the scorecard math¶
The spotlight beam has intuition. Now we make the scorecard numeric and precise.
Built on the ELI5 in
00-eli5.md. The scorecard — the relevance weights behind the spotlight beam — now becomes a concrete formula with numbers.
The picture before the math¶
First the plain-language roles. Q means: what am I looking for? K means:
what do I advertise about myself? V means: what information do I contribute
if chosen? So one token asks a question with Q. Every token presents a label
with K. Every token also carries payload with V. Attention compares the
query against all keys. That gives a scorecard. Then the values are mixed
using that scorecard. See the shape.
The formula¶
Break it calmly. -QK^T gives raw similarity scores between queries and keys.
- / √d_k scales those scores down.
- softmax(...) turns scores into positive weights summing to 1.
- multiplying by V gives the weighted sum of value vectors.
That is all. No magic. Just compare, scale, normalize, mix.
Why each letter has a separate job¶
A common confusion is this: "Why not use one vector for everything?" Because
matching and contributing are different jobs. A token may be good at being
found for one reason. It may contribute a different kind of payload afterward.
So:
- K says how I should be matched
- V says what I should contribute
- Q says what pattern I want
This separation makes attention flexible.
Worked example setup¶
Use one query vector:
Use three keys: Use three values: We will compute the whole thing end to end. Nice and clean.Step 1: raw dot products¶
Compute q · k_i.
With key 1¶
With key 2¶
With key 3¶
So the raw scores are: Key 1 matches best. Key 3 is actually anti-aligned.Step 2: scale by √d_k¶
Here d_k = 2 because each key has length 2. So:
1.414.
Scaled scores:
Why scale? Hold that thought. We will prove it after the worked example.
Step 3: softmax the scaled scores¶
Softmax turns arbitrary scores into weights. Approximate result:
Now the scorecard is easy to read. - key 1 gets 88% - key 2 gets 11% - key 3 gets 1% All weights are positive. All weights sum to 1. That is why the output becomes a weighted average of values.Step 4: weighted sum of values¶
Now mix the values with the scorecard.
Substitute the actual values. Compute each part. Add them. Done. The final attended output is: See the meaning. Most of the output came fromv1. A little came from v2.
Almost nothing came from v3. That is soft lookup in full numeric form.
ASCII diagram of the whole flow¶
q = [2,1]
|
+--> dot with k1=[2,1] -> 5
+--> dot with k2=[1,0] -> 2
+--> dot with k3=[0,-1] -> -1
raw scores : [5, 2, -1]
scaled scores : [3.54, 1.41, -0.71]
scorecard : [0.88, 0.11, 0.01]
values : [10,0], [0,8], [1,1]
weighted output : [8.81, 0.89]
Why √d_k matters¶
Dot products grow with dimension. If keys get wider, raw scores usually get
larger in magnitude. Large scores make softmax very peaky. Peaky softmax means
one item gets almost all the mass. Then gradients become weak for the others.
Training becomes unstable or slow. So we divide by √d_k to keep score scale
healthy. That keeps softmax in a useful range.
Worked comparison: without scaling vs with scaling¶
Suppose d_k = 64. And raw scores are:
Without scaling¶
Softmax of [15, 12, 9] is approximately:
With scaling¶
Now divide by √64 = 8.
Common reading mistakes¶
Mistake 1. Thinking the output picks one value. No. That would be hard
attention. Here the output is a weighted mixture. Mistake 2. Thinking softmax
acts on values. No. Softmax acts on scores from Q and K. Values are mixed
afterward. Mistake 3. Thinking scaling changes the ranking. Usually it does
not change the order. It changes the sharpness. That is what training cares
about.
Where this lives in the wild¶
- OpenAI GPT-style models. Every generated token uses scaled dot-product attention to score earlier context before producing the next token.
- Google BERT. Bidirectional attention uses the same scorecard math, just without causal masking during encoding.
- GitHub Copilot. When code completion inspects imports, signatures, and nearby variables, the relevance calculation is this exact attention mechanism.
- Meta Llama. Long-context decoder stacks repeatedly apply scaled dot-product attention inside every layer and head.
- Vision Transformers in Google and Apple pipelines. Image patches attend to other patches using the same formula, only the tokens are patches instead of words.
Interview Q&A¶
Q: What does QK^T represent intuitively? A: It is the similarity table
between what each query wants and what each key advertises. Bigger dot product
means better match. Common wrong answer to avoid: "It multiplies queries
with values." Values are not used until after softmax. Q: Why use softmax
after the scores? A: Softmax converts arbitrary real-valued scores into
positive weights that sum to 1, so the output becomes a stable weighted
combination of values. Q: Why divide by √d_k instead of by d_k? A:
Because dot-product variance grows roughly with the dimension, so standard
deviation grows with √d_k. Dividing by √d_k keeps score magnitudes in a
sensible range. Common wrong answer to avoid: "It is only for computational
speed." No. It is for numerical behaviour and gradient health. Q: In the
worked example, why did v2 affect the second output dimension even though
key 1 won overall? A: Because attention mixes values, not just winning
indices. Even an 11% weight on v2 = [0,8] contributes 0.88 to the second
dimension.
Apply now (5 min)¶
Use this tiny setup:
1. Compute the two raw dot products. 2. Divide by√2.
3. Estimate the softmax weights.
4. Compute the weighted output.
5. Say which value dominates and why.
Then sketch from memory:
- Q = what I seek
- K = what I advertise
- V = what I contribute
- the four-step pipeline: dot, scale, softmax, mix
If you can reproduce [5,2,-1] -> [3.54,1.41,-0.71] -> [0.88,0.11,0.01], the
scorecard math is now yours.
Bridge. One more ingredient is needed for next-token generation. The model must not look into the future. The next file adds the triangular block that hides future tokens: causal masking. Read
09-causal-masking.mdnext.