07. Attention as soft lookup — the spotlight beam¶
RNNs pass messages step by step. Attention says, why not look directly?
Built on the ELI5 in
00-eli5.md. The spotlight beam — one token checking which other tokens matter — now becomes direct lookup instead of relay passing.
The picture before the math¶
Picture a relay race. Runner 1 hears token 1. Runner 2 hears token 2 and whatever runner 1 managed to pass. Runner 3 hears token 3 and whatever runner 2 still remembers. By token 30, the early message is tired. Some detail has faded. Some detail got compressed. Some detail was simply lost. That is the RNN bottleneck. Now picture a different classroom. Every student can directly ask any other student a question. No relay. No whisper chain. No waiting for middle students to carry context. That direct asking is attention. A token shines a spotlight beam. Other tokens light up by relevance. The brighter the light, the more they contribute.
Why the relay race hurts¶
Suppose the sentence is long. The useful clue may be 20 tokens back. An RNN has to compress that clue into a hidden state. Then preserve it through many updates. That is hard. Each new token mixes in new information. Older detail can fade. Gradients also struggle to travel through long chains. So the model knows something from earlier. But maybe not the exact thing you needed. See the picture.
RNN relay:
x1 -> h1 -> h2 -> h3 -> h4 -> ... -> h30
carries everything through one narrow pipe
attention:
token 30 --> token 1
token 30 --> token 7
token 30 --> token 18
token 30 --> token 29
Three rescue attempts before the real fix¶
Rescue 1: Make the hidden state wider¶
Fine. Give the relay runner a bigger bag. Now more information can be carried. But it is still one bag. The same bottleneck remains.
Rescue 2: Use bidirectional RNNs¶
Now you read left-to-right and right-to-left. Helpful. Still, each direction is a chain. And generation settings cannot look into the future anyway.
Rescue 3: Stack deeper RNN layers¶
Depth helps feature extraction. Good. But the core path is still sequential. Early detail still has to survive many steps. So these rescues improve the relay race. They do not remove the relay race. The fundamental fix is different. Let every token query every other token directly.
Attention as a dictionary lookup¶
Think of memory as a dictionary. A dictionary has keys and values. If I know the exact key, I retrieve the value. That is hard lookup.
One key matches. One value returns. Attention is softer. The query does not demand one exact match. It compares itself with all keys. Then it gives each key a relevance score. Then it mixes all values using those scores. So attention is a soft lookup. Many keys contribute. Strong matches contribute more. Weak matches contribute less.Hard lookup vs soft lookup¶
Hard lookup¶
Very sharp. Very discrete.Soft lookup¶
query asks: "what earlier token best explains me?"
compare against all keys
scores -> [0.72, 0.05, 0.08, 0.15]
return weighted mix of all values
The pronoun example¶
Take the sentence: The animal didn't cross the road because it was tired.
Focus on the token it. What does it refer to? The model shines the
spotlight beam backward. Suppose the scorecard says:
animal gets the strongest light. So the updated
representation for it mostly pulls information from animal. road gets
weak attention. That makes sense. Roads do not usually get tired. cross and
tired still contribute a little. Because verbs and adjectives also help
interpret the clause. This is not exact symbolic logic. It is weighted
evidence gathering. That is why soft lookup is the right picture.
A tiny formula before the full formula¶
Do not jump to the matrix form yet. Keep the picture simple. For one query token: 1. Compare query with every key. 2. Turn comparison scores into weights. 3. Use weights to mix the values. In symbols:
That is enough for now. Next file will formalize it withQK^T, softmax, and
scaling. This file is about intuition.
Worked mini lookup with numbers¶
Keep the same pronoun story. Suppose four candidate value vectors are:
Use the same scorecard: Now compute the soft lookup output:0.72·[4,1] + 0.05·[1,4] + 0.08·[2,0] + 0.15·[0,3]
= [2.88,0.72] + [0.05,0.20] + [0.16,0.00] + [0.00,0.45]
= [3.09, 1.37]
animal. A little structure came from
tired. The lookup does not return one token. It returns a blended context
vector.
ASCII picture: relay vs direct access¶
old relay path
-------------
word1 -> state1 -> state2 -> state3 -> ... -> stateN
if wordN needs word1,
it depends on every middle state not dropping the clue
new direct path
---------------
wordN
|\
| \____ word1
|_____ word5
\______ word12
wordN asks everyone directly
Why the spotlight beam is powerful¶
A token does not just read its neighbours. It can consult far tokens too. That is useful for: - pronoun resolution - long code dependencies - matching an answer to a question earlier in the prompt - linking a table row to its header - tying a closing bracket to the right opening bracket And because the lookup is soft, the model is not forced into one brittle choice too early. It can keep multiple hypotheses alive. Nice behaviour for language.
Where this lives in the wild¶
- OpenAI ChatGPT. A user asks a question, then clarifies three turns later. Attention lets the latest turn directly consult the earlier instruction.
- Google Translate. A word late in the target sentence often needs evidence from a specific source word much earlier in the source sentence.
- GitHub Copilot. While generating code, the current line can directly attend to a function signature many lines above.
- Gmail Smart Compose. A proposed next word can use clues from the full sentence, not only the last hidden state.
- Figma or Notion AI assistants. When rewriting structured content, tokens can consult headings, bullets, and prior phrases directly.
Interview Q&A¶
Q: What problem does attention solve compared with an RNN? A: It removes
the single hidden-state bottleneck. A token no longer depends on a long relay
chain to access earlier information; it can query relevant tokens directly.
Common wrong answer to avoid: "Attention just makes training faster." Speed
is a consequence. The deeper reason is direct access to relevant context. Q:
Why call attention a soft lookup? A: Because the query compares with all
keys and retrieves a weighted mixture of all values, not one exact record from
a hash table. Q: In the it was tired example, why isn't the weight on
road exactly zero? A: Because attention is probabilistic and
context-dependent. The model often keeps small weight on less likely options
until later layers sharpen the interpretation. Common wrong answer to avoid:
"A correct model must assign zero to every wrong token." Soft models rarely
work that way; they rank and blend evidence. Q: What do the weights
intuitively mean? A: They are the scorecard for relevance. Higher weight
means, "Your value vector should contribute more to my updated meaning."
Apply now (5 min)¶
Take this sentence: Riya thanked Meena because she helped with the demo.
1. Circle the token she.
2. List three candidate earlier tokens it may attend to.
3. Invent a scorecard that sums to 1.
4. Say which token should get the highest weight and why.
5. Say what small weights on other tokens still represent.
Then sketch from memory:
- the relay-race diagram
- the direct-access diagram
- one hard lookup example
- one soft lookup example with weights
If you can explain why widening the hidden state is not the same as direct
access, the spotlight beam has clicked.
Bridge. The lookup picture is clear. Now we make the scorecard exact: dot products, scaling by
√d_k, softmax, and the final weighted sum over values. Read08-scaled-dot-product.mdnext.