07. Attention as soft lookup — the spotlight beam¶

RNNs pass messages step by step. Attention says, why not look directly?

Built on the ELI5 in 00-eli5.md. The spotlight beam — one token checking which other tokens matter — now becomes direct lookup instead of relay passing.

The picture before the math¶

Picture a relay race. Runner 1 hears token 1. Runner 2 hears token 2 and whatever runner 1 managed to pass. Runner 3 hears token 3 and whatever runner 2 still remembers. By token 30, the early message is tired. Some detail has faded. Some detail got compressed. Some detail was simply lost. That is the RNN bottleneck. Now picture a different classroom. Every student can directly ask any other student a question. No relay. No whisper chain. No waiting for middle students to carry context. That direct asking is attention. A token shines a spotlight beam. Other tokens light up by relevance. The brighter the light, the more they contribute.

Why the relay race hurts¶

Suppose the sentence is long. The useful clue may be 20 tokens back. An RNN has to compress that clue into a hidden state. Then preserve it through many updates. That is hard. Each new token mixes in new information. Older detail can fade. Gradients also struggle to travel through long chains. So the model knows something from earlier. But maybe not the exact thing you needed. See the picture.

RNN relay:
x1 -> h1 -> h2 -> h3 -> h4 -> ... -> h30
carries everything through one narrow pipe
attention:
token 30 --> token 1
token 30 --> token 7
token 30 --> token 18
token 30 --> token 29

Direct access beats repeated forwarding.

Three rescue attempts before the real fix¶

Rescue 1: Make the hidden state wider¶

Fine. Give the relay runner a bigger bag. Now more information can be carried. But it is still one bag. The same bottleneck remains.

Rescue 2: Use bidirectional RNNs¶

Now you read left-to-right and right-to-left. Helpful. Still, each direction is a chain. And generation settings cannot look into the future anyway.

Rescue 3: Stack deeper RNN layers¶

Depth helps feature extraction. Good. But the core path is still sequential. Early detail still has to survive many steps. So these rescues improve the relay race. They do not remove the relay race. The fundamental fix is different. Let every token query every other token directly.

Attention as a dictionary lookup¶

Think of memory as a dictionary. A dictionary has keys and values. If I know the exact key, I retrieve the value. That is hard lookup.

key:   "capital of India"
value: "New Delhi"

One key matches. One value returns. Attention is softer. The query does not demand one exact match. It compares itself with all keys. Then it gives each key a relevance score. Then it mixes all values using those scores. So attention is a soft lookup. Many keys contribute. Strong matches contribute more. Weak matches contribute less.

Hard lookup vs soft lookup¶

Hard lookup¶

query key = "user_id:42"
exact match found
return one record

Very sharp. Very discrete.

Soft lookup¶

query asks: "what earlier token best explains me?"
compare against all keys
scores -> [0.72, 0.05, 0.08, 0.15]
return weighted mix of all values

No single winner is required. That is why attention handles ambiguity nicely. Pronouns. Long dependencies. Multiple partial clues. All fit this picture.

The pronoun example¶

Take the sentence: The animal didn't cross the road because it was tired. Focus on the token it. What does it refer to? The model shines the spotlight beam backward. Suppose the scorecard says:

animal : 0.72
road   : 0.05
cross  : 0.08
tired  : 0.15

Now read the meaning. animal gets the strongest light. So the updated representation for it mostly pulls information from animal. road gets weak attention. That makes sense. Roads do not usually get tired. cross and tired still contribute a little. Because verbs and adjectives also help interpret the clause. This is not exact symbolic logic. It is weighted evidence gathering. That is why soft lookup is the right picture.

A tiny formula before the full formula¶

Do not jump to the matrix form yet. Keep the picture simple. For one query token: 1. Compare query with every key. 2. Turn comparison scores into weights. 3. Use weights to mix the values. In symbols:

weights = normalize(similarity(query, keys))
output  = weighted_sum(weights, values)

That is enough for now. Next file will formalize it with QK^T, softmax, and scaling. This file is about intuition.

Worked mini lookup with numbers¶

Keep the same pronoun story. Suppose four candidate value vectors are:

animal = [4, 1]
road   = [1, 4]
cross  = [2, 0]
tired  = [0, 3]

Use the same scorecard:

[0.72, 0.05, 0.08, 0.15]

Now compute the soft lookup output:

0.72·[4,1] + 0.05·[1,4] + 0.08·[2,0] + 0.15·[0,3]
= [2.88,0.72] + [0.05,0.20] + [0.16,0.00] + [0.00,0.45]
= [3.09, 1.37]

See the meaning. Most evidence came from animal. A little structure came from tired. The lookup does not return one token. It returns a blended context vector.

ASCII picture: relay vs direct access¶

old relay path
-------------
word1 -> state1 -> state2 -> state3 -> ... -> stateN
if wordN needs word1,
it depends on every middle state not dropping the clue
new direct path
---------------
wordN
|\
| \____ word1
|_____ word5
\______ word12
wordN asks everyone directly

See. The bottleneck is gone. Not reduced. Gone.

Why the spotlight beam is powerful¶

A token does not just read its neighbours. It can consult far tokens too. That is useful for: - pronoun resolution - long code dependencies - matching an answer to a question earlier in the prompt - linking a table row to its header - tying a closing bracket to the right opening bracket And because the lookup is soft, the model is not forced into one brittle choice too early. It can keep multiple hypotheses alive. Nice behaviour for language.

Where this lives in the wild¶

OpenAI ChatGPT. A user asks a question, then clarifies three turns later. Attention lets the latest turn directly consult the earlier instruction.
Google Translate. A word late in the target sentence often needs evidence from a specific source word much earlier in the source sentence.
GitHub Copilot. While generating code, the current line can directly attend to a function signature many lines above.
Gmail Smart Compose. A proposed next word can use clues from the full sentence, not only the last hidden state.
Figma or Notion AI assistants. When rewriting structured content, tokens can consult headings, bullets, and prior phrases directly.

Interview Q&A¶

Q: What problem does attention solve compared with an RNN? A: It removes the single hidden-state bottleneck. A token no longer depends on a long relay chain to access earlier information; it can query relevant tokens directly. Common wrong answer to avoid: "Attention just makes training faster." Speed is a consequence. The deeper reason is direct access to relevant context. Q: Why call attention a soft lookup? A: Because the query compares with all keys and retrieves a weighted mixture of all values, not one exact record from a hash table. Q: In the it was tired example, why isn't the weight on road exactly zero? A: Because attention is probabilistic and context-dependent. The model often keeps small weight on less likely options until later layers sharpen the interpretation. Common wrong answer to avoid: "A correct model must assign zero to every wrong token." Soft models rarely work that way; they rank and blend evidence. Q: What do the weights intuitively mean? A: They are the scorecard for relevance. Higher weight means, "Your value vector should contribute more to my updated meaning."

Apply now (5 min)¶

Take this sentence: Riya thanked Meena because she helped with the demo. 1. Circle the token she. 2. List three candidate earlier tokens it may attend to. 3. Invent a scorecard that sums to 1. 4. Say which token should get the highest weight and why. 5. Say what small weights on other tokens still represent. Then sketch from memory: - the relay-race diagram - the direct-access diagram - one hard lookup example - one soft lookup example with weights If you can explain why widening the hidden state is not the same as direct access, the spotlight beam has clicked.

Bridge. The lookup picture is clear. Now we make the scorecard exact: dot products, scaling by √d_k, softmax, and the final weighted sum over values. Read 08-scaled-dot-product.md next.