00. Tokenization & attention in kid words — the office message room¶
Read this first. Every later file in this module calls back to this picture. Five minutes.
The setup¶
Imagine a large office. A paper note arrives at reception. The note is a sentence — messy ink, mixed languages, product names, emojis. No one in the office can read raw ink. So the office uses a fixed workflow.
First comes the splitter. The splitter cuts the note into standard pieces. Not single letters — too many pieces. Not whole words — too rigid for new slang, code, and typos. Just reusable chunks. tokenization might become token + ization. Good boxes.
Next comes the badge board. Every chunk gets a reusable badge number. Badge 417 might mean token. Badge 982 might mean ization. The badge board is a giant cabinet — each drawer number is a token ID, and inside each drawer is a learned card (a vector). Now the office can compare chunks geometrically. Similar chunks live in nearby drawers.
Then comes the seat number. Even if two chunks are identical, their seat numbers differ. The word bank in position 2 is not the same as bank in position 9. Without seat numbers, dog bites man and man bites dog look identical. The seat number tells where each chunk sits in line.
Now the message pieces go to employees sitting at desks. Each employee gets one chunk. Then comes the spotlight beam. An employee can shine a spotlight on colleagues — "You may matter for my meaning." But the beam needs a strength measure. So we keep the scorecard. The scorecard gives each colleague a relevance score. Higher score means stronger attention. Lower means weaker.
At the end, every employee writes a richer note. That richer note mixes its own chunk with helpful context. So bank near river becomes one thing. And bank near loan becomes another.
The whole story in one picture¶
raw text
|
v
[the splitter] ──── cuts into reusable chunks
|
v
token IDs
|
v
[the badge board] ──── looks up a learned vector per ID
|
+──── [the seat number] ──── adds position info
|
v
token vector + position vector
|
v
[the spotlight beam] ──── each token queries every other token
|
v
[the scorecard] ──── softmax weights: who matters how much
|
v
contextual token vectors ──── meaning shaped by neighbors
A tiny worked example¶
Take the sentence: The server failed again
The splitter produces: The | server | failed | again
The badge board turns each piece into a vector. The seat number marks positions 1, 2, 3, 4.
Then failed shines the spotlight beam. Its scorecard:
failed --> The : 0.05 (weak — article, not informative)
failed --> server : 0.70 (strong — what failed?)
failed --> failed : 0.15 (moderate — self-reference)
failed --> again : 0.10 (moderate — repetition signal)
So the final vector for failed becomes richer. It no longer means failure in isolation. It means failure of a server, repeated again.
The placeholders you will see called back¶
Whenever a later file uses one of these names, picture this office:
| Placeholder | Picture |
|---|---|
| The splitter | Tokenization — cuts raw text into reusable subword pieces. |
| The badge board | Embedding table — maps each token ID to a learned vector. |
| The seat number | Positional encoding — tells the model where each token sits. |
| The spotlight beam | Attention — each token queries every other token for relevance. |
| The scorecard | Attention weights — softmax scores saying who matters how much. |
If any one piece fails, the office misreads the note. Bad splitter → awkward chunks. Weak badge board → poor meaning vectors. Missing seat number → lost word order. Dull spotlight → hidden context. Unstable scorecard → noisy attention.
What's coming¶
The rest of the module is the failure-fix chain:
- The tokenizer failure. Why naive word-level tokenizers collapse on real text. →
01-tokenizer-failure.md - Character vs word level. Two extremes, both broken.
- Subword tokenization and BPE. The practical middle path.
- Embeddings. Token IDs to vectors — the badge board.
- Positional encoding. The seat number — sinusoidal clocks.
- RoPE and ALiBi. Modern fixes for position at long context.
- Attention as soft lookup. The spotlight beam replaces the RNN bottleneck.
- Scaled dot-product attention. The math behind the scorecard.
- Causal masking. Blocking the future in decoders.
- Multi-head attention. Parallel crews with different consultation habits.
- The full pipeline. Raw text to contextual vectors, end to end.
- WordPiece and Unigram. Two alternatives to BPE — different merge strategies, same goal.
- Cross-attention. When Q comes from one sequence and K, V from another.
- Honest admission. What still feels unsolved.
Each file is one piece. Each piece exists because the previous piece broke at something specific.
Bridge. The first thing that breaks is the front door. Feed real text — code, prices, product names, mixed scripts — into a naive word tokenizer, and meaning collapses. Read
01-tokenizer-failure.mdnext.