Skip to content

02. Tokenization & Attention — Narrative Explainer

Companion note: keep 03_study_material.md open for the terse reference version.

Table of contents

ELI5 — The office message room

Imagine a large office. A paper note arrives at reception. The note is a sentence. The office staff must understand it together. But the note arrives as messy ink. No one can process raw ink directly. So the office uses a fixed workflow. First comes the splitter. The splitter cuts the note into standard pieces. Not too small. Not too large. Just useful chunks. Next comes the badge board. Every chunk gets a reusable badge number. Badge 417 might mean token. Badge 982 might mean ization. Now the office can store and compare chunks cleanly. Then comes the seat number. Even if two chunks are identical, their seat numbers differ. The word bank in position 2 is not position 9. Seat number tells where a chunk sits in the line. Now the message pieces go to employees sitting at desks. Each employee gets one chunk. Each employee also sees the other employees. Then comes the spotlight beam. An employee can shine the spotlight on colleagues. The beam means, "You may matter for my meaning." But the beam needs strength. So we keep the scorecard. The scorecard gives each colleague a relevance score. Higher score means stronger attention. Lower score means weaker attention. At the end, every employee writes a richer note. That richer note mixes its own chunk with helpful context. So bank near river becomes one thing. And bank near loan becomes another. Simple, no?

The five placeholders

We will reuse these names again and again. - The splitter = tokenization. - The badge board = mapping token IDs to vectors. - The seat number = positional information. - The spotlight beam = attention over other tokens. - The scorecard = numeric attention weights. If any one piece fails, the office misreads the note. If the splitter is bad, chunks are awkward. If the badge board is weak, chunks carry poor meaning. If the seat number is missing, order disappears. If the spotlight beam is dull, context stays hidden. If the scorecard is unstable, attention becomes noisy.

The whole story in one picture

raw text
   |
   v
[the splitter]
   |
   v
 token IDs
   |
   v
[the badge board] ---> embeddings
   |
   +----[the seat number]----+
   |                         |
   v                         v
  token vector + position vector
               |
               v
      [the spotlight beam]
               |
               v
        [the scorecard]
               |
               v
      contextual token vectors
Now take one sentence. The server failed again The splitter may produce: The | server | failed | again The badge board turns each piece into a vector. The seat number marks positions 1, 2, 3, 4. Then failed shines the spotlight beam. It may care strongly about server. It may care weakly about The. Its scorecard could look like this:
failed --> The     : 0.05
failed --> server  : 0.70
failed --> failed  : 0.15
failed --> again   : 0.10
So the final vector for failed becomes richer. It no longer means failure in isolation. It means failure of a server, repeated again. That is the heart of tokenization and attention. The chapters that follow walk every piece from the splitter to the scorecard, slowly, numerically, and visually.

Chapter 1 — Opening failure & stakes

1.1 A curiosity-gap failure

See. Suppose you build a naive word-level tokenizer. Its vocabulary has 50,000 words. That sounds large. Now feed this line: ChatGPT-4o mini costs ₹0.15 per 1M input tokens. A whitespace tokenizer splits it like this: 1. ChatGPT-4o 2. mini 3. costs 4. ₹0.15 5. per 6. 1M 7. input 8. tokens. Now imagine the vocabulary contains: - mini - costs - per - input - tokens But it does not contain: - ChatGPT-4o - ₹0.15 - 1M - tokens. Already 4 of 8 pieces are bad. Maybe you say, "Lowercase it." Good attempt. chatgpt-4o is still unknown. Maybe you say, "Strip punctuation." Good attempt. ChatGPT4o is still unknown. Maybe you say, "Split on dash." Good attempt. Now you get ChatGPT and 4o. 4o is still awkward. Maybe you say, "Replace unknown words with [UNK]." Now meaning collapses. The model sees: [UNK] mini costs [UNK] per [UNK] input tokens You lost product identity. You lost price shape. You lost quantity format. You lost punctuation signal. So the curiosity gap is this. How do modern models avoid collapsing on real text? How do they handle product names, code, numbers, emojis, typos, and mixed scripts? This is not a tiny preprocessing detail. This is the front door of the model.

1.2 Why this matters to you

Gaurav, this matters directly for a Lead AI Engineer interview. Why? Because senior roles are not only about calling an API. They are about diagnosing failure at the right layer. If your RAG chunking looks fine but token count explodes, you need tokenizer intuition. If latency jumps with longer prompts, you need attention-cost intuition. If multilingual prompts become expensive, you need subword and vocabulary intuition. If a model misses the right referent, you need self-attention intuition. A lead candidate should say more than, "Transformers use attention." You should be able to say: - why character-level input becomes too long, - why word-level vocabularies crack on the open world, - how BPE repairs the trade-off, - why embeddings are table lookups, - why order must be injected, - why self-attention is a soft lookup, - why scaling by √d_k stabilizes the scorecard, - and why multiple heads are not decorative. Interviewers listen for structure. They want to hear failure, fix, new failure, next fix. That is how real systems evolve. So let us build the chain carefully.

Chapter 2 — Tokenization

2.1 Raw text is not model-ready

A model consumes numbers. Text is not numbers. So what to do? We need a rule that converts text into repeatable pieces. Those pieces become IDs. Those IDs become vectors. But the first choice matters a lot. If pieces are too tiny, sequences grow long. If pieces are too large, vocabulary explodes. The splitter must find the middle path. Think of luggage at an airport. One giant trunk is hard to sort. Thousands of loose screws are also hard to sort. Standard boxes work better. Tokenization is that standard box choice.

2.2 Character-level tokenization: why it strains the system

Mental picture first. The splitter cuts every word into single letters. tokenization becomes: t | o | k | e | n | i | z | a | t | i | o | n Character-level looks attractive at first. No out-of-vocabulary problem. Any new word can be spelled with known characters. Any typo can still be represented. Any code string can still be represented. So why not stop here? Let us try three concrete rescues.

Attempt 1 — accept long sequences

Suppose a sentence has 100 words. Average word length is 5 characters. Add spaces and punctuation. You may land near 550 character tokens. A decent subword tokenizer might need only 130 tokens. Now compare attention cost. Attention builds an L x L interaction grid. For 550 tokens: 550 x 550 = 302,500 score cells. For 130 tokens: 130 x 130 = 16,900 score cells. That is roughly 18x more pairwise work. See the pain. The spotlight beam must compare many more desks.

Attempt 2 — keep characters, but widen the model

Maybe you say, "Fine, I will use larger vectors." Now each tiny character gets a richer embedding. But sequence length is still huge. Attention still scales quadratically with length. A fatter badge board does not fix too many seats.

Attempt 3 — keep characters, then truncate

Maybe you say, "I will cut the prompt earlier." Now cost drops. But meaning drops too. Long code files. Long legal clauses. Long chat histories. All get cut sooner. You saved compute by deleting context. That is not a real fix.

Why the failure is fundamental

Characters preserve spelling. They do not package meaning efficiently. The model must rediscover common chunks again and again. It must learn that t,o,k,e,n often travels together. It must learn that ing behaves like a reusable ending. It must learn common names letter by letter. Possible? Yes. Efficient? Usually no. So character-level tokenization can represent anything. But it cannot do it compactly enough for modern usage. That is the core failure.

2.3 Word-level tokenization: why it breaks on the real world

So what to do? Maybe make each whole word one token. Now tokenization is one piece. That seems elegant. Sequences become shorter. The splitter looks smarter. But let us stress it. Take these words: - play - playing - played - player - playful - replay - gameplay A word-level vocabulary treats them as separate atoms. Now add real-world mess: - product names, - code identifiers, - Indian names with variants, - URLs, - hashtags, - prices, - mixed Hindi-English text, - typos, - new slang. Vocabulary demand shoots up. Let us try three repairs.

Attempt 1 — lowercase everything

Play, PLAY, and play collapse together. Good. But play, player, and replay are still different words. New words still appear daily.

Attempt 2 — stem or normalize aggressively

Now you may map playing to play. Helpful sometimes. But meaning can leak. policy and police are not near twins. US and us differ. RBI should not become mush. Aggressive normalization saves vocabulary by losing precision.

Attempt 3 — use [UNK] for unknowns

This is the classic bandage. Vocabulary stays fixed. But rare words collapse into one bucket. ChatGPT-4o, Gemma, Qwen, and Llama may all become [UNK]. That destroys identity. The office receives different visitors. The badge board gives them the same blank badge. That is disastrous for meaning.

Why the failure is fundamental

Word boundaries are too rigid for an open vocabulary world. Language keeps inventing forms. Software keeps inventing identifiers. Users keep making typos. Domains keep mixing symbols. A fixed word list cannot stay fresh enough. Word-level tokenization gives short sequences. But it cannot generalize compositionally. That is the core break.

2.4 Subword tokenization: the practical compromise

Now the practical idea appears. Do not split too finely. Do not split too coarsely. Split into reusable pieces. Common words may stay whole. Rare words may break into known parts. Examples: - playing -> play + ing - unhappiness -> un + happi + ness - ChatGPT-4o -> Chat + GPT + - + 4 + o - tokenizers -> token + ize + r + s Now we get two wins. Open-vocabulary flexibility. And much shorter sequences than characters. This is where BPE enters. BPE teaches the splitter which chunks deserve to stay together. It learns from frequency. Repeated neighbors get merged. Rare patterns stay split. Simple, no?

2.5 BPE worked example with actual merge steps

Let us do a real toy run. No hand-waving. Training corpus: - token - tokens - tokenize - tokenizer Start with character tokens. We mark end of word as _. So the corpus becomes:

t o k e n _
t o k e n s _
t o k e n i z e _
t o k e n i z e r _
Now count adjacent pairs. Initial frequent pairs include: - t o -> 4 - o k -> 4 - k e -> 4 - e n -> 4 - n _ -> 1 - n s -> 1 - n i -> 2 - i z -> 2 - z e -> 2 - e _ -> 1 - e r -> 1 - r _ -> 1 Suppose tie-break picks t o first.

Merge step 1

Merge t o -> to Corpus now:

to k e n _
to k e n s _
to k e n i z e _
to k e n i z e r _
Frequent pairs now include: - to k -> 4 - k e -> 4 - e n -> 4 - n i -> 2 - i z -> 2

Merge step 2

Merge to k -> tok Corpus now:

tok e n _
tok e n s _
tok e n i z e _
tok e n i z e r _

Merge step 3

Merge tok e -> toke Corpus now:

toke n _
toke n s _
toke n i z e _
toke n i z e r _

Merge step 4

Merge toke n -> token Corpus now:

token _
token s _
token i z e _
token i z e r _
Already we learned a meaningful reusable chunk. That is the splitter becoming smarter. Now pair counts include: - token _ -> 1 - token s -> 1 - token i -> 2 - i z -> 2 - z e -> 2 - e _ -> 1 - e r -> 1

Merge step 5

Merge i z -> iz Corpus now:

token _
token s _
token iz e _
token iz e r _

Merge step 6

Merge iz e -> ize Corpus now:

token _
token s _
token ize _
token ize r _

Merge step 7

Merge token ize -> tokenize Corpus now:

token _
token s _
tokenize _
tokenize r _
Now we have useful chunks: - token - tokenize - s - r - _ See what happened. The splitter did not memorize every final word separately. It learned reusable building blocks from frequency. That is the beauty.

What the merge list really is

Your trained BPE model is mainly: 1. the base vocabulary, 2. the ordered merge rules, 3. the final token-to-ID mapping. That ordered merge list matters. Earlier merges can enable later merges. If training learned token before tokenize, that order shapes encoding.

Retrieval prompt: Without looking above, replay the exact path from t o k e n to token, and explain why BPE needs merge order, not only final tokens.

2.6 Encoding a new word with learned merges

Now test a new string. tokenizers Start at character level:

t o k e n i z e r s _
Apply learned merges in order. t o becomes to. to k becomes tok. tok e becomes toke. toke n becomes token. i z becomes iz. iz e becomes ize. token ize becomes tokenize only if those two are adjacent. But here the string is token i z e r s _ before that merge. After merges we may get:
token ize r s _
So the unseen word tokenizers is still representable. No [UNK] disaster. No character-by-character explosion. The splitter uses pieces it already trusts. This is why subword tokenization generalizes. Not perfectly. But practically well.

2.7 Tokenization knobs you will touch in production

In real systems, tokenization is a cost lever. It is also a quality lever. Important knobs include: - vocabulary size, - normalization rules, - byte-level vs unicode-level base units, - whitespace handling, - special tokens, - multilingual coverage, - tokenizer-model mismatch. A few examples.

Vocabulary size

Small vocabulary means more splits. More splits mean longer prompts. Longer prompts mean higher attention cost. But a smaller vocabulary also means a smaller output softmax layer. Large vocabulary means fewer splits. Good. But it makes embeddings and output projection larger. So what to do? Choose a middle path for the target domain.

Byte-level tokenization

Byte-level BPE can represent any text byte. That is robust. It avoids true unknowns. But some scripts may tokenize less elegantly. Costs can rise for certain languages.

Tokenizer mismatch

You chunk documents with one tokenizer. The deployed model uses another. Now your "500-token chunk" becomes 720 at inference. Latency jumps. Truncation appears. This happens more often than people admit. That is why the splitter matters operationally.

Chapter 3 — Embeddings & position

3.1 Token IDs are addresses, not meanings

After tokenization, you get IDs. Example:

The      -> 11
server   -> 582
failed   -> 913
again    -> 207
Do not romanticize these IDs. ID 582 is not "larger meaning" than ID 11. IDs are addresses. Nothing more. The badge number itself carries no geometry. If you feed raw IDs to a neural network, ordinal nonsense leaks in. Token 913 is not semantically near 914 just because numbers are close. So what to do? Use the badge board. That is the embedding table. Each ID indexes a learned vector. Now geometry appears. Nearby vectors can express related roles. Far vectors can express different roles.

3.2 Embedding lookup as matrix indexing

Mental picture first. Imagine a giant cabinet. Each drawer number is a token ID. Open drawer 582. Inside is a d-dimensional card. That card is the embedding vector. Simple, no? Mathematically, the cabinet is a matrix. If vocabulary size is V and embedding width is d_model, then:

E has shape [V, d_model]
Suppose V = 6 and d_model = 4. A tiny embedding table might be:
ID   token     embedding row
0    <pad>     [ 0.0,  0.0,  0.0,  0.0]
1    the       [ 0.2, -0.1,  0.4,  0.0]
2    server    [ 1.1,  0.7, -0.3,  0.5]
3    failed    [ 0.9,  1.0, -0.8,  0.2]
4    again     [ 0.3,  0.2,  0.1, -0.4]
5    error     [ 1.0,  0.8, -0.6,  0.3]
Input token IDs:
[1, 2, 3, 4]
Embedding lookup means:
x_1 = E[1]
x_2 = E[2]
x_3 = E[3]
x_4 = E[4]
So the sequence becomes:
[ 0.2, -0.1,  0.4,  0.0]
[ 1.1,  0.7, -0.3,  0.5]
[ 0.9,  1.0, -0.8,  0.2]
[ 0.3,  0.2,  0.1, -0.4]
That is it. Embedding lookup is matrix indexing. Not magic. Not a mysterious semantic oracle. A learned table. Later training shapes it. The badge board is learned because the model keeps adjusting drawer contents.

One-hot picture, then lookup picture

You can also think in one-hot terms. Token server with ID 2 becomes:

[0, 0, 1, 0, 0, 0]
Then:
one_hot @ E = E[2]
So lookup and one-hot multiplication are equivalent. Lookup is just faster and cleaner.

3.3 Bag-of-words loses order

Now let us create the next failure. Suppose we have embeddings for words. Great. Can we just average them? That gives a bag-of-words representation. Try two sentences: 1. dog bites man 2. man bites dog Bag-of-words sees the same multiset. Same words. Same counts. So average embedding becomes nearly the same. But meaning is reversed. Let us try three rescue attempts.

Attempt 1 — bigger embeddings

Make each word vector wider. Still no order. A richer badge board cannot invent seat number by itself.

Attempt 2 — average plus max pooling

Now you keep two summaries. Still no order. You still know what appeared. Not where it appeared.

Attempt 3 — add n-grams everywhere

Now include bigrams and trigrams. This helps partly. But feature space explodes. Coverage stays brittle. You are manually patching missing order.

Why the failure is fundamental

A bag ignores arrangement. Language meaning depends on arrangement. So the office needs seat numbers. Without seat numbers, the employees know who came. They do not know where anyone sat.

3.4 Adding position information

So what to do? Add position to each token vector. Very common recipe:

input_vector = token_embedding + position_embedding
This is the seat number joining the badge board. Picture:
token:      server  -> [1.1, 0.7, -0.3, 0.5]
position:   pos=2   -> [0.0, 0.2,  0.1, 0.4]
sum:                 [1.1, 0.9, -0.2, 0.9]
Now server at position 2 differs from server at position 9. That is enough for attention layers to notice order-sensitive patterns. Two broad styles exist: - learned absolute position embeddings, - formula-based encodings such as sinusoidal, - relative methods such as RoPE or ALiBi. Let us first understand the geometric picture.

3.5 Sinusoidal encoding: picture first

Before formula, imagine many clock hands. Each pair of embedding dimensions is one clock. Different clocks rotate at different speeds. Position 0 means all clocks start at a reference angle. Position 1 rotates each clock a little. Position 2 rotates again. Nearby positions get nearby angle patterns. Far positions get different patterns. That is the intuition. Not random numbers. Coordinated rotations. Now the formula.

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
For a tiny case, take d_model = 4. Then we have two clocks. One fast clock. One slow clock. Approximate values:
pos 0 -> [0.00,  1.00, 0.000, 1.000]
pos 1 -> [0.84,  0.54, 0.010, 0.999]
pos 2 -> [0.91, -0.42, 0.020, 0.999]
pos 3 -> [0.14, -0.99, 0.030, 0.999]
See the pattern. The first pair changes quickly. The second pair changes slowly. This gives both local and broad position signals.

Why this helps attention

If each token carries a rotational seat number, then the spotlight beam can compare content plus place. The word bank at position 2 is distinguishable from bank at position 20. The model can also infer relative differences from angle relations. That is the useful part.

3.6 What feels uncomfortable about sinusoidal and learned absolute positions

Now the next failure appears. Absolute positions help. But they are not the end.

Learned absolute embeddings problem

You learn a table up to max length 2048. What about token 2049? No learned row exists. Maybe you extend the table. But those new rows were not trained well. Maybe you interpolate. Sometimes okay. Often brittle.

Sinusoidal problem

Formula gives values for any position. So extrapolation seems free. But the model still trains on a finite range. It may not use far positions gracefully. Also, absolute addition can feel indirect. We want relative distance to influence attention more naturally. Let us try three fixes mentally.

Attempt 1 — just train with longer context

Helpful. But memory cost rises sharply. Training gets expensive.

Attempt 2 — increase maximum learned positions

You get more rows. But you still rely on absolute slots. Generalization can remain weak.

Attempt 3 — keep sinusoidal and hope

Sometimes okay for moderate extension. Not always enough for strong long-context behavior. So modern practice looks for more relative structure. That leads to RoPE and ALiBi.

3.7 RoPE and ALiBi: two practical fixes

RoPE mental picture

Take every 2D pair in Q and K. Rotate it by an angle based on position. Not the token embedding alone. The query and key directions themselves rotate. Then their dot product depends on relative angle. That means distance information enters attention more directly. Picture one 2D pair. Base vector:

[1, 0]
Rotate query at position 1 by 30 degrees. Rotate key at position 3 by 90 degrees. Now:
q_rot ≈ [0.866, 0.500]
k_rot ≈ [0.000, 1.000]
Dot product:
0.866*0 + 0.500*1 = 0.500
Relative angle drove the score. That is the point. RoPE lets the scorecard feel distance through rotation.

ALiBi mental picture

ALiBi is simpler. Keep normal attention scores. Then subtract a distance-based penalty. Farther tokens get a larger negative bias. Example. Query at position 4. Raw attention scores to positions 1,2,3,4:

[8.0, 7.0, 6.0, 5.0]
Distances are:
[3, 2, 1, 0]
Let slope = 0.5. Biases are:
[-1.5, -1.0, -0.5, 0.0]
Adjusted scores become:
[6.5, 6.0, 5.5, 5.0]
Now nearer positions receive gentler treatment. Simple, no?

When to remember which one

RoPE: - common in modern decoder models, - relative information enters through rotated Q and K, - often strong for long-context extensions. ALiBi: - simpler bias idea, - distance penalty is explicit, - also good for length generalization.

Retrieval prompt: Explain to yourself why dog bites man and man bites dog look identical to bag-of-words, and then say how the seat number repairs that failure.

Chapter 4 — Attention mechanism

4.1 The RNN bottleneck

Now we reach the main stage. Before attention, sequence models often processed tokens one by one. Imagine an employee relay race. Token 1 whispers to token 2. Token 2 whispers to token 3. And so on. By the time token 30 arrives, earlier details may fade. That is the bottleneck. A single hidden state must carry too much. Take this sentence: The contract with the vendor from Pune was delayed because the payment approval was missing. When the model reaches missing, it should connect back to payment approval. In a strict chain, that signal travels through many hops. Let us try three repairs.

Attempt 1 — larger hidden state

Good idea. Store more information. But the path is still long. Important signal still travels through many steps.

Attempt 2 — deeper RNN

More compute. Sometimes better expressivity. But dependency path remains sequential and long. Training stays slow.

Attempt 3 — bidirectional context

Useful for some tasks. But sequential recurrence still limits parallelism. And bottleneck compression still exists.

Why the failure is fundamental

One narrow pipe carries the entire past. That is the problem. We want direct access. Not relay-race memory. So what to do? Let each token inspect other tokens directly. That is attention.

4.2 Attention as a soft lookup

Mental picture first. Each employee holds a question. "Who among the others should I consult?" That question forms the spotlight beam. Each colleague presents a label describing what they contain. That is the key. Each colleague also presents the information they can share. That is the value. Then the scorecard assigns weights. The token does not pick exactly one colleague. It does a soft lookup. Mostly this person. A little that person. Almost none from another. Picture:

query token: "it"

looks at:
[animal]  [road]  [cross]  [tired]
   |         |       |        |
 score     score   score    score
 0.72      0.05    0.08     0.15
Then it receives a weighted mixture. That mixture helps resolve meaning. No one relay state had to carry everything. Direct lookup. Simple, no?

4.3 Self-attention means every word queries every other word

This phrase is easy to say. It must become concrete. Take four tokens:

[The] [server] [failed] [again]
Self-attention means each token can ask about every token. So the scorecard is a full grid.
            keys
          T   server failed again
query T   ?      ?      ?     ?
server    ?      ?      ?     ?
failed    ?      ?      ?     ?
again     ?      ?      ?     ?
failed may care most about server. again may care most about failed. The may care weakly about most things. This grid is why attention is powerful. It is also why cost rises with sequence length. For L tokens, score grid size is L x L. That is the price of direct consultation.

Q, K, V in office language

  • Query = what I am looking for.
  • Key = what kind of information I advertise.
  • Value = what information I will contribute. Same token can emit all three. That is why it is called self-attention. Every employee both asks and answers.

4.4 Scaled dot-product attention with numbers

Now formula time. But only after the picture. Attention score between a query and a key is their similarity. Dot product is the similarity measure. Then we normalize scores with softmax. Then we mix the values. Formula:

Attention(Q, K, V) = softmax(QK^T / √d_k) V
Let one query be:
q = [2, 1]
Three keys:
k1 = [ 2, 1]
k2 = [ 0, 2]
k3 = [-1, 1]
Three values:
v1 = [10, 0]
v2 = [ 0, 8]
v3 = [ 1, 1]

Step 1 — raw dot products

q·k1 = 2*2 + 1*1  = 5
q·k2 = 2*0 + 1*2  = 2
q·k3 = 2*(-1)+1*1 = -1
So raw scores are:
[5, 2, -1]

Step 2 — scale by √d_k

Here d_k = 2. So √d_k ≈ 1.414. Scaled scores:

[3.54, 1.41, -0.71]

Step 3 — softmax to get weights

Approximate softmax gives:

[0.88, 0.11, 0.01]
This is the scorecard. Most weight goes to key 1. A little goes to key 2. Almost none goes to key 3.

Step 4 — weighted sum of values

output = 0.88*v1 + 0.11*v2 + 0.01*v3
Compute coordinate-wise: First coordinate:
0.88*10 + 0.11*0 + 0.01*1 = 8.81
Second coordinate:
0.88*0 + 0.11*8 + 0.01*1 = 0.89
So output is approximately:
[8.81, 0.89]
See the meaning. The output mostly carries information from v1. Because k1 matched the query best. That is soft lookup done numerically.

Retrieval prompt: Recompute the previous example from memory. Raw scores first. Then scaling. Then softmax shape. Then the weighted sum story.

4.5 Why divide by √d_k

This detail is small on paper. It is large in practice. Mental picture first. If each query and key has many coordinates, then even random vectors can produce large dot products. Large scores make softmax too peaky. Peaky softmax gives tiny gradients for losers. The scorecard becomes overconfident too early. Let us make it concrete. Suppose d_k = 64. Imagine raw scores:

[15, 12, 9]
Softmax of this is extremely sharp. Approximate weights:
[0.95, 0.05, 0.00]
Now divide by √64 = 8. Scaled scores:
[1.875, 1.5, 1.125]
Softmax now is milder. Approximate weights:
[0.46, 0.32, 0.22]
Much healthier. The model can still learn preference. But it does not saturate instantly.

The variance intuition

If query and key coordinates each have mean 0 and variance 1, then the dot product sums d_k random products. Variance grows with d_k. Typical magnitude grows like √d_k. So dividing by √d_k stabilizes scale. Simple, no?

Three bad alternatives

Maybe you think: - clip scores, - lower learning rate, - shrink initialization. These may help symptoms. But they do not fix the core scaling law. The dimension itself inflates raw score magnitude. That is why the formula uses √d_k directly.

4.6 Causal masking in one clean picture

One more practical piece. Decoder language models should not see future tokens. When predicting token 4, token 4 must not peek at token 5. So we mask the future. Picture for four tokens:

          key positions
          1   2   3   4
query 1   ✓   x   x   x
query 2   ✓   ✓   x   x
query 3   ✓   ✓   ✓   x
query 4   ✓   ✓   ✓   ✓
Masked positions get a huge negative score. After softmax, their weight becomes near zero. This keeps generation honest. The employee can consult earlier desks. Not future desks.

Chapter 5 — Multi-head attention & putting it together

5.1 Why one head is not enough

Suppose one spotlight beam serves the whole office. That single beam must track: - pronoun resolution, - subject-verb agreement, - negation, - long-range entity links, - local phrase structure, - punctuation cues. Possible? Maybe partly. Comfortable? Not really. Let us try three naive rescues.

Attempt 1 — one very wide head

Now one head has more dimensions. But it still produces one score pattern per token. One pattern must satisfy many roles simultaneously.

Attempt 2 — stack more layers without splitting heads

Useful. But each layer still mixes through a single attention pattern. Specialization is limited.

Attempt 3 — hope the feed-forward layer disentangles everything

Feed-forward helps after mixing. But if the wrong tokens were mixed together, later cleanup is harder.

The real fix

Use multiple heads. Each head gets its own Q, K, V projections. So each head can learn a different consultation habit. One head may look locally. One may look for matching entity mentions. One may track syntax. One may watch delimiters in code. This is why multi-head attention exists. Not for decoration. For factorized search patterns.

5.2 Tiny multi-head example

Let d_model = 8. Use h = 2 heads. Then each head uses d_k = d_v = 4. Input token vector x is 8-dimensional. Each head has separate projection matrices:

head 1: W_Q1, W_K1, W_V1
head 2: W_Q2, W_K2, W_V2
Same token sequence enters both heads. But after projection, each head sees a different subspace. Picture:
input x
  |\
  | \__ head 1 projections --> attention pattern A --> output o1
  |
  \____ head 2 projections --> attention pattern B --> output o2

concat(o1, o2) --> W_O --> final mixed output
Suppose token it appears in a sentence. Head 1 may produce weights:
[it -> animal] = 0.70
[it -> road]   = 0.05
[it -> tired]  = 0.15
[it -> itself] = 0.10
Head 2 may produce weights:
[it -> animal] = 0.20
[it -> road]   = 0.10
[it -> tired]  = 0.55
[it -> itself] = 0.15
See. Head 1 may chase entity identity. Head 2 may chase causal description. Their outputs are different. After concatenation and output projection, the model gets a richer combined vector. That is the motivation.

Tiny numeric feel

Suppose head outputs are:

o1 = [1, 2, 0, 1]
o2 = [0, 1, 3, 1]
Concatenate:
[o1 ; o2] = [1, 2, 0, 1, 0, 1, 3, 1]
Then apply output matrix W_O. This mixes the specialized head outputs back into d_model space.

5.3 Full pass from raw text to contextual vectors

Now let us put the entire chain together. Sentence: The tokenizer reduced cost

Step 1 — the splitter

Possible subwords:

The | token | izer | reduced | cost

Step 2 — token IDs

The      -> 11
token    -> 417
izer     -> 982
reduced  -> 233
cost     -> 901

Step 3 — the badge board

Each ID indexes the embedding table. Now we have five vectors.

Step 4 — the seat number

Add position information for slots 1 to 5. Now token at slot 2 differs from token at slot 9.

Step 5 — create Q, K, V

For each token vector x, compute:

q = xW_Q
k = xW_K
v = xW_V
Each head uses its own matrices.

Step 6 — the spotlight beam and scorecard

For a given query token, compute similarity against all keys. Scale. Mask if needed. Softmax. Now you have attention weights.

Step 7 — weighted sum of values

The token gathers information from other tokens. Now reduced may attend to cost strongly. Now izer may attend strongly to token.

Step 8 — concatenate heads and mix

Each head contributes a partial view. Concatenate them. Apply output projection. Now each token has a contextual vector. Contextual means: same token surface, different meaning depending on neighbors. That is the whole office workflow.

raw text
  -> split into subwords
  -> map to IDs
  -> lookup embeddings
  -> add positions
  -> project to Q,K,V per head
  -> score every token against others
  -> softmax scorecard
  -> weighted value mix
  -> combine heads
  -> richer contextual vectors

Retrieval prompt: Starting from raw text, narrate the full pipeline using the five placeholders: splitter, badge board, seat number, spotlight beam, scorecard.

5.4 What changes in production

In notebooks, attention looks clean. In production, many knobs appear.

Longer prompts cost more than linearly

Token count doubles. Attention score cells can quadruple. This surprises teams who only count characters.

KV cache matters during generation

At inference, decoder models cache old keys and values. Why? So they do not recompute the past every step. This reduces repeated work. But cache memory then becomes a real serving cost.

Head behavior is useful, but not perfectly interpretable

People love attention maps. They can be informative. They are not a full explanation of reasoning. Do not overclaim.

Position handling affects long-context promises

RoPE scaling tricks may extend context. But quality can still degrade in the middle or far tail. Marketing context length and usable context length are not identical.

Honest admission — what still feels unsolved

Tokenization and attention are powerful. They are also uncomfortable in real life. A few honest points.

Tokenization is still a compromise

There is no universally perfect splitter. A tokenizer that works beautifully for English prose may behave poorly on code or mixed-script chat. A tokenizer that is byte-robust may waste tokens on some languages. Subword boundaries are engineering decisions. Not natural laws.

Attention is expensive

Full self-attention gives beautiful direct access. It also gives L x L cost. That is why long-context serving remains expensive. Sparse tricks help. Chunking helps. Caching helps. None remove the core trade-off completely.

Attention weights are not perfect explanations

If a head attends strongly somewhere, that is evidence of information flow. It is not the whole causal story of the model's answer. Use attention maps carefully.

Long-context generalization is still messy

RoPE and ALiBi improve things. They do not make infinite context free. Models can still forget early details. Or over-focus on recency. Or show lost-in-the-middle behavior.

Multilingual fairness is hard

Some languages consume more tokens for the same meaning. That raises cost. It can also change effective context budget. So the splitter is not just technical. It can affect product fairness.

Chapter 6 — Recap & application

6.1 Failure-fix chain table

# Failure Fix
1 Raw text is not numeric Tokenize into reusable pieces
2 Character tokens make sequences too long Move to larger reusable units
3 Word tokens break on unknown words Use subword tokenization
4 Static whole-word vocab loses composition Learn merges like BPE
5 Token IDs have no geometry Use embedding lookup table
6 Bag-of-words loses order Add positional information
7 Absolute positions can feel brittle at long context Use relative-friendly methods like RoPE or ALiBi
8 RNN-style recurrence compresses too much into one path Let tokens attend directly to other tokens
9 Large dot products saturate softmax Scale by √d_k
10 One attention pattern must do many jobs Use multiple heads
Keep this table in your head. Interview answers become cleaner when you speak in this chain.
### 6.2 Key points to remember
- A tokenizer is a compression-and-generalization device, not only a splitter.
- Character-level input avoids unknown words, but usually pays too much sequence cost.
- Word-level input shortens sequences, but breaks on open-vocabulary reality.
- BPE works because frequent neighbors often encode reusable meaning chunks.
- Merge order matters in BPE.
- Token IDs are addresses only.
- Embeddings are learned rows in a matrix.
- Position information is not optional in parallel token processing.
- Bag-of-words knows presence, not arrangement.
- Sinusoidal encoding is easiest to picture as many clocks rotating at different speeds.
- RoPE injects relative position through rotated queries and keys.
- ALiBi injects distance preference through score bias.
- Self-attention means every word queries every other word.
- Attention output is a weighted sum of values.
- √d_k is about score scale stability, not cosmetic math.
- Multi-head attention creates specialized consultation patterns.
- More tokens do not only cost more embeddings.
- They cost more pairwise attention work.
### 6.3 Interview questions
#### Q1. Why not use word-level tokenization for LLMs?
Because the real world is open-vocabulary. New words, code identifiers, product names, prices, typos, and mixed-script text appear constantly. Word-level vocabularies either explode or collapse many items into [UNK]. Subword tokenization keeps sequences manageable while preserving compositional generalization.
Common wrong answer to avoid: "Because BPE is newer and therefore better." That answer has no mechanism.
#### Q2. Explain BPE in one clean minute.
Start from base symbols, often bytes or characters. Count adjacent pairs in training text. Merge the most frequent pair. Repeat until the vocabulary reaches the desired size. The learned ordered merges create reusable subword units. Common patterns become one token. Rare patterns remain decomposable.
#### Q3. What exactly is an embedding layer?
A learned matrix of shape [vocab_size, d_model]. Input token IDs index rows of that matrix. So embedding lookup is matrix indexing. The output is a dense vector per token.
#### Q4. Why do transformers need positional information?
Because self-attention compares tokens in parallel. Without position, the model knows which tokens exist. It does not know their arrangement. man bites dog and dog bites man become dangerously similar. Seat number repairs that.
Common wrong answer to avoid: "Transformers somehow figure out order automatically from attention." Not without positional signal.
#### Q5. What is self-attention intuitively?
Each token asks every other token, "How relevant are you to my meaning right now?" Queries represent what a token seeks. Keys represent what each token offers. Values represent the information each token contributes. The output is a weighted sum of values.
#### Q6. Why do we divide attention scores by √d_k?
Because raw dot products grow in magnitude with dimension. Without scaling, softmax becomes too sharp. That hurts learning by shrinking useful gradients. Scaling stabilizes the scorecard. Common wrong answer to avoid: "It is for normalization in the same sense as batch norm." No. It is specifically about dot-product scale.
#### Q7. Why multi-head attention instead of one big head?
Different relational patterns matter at the same time. One head can learn one score pattern per token. Multiple heads let the model learn several consultation habits in parallel. One head may track syntax. Another may track entity reference. Another may track local neighborhoods.
### 6.4 Production experience
Now let us talk like an engineer shipping systems.
#### Failure mode 1 — token budget surprises
Symptom: A prompt that looked short in characters is expensive in tokens. Common causes:
- wrong tokenizer used for estimation,
- code blocks and JSON fragment heavily,
- multilingual text tokenizes longer than expected.
Knobs:
- estimate with the exact deployment tokenizer,
- compress boilerplate,
- chunk by tokens, not characters.
#### Failure mode 2 — RAG chunking misalignment
Symptom: Chunks look okay during preprocessing but are truncated at inference. Cause: Chunker counted with a different splitter. Knob: Use the same tokenizer for chunking, truncation, and cost estimation.
#### Failure mode 3 — long-context latency blow-up
Symptom: The system becomes slow after adding large conversation history. Cause: Attention cost rises sharply with sequence length. Knobs:
- summarize old context,
- retrieve only relevant chunks,
- cache keys and values,
- reduce output length,
- avoid sending repeated instructions every turn.
#### Failure mode 4 — position extension hype
Symptom: A model advertises huge context but forgets crucial middle details. Cause: Usable long-context quality is weaker than nominal limit. Knobs:
- benchmark retrieval at various positions,
- test lost-in-the-middle cases,
- use reranking or recency-aware prompting,
- do not trust headline context length alone.
#### Failure mode 5 — bad special-token handling
Symptom: System prompts leak, formatting breaks, or tool calls misparse. Cause: Special separators and reserved tokens were mishandled. Knobs:
- verify exact chat template,
- inspect tokenized prompt,
- reserve delimiters carefully.
#### Failure mode 6 — headcount and dimension trade-offs
More heads are not always better. Tiny head dimension can starve each head. Too few heads can limit specialization. You tune within memory and latency budgets. That is real engineering.
#### Failure mode 7 — multilingual cost unfairness
Some users spend more context budget to express the same idea. That affects both price and quality. You should measure token inflation by language. That is a product decision, not only a model detail.
#### Cost intuition to remember
Input length affects:
- attention score matrix size,
- KV cache size,
- end-to-end latency,
- per-request memory pressure,
- throughput under concurrency.
So when you optimize prompts, you are often optimizing attention cost downstream.
### 6.5 Graded exercises
#### Easy — 5 minutes
1. Tokenize ChatGPT-4o-mini in three ways:
- character-level,
- word-level,
- plausible subword-level.
2. Write one sentence explaining why the word-level version is brittle.
3. Say which of the five placeholders handles this problem first.
#### Medium — 15 minutes
1. Recreate the BPE toy corpus from Chapter 2.
2. Compute the first four merges by hand.
3. Encode tokenizers using those learned merges.
4. Then explain why merge order matters.
Hint: Re-read the merge walk in Chapter 2.5.
#### Hard — 45 minutes
1. Create a 4-token sentence of your own.
2. Invent tiny 2D query, key, and value vectors.
3. Compute raw dot products.
4. Scale by √d_k.
5. Softmax the scores approximately.
6. Produce the weighted sum.
7. Then explain the result using the spotlight beam and scorecard language.
#### Drawing task
On paper, sketch the full pipeline: raw text -> splitter -> token IDs -> badge board -> seat number -> Q/K/V -> spotlight beam -> scorecard -> contextual vectors. If you cannot draw it cleanly, you do not yet own it cleanly.
#### Stretch task for interview practice
Answer these aloud without notes:
- Why is subword tokenization the practical middle path?
- Why is embedding lookup just matrix indexing?
- Why does attention need √d_k?
- Why can one head be insufficient?
Record yourself. Listen for vagueness. Tighten the failure-fix chain. Next module — 03_transformer_mechanics — shows how attention, embeddings, and position encoding assemble into the full transformer block.