03. Subword tokenization and BPE — the practical middle path¶

This is the fix. Keep common chunks whole. Break rare forms into reusable parts.

Built on the ELI5 in 00-eli5.md. The splitter — which decides the model's text pieces — becomes practical here by learning reusable subword chunks.

Mental picture¶

Think of a Lego tray. You do not keep one giant brick for every possible object. That would be word-level rigidity. You also do not keep only single dust grains. That would be character-level overload. Instead, you keep reusable medium pieces. Common shapes stay whole. Rare shapes are assembled from known parts. That is the main idea of subword tokenization. playing can become play + ing. ChatGPT-4o can become Chat + GPT + - + 4 + o. See. The splitter is no longer guessing whole words blindly. It is reusing pieces that show up often.

common word        -> keep mostly whole
rare new form      -> break into familiar parts
mixed string       -> keep letters, digits, symbols as needed

Simple, no? That gives us both coverage and compression.

Formula first — what BPE does at each step¶

Byte Pair Encoding, or BPE, learns merges greedily. Start with a base vocabulary of tiny symbols. Usually that means characters or bytes. Then repeat this rule: best_pair = argmax count(adjacent_pair) Merge the most frequent adjacent pair into one new token. Do this again. And again. After K merges: final_vocab = base_symbols + K learned merges So what is learned? Not a dictionary of whole words only. A ranked merge list. Order matters. We will see why.

Training corpus for the worked example¶

Use this tiny corpus: token tokens tokenize tokenizer Very small corpus. Very clear mechanics. We will train from characters. No magic. Just frequency and repeated merging.

Step 0 — start from characters¶

Write each word as character pieces.

t o k e n
t o k e n s
t o k e n i z e
t o k e n i z e r

Now count adjacent pairs across the corpus. Important counts are: (t,o): 4 (o,k): 4 (k,e): 4 (e,n): 4 (n,i): 2 (i,z): 2 (z,e): 2 There is a tie at frequency four. Any tie-break rule is possible. We will choose left-to-right for simplicity.

Merge step 1 — `t` + `o` -> `to`¶

Most frequent pair chosen: (t,o) New corpus:

to k e n
to k e n s
to k e n i z e
to k e n i z e r

Why this makes sense: The pair appears in every training word. So the new chunk is reusable.

Merge step 2 — `to` + `k` -> `tok`¶

Now the strongest pair is: (to,k): 4 New corpus:

tok e n
tok e n s
tok e n i z e
tok e n i z e r

See the word stem emerging. BPE is building bigger pieces only where frequency supports them.

Merge step 3 — `tok` + `e` -> `toke`¶

Chosen pair: (tok,e): 4 New corpus:

toke n
toke n s
toke n i z e
toke n i z e r

Still every training word agrees. So the merge is safe.

Merge step 4 — `toke` + `n` -> `token`¶

Chosen pair: (toke,n): 4 New corpus:

token
token s
token i z e
token i z e r

Now a very meaningful subword appears. token is useful by itself. It is also a prefix for longer words. This is the sweet spot.

Merge step 5 — `token` + `i` -> `tokeni`¶

Now counts change. (token,i) appears twice. So do some other pairs. We keep the same tie-break rule. Chosen pair: (token,i): 2 New corpus:

token
token s
tokeni z e
tokeni z e r

This looks a bit ugly. That is okay. Intermediate BPE pieces do not need to look elegant. They need to be useful for future merges.

Merge step 6 — `tokeni` + `z` -> `tokeniz`¶

Chosen pair: (tokeni,z): 2 New corpus:

token
token s
tokeniz e
tokeniz e r

Now the longer pattern is almost visible.

Merge step 7 — `tokeniz` + `e` -> `tokenize`¶

Chosen pair: (tokeniz,e): 2 New corpus:

token
token s
tokenize
tokenize r

Done. After seven merges, the learned merge list is: 1. t + o -> to 2. to + k -> tok 3. tok + e -> toke 4. toke + n -> token 5. token + i -> tokeni 6. tokeni + z -> tokeniz 7. tokeniz + e -> tokenize That list is the real asset. Not just the final segmented training words.

Encode a new word — `tokenizers`¶

Now test a word the trainer never saw exactly: tokenizers Start from characters.

t o k e n i z e r s

Apply learned merges in order. After step 1: to k e n i z e r s After step 2: tok e n i z e r s After step 3: toke n i z e r s After step 4: token i z e r s After step 5: tokeni z e r s After step 6: tokeniz e r s After step 7: tokenize r s Final encoding: tokenize | r | s See the win. The word is new. But no piece is unknown. And the sequence is still short. That is the middle path in action.

Why merge order matters¶

Suppose we had merged i + z early instead. Then token i z e r s might become: token | iz | e | r | s That blocks the path to tokenize as one chunk. So BPE is not just a bag of merges. It is an ordered recipe. Earlier merges change which later merges are even possible. Simple, no?

What the merge list really is¶

A usable tokenizer usually needs three things.

1) Base vocabulary¶

The starting symbols. Characters, bytes, or some primitive alphabet.

2) Ordered merge rules¶

The learned sequence of pair merges. This is the compression logic.

3) Token-to-ID mapping¶

Every final token needs an integer ID. That is what the model actually consumes. So the splitter outputs tokens. Then IDs. Then the badge board can look them up.

One compact picture¶

text
 |
 v
characters / bytes
 |
 v
apply merge #1
 |
 v
apply merge #2
 |
 v
...
 |
 v
final subword tokens
 |
 v
token IDs

This is why subword tokenization feels practical. It preserves reuse. It avoids a giant brittle word list. It avoids character-level sequence explosion.

Why this matters in real systems¶

Model names change. SKUs change. User slang changes. Currencies and units keep appearing in new forms. A subword tokenizer can adapt without needing a full token for every new whole word. That is huge in production. Especially for code, commerce, search, and multilingual chat.

Where this lives in the wild¶

OpenAI and Anthropic APIs: model names, tool names, numbers, and punctuation mix constantly in prompts.
GitHub Copilot: identifiers like fetchUserProfileV2 benefit from reusable subword pieces.
Google Search queries: rare names, misspellings, and mixed alphanumeric strings appear every second.
Shopify and Amazon catalogs: titles contain brands, sizes, counts, units, and product variants.
WhatsApp and Slack AI assistants: users mix slang, emojis, abbreviations, and domain-specific terms.

Interview Q&A¶

Q: Why is subword tokenization usually better than pure word-level tokenization? A: It handles unseen words by decomposing them into known reusable pieces instead of falling back to [UNK]. Common wrong answer to avoid: "Because subwords are always linguistically perfect morphemes." Q: What does BPE learn exactly? A: An ordered list of frequent pair merges, plus the final token vocabulary and IDs built from those merges. Q: Why does merge order matter in BPE? A: Earlier merges change the segmentation, which changes which later merges can even match. Common wrong answer to avoid: "The same set of merges gives the same result in any order." Q: Why can BPE encode a new word like tokenizers without seeing it during training? A: Because the new word can be assembled from learned reusable pieces such as tokenize, r, and s.

Apply now (5 min)¶

Take this tiny corpus: play, played, player, playing. Start from characters. Write the top three adjacent pairs. Choose three merges. Then encode the unseen word players using your learned order. Sketch from memory: draw the seven-step token -> tokenize build-up without looking.

Bridge. Once the splitter emits token IDs, those IDs still have no meaning by themselves. The next step is the badge board. Read 04-embeddings.md.