01. The tokenizer failure — why naive splitters collapse on real text¶

Two minutes. One sentence. Watch half the meaning vanish at the door.

Built on the ELI5 in 00-eli5.md. The splitter — the front door that cuts raw text into pieces — is the thing breaking first here.

Mental picture¶

Imagine a security guard at a factory gate. Every truck must be broken into standard crates. If the guard only recognises old crate labels, new cargo gets dumped into one bin. That one bin is [UNK]. See. Once four different things become the same unknown blob, the model loses choices early. A product name, a price, a number scale, and punctuation all collapse together. That is not a small error. That is the front door misreading the note.

raw text
   |
   v
naive word splitter
   |
   +--> known word --> keep token
   |
   +--> unseen thing --> [UNK]
   |
   v
sequence with meaning already erased

So what to do? First, stop thinking tokenization is boring preprocessing. It is a modeling decision. It decides what the rest of the stack can even see.

The curiosity gap¶

Feed this sentence into a naive word tokenizer: ChatGPT-4o mini costs ₹0.15 per 1M input tokens. A plain space split gives eight pieces. 1. ChatGPT-4o 2. mini 3. costs 4. ₹0.15 5. per 6. 1M 7. input 8. tokens. Now assume the tiny vocabulary only knows these four exact items: mini, costs, per, input Then the output becomes: [UNK] | mini | costs | [UNK] | per | [UNK] | input | [UNK] Four of eight pieces are now [UNK]. That is 50% collapse before embeddings, position, or attention even begin. Simple, no? You cannot recover distinctions the splitter already destroyed.

The quick formula¶

A useful sanity check is unknown rate. unknown_rate = unknown_pieces / total_pieces Here: unknown_rate = 4 / 8 = 0.50 Coverage is the mirror view. coverage = known_pieces / total_pieces = 4 / 8 = 0.50 If coverage is weak at the front door, downstream quality will wobble. Why? Because one [UNK] vector must now stand in for many unrelated surfaces.

Worked example with the eight pieces¶

See the collision clearly.

original piece    -> tokenizer output
-------------------------------------
ChatGPT-4o        -> [UNK]
mini              -> mini
costs             -> costs
₹0.15             -> [UNK]
per               -> per
1M                -> [UNK]
input             -> input
tokens.           -> [UNK]

Now ask what each unknown actually meant. ChatGPT-4o is a model name. ₹0.15 is a price with currency. 1M is a scale. tokens. is an ordinary word plus punctuation. The naive splitter says all four are the same event. They are not. This is why retrieval, pricing, and multilingual prompts go wrong in subtle ways.

Failed rescue attempt 1 — lowercase everything¶

The instinct is common. "Just lowercase it." So we get: chatgpt-4o | mini | costs | ₹0.15 | per | 1m | input | tokens. Looks cleaner. But the vocabulary still does not know chatgpt-4o. It still does not know ₹0.15. It still does not know 1m. It still does not know tokens. with the period attached. So the unknown count is still four. Lowercase helps casing. It does not solve mixed symbols, prices, or fused forms.

Failed rescue attempt 2 — strip punctuation¶

Next instinct. "Remove punctuation." Maybe you produce: ChatGPT4o | mini | costs | 015 | per | 1M | input | tokens Now tokens might become known. Good. But look what you paid. ₹0.15 became 015. The decimal point vanished. The currency sign vanished. The price semantics are damaged. ChatGPT-4o became ChatGPT4o, which is still an unseen product string. This rescue fixes surface mess by deleting signal. That is not a real rescue.

Failed rescue attempt 3 — split on dash¶

Third instinct. "Okay, split hyphenated words." So maybe: ChatGPT | 4o | mini | costs | ₹0.15 | per | 1M | input | tokens. But now you have more pieces. And ChatGPT may still be unknown. 4o may still be unknown. ₹0.15 is still unknown. 1M is still unknown. tokens. is still unknown. You turned one bad unknown into several awkward fragments. Latency can even get worse because the sequence got longer. See the trap. Local string hacks do not give you a principled vocabulary.

Why this matters to a Lead AI Engineer¶

A lead does not need to hand-build tokenizers every week. But the intuition matters everywhere.

1) RAG chunking¶

Chunk size is usually budgeted in tokens, not words. If you think in words, you will overpack chunks. Then retrieval gets uneven. Then truncation appears in odd places. Mixed text like SKU-A17, ₹499, and v2.1 makes the mismatch sharper.

2) Latency diagnosis¶

Attention cost grows with sequence length. Bad splitting means more tokens. More tokens means slower prompts and higher memory. If a prompt suddenly becomes expensive, tokenization is one of the first suspects.

3) Multilingual cost intuition¶

English words alone are the easy case. Real traffic has Hindi, English, numbers, emojis, and brand names together. A weak tokenizer can explode token count on such text. That changes serving cost and batching decisions.

4) Evaluation discipline¶

If a model fails on product names or prices, do not blame attention first. Check the front door. Maybe the splitter never preserved the signal cleanly.

The core lesson¶

The tokenizer is not just a format converter. It defines the alphabet of thought available to the model. Bad alphabets force bad collisions. Good alphabets preserve reusable structure. That is why modern tokenizers aim for subword pieces. Not whole words only. Not single characters only. A practical middle path.

Where this lives in the wild¶

OpenAI ChatGPT and API pricing pages: model names, currencies, and token units appear together in one line.
GitHub Copilot: code identifiers like getUserById_v2 punish naive word splitting immediately.
Google Ads product feeds: titles mix brands, sizes, SKUs, currencies, and punctuation.
Razorpay and Stripe support flows: receipts contain ₹499, invoice IDs, dates, and merchant names.
WhatsApp Business bots: users send Hinglish, emojis, abbreviations, and phone numbers in one message.

Interview Q&A¶

Q: Why is [UNK] especially harmful compared with a slightly wrong token split? A: Because many different surfaces now map to one shared ID before the model can reason about them. Common wrong answer to avoid: "Attention will recover the original text later." Q: Why should a Lead AI Engineer care about tokenizer behavior in RAG? A: Chunk budgets, truncation points, and retrieval recall are all token-sensitive, not word-sensitive. Common wrong answer to avoid: "Chunking is just a document loader problem." Q: Is lowercasing a real fix for tokenizer failure? A: Only for casing variation. It does not solve prices, symbols, product names, or mixed alphanumeric strings. Q: If four of eight pieces are unknown, is the model useless? A: Not useless, but already handicapped. Half the sentence has been flattened into one generic placeholder.

Apply now (5 min)¶

Take three real strings from your work. Use one with a price, one with a SKU, and one with mixed Hindi-English text. Split each one by spaces. Mark which pieces would become [UNK] under a tiny 20-word vocabulary. Then write the unknown rate for each example. Sketch from memory: draw the front door diagram and label where meaning disappears.

Bridge. If whole-word splitting fails on new words, maybe characters can save us. But that extreme breaks somewhere else. Next: 02-character-vs-word.md.