Skip to content

02. Character vs word level — two extremes, both broken

Quick test. One extreme is too tiny. The other is too rigid. Both fail.

Built on the ELI5 in 00-eli5.md. The splitter — the part choosing text pieces — can go too small or too large, and both choices hurt.


Mental picture

Imagine packing a classroom cupboard. Character-level tokenization is like storing every grain of rice separately. Nothing is out of vocabulary. But counting, moving, and comparing takes forever. Word-level tokenization is the opposite. It is like storing only sealed family-size sacks. Fast when the exact sack exists. Useless when the packet shape changes slightly. See the two extremes. One explodes sequence length. The other explodes unknowns. Neither gives a robust front door for modern text.

character level          subword level           word level
--------------          -------------           ----------
C h a t G P T           Chat | GPT | - | 4 | o   [UNK]
₹ 0 . 1 5               ₹ | 0 | . | 15           [UNK]
1 M                     1 | M                     [UNK]

Simple, no? The useful answer is usually in the middle. But first see why the extremes fail fundamentally.

Formula first — attention pays for sequence length

For self-attention, the interaction grid is roughly n x n. So the main cost grows like: attention_cells = n^2 If token count doubles, the grid becomes about four times larger. That is why tokenization is not just semantics. It is also systems engineering. Now use the example from practice. A 100-word passage might become about 550 character tokens. The same passage might become about 130 subword tokens. Then: 550^2 = 302,500 130^2 = 16,900 That is almost 18 times more attention cells for the character view. See. The splitter choice changes runtime shape directly.

Character level — what it gets right

Every letter is a token. So there is no out-of-vocabulary problem in the classic sense. ChatGPT-4o is always representable. ₹0.15 is always representable. A typo is still representable. A new slang word is still representable. That feels attractive. No unknowns. No brittle vocabulary boundary. Good. But now the bill arrives.

Character level — where it breaks

The sequence becomes long very quickly. Long sequences make attention expensive. Long sequences also make optimization harder. The model must combine many tiny pieces before one useful concept appears. Take tokenization. At character level, it is twelve small hops. At subword level, it might be two or three meaningful chunks. At word level, maybe one token. Characters preserve surface coverage. They destroy compression.

Worked numerical example — 100 words

Assume a 100-word support note. Average word length is 4.5 characters. Add spaces and punctuation, and character token count lands near 550. A practical subword tokenizer lands near 130. Draw the two attention grids.

character grid: 550 x 550
+--------------------------------+
| 302,500 pairwise score cells    |
+--------------------------------+
subword grid: 130 x 130
+----------------------+
| 16,900 score cells   |
+----------------------+

Now compare memory and latency pressure. The character system must score far more pairs. So even though no token is unknown, the compute tax is heavy.

Failed rescue 1 for character level — make the model wider

This is a common engineering reflex. "If the sequence is too long, make hidden size bigger." But wider layers do not shrink the token count. The n^2 grid is still there. You may improve capacity. You may also increase cost further. So this rescue misses the core issue.

Failed rescue 2 for character level — truncate harder

Second reflex. "Keep only the first chunk of text." Now latency may improve. But you silently drop context. That hurts long documents, retrieval chunks, chat histories, and code files. You saved runtime by cutting meaning. Again, not a principled fix.

Failed rescue 3 for character level — hope training solves it

Third reflex. "The model will learn longer patterns anyway." Sometimes partly true. But the model still has to discover words from raw letters repeatedly. That is unnecessary work. The representation burden stays high. Hope is not compression.

Word level — what it gets right

Now go to the other extreme. Use a 50K word vocabulary. Common words become single tokens. Sequence length is short. Attention is cheaper. Frequent words are easy to represent. For clean newsroom English, this can look fine. That is why word-level tokenization feels appealing at first glance.

Word level — where it breaks

Modern text is not a closed dictionary. Look at these strings: ChatGPT-4o ₹0.15 1M A 50K word list probably does not contain them as whole items. So they become [UNK] or awkward fallback fragments. Now the vocabulary is too rigid. One new product name breaks coverage. One currency format breaks coverage. One typo breaks coverage. One mixed-script word breaks coverage. See the opposite failure. Good compression. Bad flexibility.

Worked numerical example — same sentence, word level

Take: ChatGPT-4o mini costs ₹0.15 per 1M input tokens. A strict word-level vocabulary of 50K general words might produce: [UNK] | mini | costs | [UNK] | per | [UNK] | input | [UNK] Sequence length is only eight. Nice. But meaning quality is poor. Four distinct concepts collapsed to one placeholder. Cheap attention is not useful if the pieces are wrong.

Failed rescue 1 for word level — lowercase everything

Lowercasing may normalise Apple and apple. Good. But it still does not invent entries for chatgpt-4o or ₹0.15. Coverage remains brittle.

Failed rescue 2 for word level — stem words

Maybe you stem costs to cost. Fine for morphology. But what is the stem of 1M? What is the stem of GPT-4o? What is the stem of ₹499? The hard cases remain hard.

Failed rescue 3 for word level — just use [UNK]

This is not even a rescue. It is surrender. [UNK] says many unrelated surfaces should share one embedding. That throws away detail the model could have used. One unknown token is a lossy bucket, not a smart abstraction.

Why both failures are fundamental

Character level says, "Represent everything, even if the sequence explodes." Word level says, "Compress hard, even if new forms disappear." Those are opposite mistakes. The first ignores compute efficiency. The second ignores open-vocabulary reality. Real product traffic needs both. Coverage and compression. Flexibility and efficiency. That is why modern systems prefer subword tokenization. It keeps common words whole when possible. It breaks rare forms into reusable parts when needed.

One side-by-side picture

Input: ChatGPT-4o mini costs ₹0.15 per 1M input tokens.
Character level:
C | h | a | t | G | P | T | - | 4 | o | ... many more
Pros: no OOV
Cons: long sequence, big attention grid
Word level:
[UNK] | mini | costs | [UNK] | per | [UNK] | input | [UNK]
Pros: short sequence
Cons: brittle coverage, heavy meaning loss
Subword level:
Chat | GPT | - | 4 | o | mini | costs | ₹ | 0 | . | 15 | per | 1 | M | input | tokens
Pros: practical middle path
Cons: still approximate, but far better trade-off

So what to do? Use pieces that are reusable, not extreme.

Where this lives in the wild

  • GitHub Copilot: character-level code modeling is costly, while strict word vocabularies break on new identifiers.
  • Google Search and Ads queries: users type prices, misspellings, units, and product names every minute.
  • WhatsApp moderation systems: messages mix emojis, Hinglish, numerals, and abbreviations in one line.
  • Stripe and Razorpay receipts: short strings contain currencies, invoice IDs, and merchant tokens.
  • Duolingo and Google Translate: multilingual input punishes rigid word lists and overlong character streams.

Interview Q&A

Q: Why is character-level tokenization attractive at first? A: It has near-perfect surface coverage. New words, typos, and symbols are always representable. Q: Why does character-level usually lose in large transformer systems? A: Because sequence length explodes, and attention cost grows roughly with the square of that length. Common wrong answer to avoid: "Characters are bad because they cannot represent meaning." Q: Why is word-level tokenization fundamentally brittle? A: Real text is open-vocabulary. New products, prices, usernames, and mixed forms appear constantly. Common wrong answer to avoid: "Just add more words to the vocabulary until the problem disappears." Q: If word-level is cheap and character-level is flexible, what should we seek? A: A middle path that preserves reusable chunks while keeping sequence length manageable.

Apply now (5 min)

Take one 100-word article paragraph. Estimate its character-token count and subword-token count. Square both numbers. Write the ratio. Then find three strings a word-level vocabulary would likely miss. Sketch from memory: draw the three-column comparison and write one failure for each extreme.


Bridge. So the practical question becomes: how do we build reusable chunks automatically? Next: 03-subword-bpe.md.