BPE Tokenizer Analysis¶
Setup¶
- Custom tokenizer:
bpe.pyin this folder, trained withvocab_size=300 - Training corpus: three tiny in-domain sentences about
the cat/the rat - Production reference: OpenAI
o200k_baseviatiktoken
This is intentionally an unfair fight. That is the point. The custom tokenizer proves the algorithm works. The production tokenizer shows what good coverage looks like.
Comparison on 10 sentences¶
| Sentence | Custom BPE tokens | o200k_base tokens |
Ratio |
|---|---|---|---|
the cat sat |
2 | 3 | 0.67x |
ChatGPT-4o mini costs ₹0.15 per 1M input tokens. |
49 | 18 | 2.72x |
SKU-A17 ships in 2-3 days. |
25 | 11 | 2.27x |
getUserById_v2(user_id=42) |
26 | 11 | 2.36x |
नमस्ते दुनिया |
37 | 5 | 7.40x |
Mixed Hindi-English price ₹499 only today! |
43 | 9 | 4.78x |
email me at gaurav@example.com |
27 | 7 | 3.86x |
Version v2.1.0-beta released |
27 | 9 | 3.00x |
😀👍 works? |
15 | 4 | 3.75x |
Line one\nLine two\nLine three |
24 | 8 | 3.00x |
What this shows¶
- In-domain text compresses well.
the cat satbecomes 2 tokens because the toy corpus taught merges likethe catandsat. - Everything else fragments fast. Product names, code, prices, email addresses, emojis, and mixed-script text fall back toward raw bytes.
- Multilingual text is where the gap becomes brutal.
नमस्ते दुनियाbecomes 37 tokens in the toy tokenizer versus 5 in the production tokenizer.
Why production tokenizers win¶
- They are trained on massive, diverse corpora.
- They learn merges for common words, code patterns, punctuation habits, emojis, and multilingual byte sequences.
- They are tuned for deployment workloads, not one tiny local corpus.
Token cost takeaway¶
If your tokenizer turns one user string into 3x-7x more tokens, you pay for it everywhere:
- longer prompts,
- slower attention,
- higher memory use,
- worse batching,
- and earlier truncation.
Tokenizer mismatch risk¶
This is the practical lesson for AI engineering work:
- If you chunk documents with one tokenizer and serve with another, chunk budgets drift.
- If you estimate prompt cost with word counts instead of deployment-token counts, your latency math lies.
- If you evaluate only on in-domain English text, you miss failure modes on code, SKUs, prices, Hindi, and emoji-heavy chat.
Bottom line¶
The custom tokenizer is correct and lossless. But it is not production-grade because the training corpus is too small and too narrow. That gap is exactly why tokenizer choice affects cost, recall, and model behavior upstream of attention.