Skip to content

BPE Tokenizer Analysis

Setup

  • Custom tokenizer: bpe.py in this folder, trained with vocab_size=300
  • Training corpus: three tiny in-domain sentences about the cat / the rat
  • Production reference: OpenAI o200k_base via tiktoken

This is intentionally an unfair fight. That is the point. The custom tokenizer proves the algorithm works. The production tokenizer shows what good coverage looks like.

Comparison on 10 sentences

Sentence Custom BPE tokens o200k_base tokens Ratio
the cat sat 2 3 0.67x
ChatGPT-4o mini costs ₹0.15 per 1M input tokens. 49 18 2.72x
SKU-A17 ships in 2-3 days. 25 11 2.27x
getUserById_v2(user_id=42) 26 11 2.36x
नमस्ते दुनिया 37 5 7.40x
Mixed Hindi-English price ₹499 only today! 43 9 4.78x
email me at gaurav@example.com 27 7 3.86x
Version v2.1.0-beta released 27 9 3.00x
😀👍 works? 15 4 3.75x
Line one\nLine two\nLine three 24 8 3.00x

What this shows

  1. In-domain text compresses well. the cat sat becomes 2 tokens because the toy corpus taught merges like the cat and sat.
  2. Everything else fragments fast. Product names, code, prices, email addresses, emojis, and mixed-script text fall back toward raw bytes.
  3. Multilingual text is where the gap becomes brutal. नमस्ते दुनिया becomes 37 tokens in the toy tokenizer versus 5 in the production tokenizer.

Why production tokenizers win

  • They are trained on massive, diverse corpora.
  • They learn merges for common words, code patterns, punctuation habits, emojis, and multilingual byte sequences.
  • They are tuned for deployment workloads, not one tiny local corpus.

Token cost takeaway

If your tokenizer turns one user string into 3x-7x more tokens, you pay for it everywhere:

  • longer prompts,
  • slower attention,
  • higher memory use,
  • worse batching,
  • and earlier truncation.

Tokenizer mismatch risk

This is the practical lesson for AI engineering work:

  • If you chunk documents with one tokenizer and serve with another, chunk budgets drift.
  • If you estimate prompt cost with word counts instead of deployment-token counts, your latency math lies.
  • If you evaluate only on in-domain English text, you miss failure modes on code, SKUs, prices, Hindi, and emoji-heavy chat.

Bottom line

The custom tokenizer is correct and lossless. But it is not production-grade because the training corpus is too small and too narrow. That gap is exactly why tokenizer choice affects cost, recall, and model behavior upstream of attention.