Skip to content

Exercise 07 — BPE Tokenizer From Scratch

Timebox: 60-90 minutes

Goal

Implement Byte-Pair Encoding training and encoding/decoding without a tokenizer library. Common live-coding question for AI Eng loops.

Work in

  • bpe.py
  • train_tokenizer.py
  • test_bpe.py
  • analysis.md

Tasks

  1. Train: starting from a corpus, iteratively merge the most frequent adjacent symbol pair until you hit a vocab budget.
  2. Encode: turn a string into token IDs using the learned merge rules.
  3. Decode: turn token IDs back into a string.
  4. Round-trip a small corpus and assert decode(encode(x)) == x.
  5. Add a --vocab-size flag to your training routine.

Done when

  • A small corpus trains in under a few seconds
  • Encode and decode round-trip cleanly
  • You can explain the algorithm at a whiteboard without notes

Implementation notes

  • This solution uses byte-level BPE, so there is no [UNK] fallback path.
  • Base vocabulary starts at 256 single-byte tokens, so --vocab-size must be at least 256.
  • Merge order is stored explicitly and replayed during encode, which keeps encode and decode consistent.

Run

python3 train_tokenizer.py --vocab-size 300 --sample "the cat sat"
python3 -m unittest discover -s . -p "test_*.py" -v

The training CLI prints both:

  • learned vocabulary entries beyond the base 256 byte tokens
  • learned merge order

Sample encoding

Using the built-in three-line corpus:

sample: 'the cat sat'
sample ids: [277, 256]
decoded: 'the cat sat'

Stretch

  • Byte-level fallback (no <unk> token ever)
  • Compare your vocab against tiktoken for the same corpus and explain the differences