Exercise 07 — BPE Tokenizer From Scratch¶
Timebox: 60-90 minutes
Goal¶
Implement Byte-Pair Encoding training and encoding/decoding without a tokenizer library. Common live-coding question for AI Eng loops.
Work in¶
bpe.pytrain_tokenizer.pytest_bpe.pyanalysis.md
Tasks¶
- Train: starting from a corpus, iteratively merge the most frequent adjacent symbol pair until you hit a vocab budget.
- Encode: turn a string into token IDs using the learned merge rules.
- Decode: turn token IDs back into a string.
- Round-trip a small corpus and assert
decode(encode(x)) == x. - Add a
--vocab-sizeflag to your training routine.
Done when¶
- A small corpus trains in under a few seconds
- Encode and decode round-trip cleanly
- You can explain the algorithm at a whiteboard without notes
Implementation notes¶
- This solution uses byte-level BPE, so there is no
[UNK]fallback path. - Base vocabulary starts at 256 single-byte tokens, so
--vocab-sizemust be at least256. - Merge order is stored explicitly and replayed during
encode, which keepsencodeanddecodeconsistent.
Run¶
python3 train_tokenizer.py --vocab-size 300 --sample "the cat sat"
python3 -m unittest discover -s . -p "test_*.py" -v
The training CLI prints both:
- learned vocabulary entries beyond the base 256 byte tokens
- learned merge order
Sample encoding¶
Using the built-in three-line corpus:
Stretch¶
- Byte-level fallback (no
<unk>token ever) - Compare your vocab against
tiktokenfor the same corpus and explain the differences