Skip to content

AI Engineering Playbook

Exercise 07 — BPE Tokenizer From Scratch

Exercise 07 — BPE Tokenizer From Scratch¶

Timebox: 60-90 minutes

Goal¶

Implement Byte-Pair Encoding training and encoding/decoding without a tokenizer library. Common live-coding question for AI Eng loops.

Work in¶

bpe.py
train_tokenizer.py
test_bpe.py
analysis.md

Tasks¶

Train: starting from a corpus, iteratively merge the most frequent adjacent symbol pair until you hit a vocab budget.
Encode: turn a string into token IDs using the learned merge rules.
Decode: turn token IDs back into a string.
Round-trip a small corpus and assert decode(encode(x)) == x.
Add a --vocab-size flag to your training routine.

Done when¶

A small corpus trains in under a few seconds
Encode and decode round-trip cleanly
You can explain the algorithm at a whiteboard without notes

Implementation notes¶

This solution uses byte-level BPE, so there is no [UNK] fallback path.
Base vocabulary starts at 256 single-byte tokens, so --vocab-size must be at least 256.
Merge order is stored explicitly and replayed during encode, which keeps encode and decode consistent.

Run¶

python3 train_tokenizer.py --vocab-size 300 --sample "the cat sat"
python3 -m unittest discover -s . -p "test_*.py" -v

The training CLI prints both:

learned vocabulary entries beyond the base 256 byte tokens
learned merge order

Sample encoding¶

Using the built-in three-line corpus:

sample: 'the cat sat'
sample ids: [277, 256]
decoded: 'the cat sat'

Stretch¶

Byte-level fallback (no <unk> token ever)
Compare your vocab against tiktoken for the same corpus and explain the differences