05. Assignment 2 — BPE Tokenizer from Scratch¶

Required reading before you code: read 02_explainer.md §2.4-§2.6 for the BPE merge logic, §2.7 for production trade-offs, and §6.5 for the exercise framing.

Build a Byte Pair Encoding tokenizer from scratch. Do not use tiktoken, sentencepiece, or other tokenizer libraries.

Goal¶

Implement a BPE tokenizer that: - learns merge rules from a training corpus, - encodes arbitrary text into token IDs, - decodes token IDs back to text, - and handles unseen strings through reusable pieces.

Requirements¶

train(corpus, vocab_size) — learn merge rules from text.
encode(text) -> list[int] — tokenize text using learned merges.
decode(ids) -> str — reconstruct text from token IDs.
vocab — inspect the learned vocabulary and merge order.

Spec¶

Start with byte-level base tokens.
Iteratively merge the most frequent adjacent pair until vocab_size is reached.
Store merge rules in order.
Preserve enough metadata to encode and decode consistently.
Test on at least 3 different text samples.

Deliverables¶

bpe.py — tokenizer implementation in pure Python.
train_tokenizer.py — script that trains on a small corpus.
test_bpe.py — tests for empty string, Unicode, and roundtrip behavior.
README.md — explanation of how your tokenizer works, plus sample encodings.
analysis.md — comparison against a production tokenizer on 10 sentences.

Hints¶

First make the training loop correct, then make it faster. Re-read the toy merge walk in 02_explainer.md §2.5.
Keep merge order explicit. The explainer shows why final tokens alone are not enough. See 02_explainer.md §2.5.
When encoding new text, apply learned merges in the stored order. See 02_explainer.md §2.6.
If you get stuck on unknown-looking strings, revisit why subword tokenization beats [UNK]. See 02_explainer.md §2.3-§2.4.
When writing your analysis, comment on token cost and tokenizer mismatch risk. See 02_explainer.md §2.7 and §6.4.

Success criteria¶

Encode -> decode roundtrip is lossless.
Learned vocabulary produces meaningful merges on in-domain text.
Unseen text stays representable without collapsing to a single unknown token.
Analysis explains where your simple BPE differs from production tokenizers and why.

Why this matters¶

This hands_on_lab forces you to understand the splitter, not just use it. If you can build BPE, you will reason better about token budgets, chunking, model inputs, and downstream cost.