05. Assignment 2 — BPE Tokenizer from Scratch¶
Required reading before you code: read 02_explainer.md §2.4-§2.6 for the BPE merge logic, §2.7 for production trade-offs, and §6.5 for the exercise framing.
Build a Byte Pair Encoding tokenizer from scratch.
Do not use tiktoken, sentencepiece, or other tokenizer libraries.
Goal¶
Implement a BPE tokenizer that: - learns merge rules from a training corpus, - encodes arbitrary text into token IDs, - decodes token IDs back to text, - and handles unseen strings through reusable pieces.
Requirements¶
train(corpus, vocab_size)— learn merge rules from text.encode(text) -> list[int]— tokenize text using learned merges.decode(ids) -> str— reconstruct text from token IDs.vocab— inspect the learned vocabulary and merge order.
Spec¶
- Start with byte-level base tokens.
- Iteratively merge the most frequent adjacent pair until
vocab_sizeis reached. - Store merge rules in order.
- Preserve enough metadata to encode and decode consistently.
- Test on at least 3 different text samples.
Deliverables¶
bpe.py— tokenizer implementation in pure Python.train_tokenizer.py— script that trains on a small corpus.test_bpe.py— tests for empty string, Unicode, and roundtrip behavior.README.md— explanation of how your tokenizer works, plus sample encodings.analysis.md— comparison against a production tokenizer on 10 sentences.
Hints¶
- First make the training loop correct, then make it faster. Re-read the toy merge walk in 02_explainer.md §2.5.
- Keep merge order explicit. The explainer shows why final tokens alone are not enough. See 02_explainer.md §2.5.
- When encoding new text, apply learned merges in the stored order. See 02_explainer.md §2.6.
- If you get stuck on unknown-looking strings, revisit why subword tokenization beats
[UNK]. See 02_explainer.md §2.3-§2.4. - When writing your analysis, comment on token cost and tokenizer mismatch risk. See 02_explainer.md §2.7 and §6.4.
Success criteria¶
- Encode -> decode roundtrip is lossless.
- Learned vocabulary produces meaningful merges on in-domain text.
- Unseen text stays representable without collapsing to a single unknown token.
- Analysis explains where your simple BPE differs from production tokenizers and why.
Why this matters¶
This hands_on_lab forces you to understand the splitter, not just use it. If you can build BPE, you will reason better about token budgets, chunking, model inputs, and downstream cost.