Skip to content

05. Assignment 2 — BPE Tokenizer from Scratch

Required reading before you code: read 02_explainer.md §2.4-§2.6 for the BPE merge logic, §2.7 for production trade-offs, and §6.5 for the exercise framing.

Build a Byte Pair Encoding tokenizer from scratch. Do not use tiktoken, sentencepiece, or other tokenizer libraries.

Goal

Implement a BPE tokenizer that: - learns merge rules from a training corpus, - encodes arbitrary text into token IDs, - decodes token IDs back to text, - and handles unseen strings through reusable pieces.

Requirements

  1. train(corpus, vocab_size) — learn merge rules from text.
  2. encode(text) -> list[int] — tokenize text using learned merges.
  3. decode(ids) -> str — reconstruct text from token IDs.
  4. vocab — inspect the learned vocabulary and merge order.

Spec

  • Start with byte-level base tokens.
  • Iteratively merge the most frequent adjacent pair until vocab_size is reached.
  • Store merge rules in order.
  • Preserve enough metadata to encode and decode consistently.
  • Test on at least 3 different text samples.

Deliverables

  1. bpe.py — tokenizer implementation in pure Python.
  2. train_tokenizer.py — script that trains on a small corpus.
  3. test_bpe.py — tests for empty string, Unicode, and roundtrip behavior.
  4. README.md — explanation of how your tokenizer works, plus sample encodings.
  5. analysis.md — comparison against a production tokenizer on 10 sentences.

Hints

  • First make the training loop correct, then make it faster. Re-read the toy merge walk in 02_explainer.md §2.5.
  • Keep merge order explicit. The explainer shows why final tokens alone are not enough. See 02_explainer.md §2.5.
  • When encoding new text, apply learned merges in the stored order. See 02_explainer.md §2.6.
  • If you get stuck on unknown-looking strings, revisit why subword tokenization beats [UNK]. See 02_explainer.md §2.3-§2.4.
  • When writing your analysis, comment on token cost and tokenizer mismatch risk. See 02_explainer.md §2.7 and §6.4.

Success criteria

  • Encode -> decode roundtrip is lossless.
  • Learned vocabulary produces meaningful merges on in-domain text.
  • Unseen text stays representable without collapsing to a single unknown token.
  • Analysis explains where your simple BPE differs from production tokenizers and why.

Why this matters

This hands_on_lab forces you to understand the splitter, not just use it. If you can build BPE, you will reason better about token budgets, chunking, model inputs, and downstream cost.