Skip to content

12. WordPiece and Unigram — same destination, different training logic

BPE grows pieces upward. These two splitters either score smarter or prune downward.

Built on the ELI5 in 00-eli5.md. The splitter — which chooses the model's text pieces — now has two more training habits besides BPE.


The picture before the math

See. All three methods want the same thing. Short token sequences. Good coverage. Few [UNK] disasters. But they train the splitter differently. BPE says: start small, then keep gluing the most common neighbours. WordPiece says: start small too, but do not trust raw counts alone. Pick the merge that improves corpus likelihood the most. Unigram says: start with a huge shelf of possible pieces. Then throw away the pieces the corpus can live without. So the destination is shared. The road is not. Think of three shopkeepers making a parts catalog.

BPE        : keep adding the most common joined part
WordPiece  : keep adding the most useful joined part
Unigram    : start huge, then remove the least useful part
Simple, no?

One compact comparison

Keep this table in your head.

method      | direction   | selection criterion                       | notable models
------------+-------------+-------------------------------------------+-------------------------
BPE         | bottom-up   | most frequent adjacent pair               | GPT-style tokenizers
WordPiece   | bottom-up   | biggest likelihood gain from a merge      | BERT
Unigram     | top-down    | remove token that hurts likelihood least  | T5, mT5, ALBERT SentencePiece setups
Bottom-up means: we begin with tiny pieces and build upward. Top-down means: we begin with too many pieces and prune downward. The splitter goal is identical. The training logic differs.

Why BPE was already good

BPE fixed the main disaster of word-level splitting. Rare words no longer collapse into [UNK]. playing can become play + ing. microservices-v2 can become reusable chunks too. So what is left to improve? Mainly this: raw frequency is useful, but frequency is not the whole story. Some pairs are frequent because both parts are boringly common. Some pairs are less frequent, but much more informative as one chunk. That is where WordPiece enters.

WordPiece — same ladder, smarter judge

Picture a school principal checking pairs for promotion. BPE asks: which pair stood together most often? WordPiece asks: which pair deserves to stay together as one badge most strongly? So WordPiece still grows bottom-up. But the judge is different. A common intuition is this score:

pair usefulness ~ pair_count / (left_count x right_count)
Do not worship the exact formula here. Keep the picture. A pair wins when being together explains the corpus unusually well. That is close to saying: this merge gives good likelihood gain.

Tiny numeric contrast — frequency vs usefulness

Suppose the splitter is considering two candidate merges.

candidate   pair_count   left_count   right_count   usefulness
a+b         20           100          100           20 / 10000 = 0.002
x+y         8            10           10            8 / 100   = 0.080
BPE looks only at pair_count. So BPE picks a+b. WordPiece notices something subtler. a and b are common everywhere. Keeping them together is not very special. But x and y almost belong together whenever they appear. So WordPiece prefers x+y. See the difference? BPE reward: "you occur a lot." WordPiece reward: "you belong together strongly." That is why WordPiece often produces cleaner continuation pieces.

What WordPiece outputs look like

BERT-style WordPiece often marks continuation chunks with ##. Example:

playing -> play | ##ing
unhappily -> un | ##happy | ##ly
The ## marker just means: this piece usually continues a word, not starts one. So the splitter is learning both pieces and word-boundary habits.

Now flip the whole training story. Unigram does not say, "let me build bigger pieces one merge at a time." It says, "let me start with a fat vocabulary, then prune the weak pieces." Picture a shelf that is too full.

initial shelf:
play, playing, playi, ing, lay, ay, p, l, a, y, ...
goal:
keep only the pieces that help overall likelihood
drop the rest
So Unigram is top-down. That is the clean contrast with BPE.
BPE      : tiny -> medium -> bigger
Unigram  : huge -> smaller -> cleaner
Same splitter goal. Opposite search direction.

How Unigram chooses what to remove

Each token on the shelf has a probability. A word can often be segmented in multiple ways. Unigram asks: if I remove this token, how badly does corpus likelihood fall? If the harm is tiny, remove it. If the harm is large, keep it. So the pruning rule is:

drop the token whose removal hurts least
This is why Unigram feels calm. It does not commit to one greedy merge path early. It keeps many candidate pieces alive, then deletes the weak ones later. In practice, dynamic programming finds the best segmentation under the current token probabilities.

Worked example — same word, different split

Take the word:

tokenizers
Now suppose we trained two different splitters.

BPE-style outcome

Assume the learned merge list already created tokenize, but never created izer. Then encoding is deterministic:

t o k e n i z e r s
-> to
-> tok
-> toke
-> token
-> tokeni
-> tokeniz
-> tokenize
final: tokenize | r | s
That is classic bottom-up behaviour. The earlier merge path decides the later split.

Unigram-style outcome

Now imagine the Unigram shelf contains these candidate pieces.

token      cost 1.1
izer       cost 1.3
s          cost 0.7
tokenize   cost 2.6
r          cost 2.4
i          cost 2.0
zer        cost 2.2
Smaller total cost is better. Possible segmentations:
A: tokenize | r | s
cost = 2.6 + 2.4 + 0.7 = 5.7
B: token | izer | s
cost = 1.1 + 1.3 + 0.7 = 3.1
C: token | i | zer | s
cost = 1.1 + 2.0 + 2.2 + 0.7 = 6.0
Unigram chooses B.
tokenizers -> token | izer | s
So the same raw word gets different pieces. BPE says: follow my learned merge ladder. Unigram says: among today's candidate pieces, pick the segmentation with best total likelihood. This is the key difference. BPE learns an ordered merge recipe. Unigram learns a probabilistic token inventory.

Why this top-down view is useful

A greedy bottom-up merge can lock in awkward paths early. Sometimes that is fine. Sometimes it is not. Unigram gives the splitter another option. Keep many pieces alive. Then let likelihood decide which combinations survive. This can work nicely for multilingual text, mixed scripts, and cases where one rigid merge ladder is too limiting.

SentencePiece — the practical toolkit wrapper

Now one production detail. Many people say "Unigram tokenizer" when they actually mean "SentencePiece Unigram." SentencePiece is the toolkit. Inside it, you can train BPE or Unigram. So SentencePiece is not one algorithm only. It is the practical wrapper around both training styles. Useful properties: - it can train directly from raw text - it can keep whitespace information with the marker - it can support raw byte handling or byte fallback for odd characters - it serializes the tokenizer artifact cleanly for serving Picture it like this:

raw text
   |
   v
SentencePiece toolkit
   |
   +--> train BPE model
   |
   +--> train Unigram model
   |
   +--> handle raw text / whitespace / bytes safely
So when someone says, "this model uses SentencePiece," ask the next question: BPE or Unigram? Simple, no?

When this matters in practice

Here is the production trap. A model is trained with one splitter. Serving uses another splitter. The model still runs. But meaning quietly degrades. Why? Because the badge board rows no longer match the expected token pieces. Example:

model expects:
playing -> play | ##ing -> IDs [2107, 1379]
wrong service splitter emits:
playing -> pl | ay | ing -> IDs [441, 812, 1379]
The same raw word reached different badge numbers. So the badge board opens different drawers. Every downstream attention pattern shifts. This is a real bug. Not a cosmetic difference. Sometimes the failure is obvious. Special tokens mismatch. The model crashes or truncates badly. Sometimes the failure is subtle. No crash. Just worse answers, strange retrieval, or silent quality loss. So what to do? - ship the tokenizer artifact with the model - version them together - test exact string-to-ID outputs in CI - never assume "all subword tokenizers are close enough" Close enough is not enough here.

Quick mental summary

Keep this three-line memory hook.

BPE       = build upward by frequency
WordPiece = build upward by likelihood gain
Unigram   = prune downward by likelihood harm
If that stays in your head, the chapter is already useful.

Where this lives in the wild

  • Google BERT uses WordPiece so search and ranking text split the way BERT weights expect.
  • Google T5 and mT5 commonly use SentencePiece Unigram to handle raw multilingual text cleanly.
  • Meta Llama serving stacks keep the tokenizer artifact bundled with weights because mismatch corrupts IDs immediately.
  • OpenAI API client libraries need exact tokenizer alignment for token counting, truncation, and cost budgeting.
  • Hugging Face inference endpoints must load the checkpoint-matched tokenizer files, not a "similar" subword tokenizer.

Interview Q&A

Q: What is the core difference between BPE and WordPiece? A: Both grow tokens bottom-up, but BPE picks merges by frequency while WordPiece prefers merges that improve likelihood more. Common wrong answer to avoid: "WordPiece is just BPE with different symbols." The training judge changes, not only the notation. Q: Why is Unigram called top-down? A: Because it starts with a large candidate vocabulary and prunes tokens whose removal hurts likelihood the least. Common wrong answer to avoid: "Because it reads text from right to left." Top-down refers to vocabulary search direction, not token order. Q: Can the same word be split differently by BPE and Unigram? A: Yes. BPE follows its learned merge order, while Unigram chooses the lowest-cost segmentation under its token probabilities. Q: Why is model-tokenizer mismatch a production bug? A: Because different splits produce different token IDs, so the badge board looks up the wrong rows. Common wrong answer to avoid: "Only decoding changes a little." No. The whole model input changes.

Apply now (5 min)

Take the two candidate WordPiece merges below.

a+b : pair_count 18, left_count 90, right_count 90
x+y : pair_count 7,  left_count 8,  right_count 9
First, say which merge BPE picks. Then, say which merge WordPiece prefers. Now use this Unigram shelf:
play   cost 1.0
ing    cost 0.8
playi  cost 2.4
ng     cost 2.0
Choose the better split for playing. Sketch from memory: - the BPE vs WordPiece vs Unigram comparison table - one bottom-up ladder - one top-down pruning shelf - the production bug where the wrong splitter breaks badge IDs


Bridge. The splitter is now fully understood — three algorithms, same goal. But so far every token only talks to itself. What happens when a token needs information from a different part of the sequence that is not itself? That is cross-attention. Read 13-cross-attention.md next.