Skip to content

AI Engineering Playbook

03. Tokenization & Attention — Study Material

03. Tokenization & Attention — Study Material¶

For deep understanding see 02_explainer.md.

Tokenization¶

Text must be split into repeatable units before modeling.
Character-level tokenization handles any string but creates long sequences.
Word-level tokenization shortens sequences but breaks on open-vocabulary text.
Subword tokenization is the practical compromise.
Common words may stay whole.
Rare words break into reusable parts.

See explainer §2.1-§2.4.

BPE in one page¶

Start from base symbols.
Count adjacent pairs.
Merge the most frequent pair.
Repeat until the target vocabulary size is reached.
Store merge order, not only final tokens.
Unseen words stay representable through learned pieces.

Toy path from the explainer: - t o -> to - to k -> tok - tok e -> toke - toke n -> token

See explainer §2.5-§2.6.

Vocabulary trade-offs¶

Choice	Upside	Cost
Small vocab	Smaller embedding/output tables	More token splits
Medium vocab	Balanced	Balanced
Large vocab	Fewer splits	Larger embedding/output layers

Watch multilingual inflation and tokenizer-model mismatch.

See explainer §2.7.

Embeddings¶

Token IDs are addresses.
Embedding lookup is matrix indexing.
If E has shape [vocab_size, d_model], then token i maps to E[i].
One-hot multiplication and lookup are equivalent views.

See explainer §3.1-§3.2.

Position information¶

Bag-of-words loses order.
Add positional information so the same token differs by location.
Common recipe: input = token_embedding + position_embedding.
Sinusoidal encoding can be pictured as many clocks rotating at different speeds.
Learned absolute positions are simple but capped by training setup.

See explainer §3.3-§3.6.

RoPE and ALiBi¶

RoPE¶

Rotate query and key pairs by position-dependent angles.
Relative position affects the dot product naturally.
Common in modern decoder models.

ALiBi¶

Add a distance-based bias to attention scores.
Nearer tokens get a gentler penalty.
Useful for long-context generalization.

See explainer §3.7.

Attention¶

Self-attention lets each token query every other token.
Query = what I seek.
Key = what I advertise.
Value = what I share.
Formula: Attention(Q, K, V) = softmax(QK^T / √d_k)V
Scaling by √d_k stabilizes score magnitude.
Decoder models add a causal mask to hide future tokens.

See explainer §4.2-§4.6.

Multi-head attention¶

One head gives one score pattern per token.
Multiple heads let the model learn several consultation habits in parallel.
Heads are concatenated and projected back to model width.

See explainer §5.1-§5.3.

Production reminders¶

Count tokens with the exact deployment tokenizer.
Prompt length drives attention cost.
Long advertised context and usable context are not identical.
Attention maps can be suggestive, not definitive explanations.

See explainer §5.4 and §6.4.