03. Tokenization & Attention — Study Material¶
For deep understanding see 02_explainer.md.
Tokenization¶
- Text must be split into repeatable units before modeling.
- Character-level tokenization handles any string but creates long sequences.
- Word-level tokenization shortens sequences but breaks on open-vocabulary text.
- Subword tokenization is the practical compromise.
- Common words may stay whole.
- Rare words break into reusable parts.
See explainer §2.1-§2.4.
BPE in one page¶
- Start from base symbols.
- Count adjacent pairs.
- Merge the most frequent pair.
- Repeat until the target vocabulary size is reached.
- Store merge order, not only final tokens.
- Unseen words stay representable through learned pieces.
Toy path from the explainer:
- t o -> to
- to k -> tok
- tok e -> toke
- toke n -> token
See explainer §2.5-§2.6.
Vocabulary trade-offs¶
| Choice | Upside | Cost |
|---|---|---|
| Small vocab | Smaller embedding/output tables | More token splits |
| Medium vocab | Balanced | Balanced |
| Large vocab | Fewer splits | Larger embedding/output layers |
Watch multilingual inflation and tokenizer-model mismatch.
See explainer §2.7.
Embeddings¶
- Token IDs are addresses.
- Embedding lookup is matrix indexing.
- If
Ehas shape[vocab_size, d_model], then tokenimaps toE[i]. - One-hot multiplication and lookup are equivalent views.
See explainer §3.1-§3.2.
Position information¶
- Bag-of-words loses order.
- Add positional information so the same token differs by location.
- Common recipe:
input = token_embedding + position_embedding. - Sinusoidal encoding can be pictured as many clocks rotating at different speeds.
- Learned absolute positions are simple but capped by training setup.
See explainer §3.3-§3.6.
RoPE and ALiBi¶
RoPE¶
- Rotate query and key pairs by position-dependent angles.
- Relative position affects the dot product naturally.
- Common in modern decoder models.
ALiBi¶
- Add a distance-based bias to attention scores.
- Nearer tokens get a gentler penalty.
- Useful for long-context generalization.
See explainer §3.7.
Attention¶
- Self-attention lets each token query every other token.
- Query = what I seek.
- Key = what I advertise.
- Value = what I share.
- Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k)V - Scaling by
√d_kstabilizes score magnitude. - Decoder models add a causal mask to hide future tokens.
See explainer §4.2-§4.6.
Multi-head attention¶
- One head gives one score pattern per token.
- Multiple heads let the model learn several consultation habits in parallel.
- Heads are concatenated and projected back to model width.
See explainer §5.1-§5.3.
Production reminders¶
- Count tokens with the exact deployment tokenizer.
- Prompt length drives attention cost.
- Long advertised context and usable context are not identical.
- Attention maps can be suggestive, not definitive explanations.
See explainer §5.4 and §6.4.