Skip to content

03. Tokenization & Attention — Study Material

For deep understanding see 02_explainer.md.

Tokenization

  • Text must be split into repeatable units before modeling.
  • Character-level tokenization handles any string but creates long sequences.
  • Word-level tokenization shortens sequences but breaks on open-vocabulary text.
  • Subword tokenization is the practical compromise.
  • Common words may stay whole.
  • Rare words break into reusable parts.

See explainer §2.1-§2.4.

BPE in one page

  • Start from base symbols.
  • Count adjacent pairs.
  • Merge the most frequent pair.
  • Repeat until the target vocabulary size is reached.
  • Store merge order, not only final tokens.
  • Unseen words stay representable through learned pieces.

Toy path from the explainer: - t o -> to - to k -> tok - tok e -> toke - toke n -> token

See explainer §2.5-§2.6.

Vocabulary trade-offs

Choice Upside Cost
Small vocab Smaller embedding/output tables More token splits
Medium vocab Balanced Balanced
Large vocab Fewer splits Larger embedding/output layers

Watch multilingual inflation and tokenizer-model mismatch.

See explainer §2.7.

Embeddings

  • Token IDs are addresses.
  • Embedding lookup is matrix indexing.
  • If E has shape [vocab_size, d_model], then token i maps to E[i].
  • One-hot multiplication and lookup are equivalent views.

See explainer §3.1-§3.2.

Position information

  • Bag-of-words loses order.
  • Add positional information so the same token differs by location.
  • Common recipe: input = token_embedding + position_embedding.
  • Sinusoidal encoding can be pictured as many clocks rotating at different speeds.
  • Learned absolute positions are simple but capped by training setup.

See explainer §3.3-§3.6.

RoPE and ALiBi

RoPE

  • Rotate query and key pairs by position-dependent angles.
  • Relative position affects the dot product naturally.
  • Common in modern decoder models.

ALiBi

  • Add a distance-based bias to attention scores.
  • Nearer tokens get a gentler penalty.
  • Useful for long-context generalization.

See explainer §3.7.

Attention

  • Self-attention lets each token query every other token.
  • Query = what I seek.
  • Key = what I advertise.
  • Value = what I share.
  • Formula: Attention(Q, K, V) = softmax(QK^T / √d_k)V
  • Scaling by √d_k stabilizes score magnitude.
  • Decoder models add a causal mask to hide future tokens.

See explainer §4.2-§4.6.

Multi-head attention

  • One head gives one score pattern per token.
  • Multiple heads let the model learn several consultation habits in parallel.
  • Heads are concatenated and projected back to model width.

See explainer §5.1-§5.3.

Production reminders

  • Count tokens with the exact deployment tokenizer.
  • Prompt length drives attention cost.
  • Long advertised context and usable context are not identical.
  • Attention maps can be suggestive, not definitive explanations.

See explainer §5.4 and §6.4.