Skip to content

01. Module 02 — Weekly Plan

Key concepts

  • Raw text must be converted into repeatable numeric units.
  • Character-level tokenization is flexible but sequence-heavy.
  • Word-level tokenization is compact but brittle on open-vocabulary text.
  • Subword tokenization is the practical compromise.
  • BPE learns reusable merges from frequent adjacent pairs.
  • Token IDs are addresses, not meanings.
  • Embeddings are matrix rows selected by token ID.
  • Bag-of-words loses order information.
  • Positional information repairs the missing order signal.
  • Sinusoidal encoding is easiest to picture as rotating clocks.
  • RoPE makes relative position matter inside attention scores.
  • ALiBi adds a distance bias to the scorecard.
  • Self-attention means each token queries every other token.
  • Scaling by √d_k keeps softmax from becoming too sharp.
  • Multi-head attention allows specialized consultation patterns.

🧠 Mental models

  • BPE / subword tokenization: "Teach the model reusable syllable LEGO blocks instead of whole words or single letters."
  • Embeddings: "A token ID is a locker number; the embedding vector is the contents stored in that locker."
  • Positional encoding: "Attach clock hands to each token so the model can tell first, second, and third."
  • RoPE: "Rotate queries and keys like matched gears so relative distance changes their alignment."
  • Self-attention: "Each token asks the room, 'who matters to me right now?'"
  • Multi-head attention: "Several specialists read the same sentence with different lenses."

⚠️ Common traps

  • Assuming nearby token IDs have nearby meanings; IDs are just indices into an embedding table.
  • Ignoring how smaller token units increase sequence length and make attention quadratically more expensive.
  • Treating embeddings like bag-of-words features and forgetting that order must be reintroduced explicitly.
  • Omitting the √d_k scaling and making attention scores so sharp that softmax saturates.
  • Explaining RoPE as simple position addition when it actually rotates Q/K representations.
  • Assuming one attention head can capture every useful pattern equally well.

🔗 Prerequisites & connections

  • Builds on: Module 01 neural-network basics, matrix multiplication, softmax, and learned vector representations.
  • Feeds into: transformer block design, causal masking, Q/K/V implementation, and KV-cache reasoning in Modules 03-04.

💬 Interview phrasing

  • "Why is subword tokenization the practical sweet spot for LLMs?"
  • "If token IDs are just integers, where does meaning actually enter the model?"
  • "Why do we still need positional information after embedding lookup?"
  • "What does RoPE change inside attention compared with sinusoidal encodings or ALiBi?"
  • "Why divide attention scores by √d_k, and why are multiple heads useful?"

⏱️ Difficulty markers

  • 🟢 embeddings
  • 🟢 subword tokenization
  • 🟡 positional encodings
  • 🟡 self-attention
  • 🔴 RoPE / ALiBi intuition
  • 🔴 multi-head attention

Self-check questions

  1. Why does ChatGPT-4o create trouble for naive word tokenization? (explainer §1.1)
  2. Why does character-level tokenization hurt attention cost so quickly? (explainer §2.2)
  3. Why is [UNK] a poor answer to open-vocabulary text? (explainer §2.3)
  4. Walk the BPE merge path from t o k e n to token. (explainer §2.5)
  5. Why are token IDs only addresses? (explainer §3.1)
  6. Explain embedding lookup as matrix indexing. (explainer §3.2)
  7. Why do dog bites man and man bites dog break bag-of-words? (explainer §3.3)
  8. Give the geometric intuition for sinusoidal encoding. (explainer §3.5)
  9. What changes when RoPE rotates queries and keys? (explainer §3.7)
  10. Explain self-attention as a soft lookup. (explainer §4.2)
  11. Why divide by √d_k in attention? (explainer §4.5)
  12. Why is one attention head often not enough? (explainer §5.1)

Health check

  • Can you explain the full pipeline without looking at notes?
  • Can you work one BPE example by hand?
  • Can you compute one tiny attention example numerically?
  • Can you say what breaks before you say what fixes it?
  • If not, slow down and re-read the fuzzy chapter in 02_explainer.md.