01. Module 02 — Weekly Plan
Key concepts
- Raw text must be converted into repeatable numeric units.
- Character-level tokenization is flexible but sequence-heavy.
- Word-level tokenization is compact but brittle on open-vocabulary text.
- Subword tokenization is the practical compromise.
- BPE learns reusable merges from frequent adjacent pairs.
- Token IDs are addresses, not meanings.
- Embeddings are matrix rows selected by token ID.
- Bag-of-words loses order information.
- Positional information repairs the missing order signal.
- Sinusoidal encoding is easiest to picture as rotating clocks.
- RoPE makes relative position matter inside attention scores.
- ALiBi adds a distance bias to the scorecard.
- Self-attention means each token queries every other token.
- Scaling by
√d_k keeps softmax from becoming too sharp.
- Multi-head attention allows specialized consultation patterns.
🧠 Mental models
- BPE / subword tokenization: "Teach the model reusable syllable LEGO blocks instead of whole words or single letters."
- Embeddings: "A token ID is a locker number; the embedding vector is the contents stored in that locker."
- Positional encoding: "Attach clock hands to each token so the model can tell first, second, and third."
- RoPE: "Rotate queries and keys like matched gears so relative distance changes their alignment."
- Self-attention: "Each token asks the room, 'who matters to me right now?'"
- Multi-head attention: "Several specialists read the same sentence with different lenses."
⚠️ Common traps
- Assuming nearby token IDs have nearby meanings; IDs are just indices into an embedding table.
- Ignoring how smaller token units increase sequence length and make attention quadratically more expensive.
- Treating embeddings like bag-of-words features and forgetting that order must be reintroduced explicitly.
- Omitting the
√d_k scaling and making attention scores so sharp that softmax saturates.
- Explaining RoPE as simple position addition when it actually rotates Q/K representations.
- Assuming one attention head can capture every useful pattern equally well.
🔗 Prerequisites & connections
- Builds on: Module 01 neural-network basics, matrix multiplication, softmax, and learned vector representations.
- Feeds into: transformer block design, causal masking, Q/K/V implementation, and KV-cache reasoning in Modules 03-04.
💬 Interview phrasing
- "Why is subword tokenization the practical sweet spot for LLMs?"
- "If token IDs are just integers, where does meaning actually enter the model?"
- "Why do we still need positional information after embedding lookup?"
- "What does RoPE change inside attention compared with sinusoidal encodings or ALiBi?"
- "Why divide attention scores by
√d_k, and why are multiple heads useful?"
⏱️ Difficulty markers
- 🟢 embeddings
- 🟢 subword tokenization
- 🟡 positional encodings
- 🟡 self-attention
- 🔴 RoPE / ALiBi intuition
- 🔴 multi-head attention
Self-check questions
- Why does
ChatGPT-4o create trouble for naive word tokenization? (explainer §1.1)
- Why does character-level tokenization hurt attention cost so quickly? (explainer §2.2)
- Why is
[UNK] a poor answer to open-vocabulary text? (explainer §2.3)
- Walk the BPE merge path from
t o k e n to token. (explainer §2.5)
- Why are token IDs only addresses? (explainer §3.1)
- Explain embedding lookup as matrix indexing. (explainer §3.2)
- Why do
dog bites man and man bites dog break bag-of-words? (explainer §3.3)
- Give the geometric intuition for sinusoidal encoding. (explainer §3.5)
- What changes when RoPE rotates queries and keys? (explainer §3.7)
- Explain self-attention as a soft lookup. (explainer §4.2)
- Why divide by
√d_k in attention? (explainer §4.5)
- Why is one attention head often not enough? (explainer §5.1)
Health check
- Can you explain the full pipeline without looking at notes?
- Can you work one BPE example by hand?
- Can you compute one tiny attention example numerically?
- Can you say what breaks before you say what fixes it?
- If not, slow down and re-read the fuzzy chapter in 02_explainer.md.