01. Module 02 — Weekly Plan¶

Key concepts¶

BPE / subword tokenization: "Teach the model reusable syllable LEGO blocks instead of whole words or single letters."
Embeddings: "A token ID is a locker number; the embedding vector is the contents stored in that locker."
Positional encoding: "Attach clock hands to each token so the model can tell first, second, and third."
RoPE: "Rotate queries and keys like matched gears so relative distance changes their alignment."
Self-attention: "Each token asks the room, 'who matters to me right now?'"
Multi-head attention: "Several specialists read the same sentence with different lenses."

Assuming nearby token IDs have nearby meanings; IDs are just indices into an embedding table.
Ignoring how smaller token units increase sequence length and make attention quadratically more expensive.
Treating embeddings like bag-of-words features and forgetting that order must be reintroduced explicitly.
Omitting the √d_k scaling and making attention scores so sharp that softmax saturates.
Explaining RoPE as simple position addition when it actually rotates Q/K representations.
Assuming one attention head can capture every useful pattern equally well.

Builds on: Module 01 neural-network basics, matrix multiplication, softmax, and learned vector representations.
Feeds into: transformer block design, causal masking, Q/K/V implementation, and KV-cache reasoning in Modules 03-04.

"Why is subword tokenization the practical sweet spot for LLMs?"
"If token IDs are just integers, where does meaning actually enter the model?"
"Why do we still need positional information after embedding lookup?"
"What does RoPE change inside attention compared with sinusoidal encodings or ALiBi?"
"Why divide attention scores by √d_k, and why are multiple heads useful?"

Why does ChatGPT-4o create trouble for naive word tokenization? (explainer §1.1)
Why does character-level tokenization hurt attention cost so quickly? (explainer §2.2)
Why is [UNK] a poor answer to open-vocabulary text? (explainer §2.3)
Walk the BPE merge path from t o k e n to token. (explainer §2.5)
Why are token IDs only addresses? (explainer §3.1)
Explain embedding lookup as matrix indexing. (explainer §3.2)
Why do dog bites man and man bites dog break bag-of-words? (explainer §3.3)
Give the geometric intuition for sinusoidal encoding. (explainer §3.5)
What changes when RoPE rotates queries and keys? (explainer §3.7)
Explain self-attention as a soft lookup. (explainer §4.2)
Why divide by √d_k in attention? (explainer §4.5)
Why is one attention head often not enough? (explainer §5.1)