01. Week 3 — Transformer Architecture¶
Key concepts to master¶
- Residual stream as the main data highway across layers
- Residual connection as an edit path, not a full rewrite path
- Why deep plain stacks cause exploding or vanishing gradients
- LayerNorm as per-token feature normalization
- Why activations drift without normalization
- Pre-norm vs post-norm, and why modern LLMs prefer pre-norm
- Attention as cross-token mixing inside the block
- FFN as per-token nonlinear transformation inside the block
- Multi-head attention as parallel crews with different relation patterns
- Encoder-decoder vs decoder-only information flow
- Causal mask as lower-triangular visibility control
- KV cache as an inference-only latency optimization
🧠 Mental models¶
- Residual stream: "A shared highway where each block merges a small edit back into the main traffic."
- LayerNorm: "A per-token recenter-and-rescale pit stop before the next computation."
- Pre-norm blocks: "Stabilize the input before each sublayer, like checking the car before every lap."
- Attention block: "A routing board that mixes information across tokens."
- FFN block: "A tiny private workshop each token enters alone for feature expansion and compression."
- KV cache: "A notebook of past keys and values so generation does not reread the whole prefix every step."
⚠️ Common traps¶
- Describing residual connections as simple copying instead of additive edit paths that preserve gradient flow.
- Confusing LayerNorm with BatchNorm; LayerNorm normalizes features within each token, not across the batch.
- Forgetting why post-norm becomes harder to optimize in deep autoregressive stacks.
- Mixing up attention's cross-token communication role with FFN's per-token transformation role.
- Applying causal masking incorrectly in decoder-only models and leaking future information.
- Assuming KV cache helps training when it is mainly an autoregressive inference optimization.
🔗 Prerequisites & connections¶
- Builds on: Module 02 attention math and positional schemes, plus Module 01 optimization and stability intuition.
- Feeds into: coded causal attention, head reshaping, cache implementation, and later serving/training tradeoffs in Modules 04-05.
💬 Interview phrasing¶
- "What exactly lives in the residual stream as you move up the transformer stack?"
- "Why do modern LLMs prefer pre-norm transformer blocks?"
- "Attention vs FFN — if you removed one, what capability would break first?"
- "How does a decoder-only transformer differ mechanically from an encoder-decoder transformer?"
- "Why does KV cache reduce generation latency but not help training in the same way?"
⏱️ Difficulty markers¶
- 🟢 causal mask
- 🟢 encoder-decoder vs decoder-only
- 🟡 residual stream
- 🟡 attention + FFN block roles
- 🔴 LayerNorm / pre-norm stability
- 🔴 KV cache
Self-check questions¶
For the longer answers, use explainer chapter references directly.
- Why does a plain deep stack become numerically unstable? (§1.1, §2.2)
- Why is a residual connection easier to optimize than a full rewrite? (§2.1-§2.3)
- What exactly is the residual stream? (§2.4)
- What does LayerNorm normalize over? (§3.1, §3.3)
- Why is pre-norm preferred in modern LLMs? (§3.4)
- Attention vs FFN — what does each one do? (§4.1-§4.4)
- Encoder-decoder vs decoder-only — what changes mechanically? (§5.1, §5.3)
- Why must decoder attention be masked? (§5.2)
- Why does KV cache help inference but not training? (§5.4)
Health check¶
- [ ] Read all 6 explainer chapters once
- [ ] Can draw the pre-norm block without notes
- [ ] Can explain the residual stream in plain English
- [ ] Can apply a causal mask by hand on a toy matrix
- [ ] Can explain KV cache latency benefit with numbers
- [ ] Assignment completed and written up clearly