Skip to content

01. Week 3 — Transformer Architecture

Key concepts to master

  • Residual stream as the main data highway across layers
  • Residual connection as an edit path, not a full rewrite path
  • Why deep plain stacks cause exploding or vanishing gradients
  • LayerNorm as per-token feature normalization
  • Why activations drift without normalization
  • Pre-norm vs post-norm, and why modern LLMs prefer pre-norm
  • Attention as cross-token mixing inside the block
  • FFN as per-token nonlinear transformation inside the block
  • Multi-head attention as parallel crews with different relation patterns
  • Encoder-decoder vs decoder-only information flow
  • Causal mask as lower-triangular visibility control
  • KV cache as an inference-only latency optimization

🧠 Mental models

  • Residual stream: "A shared highway where each block merges a small edit back into the main traffic."
  • LayerNorm: "A per-token recenter-and-rescale pit stop before the next computation."
  • Pre-norm blocks: "Stabilize the input before each sublayer, like checking the car before every lap."
  • Attention block: "A routing board that mixes information across tokens."
  • FFN block: "A tiny private workshop each token enters alone for feature expansion and compression."
  • KV cache: "A notebook of past keys and values so generation does not reread the whole prefix every step."

⚠️ Common traps

  • Describing residual connections as simple copying instead of additive edit paths that preserve gradient flow.
  • Confusing LayerNorm with BatchNorm; LayerNorm normalizes features within each token, not across the batch.
  • Forgetting why post-norm becomes harder to optimize in deep autoregressive stacks.
  • Mixing up attention's cross-token communication role with FFN's per-token transformation role.
  • Applying causal masking incorrectly in decoder-only models and leaking future information.
  • Assuming KV cache helps training when it is mainly an autoregressive inference optimization.

🔗 Prerequisites & connections

  • Builds on: Module 02 attention math and positional schemes, plus Module 01 optimization and stability intuition.
  • Feeds into: coded causal attention, head reshaping, cache implementation, and later serving/training tradeoffs in Modules 04-05.

💬 Interview phrasing

  • "What exactly lives in the residual stream as you move up the transformer stack?"
  • "Why do modern LLMs prefer pre-norm transformer blocks?"
  • "Attention vs FFN — if you removed one, what capability would break first?"
  • "How does a decoder-only transformer differ mechanically from an encoder-decoder transformer?"
  • "Why does KV cache reduce generation latency but not help training in the same way?"

⏱️ Difficulty markers

  • 🟢 causal mask
  • 🟢 encoder-decoder vs decoder-only
  • 🟡 residual stream
  • 🟡 attention + FFN block roles
  • 🔴 LayerNorm / pre-norm stability
  • 🔴 KV cache

Self-check questions

For the longer answers, use explainer chapter references directly.

  1. Why does a plain deep stack become numerically unstable? (§1.1, §2.2)
  2. Why is a residual connection easier to optimize than a full rewrite? (§2.1-§2.3)
  3. What exactly is the residual stream? (§2.4)
  4. What does LayerNorm normalize over? (§3.1, §3.3)
  5. Why is pre-norm preferred in modern LLMs? (§3.4)
  6. Attention vs FFN — what does each one do? (§4.1-§4.4)
  7. Encoder-decoder vs decoder-only — what changes mechanically? (§5.1, §5.3)
  8. Why must decoder attention be masked? (§5.2)
  9. Why does KV cache help inference but not training? (§5.4)

Health check

  • [ ] Read all 6 explainer chapters once
  • [ ] Can draw the pre-norm block without notes
  • [ ] Can explain the residual stream in plain English
  • [ ] Can apply a causal mask by hand on a toy matrix
  • [ ] Can explain KV cache latency benefit with numbers
  • [ ] Assignment completed and written up clearly