01. Week 3 — Transformer Architecture¶

Key concepts to master¶

Residual stream: "A shared highway where each block merges a small edit back into the main traffic."
LayerNorm: "A per-token recenter-and-rescale pit stop before the next computation."
Pre-norm blocks: "Stabilize the input before each sublayer, like checking the car before every lap."
Attention block: "A routing board that mixes information across tokens."
FFN block: "A tiny private workshop each token enters alone for feature expansion and compression."
KV cache: "A notebook of past keys and values so generation does not reread the whole prefix every step."

Describing residual connections as simple copying instead of additive edit paths that preserve gradient flow.
Confusing LayerNorm with BatchNorm; LayerNorm normalizes features within each token, not across the batch.
Forgetting why post-norm becomes harder to optimize in deep autoregressive stacks.
Mixing up attention's cross-token communication role with FFN's per-token transformation role.
Applying causal masking incorrectly in decoder-only models and leaking future information.
Assuming KV cache helps training when it is mainly an autoregressive inference optimization.

Builds on: Module 02 attention math and positional schemes, plus Module 01 optimization and stability intuition.
Feeds into: coded causal attention, head reshaping, cache implementation, and later serving/training tradeoffs in Modules 04-05.

"What exactly lives in the residual stream as you move up the transformer stack?"
"Why do modern LLMs prefer pre-norm transformer blocks?"
"Attention vs FFN — if you removed one, what capability would break first?"
"How does a decoder-only transformer differ mechanically from an encoder-decoder transformer?"
"Why does KV cache reduce generation latency but not help training in the same way?"

For the longer answers, use explainer chapter references directly.

Why does a plain deep stack become numerically unstable? (§1.1, §2.2)
Why is a residual connection easier to optimize than a full rewrite? (§2.1-§2.3)
What exactly is the residual stream? (§2.4)
What does LayerNorm normalize over? (§3.1, §3.3)
Why is pre-norm preferred in modern LLMs? (§3.4)
Attention vs FFN — what does each one do? (§4.1-§4.4)
Encoder-decoder vs decoder-only — what changes mechanically? (§5.1, §5.3)
Why must decoder attention be masked? (§5.2)
Why does KV cache help inference but not training? (§5.4)