03. Week 3 — Transformer Architecture¶
Start with
02_explainer.md. This file is the quick-reference sheet. See explainer references after each section for the full failure-fix story.
Section 1 — Pre-norm block at a glance¶
Modern LLM blocks are usually pre-norm:
Two sublayers.
Two residual additions.
Same residual stream width d_model throughout.
See explainer §4.2.
Section 2 — Residual connection and residual stream¶
Residual rule:
Read it as carry old state + learned edit.
Why it matters: - preserves a clean identity path - stabilizes deep optimization - lets every block write into the same residual stream
Residual stream = the running token representation passed across blocks.
See explainer §2.1-§2.5.
Section 3 — Layer normalization¶
LayerNorm operates per token across features:
Why it matters: - controls activation scale drift - recenters token features - gives later sublayers more predictable input statistics
See explainer §3.1-§3.3.
Section 4 — Pre-norm vs post-norm¶
Pre-norm:
Post-norm:
Modern LLMs usually prefer pre-norm because deep stacks train more stably.
See explainer §3.4.
Section 5 — Attention inside the block¶
Attention mixes information across positions.
Q, K, V come from learned projections.
Multi-head attention runs several heads in parallel, then projects back to d_model.
See explainer §4.3.
Section 6 — FFN inside the block¶
FFN applies the same per-token MLP to each position independently.
Typical shape: d_model -> 4*d_model -> d_model.
Attention mixes across tokens. FFN transforms within one token.
See explainer §4.4.
Section 7 — Encoder vs decoder¶
Encoder block: full self-attention. Every token can attend to every token.
Decoder block: causal self-attention. A token can attend only to earlier positions and itself.
Encoder-decoder model: decoder also includes cross-attention to encoder outputs.
See explainer §5.1 and §5.3.
Section 8 — Causal mask¶
Causal mask is lower-triangular visibility control.
Blocked future logits get -inf before softmax.
Example for sequence length 4:
Mask bug = future leakage during training.
See explainer §5.2.
Section 9 — KV cache¶
During autoregressive inference, cache past keys and values. Then compute fresh Q, K, V only for the new token.
Why it matters: - lower latency - less repeated work - critical for long generations
Cache helps inference, not standard parallel training.
See explainer §5.4-§5.5.
Reading list¶
Self-check¶
- What is the residual stream? (See explainer §2.4)
- Why does LayerNorm operate per token, not per batch? (§3.1)
- Why is pre-norm usually easier to train deeply? (§3.4)
- Attention vs FFN — what does each one change? (§4.1)
- Why must decoder attention be masked? (§5.2)
- Why does KV cache matter operationally? (§5.4)