04. Week 3 — Daily Recall¶
Spaced practice. Answer aloud. If stuck, jump to the explainer section in parentheses.
Monday (after explainer chapter 1 and chapter 2)¶
- What exactly is the opening failure of the unstable stack? (§1.1)
- Give three numerical behaviors of a bad deep stack: explode, collapse, oscillate. (§1.1)
- Why is "just stack more layers" incomplete advice? (§2.2)
- Residual connection in one formula. Then explain it in plain English. (§2.1, §2.3)
- Why is learning an edit easier than learning a full rewrite? (§2.3)
- What is the residual stream? (§2.4)
Tuesday (after explainer chapter 3)¶
- LayerNorm normalizes over what exact dimension? (§3.1)
- Why is LayerNorm about numerical hygiene, not semantics? (§3.1)
- Give three activation-drift patterns that hurt optimization. (§3.2)
- Compute LayerNorm for
[2, 4, 6]up to rough decimals. (§3.3) - Pre-norm vs post-norm — draw both in words. (§3.4)
- Why do modern LLMs usually choose pre-norm? (§3.4)
Wednesday (after explainer chapter 4)¶
- A transformer block has two benches. What are they? (§4.1)
- Attention mixes across what axis? FFN mixes across what axis? (§4.1, §4.4)
- Draw the modern pre-norm block from memory. (§4.2)
- Why is FFN not redundant even after attention? (§4.1, §4.4)
- In one head, what do attention weights actually do to the value vectors? (§4.3)
- What role do the parallel crews play inside multi-head attention? (§4.3)
Thursday (after explainer chapter 5)¶
- Encoder-only vs decoder-only — what visibility changes? (§5.1)
- In cross-attention, where do Q, K, and V come from? (§5.3)
- Why is a causal mask lower-triangular? (§5.2)
- What exact bug happens if a decoder can attend rightward during training? (§5.2)
- KV cache stores what tensors? (§5.4)
- Why does KV cache help inference but not standard training? (§5.4)
Friday (cumulative + drawing)¶
- Sketch the full failure-fix table from explainer §6.1 with at least eight rows.
- Draw three diagrams: unstable plain stack, pre-norm block, causal mask. (§1.1, §4.2, §5.2)
- Explain the shortcut pipe and quality inspector without using jargon first. (§2, §3)
- Give a production debugging checklist for unstable transformer training. (§6.4)
- Explain decoder-only, encoder-only, and encoder-decoder in one clean comparison. (§5.1, §5.3)
- Say the exact bridge to the next module from memory. (last line of
02_explainer.md)