Skip to content

04. Week 3 — Daily Recall

Spaced practice. Answer aloud. If stuck, jump to the explainer section in parentheses.

Monday (after explainer chapter 1 and chapter 2)

  1. What exactly is the opening failure of the unstable stack? (§1.1)
  2. Give three numerical behaviors of a bad deep stack: explode, collapse, oscillate. (§1.1)
  3. Why is "just stack more layers" incomplete advice? (§2.2)
  4. Residual connection in one formula. Then explain it in plain English. (§2.1, §2.3)
  5. Why is learning an edit easier than learning a full rewrite? (§2.3)
  6. What is the residual stream? (§2.4)

Tuesday (after explainer chapter 3)

  1. LayerNorm normalizes over what exact dimension? (§3.1)
  2. Why is LayerNorm about numerical hygiene, not semantics? (§3.1)
  3. Give three activation-drift patterns that hurt optimization. (§3.2)
  4. Compute LayerNorm for [2, 4, 6] up to rough decimals. (§3.3)
  5. Pre-norm vs post-norm — draw both in words. (§3.4)
  6. Why do modern LLMs usually choose pre-norm? (§3.4)

Wednesday (after explainer chapter 4)

  1. A transformer block has two benches. What are they? (§4.1)
  2. Attention mixes across what axis? FFN mixes across what axis? (§4.1, §4.4)
  3. Draw the modern pre-norm block from memory. (§4.2)
  4. Why is FFN not redundant even after attention? (§4.1, §4.4)
  5. In one head, what do attention weights actually do to the value vectors? (§4.3)
  6. What role do the parallel crews play inside multi-head attention? (§4.3)

Thursday (after explainer chapter 5)

  1. Encoder-only vs decoder-only — what visibility changes? (§5.1)
  2. In cross-attention, where do Q, K, and V come from? (§5.3)
  3. Why is a causal mask lower-triangular? (§5.2)
  4. What exact bug happens if a decoder can attend rightward during training? (§5.2)
  5. KV cache stores what tensors? (§5.4)
  6. Why does KV cache help inference but not standard training? (§5.4)

Friday (cumulative + drawing)

  1. Sketch the full failure-fix table from explainer §6.1 with at least eight rows.
  2. Draw three diagrams: unstable plain stack, pre-norm block, causal mask. (§1.1, §4.2, §5.2)
  3. Explain the shortcut pipe and quality inspector without using jargon first. (§2, §3)
  4. Give a production debugging checklist for unstable transformer training. (§6.4)
  5. Explain decoder-only, encoder-only, and encoder-decoder in one clean comparison. (§5.1, §5.3)
  6. Say the exact bridge to the next module from memory. (last line of 02_explainer.md)