04. Week 5 — Daily Recall¶
Spaced practice. Answer from memory. If stuck, jump to the explainer section in parentheses.
Monday (after ELI5 + chapters 1-2)¶
- Use the employee-training analogy to explain pretraining, SFT, RLHF/DPO, and domain fine-tuning. Name all five placeholders. (ELI5)
- Why can a trillion-token model still fail to summarize an email properly? (§1.1)
- Broad knowledge vs usable behavior — state the difference in two sentences. (§1.1-§1.2)
- What does the curriculum control during pretraining? Give three examples of data-mix choices. (§2.1)
- Why does dedup matter, beyond just saving tokens? (§2.1)
- Why does next-token prediction produce broad knowledge at all? (§2.2)
- Draw the simple training flow: tokens → forward → loss → backward → update. (§2.3)
Tuesday (after chapter 3)¶
- What stays the same objective-wise between pretraining and SFT? What changes? (§3.1)
- Give one example of a behavior SFT teaches well, and one thing SFT does not perfectly solve. (§3.1)
- Why does SFT data quality often beat quantity? Explain using the shadowing analogy. (§3.2)
- List four signs of a high-quality SFT example. (§3.2)
- Why is a chat template not just wrapper fluff? (§3.3)
- Name three template bugs that can hurt serving quality. (§3.3)
- What is catastrophic forgetting? Give two causes and two fixes. (§3.4)
Wednesday (after chapter 3 review + hands_on_lab setup)¶
- Training vs inference — what extra things exist only in training? (§2.3)
- Compute memory for a 7B model in bf16. Then say why full training still needs much more than 14 GB. (§2.3)
- Domain fine-tuning — when does it make sense, and when might RAG be better? (§3.5)
- Draw the broad-skill vs domain-skill bars for careful SFT versus reckless SFT. (§3.4)
- Global batch formula from memory. Then compute one example. (§5.3)
Thursday (after chapter 4)¶
- What does the reward model learn from? Write the chosen/rejected format. (§4.1)
- Why does RLHF exist if we already did SFT? (§4.1-§4.2)
- PPO in plain English — what is the loop? (§4.2)
- Why is reward maximization alone dangerous? Name the specific failure. (§4.3)
- KL divergence in RLHF — what does it practically protect? (§4.3)
- DPO vs PPO-style RLHF — what extra machinery disappears? (§4.4)
Friday (after chapter 5)¶
- Why do we use warmup before cosine decay? (§5.1)
- bf16 vs fp16 — same bytes, so what is the real difference? (§5.2)
- Gradient accumulation — what does it simulate, and what trade-off does it introduce? (§5.3)
- Two meanings of checkpointing — explain both. (§5.4)
- Why is reward increase alone not enough to decide when to stop RLHF training? (§5.5)
- Module 06 bridge: why does quantization only make sense after you understand parameter count × bytes? (§5.6)
Weekend (pre-hands_on_lab + revision)¶
- From memory, list all 12 rows of the failure-fix table. (§6.1)
- Draw the full lifecycle pipeline from raw data to domain specialization. (ELI5 + §6.1)
- Give a Lead-level answer to: “Why is pretraining alone insufficient for assistant behavior?” (§6.3)
- Give a Lead-level answer to: “Why is KL divergence necessary in RLHF?” (§6.3)
- Name three production bugs from this module and one diagnostic check for each. (§6.4)
- Without looking, state the exact bridge to the next module in your own words. (final bridge)