04. Week 5 — Daily Recall¶

Spaced practice. Answer from memory. If stuck, jump to the explainer section in parentheses.

Monday (after ELI5 + chapters 1-2)¶

Use the employee-training analogy to explain pretraining, SFT, RLHF/DPO, and domain fine-tuning. Name all five placeholders. (ELI5)
Why can a trillion-token model still fail to summarize an email properly? (§1.1)
Broad knowledge vs usable behavior — state the difference in two sentences. (§1.1-§1.2)
What does the curriculum control during pretraining? Give three examples of data-mix choices. (§2.1)
Why does dedup matter, beyond just saving tokens? (§2.1)
Why does next-token prediction produce broad knowledge at all? (§2.2)
Draw the simple training flow: tokens → forward → loss → backward → update. (§2.3)

What stays the same objective-wise between pretraining and SFT? What changes? (§3.1)
Give one example of a behavior SFT teaches well, and one thing SFT does not perfectly solve. (§3.1)
Why does SFT data quality often beat quantity? Explain using the shadowing analogy. (§3.2)
List four signs of a high-quality SFT example. (§3.2)
Why is a chat template not just wrapper fluff? (§3.3)
Name three template bugs that can hurt serving quality. (§3.3)
What is catastrophic forgetting? Give two causes and two fixes. (§3.4)

Training vs inference — what extra things exist only in training? (§2.3)
Compute memory for a 7B model in bf16. Then say why full training still needs much more than 14 GB. (§2.3)
Domain fine-tuning — when does it make sense, and when might RAG be better? (§3.5)
Draw the broad-skill vs domain-skill bars for careful SFT versus reckless SFT. (§3.4)
Global batch formula from memory. Then compute one example. (§5.3)

What does the reward model learn from? Write the chosen/rejected format. (§4.1)
Why does RLHF exist if we already did SFT? (§4.1-§4.2)
PPO in plain English — what is the loop? (§4.2)
Why is reward maximization alone dangerous? Name the specific failure. (§4.3)
KL divergence in RLHF — what does it practically protect? (§4.3)
DPO vs PPO-style RLHF — what extra machinery disappears? (§4.4)

Why do we use warmup before cosine decay? (§5.1)
bf16 vs fp16 — same bytes, so what is the real difference? (§5.2)
Gradient accumulation — what does it simulate, and what trade-off does it introduce? (§5.3)
Two meanings of checkpointing — explain both. (§5.4)
Why is reward increase alone not enough to decide when to stop RLHF training? (§5.5)
Module 06 bridge: why does quantization only make sense after you understand parameter count × bytes? (§5.6)

From memory, list all 12 rows of the failure-fix table. (§6.1)
Draw the full lifecycle pipeline from raw data to domain specialization. (ELI5 + §6.1)
Give a Lead-level answer to: “Why is pretraining alone insufficient for assistant behavior?” (§6.3)
Give a Lead-level answer to: “Why is KL divergence necessary in RLHF?” (§6.3)
Name three production bugs from this module and one diagnostic check for each. (§6.4)
Without looking, state the exact bridge to the next module in your own words. (final bridge)