01. Week 5 — LLM Training Lifecycle¶
Key concepts to master¶
- The opening failure: pretraining alone does not create an assistant.
- Data curation: dedup, filtering, mixing, and why the curriculum matters.
- Next-token prediction: what it optimizes and why broad knowledge emerges.
- Why models are large: parameter count × bytes per parameter.
- Training vs inference: gradients and optimizer states only exist in training.
- Data parallelism vs tensor parallelism vs pipeline parallelism.
- SFT: instruction tuning as job-shaped next-token prediction.
- Why SFT data quality often beats raw quantity.
- Chat templates: role markers are part of learned behavior.
- Catastrophic forgetting: how narrow tuning can erase broad skill.
- Domain fine-tuning: when specialization helps and when it harms.
- Reward modeling: chosen vs rejected comparisons.
- PPO intuition: optimize reward, but do not jump too far.
- KL divergence: the seatbelt against reward hacking.
- DPO: direct preference optimization without a separate reward model.
- Warmup + cosine decay.
- bf16 vs fp16.
- Gradient accumulation and checkpointing.
- When to stop pretraining, SFT, and preference tuning.
🧠 Mental models¶
- Pretraining: "Massive next-token compression that teaches the model the internet's habits."
- SFT: "A supervised rehearsal where the model practices the job you actually want."
- RLHF / PPO: "Reward-driven steering with a leash so the policy improves without sprinting off distribution."
- DPO: "Preference learning by comparing two answers directly instead of training a separate judge first."
- Parallelism strategies: "Data parallel clones workers; tensor parallel splits one layer; pipeline parallel splits the assembly line."
- Training vs inference: "Training carries the full backpack — activations, gradients, optimizer state — while inference travels light."
⚠️ Common traps¶
- Treating raw token count as enough while ignoring deduplication, filtering, contamination, and mixture quality.
- Assuming next-token pretraining alone will produce reliable instruction-following behavior.
- Fine-tuning so narrowly that catastrophic forgetting wipes out broad pretrained capability.
- Forgetting chat templates or role markers and concluding the tuned model itself regressed.
- Reward-hacking during RLHF: reward increases while human preference and KL stability get worse.
- Underestimating training memory because optimizer states and saved activations dominate parameter storage.
🔗 Prerequisites & connections¶
- Builds on: Modules 03-04 transformer computation, Module 01 optimization basics, and Module 00 evaluation/calibration habits.
- Feeds into: quantization, LoRA/PEFT, deployment tradeoffs, and adaptation strategy choices in Module 06.
💬 Interview phrasing¶
- "Why doesn't pretraining alone give you a useful chat assistant?"
- "What is the objective difference between pretraining, SFT, RLHF, and DPO?"
- "Data, tensor, and pipeline parallelism — what exactly is split in each?"
- "Why is training memory so much larger than inference memory for the same model?"
- "How do you tell reward improvement from genuine assistant improvement in RLHF?"
⏱️ Difficulty markers¶
- 🟢 next-token prediction
- 🟢 SFT
- 🟡 training vs inference memory
- 🟡 parallelism strategies
- 🔴 RLHF / PPO / KL divergence
- 🔴 DPO
- 🔴 catastrophic forgetting
Self-check questions¶
For the fuller Q&A bank, see 02_explainer.md §6.3.
- Why does a trillion-token pretrained model still fail at email summarization? (§1.1)
- Why does the curriculum matter as much as raw scale? (§2.1)
- Why does next-token prediction create broad knowledge at all? (§2.2)
- Why is training memory much larger than inference memory? (§2.3)
- Data parallelism vs tensor parallelism vs pipeline parallelism — what gets split in each? (§2.4)
- What stays the same objective-wise between pretraining and SFT? (§3.1)
- Why does SFT data quality beat quantity so often? (§3.2)
- Why can a chat-template mismatch make a tuned model feel broken? (§3.3)
- What is catastrophic forgetting, and how do you detect it? (§3.4)
- What does RLHF add beyond SFT? (§4.1-4.2)
- Why is KL divergence necessary in RLHF? (§4.3)
- PPO-style RLHF vs DPO — what machinery disappears? (§4.4)
- Why is bf16 usually safer than fp16 for training? (§5.2)
- What does gradient accumulation simulate? (§5.3)
- Why is reward increase alone not enough to decide when to stop? (§5.5)
Health check¶
- [ ] Read all 6 explainer sections, including ELI5 and recap.
- [ ] Can explain the lifecycle using the employee-training analogy.
- [ ] Can compute model memory from parameter count × bytes.
- [ ] Can explain training vs inference without hand-waving.
- [ ] Can distinguish SFT, RLHF, and DPO cleanly.
- [ ] GPT-2 hands_on_lab completed with before/after perplexity.
- [ ] One failure → fix note written using explainer §6.1 vocabulary.
- [ ] Daily recall questions answerable from memory.
- [ ] Ready to bridge into
06_adaptation_compression.