01. Week 5 — LLM Training Lifecycle¶

Key concepts to master¶

The opening failure: pretraining alone does not create an assistant.
Data curation: dedup, filtering, mixing, and why the curriculum matters.
Next-token prediction: what it optimizes and why broad knowledge emerges.
Why models are large: parameter count × bytes per parameter.
Training vs inference: gradients and optimizer states only exist in training.
Data parallelism vs tensor parallelism vs pipeline parallelism.
SFT: instruction tuning as job-shaped next-token prediction.
Why SFT data quality often beats raw quantity.
Chat templates: role markers are part of learned behavior.
Catastrophic forgetting: how narrow tuning can erase broad skill.
Domain fine-tuning: when specialization helps and when it harms.
Reward modeling: chosen vs rejected comparisons.
PPO intuition: optimize reward, but do not jump too far.
KL divergence: the seatbelt against reward hacking.
DPO: direct preference optimization without a separate reward model.
Warmup + cosine decay.
bf16 vs fp16.
Gradient accumulation and checkpointing.
When to stop pretraining, SFT, and preference tuning.

Pretraining: "Massive next-token compression that teaches the model the internet's habits."
SFT: "A supervised rehearsal where the model practices the job you actually want."
RLHF / PPO: "Reward-driven steering with a leash so the policy improves without sprinting off distribution."
DPO: "Preference learning by comparing two answers directly instead of training a separate judge first."
Parallelism strategies: "Data parallel clones workers; tensor parallel splits one layer; pipeline parallel splits the assembly line."
Training vs inference: "Training carries the full backpack — activations, gradients, optimizer state — while inference travels light."

Treating raw token count as enough while ignoring deduplication, filtering, contamination, and mixture quality.
Assuming next-token pretraining alone will produce reliable instruction-following behavior.
Fine-tuning so narrowly that catastrophic forgetting wipes out broad pretrained capability.
Forgetting chat templates or role markers and concluding the tuned model itself regressed.
Reward-hacking during RLHF: reward increases while human preference and KL stability get worse.
Underestimating training memory because optimizer states and saved activations dominate parameter storage.

Builds on: Modules 03-04 transformer computation, Module 01 optimization basics, and Module 00 evaluation/calibration habits.
Feeds into: quantization, LoRA/PEFT, deployment tradeoffs, and adaptation strategy choices in Module 06.

"Why doesn't pretraining alone give you a useful chat assistant?"
"What is the objective difference between pretraining, SFT, RLHF, and DPO?"
"Data, tensor, and pipeline parallelism — what exactly is split in each?"
"Why is training memory so much larger than inference memory for the same model?"
"How do you tell reward improvement from genuine assistant improvement in RLHF?"

For the fuller Q&A bank, see 02_explainer.md §6.3.

Why does a trillion-token pretrained model still fail at email summarization? (§1.1)
Why does the curriculum matter as much as raw scale? (§2.1)
Why does next-token prediction create broad knowledge at all? (§2.2)
Why is training memory much larger than inference memory? (§2.3)
Data parallelism vs tensor parallelism vs pipeline parallelism — what gets split in each? (§2.4)
What stays the same objective-wise between pretraining and SFT? (§3.1)
Why does SFT data quality beat quantity so often? (§3.2)
Why can a chat-template mismatch make a tuned model feel broken? (§3.3)
What is catastrophic forgetting, and how do you detect it? (§3.4)
What does RLHF add beyond SFT? (§4.1-4.2)
Why is KL divergence necessary in RLHF? (§4.3)
PPO-style RLHF vs DPO — what machinery disappears? (§4.4)
Why is bf16 usually safer than fp16 for training? (§5.2)
What does gradient accumulation simulate? (§5.3)
Why is reward increase alone not enough to decide when to stop? (§5.5)