03. Week 5 — LLM Training Lifecycle¶
For deep understanding see
02_explainer.md— narrative with diagrams, lifecycle story, retrieval prompts, interview Q&A. This file is the quick-reference glossary: formulas, definitions, lookup tables, and memory math.
Section 1 — Lifecycle map¶
- Pretraining = broad next-token learning on massive corpora.
- SFT = instruction-tuning on demonstration pairs.
- RLHF = preference optimization using a reward model, usually with KL control.
- DPO = direct preference optimization from chosen/rejected pairs.
- Domain fine-tuning = specialization for a narrower task or industry.
See explainer §ELI5, §1.1.
Section 2 — Data curation in pretraining¶
Common steps:
- exact dedup
- near-dup dedup
- quality filtering
- language ID
- PII / safety filtering
- source weighting / mixing
- tokenization and packing
| bad corpus issue | why it hurts |
|---|---|
| duplicates | memorization, inflated effective weight |
| spam / SEO junk | weakens knowledge quality |
| skewed source mix | over-specializes capability distribution |
| PII leakage | privacy and safety risk |
Curriculum = the effective mix of sources presented during training.
See explainer §2.1.
Section 3 — GPT-style pretraining objective¶
Causal LM objective:
Equivalent loss:
Cross-entropy over the vocabulary at each token position.
Why it works:
- compresses syntax
- compresses factual associations
- compresses style patterns
- compresses code and reasoning traces
See explainer §2.2.
Section 4 — Memory math and model size¶
First-order rule:
| precision | bytes per weight |
|---|---|
| fp32 | 4 |
| bf16 | 2 |
| fp16 | 2 |
| int8 | 1 |
| int4 | 0.5 |
Examples:
| model | precision | weight memory |
|---|---|---|
| 7B | bf16/fp16 | 14 GB |
| 13B | bf16/fp16 | 26 GB |
| 70B | bf16/fp16 | 140 GB |
See explainer §2.3, §5.2.
Section 5 — Training vs inference¶
| stage | forward | backward | gradients | optimizer state | KV cache |
|---|---|---|---|---|---|
| inference | yes | no | no | no | yes |
| training | yes | yes | yes | yes | not the main bottleneck |
Training memory roughly includes:
Inference memory roughly includes:
See explainer §2.3.
Section 6 — Parallelism at scale¶
| method | split what? | core idea |
|---|---|---|
| data parallel | batches | replicate full model; average gradients |
| tensor parallel | tensors inside layers | shard large matrix operations |
| pipeline parallel | groups of layers | assembly-line execution over microbatches |
Dense transformer FLOPs rule of thumb:
See explainer §2.4.
Section 7 — SFT basics¶
SFT objective is still next-token prediction. The distribution changes. Now the model sees instruction-response examples.
Typical SFT data row:
{
"messages": [
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarize this email."},
{"role": "assistant", "content": "- point 1\n- point 2"}
]
}
What SFT teaches well:
- role following
- output format
- tone
- common refusals
- concise completion behavior
See explainer §3.1.
Section 8 — SFT quality, templates, and forgetting¶
Quality > quantity¶
High-quality SFT data usually means:
- correct content
- consistent format
- representative prompts
- clear annotation guidelines
- minimal duplication
Template parity¶
Serving template should match training template. Role markers are part of the learned distribution.
Catastrophic forgetting¶
Symptoms:
- narrow domain improves
- general capability regresses
- style becomes repetitive
Mitigations:
- smaller LR
- fewer epochs
- mix general data
- adapter tuning / freezing
- broader eval set
See explainer §3.2-§3.4.
Section 9 — RLHF and DPO¶
Reward model¶
Learns from chosen vs rejected pairs. Often modeled with pairwise preference loss:
PPO intuition¶
Optimize:
Why KL? To limit drift and reward hacking.
DPO intuition¶
Increase the model's relative preference for chosen over rejected answers, compared with a reference policy. No separate reward model required.
| topic | RLHF/PPO | DPO |
|---|---|---|
| reward model | yes | no |
| online rollouts | common | usually simpler/offline |
| stability | trickier | often simpler |
See explainer §4.1-§4.4.
Section 10 — Practical training knobs¶
Learning-rate schedule¶
Default mental model:
Why warmup? Early steps are unstable.
Why decay? Later steps need finer movement.
Batch formula¶
Checkpointing¶
Two meanings:
- training checkpoint = save progress for resuming
- gradient checkpointing = recompute activations to save memory
See explainer §5.1, §5.3, §5.4.
Section 11 — bf16 vs fp16¶
| format | bytes | main training implication |
|---|---|---|
| fp16 | 2 | smaller range; may need loss scaling |
| bf16 | 2 | wider range; usually more stable for training |
bf16 is often preferred for transformer training on modern accelerators.
See explainer §5.2.
Section 12 — Stop criteria¶
| stage | main signal | danger signal |
|---|---|---|
| pretraining | held-out loss and downstream evals | spending huge compute for tiny gains |
| SFT | task eval + general eval together | catastrophic forgetting |
| RLHF / DPO | human win rate, reward, KL, spot checks | reward hacking |
Do not stop or continue from train loss alone.
See explainer §5.5.
Reading list¶
02_explainer.md— primary.- GPT-3 methodology section.
- InstructGPT.
- DPO overview or paper.
- Practical RLHF blog from Hugging Face or Huyen Chip.
Reference material¶
Videos¶
- Karpathy or similar long-form talk on LLM pretraining pipelines.
- John Schulman or equivalent RLHF talk for PPO intuition.
Blogs¶
- Hugging Face RLHF explainer.
- A practical DPO walkthrough.
Self-check¶
For the full Q&A bank, see 02_explainer.md §6.3.
- Why is pretraining alone insufficient for assistant behavior? (§1.1)
- What is the curriculum in pretraining? (§2.1)
- Why does next-token prediction create broad knowledge? (§2.2)
- Why is training memory larger than inference memory? (§2.3)
- What gets split in data, tensor, and pipeline parallelism? (§2.4)
- What stays the same objective-wise from pretraining to SFT? (§3.1)
- Why can high-quality SFT data beat much larger noisy data? (§3.2)
- Why are chat templates part of model behavior? (§3.3)
- What does KL protect against in RLHF? (§4.3)
- What machinery disappears in DPO? (§4.4)
- Why is bf16 usually preferred for training? (§5.2)
- What does gradient accumulation simulate? (§5.3)