03. Week 5 — LLM Training Lifecycle¶

For deep understanding see 02_explainer.md — narrative with diagrams, lifecycle story, retrieval prompts, interview Q&A. This file is the quick-reference glossary: formulas, definitions, lookup tables, and memory math.

Section 1 — Lifecycle map¶

pretraining
   ↓
SFT
   ↓
RLHF or DPO
   ↓
domain fine-tuning (optional)

Pretraining = broad next-token learning on massive corpora.
SFT = instruction-tuning on demonstration pairs.
RLHF = preference optimization using a reward model, usually with KL control.
DPO = direct preference optimization from chosen/rejected pairs.
Domain fine-tuning = specialization for a narrower task or industry.

See explainer §ELI5, §1.1.

Section 2 — Data curation in pretraining¶

Common steps:

exact dedup
near-dup dedup
quality filtering
language ID
PII / safety filtering
source weighting / mixing
tokenization and packing

bad corpus issue	why it hurts
duplicates	memorization, inflated effective weight
spam / SEO junk	weakens knowledge quality
skewed source mix	over-specializes capability distribution
PII leakage	privacy and safety risk

Curriculum = the effective mix of sources presented during training.

See explainer §2.1.

Section 3 — GPT-style pretraining objective¶

Causal LM objective:

maximize  P(x_t | x_<t)

Equivalent loss:

L = -log p(correct next token)

Cross-entropy over the vocabulary at each token position.

Why it works:

compresses syntax
compresses factual associations
compresses style patterns
compresses code and reasoning traces

See explainer §2.2.

Section 4 — Memory math and model size¶

First-order rule:

memory ≈ parameter count × bytes per parameter

precision	bytes per weight
fp32	4
bf16	2
fp16	2
int8	1
int4	0.5

Examples:

model	precision	weight memory
7B	bf16/fp16	14 GB
13B	bf16/fp16	26 GB
70B	bf16/fp16	140 GB

See explainer §2.3, §5.2.

Section 5 — Training vs inference¶

stage	forward	backward	gradients	optimizer state	KV cache
inference	yes	no	no	no	yes
training	yes	yes	yes	yes	not the main bottleneck

Training memory roughly includes:

weights + activations + gradients + optimizer states

Inference memory roughly includes:

weights + KV cache

See explainer §2.3.

Section 6 — Parallelism at scale¶

method	split what?	core idea
data parallel	batches	replicate full model; average gradients
tensor parallel	tensors inside layers	shard large matrix operations
pipeline parallel	groups of layers	assembly-line execution over microbatches

Dense transformer FLOPs rule of thumb:

training FLOPs ≈ 6 × parameters × tokens

See explainer §2.4.

Section 7 — SFT basics¶

SFT objective is still next-token prediction. The distribution changes. Now the model sees instruction-response examples.

Typical SFT data row:

{
  "messages": [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user", "content": "Summarize this email."},
    {"role": "assistant", "content": "- point 1\n- point 2"}
  ]
}

What SFT teaches well:

role following
output format
tone
common refusals
concise completion behavior

See explainer §3.1.

Section 8 — SFT quality, templates, and forgetting¶

Quality > quantity¶

High-quality SFT data usually means:

correct content
consistent format
representative prompts
clear annotation guidelines
minimal duplication

Template parity¶

Serving template should match training template. Role markers are part of the learned distribution.

Catastrophic forgetting¶

Symptoms:

narrow domain improves
general capability regresses
style becomes repetitive

Mitigations:

smaller LR
fewer epochs
mix general data
adapter tuning / freezing
broader eval set

See explainer §3.2-§3.4.

Section 9 — RLHF and DPO¶

Reward model¶

Learns from chosen vs rejected pairs. Often modeled with pairwise preference loss:

P(chosen preferred) = σ(r_chosen - r_rejected)

PPO intuition¶

Optimize:

reward - β × KL(new policy || reference policy)

Why KL? To limit drift and reward hacking.

DPO intuition¶

Increase the model's relative preference for chosen over rejected answers, compared with a reference policy. No separate reward model required.

topic	RLHF/PPO	DPO
reward model	yes	no
online rollouts	common	usually simpler/offline
stability	trickier	often simpler

See explainer §4.1-§4.4.

Section 10 — Practical training knobs¶

Learning-rate schedule¶

Default mental model:

warmup → peak LR → cosine decay

Why warmup? Early steps are unstable.

Why decay? Later steps need finer movement.

Batch formula¶

global batch = microbatch × data-parallel workers × accumulation steps

Checkpointing¶

Two meanings:

training checkpoint = save progress for resuming
gradient checkpointing = recompute activations to save memory

See explainer §5.1, §5.3, §5.4.

Section 11 — bf16 vs fp16¶

format	bytes	main training implication
fp16	2	smaller range; may need loss scaling
bf16	2	wider range; usually more stable for training

bf16 is often preferred for transformer training on modern accelerators.

See explainer §5.2.

Section 12 — Stop criteria¶

stage	main signal	danger signal
pretraining	held-out loss and downstream evals	spending huge compute for tiny gains
SFT	task eval + general eval together	catastrophic forgetting
RLHF / DPO	human win rate, reward, KL, spot checks	reward hacking

Do not stop or continue from train loss alone.

See explainer §5.5.

Reading list¶

02_explainer.md — primary.
GPT-3 methodology section.
InstructGPT.
DPO overview or paper.
Practical RLHF blog from Hugging Face or Huyen Chip.

Reference material¶

Videos¶

Karpathy or similar long-form talk on LLM pretraining pipelines.
John Schulman or equivalent RLHF talk for PPO intuition.

Blogs¶

Hugging Face RLHF explainer.
A practical DPO walkthrough.

Self-check¶

For the full Q&A bank, see 02_explainer.md §6.3.

Why is pretraining alone insufficient for assistant behavior? (§1.1)
What is the curriculum in pretraining? (§2.1)
Why does next-token prediction create broad knowledge? (§2.2)
Why is training memory larger than inference memory? (§2.3)
What gets split in data, tensor, and pipeline parallelism? (§2.4)
What stays the same objective-wise from pretraining to SFT? (§3.1)
Why can high-quality SFT data beat much larger noisy data? (§3.2)
Why are chat templates part of model behavior? (§3.3)
What does KL protect against in RLHF? (§4.3)
What machinery disappears in DPO? (§4.4)
Why is bf16 usually preferred for training? (§5.2)
What does gradient accumulation simulate? (§5.3)