Skip to content

03. Week 5 — LLM Training Lifecycle

For deep understanding see 02_explainer.md — narrative with diagrams, lifecycle story, retrieval prompts, interview Q&A. This file is the quick-reference glossary: formulas, definitions, lookup tables, and memory math.

Section 1 — Lifecycle map

pretraining
SFT
RLHF or DPO
domain fine-tuning (optional)
  • Pretraining = broad next-token learning on massive corpora.
  • SFT = instruction-tuning on demonstration pairs.
  • RLHF = preference optimization using a reward model, usually with KL control.
  • DPO = direct preference optimization from chosen/rejected pairs.
  • Domain fine-tuning = specialization for a narrower task or industry.

See explainer §ELI5, §1.1.

Section 2 — Data curation in pretraining

Common steps:

  • exact dedup
  • near-dup dedup
  • quality filtering
  • language ID
  • PII / safety filtering
  • source weighting / mixing
  • tokenization and packing
bad corpus issue why it hurts
duplicates memorization, inflated effective weight
spam / SEO junk weakens knowledge quality
skewed source mix over-specializes capability distribution
PII leakage privacy and safety risk

Curriculum = the effective mix of sources presented during training.

See explainer §2.1.

Section 3 — GPT-style pretraining objective

Causal LM objective:

maximize  P(x_t | x_<t)

Equivalent loss:

L = -log p(correct next token)

Cross-entropy over the vocabulary at each token position.

Why it works:

  • compresses syntax
  • compresses factual associations
  • compresses style patterns
  • compresses code and reasoning traces

See explainer §2.2.

Section 4 — Memory math and model size

First-order rule:

memory ≈ parameter count × bytes per parameter
precision bytes per weight
fp32 4
bf16 2
fp16 2
int8 1
int4 0.5

Examples:

model precision weight memory
7B bf16/fp16 14 GB
13B bf16/fp16 26 GB
70B bf16/fp16 140 GB

See explainer §2.3, §5.2.

Section 5 — Training vs inference

stage forward backward gradients optimizer state KV cache
inference yes no no no yes
training yes yes yes yes not the main bottleneck

Training memory roughly includes:

weights + activations + gradients + optimizer states

Inference memory roughly includes:

weights + KV cache

See explainer §2.3.

Section 6 — Parallelism at scale

method split what? core idea
data parallel batches replicate full model; average gradients
tensor parallel tensors inside layers shard large matrix operations
pipeline parallel groups of layers assembly-line execution over microbatches

Dense transformer FLOPs rule of thumb:

training FLOPs ≈ 6 × parameters × tokens

See explainer §2.4.

Section 7 — SFT basics

SFT objective is still next-token prediction. The distribution changes. Now the model sees instruction-response examples.

Typical SFT data row:

{
  "messages": [
    {"role": "system", "content": "You are a concise assistant."},
    {"role": "user", "content": "Summarize this email."},
    {"role": "assistant", "content": "- point 1\n- point 2"}
  ]
}

What SFT teaches well:

  • role following
  • output format
  • tone
  • common refusals
  • concise completion behavior

See explainer §3.1.

Section 8 — SFT quality, templates, and forgetting

Quality > quantity

High-quality SFT data usually means:

  • correct content
  • consistent format
  • representative prompts
  • clear annotation guidelines
  • minimal duplication

Template parity

Serving template should match training template. Role markers are part of the learned distribution.

Catastrophic forgetting

Symptoms:

  • narrow domain improves
  • general capability regresses
  • style becomes repetitive

Mitigations:

  • smaller LR
  • fewer epochs
  • mix general data
  • adapter tuning / freezing
  • broader eval set

See explainer §3.2-§3.4.

Section 9 — RLHF and DPO

Reward model

Learns from chosen vs rejected pairs. Often modeled with pairwise preference loss:

P(chosen preferred) = σ(r_chosen - r_rejected)

PPO intuition

Optimize:

reward - β × KL(new policy || reference policy)

Why KL? To limit drift and reward hacking.

DPO intuition

Increase the model's relative preference for chosen over rejected answers, compared with a reference policy. No separate reward model required.

topic RLHF/PPO DPO
reward model yes no
online rollouts common usually simpler/offline
stability trickier often simpler

See explainer §4.1-§4.4.

Section 10 — Practical training knobs

Learning-rate schedule

Default mental model:

warmup → peak LR → cosine decay

Why warmup? Early steps are unstable.

Why decay? Later steps need finer movement.

Batch formula

global batch = microbatch × data-parallel workers × accumulation steps

Checkpointing

Two meanings:

  • training checkpoint = save progress for resuming
  • gradient checkpointing = recompute activations to save memory

See explainer §5.1, §5.3, §5.4.

Section 11 — bf16 vs fp16

format bytes main training implication
fp16 2 smaller range; may need loss scaling
bf16 2 wider range; usually more stable for training

bf16 is often preferred for transformer training on modern accelerators.

See explainer §5.2.

Section 12 — Stop criteria

stage main signal danger signal
pretraining held-out loss and downstream evals spending huge compute for tiny gains
SFT task eval + general eval together catastrophic forgetting
RLHF / DPO human win rate, reward, KL, spot checks reward hacking

Do not stop or continue from train loss alone.

See explainer §5.5.

Reading list

  1. 02_explainer.md — primary.
  2. GPT-3 methodology section.
  3. InstructGPT.
  4. DPO overview or paper.
  5. Practical RLHF blog from Hugging Face or Huyen Chip.

Reference material

Videos

  • Karpathy or similar long-form talk on LLM pretraining pipelines.
  • John Schulman or equivalent RLHF talk for PPO intuition.

Blogs

  • Hugging Face RLHF explainer.
  • A practical DPO walkthrough.

Self-check

For the full Q&A bank, see 02_explainer.md §6.3.

  1. Why is pretraining alone insufficient for assistant behavior? (§1.1)
  2. What is the curriculum in pretraining? (§2.1)
  3. Why does next-token prediction create broad knowledge? (§2.2)
  4. Why is training memory larger than inference memory? (§2.3)
  5. What gets split in data, tensor, and pipeline parallelism? (§2.4)
  6. What stays the same objective-wise from pretraining to SFT? (§3.1)
  7. Why can high-quality SFT data beat much larger noisy data? (§3.2)
  8. Why are chat templates part of model behavior? (§3.3)
  9. What does KL protect against in RLHF? (§4.3)
  10. What machinery disappears in DPO? (§4.4)
  11. Why is bf16 usually preferred for training? (§5.2)
  12. What does gradient accumulation simulate? (§5.3)