02. LLM Training Lifecycle — Narrative Explainer¶

Companion to 03_study_material.md. The study material gives the formulas. This gives the why — and the picture in your head.

Table of contents¶

ELI5 — the whole thing in kid words
Chapter 1: It knows words, but not the job
1.1 The opening failure — trillion-token genius, useless assistant
1.2 Why this matters to you
Chapter 2: Pretraining at scale
2.1 The curriculum — what goes into the company wiki
2.2 The wiki reader — next-token prediction
2.3 Why models are huge — weights, bytes, gradients, training vs inference
2.4 The factory floor — compute and parallelism
Chapter 3: SFT — teaching usable behavior
3.1 The shadowing — what instruction tuning actually changes
3.2 Why data quality beats raw quantity
3.3 Chat templates — role formatting is part of the behavior
3.4 Catastrophic forgetting and domain specialization
Chapter 4: RLHF & DPO — performance reviews after shadowing
4.1 The promotion criteria — reward modeling
4.2 PPO simplified — improving without jumping off a cliff
4.3 KL divergence is the seatbelt
4.4 DPO — the shorter path
Chapter 5: Practical training decisions
5.1 Warmup then cosine decay
5.2 bf16 vs fp16
5.3 Gradient accumulation
5.4 Checkpointing
5.5 When to stop
5.6 Foundation-gap audit for module 06
5.7 Honest admission
Chapter 6: Recap & application

ELI5 — the whole thing in kid words¶

Imagine a company hires a new employee. On day one, you do not put them on customer emails. You make them read the entire company wiki. Policies, past reports, product docs, support chats — everything.

That is pretraining. Call the model the wiki reader.

The employee now knows many facts. But still behaves strangely. They know what an invoice is. But if you say "Reply to this angry customer politely," they answer like a textbook.

So we make them sit beside a senior colleague. Watch real tasks: how to answer, how to structure replies, where to refuse.

That is SFT — call it the shadowing.

Even after shadowing, two replies may both look acceptable. One is warmer; one sounds robotic. Humans compare outputs and reward the preferred style. That is RLHF or DPO — call it the performance review.

Now one more thing. The employee should not read random documents equally. If half the wiki is spam, the employee becomes strange. The right document mix is the curriculum — it matters from day one.

And when we score good vs bad behavior during the performance review, we need a judge. That judge is the promotion criteria — formally, the reward model.

So the full story:

Read the company wiki (pretraining)
Shadow a senior colleague (SFT)
Get performance reviews (RLHF/DPO)
Specialize in a department (domain fine-tuning)

If the wiki is bad, knowledge becomes shaky. If the shadowing is sloppy, behavior becomes sloppy. If the performance review is crooked, the employee optimizes the wrong thing. If specialization is too narrow, the employee forgets the rest of the company.

Every chapter maps to something here.

Chapter 1: It knows words, but not the job¶

1.1 The opening failure — trillion-token genius, useless assistant¶

You pretrain a model on 1 trillion tokens — web pages, books, code, forums. You feel confident.

Then you prompt it:

Summarize this email in 3 bullet points:

Hi team, the vendor pushed the database migration to Friday.
Please pause the rollout until QA signs off.
Also let finance know the invoice will slip by one week.

And the model replies:

Email is a method of exchanging digital messages.
Electronic mail became widely used in business communication.
Database migration is the process of moving data between systems.

Grammatical. Fluent. Useless.

Why? Because pretraining taught the model to continue text, not to obey instructions. The model saw internet text shaped like articles, so it produced article-like continuation. That is rational from its training objective.

Three things are missing: 1. Task framing — the model does not know where instruction ends and source text begins. 2. Format demonstration — nobody showed bullet summaries, refusal style, or response boundaries. 3. Preference signal — nobody said concise task completion beats generic exposition here.

Broad knowledge is not the same as usable behavior. The fix is not more raw pretraining — it is the shadowing (SFT).

1.2 Why this matters to you¶

As a Lead AI Engineer, this is a capability diagnosis issue. A product manager says: "The base model is brilliant, but the assistant is unusable." You must diagnose the stage.

Is the failure: - lack of knowledge? - lack of instruction following? - lack of preference alignment? - wrong chat format? - bad reward signal?

These are different failures with different fixes and different costs. If you misdiagnose: you might say "let's pretrain longer" when the real need was SFT. Or "let's do RLHF" when the real failure was garbage SFT data. Or "let's domain-finetune" when the real bug was a broken chat template.

Understanding the lifecycle lets you decide: - whether to buy a stronger base model or tune a weaker one - whether to collect more preference data or better instruction data - whether to spend on GPUs or on annotation quality - whether to use PPO, DPO, or stop at SFT

Chapter 2: Pretraining at scale¶

2.1 The curriculum — what goes into the company wiki¶

Before the wiki reader learns anything, we must decide what the wiki contains. Do not imagine pretraining as "throw the internet into GPUs." Raw web text contains duplicates, SEO spam, boilerplate, toxic content, PII, and outdated junk. A model trained on that is not wise — it is expensive and weird.

The data pipeline looks like this:

raw crawls
   │
   ├─→ language ID + boilerplate removal
   ├─→ exact dedup (hash-based)
   ├─→ near-dup dedup (MinHash / shingling)
   ├─→ quality filters (readability, spam, safety)
   ├─→ PII filtering
   └─→ source weighting → tokenize → pack sequences

After cleaning, you also control the mix:

source	cleaned size	sampling weight	effective share
web	350B	1.0	50%
code	180B	1.6	33%
books	120B	0.8	12%
support logs	30B	1.2	5%

More code improves coding ability. Too much low-quality dialogue can hurt general writing. This mix is the curriculum and it is one of the biggest frontier-lab levers.

2.2 The wiki reader — next-token prediction¶

How does the model actually learn? By predicting the next token — that is the core pretraining objective.

The capital of France is [ ? ]

→ Paris: 0.70, Lyon: 0.10, London: 0.08, cheese: 0.02
loss = -log(0.70) ≈ 0.357

Training loop in spirit: 1. feed tokens 2. predict next 3. compute loss 4. backprop 5. update weights 6. repeat for months

This sounds too trivial. But to predict well, the model must infer syntax, facts, code patterns, discourse structure, and reasoning traces. Every interesting sentence requires domain knowledge to complete well. That is how the wiki reader becomes broadly competent.

Loss curve:

loss
 ^
 |\
 | \
 |  \___
 |       \__
 +-----------> training steps

Steep early drop, expensive gains later. If the curve flattens and cost is enormous, it may be time to move to SFT — not keep pretraining.

2.3 Why models are huge — weights, bytes, gradients, training vs inference¶

Model weights are the trained numbers — billions of knobs in attention projections, feed-forward layers, and embeddings.

Memory at first order: parameter count × bytes per parameter

model	precision	bytes/weight	weight memory
7B	fp32	4	28 GB
7B	bf16/fp16	2	14 GB
70B	bf16/fp16	2	140 GB

That is just weights. Add gradients and optimizer states for training:

inference memory = weights + KV cache
training memory  = weights + activations + gradients + optimizer states

A 7B model with Adam in bf16/fp32 mixed states:

weights: 14 GB
gradients: 14 GB
Adam first moment (fp32): 28 GB
Adam second moment (fp32): 28 GB
activations: extra, sequence-length dependent

Already ~84 GB before activations. That is why a 7B model fits on one 24 GB card for inference but not for full training.

Gradients are the learning signal:

tokens → forward pass → logits → loss
                                  │
                             backward pass
                                  │
                             gradients → weight update

No gradients, no learning. Training cares deeply about gradient quality. Inference does not compute gradients at all.

2.4 The factory floor — compute and parallelism¶

A rough rule: training FLOPs ≈ 6 × parameters × tokens

A 70B model on 1T tokens: 6 × 70e9 × 1e12 ≈ 4.2e23 FLOPs Absurd on one GPU — so we parallelize.

method	what is split?	good for	main cost
data parallel	data batches	model fits per GPU	gradient synchronization
tensor parallel	matrices inside one layer	very large layers	heavy communication per layer
pipeline parallel	groups of layers	deep stacks, many GPUs	pipeline bubbles

Data parallelism: each GPU holds full model + different batch → all-reduce gradients. Tensor parallelism: split Wq/Wk/Wv shards across GPUs within one layer. Pipeline parallelism: GPU1 handles early layers, GPU2 middle, GPU3 late — assembly line style.

Frontier runs combine all three, sometimes with optimizer sharding too. One knob is not enough.

Retrieval check — pause here. Without scrolling: why does dedup matter for training? What does next-token loss actually optimize? Why is training memory much larger than inference memory? If any answer is fuzzy, scroll back.

Chapter 3: SFT — teaching usable behavior¶

3.1 The shadowing — what instruction tuning actually changes¶

We create instruction-response pairs and fine-tune on them:

User: Summarize this email in 3 bullets.
Assistant: - Vendor moved migration to Friday...

User: Explain recursion to a 12-year-old.
Assistant: Imagine two mirrors facing each other...

Important subtlety: the loss is still next-token prediction. We did not invent a new objective — we changed the dataset. Now the next tokens are assistant-style tokens, task-following tokens, well-formatted tokens.

SFT is next-token prediction on a narrower, job-shaped distribution.

SFT teaches well: output formatting, turn-taking, role boundaries, concise task completion, refusal patterns. SFT does not fully solve: preference nuance between two valid answers, or annotator biases in demonstrations.

3.2 Why data quality beats raw quantity¶

Dataset A: 5M examples — mixed quality, inconsistent format, some hallucinated answers, duplicated prompts. Dataset B: 100K examples — curated tasks, correct answers, consistent tone, template-clean.

Usually B teaches better behavior. Because behavior is copied, not abstracted away.

High-quality SFT examples have: clear user intent, correct factual content, consistent tone, no template corruption.

Synthetic data pitfalls: repetitive phrasing, over-politeness, false certainty, long-winded answers to simple tasks. The shadowing is only as good as the colleague being copied.

Practical heuristic: doubling messy SFT data often loses to spending the same effort on cleaner annotation guidelines. Pretraining rewards scale. SFT rewards curation.

3.3 Chat templates — role formatting is part of the behavior¶

A chat model is trained with role markers, for example:

<|system|>
You are a concise assistant.
<|user|>
Summarize this email.
<|assistant|>
- Point 1
- Point 2

The template is not wrapper fluff. It is part of the learned behavior. If training used <|user|>...<|assistant|> but serving uses User:...Assistant:, the actor misses cues.

Common template bugs that break production: - missing assistant-start token - system prompt placed in user role - mixing templates across datasets - duplicated BOS token - forgetting whitespace/newline conventions

Small text bug. Big behavior change. When a fine-tuned model suddenly feels dumb after deployment, check template parity first.

3.4 Catastrophic forgetting and domain specialization¶

If SFT is too aggressive, the model forgets broad behavior from pretraining.

before narrow SFT   → broad capability: ██████████  domain skill: ███
after careful SFT   → broad capability: ████████    domain skill: ███████
after reckless SFT  → broad capability: ███         domain skill: █████████

You want the middle case.

Causes: learning rate too high, too many epochs on small data, no retained mix of general instruction data. Symptoms: domain benchmark improves, general benchmark drops, style becomes uniform.

Fixes: smaller LR, fewer epochs, mix general + domain data, use adapters, track general holdout eval.

Domain fine-tuning (department specialization) makes sense when your domain has specialized vocabulary, repeatable task structure, and enough representative examples. It does not make sense when domain knowledge changes daily and can be retrieved (use RAG), or when the dataset is tiny and noisy.

Safe sequence: pretraining → general SFT → preference alignment → domain tuning with small LR.

Chapter 4: RLHF & DPO — performance reviews after shadowing¶

4.1 The promotion criteria — reward modeling¶

After shadowing, the model gives acceptable answers — but acceptable ≠ preferred.

User: How do I apologize for shipping the wrong invoice?
A: "I regret to inform you that an invoice discrepancy occurred."
B: "Sorry about the incorrect invoice. We have corrected it and will send the updated one today."

Both grammatical. B is better — direct, helpful, actionable.

How do we teach that preference? Humans compare outputs and choose the better answer. We train a model to predict those choices. That predictor is the promotion criteria — formally the reward model.

Preference data format:

prompt	chosen response	rejected response
explain X	helpful answer	vague answer
refuse harm	firm safe refusal	unsafe compliance

The reward model learns: reward(chosen) > reward(rejected) via pairwise loss. A crooked promotion rubric creates crooked behavior.

4.2 PPO simplified — improving without jumping off a cliff¶

With a shadowed model (current policy) and the promotion criteria, we:

prompt → policy response → reward score → policy update

But naïve reward maximization is unstable. If one batch says a strange answer gets high reward, the policy may overreact.

PPO clips updates: "Improve, yes. But do not leap wildly."

Conceptual objective:

maximize  reward  −  β × KL(new policy ∥ reference policy)

Two forces: - reward says: chase preferred behavior - KL says: stay near the shadowed SFT model

Like a performance review that wants improvement, not personality replacement.

4.3 KL divergence is the seatbelt¶

Because reward models are gameable. The policy optimizes a learned judge, not a human directly. Whenever you optimize a proxy hard, Goodhart's law appears.

Example failure without KL:

reward model over-rewards "safe language"
→ policy discovers: "I cannot help with that safely. I cannot help with that safely..."
→ reward model gives high score. Humans hate it.

KL says: "Do not drift too far from the reference SFT model while chasing reward."

KL weight	likely outcome
too small	reward hacking, style drift, bizarre answers
too large	almost no learning beyond SFT
reasonable	improved preference while preserving competence

4.4 DPO — the shorter path¶

PPO-based RLHF requires: policy model, reference model, reward model, rollout generation, PPO machinery.

DPO asks: can we use preference data more directly?

For each prompt we already know chosen and rejected. DPO directly increases the model's log-ratio for the chosen answer vs rejected, relative to a reference:

increase [ log π(chosen|x) − log π_ref(chosen|x) ]
       − [ log π(rejected|x) − log π_ref(rejected|x) ]

No explicit reward model needed — the reward is implicit in the preferences.

aspect	PPO-style RLHF	DPO
separate reward model	yes	no
rollouts needed	yes	usually offline
training stability	trickier	usually simpler
conceptual path	reward then optimize	optimize preferences directly

DPO does not remove the need for good preference data. Noisy chosen/rejected labels → noisy taste. The promotion criteria lives inside the data — just more implicitly.

Retrieval check. Without scrolling: what does the reward model learn from? Why is PPO not plain reward maximization? What failure does KL specifically guard against? If fuzzy, scroll back.

Chapter 5: Practical training decisions¶

5.1 Warmup then cosine decay¶

Training success depends heavily on learning-rate schedule.

Warmup — start small, ramp up
Main phase — near peak LR
Cosine decay — smoothly reduce for fine settling

learning rate
 ^
 |        /\
 |       /  \
 |      /    \___
 |_____/          \_____
 +---------------------> steps
    warmup     cosine decay

Early gradients are noisy; Adam moments unstable. Hitting peak LR at step one can diverge. Later, smaller LR helps the model settle into a better loss basin.

schedule mistake	symptom
no warmup	loss spikes early, NaNs, unstable attention
LR too high	catastrophic forgetting in SFT, divergence
LR too low	flat curve, almost no adaptation
no decay	late overfitting or plateau

5.2 bf16 vs fp16¶

Both use 2 bytes per value. The bit allocation differs:

format	exponent bits	mantissa bits	key property
fp16	5	10	more precision, small range
bf16	8	7	fp32-like range, stable

During training, activations and gradients can be very small or very large. fp16's narrow range → underflow, overflow, NaNs, loss spikes. bf16's wider exponent range protects gradient quality — usually no loss scaling needed.

Memory formula stays the same for both: 13B × 2 bytes ≈ 26 GB Next module: same weights compressed to int8 (1 byte) or int4 (0.5 byte) for deployment.

5.3 Gradient accumulation¶

When target effective batch is large but GPU memory is small:

microbatch 1 → grad g1 ┐
microbatch 2 → grad g2 ├─→ average → one optimizer step
microbatch 3 → grad g3 ┘

Formula: global batch = microbatch size × data-parallel GPUs × accumulation steps

Example: 4 × 8 × 16 = 512 sequences effective batch, while each GPU holds only 4 at a time.

Memory down, wall-clock time per step up. If you change effective batch significantly, revisit LR — they are coupled.

5.4 Checkpointing¶

Two distinct meanings:

A. Training checkpoints for resuming — save weights + optimizer state + scheduler state + step number. Without these, a week-long run that dies at step 180,000 restarts from zero.

B. Gradient checkpointing for memory — during forward pass, save fewer activations; recompute missing ones during backward.

normal:       forward saves many activations → backward reuses them
checkpointed: forward saves fewer           → backward recomputes missing ones

Compute-for-memory trade. Very common in large-model training, long contexts, and SFT/DPO setups.

5.5 When to stop¶

Pretraining: watch held-out next-token loss and downstream eval movement. If the curve flattens and cost is enormous, stop.

SFT: watch task-specific eval + general capability regression. If train loss falls but general eval drops, that is catastrophic forgetting — stop and diagnose.

RLHF/DPO: watch reward/preference win rate + KL drift + hallucination rate + verbosity + human spot checks.

metric
 ^            val quality
 |           /\
 |          /  \___
 | train ___/
 +-------------------> steps
               ↑ stop near peak validation / win rate

Do not worship train loss alone. The job is to maximize useful behavior under cost constraints.

5.6 Foundation-gap audit for module 06¶

The next module is 06_adaptation_compression. What it assumes you now understand:

assumption for module 06	covered in this module
weights are stored numbers with precision	§2.3 — fp32/bf16/int8/int4 bytes
training ≠ inference workload	§2.3 — forward vs forward+backward
memory ≈ parameter count × bytes	§2.3 — the first-order formula
gradients drive learning	§2.3 + all training stages
big models need memory tricks	§2.4, §5.3, §5.4

Quantization is just controlled compression of stored numbers. With these foundations, it feels grounded — not mystical.

5.7 Honest admission¶

Three honest truths.

First: many crucial training choices are still empirical recipes. Data mix, LR range, KL weight, epoch count — these are tuned, not cleanly derived from theory.

Second: reward models are not truth machines. They are proxies. Shallow proxy → policy becomes shallow in a high-scoring way.

Third: benchmarks only partially reflect product quality. A model can climb a preference metric and still annoy users in production.

Be honest in interviews. Be honest with yourself while shipping. The lifecycle is powerful. It is not neat.

Final retrieval check. Without scrolling: why is bf16 preferred for training? What does gradient accumulation simulate? Why is reward increase alone not enough to decide stopping? If fuzzy, scroll back.

Chapter 6: Recap & application¶

6.1 The failure-fix chain¶

#	Failure	Fix
1	Wiki reader knows language but not job execution	The shadowing (SFT)
2	Raw internet contains duplicates, junk, skewed sources	The curriculum (dedup, filtering, mixing)
3	Next-token training is compute-hungry at scale	Parallelism — data, tensor, pipeline
4	Model weights exceed naïve memory budgets	Mixed precision, sharding, accumulation, checkpointing
5	Shadowing on messy examples teaches messy behavior	High-quality SFT data, clear annotation rules
6	Wrong role formatting confuses the model at serving	Match the training chat template exactly
7	Narrow SFT overwrites broad skills	Small LR, data mixing, adapters, early stopping
8	SFT alone cannot express subtle preference ranking	The performance review (RLHF or DPO)
9	Reward models can be gamed	KL seatbelt + human spot checks
10	PPO adds operational complexity	DPO uses preference pairs more directly
11	Small hardware cannot fit the desired effective batch	Gradient accumulation
12	Training forever wastes money or degrades behavior	Eval-based stopping with checkpoints

Every stage exists because the previous one was not enough. That is the clean mental chain.

6.2 Interview questions — with wrong-answer traps¶

Q: Why is pretraining on 1T tokens not enough to make a useful assistant? A: Pretraining optimizes next-token continuation, not task-following behavior. The model learns fluency and facts, not instruction formatting or preferred response style. SFT teaches the assistant job shape. Wrong trap: "Because pretraining does not include enough data." Quantity is not the core issue — objective and distribution mismatch are.

Q: What is the difference between data, tensor, and pipeline parallelism? A: Data parallelism replicates the model and splits batches. Tensor parallelism splits matrix operations inside one layer. Pipeline parallelism splits layer groups like an assembly line. Wrong trap: "They all just split training across GPUs." Too shallow — what is being split matters.

Q: Why does SFT data quality matter more than quantity? A: SFT examples directly teach behavior imitation. Inconsistent or wrong examples get copied. A smaller curated dataset often teaches cleaner behavior than a huge noisy one.

Q: What does RLHF add beyond SFT? A: SFT teaches imitation of good demonstrations. RLHF adds ranking pressure between plausible answers and pushes toward human-preferred behavior — especially when several answers are acceptable but one is clearer or safer. Wrong trap: "RLHF teaches safety." Safety may be one target, but the broader role is preference alignment under a chosen rubric.

Q: Why is the KL term important in RLHF? A: It keeps the new policy close to the SFT reference while reward is being optimized. Without it, the model exploits reward-model blind spots and drifts into bizarre high-scoring behavior. Wrong trap: "KL just regularizes the model." True but incomplete — the important point is what failure it guards against: reward hacking.

Q: bf16 or fp16 for training — which, and why? A: Usually bf16. Both use 2 bytes, but bf16 has a wider exponent range so activations and gradients overflow less. fp16 can work but often needs loss scaling. Wrong trap: "bf16 is smaller." Same 16-bit storage — the key difference is numerical range.

Q: What is catastrophic forgetting? A: The model improves on a narrow domain but loses general capability. Caused by high LR, too many epochs on small data, no general-data mixing. Diagnose by tracking both domain and general evals together.

6.3 Production experience — what this looks like when you ship¶

Model sounds encyclopedic instead of helpful. Usually SFT gap. Check whether task-format demonstrations exist in the fine-tuning set.
Serving quality dropped after deployment. Check chat-template parity first. One missing assistant marker can look like "the model got worse."
Domain benchmark up, general usage complaints up. Likely catastrophic forgetting. Keep a frozen general-eval panel, compare checkpoint-to-checkpoint.
Reward score rises, user satisfaction drops. Classic proxy hacking. Check KL drift, refusal rate, verbosity, and human side-by-side reviews.
Training diverges early in fp16. Try bf16 or use loss scaling and lower LR.
OOM during long-context SFT. Turn on gradient checkpointing, reduce microbatch size, or lower sequence length.
Checkpoint resume gives different behavior. Ensure you saved optimizer and scheduler state, not just weights.
Preference data looks fine, outputs still awkward. Check whether the SFT base was strong enough — weak shadowing makes the performance review less effective.
A single metric says training is improving. Never trust one metric alone. Keep task evals, general evals, and human spot checks together.

6.4 Exercises¶

Easy (5 minutes) Take one prompt and format it two ways: (A) raw text only, (B) exact chat template with role markers. Run both through the same chat-tuned model. Write one sentence explaining why template parity matters.

Medium (15 minutes) By hand: 1. Compute weight memory for a 7B model in fp32, bf16, and int8. 2. Compute effective global batch: microbatch=2, GPUs=8, accumulation=32. 3. Write one paragraph on why training memory exceeds inference memory.

Answers should include: 7B × bytes, 2 × 8 × 32 = 512, and weights vs gradients vs optimizer states.

Hard (45 minutes) Build a tiny instruction-tuning dataset of 30 JSONL examples across three tasks: summarize an email, draft a polite reply, extract action items. Write a Python script that converts each example into the exact chat template for your target model. The script should: 1. Read raw fields 2. Insert system, user, and assistant markers 3. Print one fully formatted training example 4. Print one intentionally broken example with a missing assistant marker 5. Explain in comments what serving bug that broken example simulates

Drawing retrieval task Without looking, draw the failure-fix table from §6.1 — all 12 rows. Use the ELI5 names: wiki reader, curriculum, shadowing, performance review, promotion criteria. If you can redraw the chain, you understand the lifecycle.

Next module — 06_adaptation_compression — shows how to make these giant trained models actually deployable. Smaller, faster, cheaper — without losing what they learned.