00. LLM Training Lifecycle — The Five-Year-Old Version¶

Module 04 built the attention machine. This module turns that machine into a trained, useful, inspectable assistant.

The strange part is this: nobody directly programs the assistant to be helpful. We mostly choose what it reads, which tokens get punished, which examples count as good behavior, which comparisons win, and which checkpoints survive evaluation. A trained LLM is therefore less like a hand-written rulebook and more like a city shaped by roads, prices, inspections, and habits.

Imagine hiring a brilliant new analyst for a support team. On day one, you do not put them on live customer chats. First they read the company wiki: manuals, old tickets, product docs, policy notes, bug reports, and code snippets. That stage creates the wiki reader. The reader knows patterns, phrases, and facts, but still treats every request as another page to continue.

Then you choose what goes into the wiki. If the reading pile has duplicate pages, stale policies, spam, or too much of one department's writing, the analyst absorbs those biases. That pile is the curriculum. Training is not just "more reading"; it is deciding which experiences deserve to shape the employee.

After reading, the analyst shadows a senior teammate. They see exact requests and exact replies: summarize in bullets, return JSON, refuse unsafe work, ask a clarifying question, stop after the answer. This is the shadow shift. The employee is not gaining all new knowledge; they are learning the product contract.

But two answers can both be correct. One is short and useful. One is long and annoying. One refuses too much. One takes the riskier path. Humans compare drafts and create the preference desk. The desk tells the analyst which acceptable answer should win.

Behind the scenes, the training manager has a second problem: the employee is huge. Their notes, scratch work, review forms, and memory pile may not fit in one room. So the manager uses the GPU kitchen: split work across stations, checkpoint bulky intermediate work, keep logs, and avoid burning the whole budget on a run that is already failing.

The tools matter too. PyTorch is the workbench where you can build every joint by hand. Hugging Face is the labeled supply shelf: tokenizers, model classes, datasets, registries, launchers, trainers. The shelf is fast until the pre-cut panel does not fit. Then you need to understand the workbench underneath.

The lifecycle is the sequence of pressures:

raw text pressure
      │
      ▼
curriculum choice ──→ next-token learning ──→ base model
      │                                      │
      ▼                                      ▼
tooling + GPU kitchen                 instruction failure
      │                                      │
      ▼                                      ▼
shadow shift ──→ preference desk ──→ evaluated product model

The trick is to stop seeing training as one magic run. It is a chain of contracts. Data contracts decide what world the model compresses. Loss contracts decide what behavior is rewarded locally. Tool contracts decide what code actually runs. Preference contracts decide which useful-looking behavior survives.

That is the core first principle of the module: every training mechanism is a response to a mismatch. The model predicts text, but the product needs a contract. The corpus is huge, but the useful signal is rare. The loop is simple, but the tensors do not fit. The answer is plausible, but not the one users prefer. The benchmark improves, but the release gets worse.

Once you see the mismatches, the lifecycle becomes easier to remember:

mismatch discovered ──→ pressure named ──→ mechanism chosen ──→ new pressure measured

The same chain explains why a small targeted change can beat a huge generic one. If the analyst already knows what a rollback is but keeps writing essays, more wiki pages are not the first fix. The missing stage is the shadow shift: show exact requests and exact answers until the right behavior becomes the obvious continuation.

It also explains why tooling is part of the curriculum, not an appendix. If the manager records examples in one form but the live desk sends them in another, the analyst sees a different job at deployment time. That is what a chat template bug feels like. It is not "just formatting"; it changes where the answer begins.

Now stretch the analogy to hardware. A tiny employee can work alone at one desk. A giant employee needs a kitchen: one station holds notes, one station holds drafts, one station holds review sheets, and a runner carries pieces between them. If the kitchen plan is wrong, the employee may be brilliant and still unable to finish the shift. That is the GPU kitchen.

This module is therefore not a survey of every training trick. It is a pressure chain:

why the base model fails the product contract
why data mix becomes behavior
how the next-token loop actually changes weights
why memory and parallelism dominate large runs
where PyTorch and Hugging Face fit
why SFT teaches assistant behavior
why templates and curation decide reliability
why preferences need drift control
how eval gates decide whether a checkpoint is real progress

One recurring mistake is to confuse stage names with explanations. "We used SFT" does not explain why the answer improved. Did the examples teach role boundaries? Did masking target only assistant tokens? Did the data contain the hard cases? Did evals isolate format from factual preservation? The stage name is a label. The mechanism is the explanation.

Another recurring mistake is to treat later stages as cleanup for earlier negligence. Sometimes they are. Usually that is expensive. A bad curriculum creates priors that the shadow shift must fight. A shallow preference desk rewards cheap style that later product evals must catch. A sloppy supply shelf configuration can make a good checkpoint look broken.

So read this module like a detective story, not a glossary. In each chapter, ask: what failure fooled the team, what constraint made the obvious repair insufficient, and what new cost did the successful mechanism create? If a concept does not answer those questions, you have memorized a label instead of owning the tool.

Keep the analyst story in mind, but do not force it beyond usefulness. The analogy is there to preserve the pressure sequence: read broadly, learn the job, compare drafts, fit the kitchen, measure the release. The real mechanisms are tensors, losses, datasets, process groups, templates, human labels, and evals.

By the end, a reader should be able to look at a training failure and ask the right first question:

Is this a knowledge problem?
Is this a behavior-copying problem?
Is this a protocol/template problem?
Is this a preference-ranking problem?
Is this a memory/runtime problem?
Is this an eval blindness problem?

That diagnostic split is the point of the module. It keeps teams from using the most expensive knob just because it is the most famous one.

The placeholders you will see called back¶

Placeholder	Meaning
the wiki reader	the pretrained base model that learned broad text patterns before becoming an assistant
the curriculum	the data mixture, filters, dedup rules, tokenizer choices, and sampling ratios that shape pretraining
the GPU kitchen	the hardware/runtime system: memory, optimizer state, parallelism, checkpointing, and launch tooling
the workbench	PyTorch-level primitives: tensors, autograd, modules, training loops, and distributed wrappers
the supply shelf	Hugging Face ecosystem pieces: Transformers, Datasets, Tokenizers, Hub, Trainer, Accelerate
the shadow shift	supervised fine-tuning that copies demonstrated assistant behavior
the preference desk	reward models, PPO, DPO, evals, and human comparisons that choose among plausible answers

Top resources¶

Let's reproduce GPT-2 (124M) — end-to-end pretraining mechanics with visible intermediate decisions.
The Ultra-Scale Playbook — practical memory, parallelism, and cluster-training tradeoffs.
The Illustrated GPT-2 — visual grounding for how a pretrained decoder behaves.
InstructGPT — the classic pipeline from SFT to reward modeling to PPO.
Direct Preference Optimization — a simpler preference-training path and its assumptions.
Chinchilla — why compute, data, and model size must be balanced instead of scaled blindly.
Hugging Face Course — practical Transformers, Tokenizers, Datasets, and Hub fluency.
PyTorch tutorials — tensors, autograd, modules, AMP, and distributed basics.

What's coming¶

01-base-model-product-contract.md — why a fluent base model still fails the product job.
02-curriculum-data-mix.md — how data mix becomes model behavior.
03-next-token-training-loop.md — how next-token loss, autograd, and a real loop actually train.
04-memory-and-parallelism.md — why training breaks on memory before it breaks on math.
05-pytorch-hf-tooling.md — how PyTorch and Hugging Face encode training decisions.
06-sft-behavior-copying.md — why SFT changes manners without changing the basic loss.
07-chat-protocol-and-data-quality.md — why templates and curation beat vague "more examples."
08-preferences-reward-ppo-dpo.md — how preference training improves behavior without letting it drift.
09-lifecycle-decisions-evals.md — how to choose stop points, eval gates, and deployment handoffs.
10-honest-admission.md — what still works more empirically than theoretically.

Memory map¶

Concept	Prerequisite	Pressure family	Recurs later as	Layer touched
Base model contract	next-token prediction	product behavior	SFT target shape, eval gates	model → API → product
Data mix	corpus sampling	data quality	source-sliced evals, preference drift	data → tokenizer → behavior
Training loop	shifted tokens + loss	optimization	SFT masking, DPO updates	tensor → gradient → checkpoint
Memory plan	parameters + activations	bounded compute	sharding, checkpointing, launch config	GPU → runtime → operator
Tooling boundary	PyTorch loop	abstraction leakage	template bugs, checkpoint reproducibility	library → config → run
SFT examples	base model behavior	role adaptation	chat protocol, preference pairs	data row → loss mask → assistant
Chat protocol	SFT row structure	protocol consistency	train/eval/serve mismatch	template → token ids → inference
Preference training	SFT candidate answers	taste control	KL drift, reward hacking	label → objective → behavior
Eval gate	product contract	release risk	rollback, ablation, honest uncertainty	metric → decision → deployment

Bridge. The wiki reader can sound smart while violating the user's contract. The first failure is not ignorance; it is the gap between language continuation and product behavior. → 01-base-model-product-contract.md