00. LLM Training Lifecycle — The Five-Year-Old Version¶
Module 04 built the attention machine. This module turns that machine into a trained, useful, inspectable assistant.
The strange part is this: nobody directly programs the assistant to be helpful. We mostly choose what it reads, which tokens get punished, which examples count as good behavior, which comparisons win, and which checkpoints survive evaluation. A trained LLM is therefore less like a hand-written rulebook and more like a city shaped by roads, prices, inspections, and habits.
Imagine hiring a brilliant new analyst for a support team. On day one, you do not put them on live customer chats. First they read the company wiki: manuals, old tickets, product docs, policy notes, bug reports, and code snippets. That stage creates the wiki reader. The reader knows patterns, phrases, and facts, but still treats every request as another page to continue.
Then you choose what goes into the wiki. If the reading pile has duplicate pages, stale policies, spam, or too much of one department's writing, the analyst absorbs those biases. That pile is the curriculum. Training is not just "more reading"; it is deciding which experiences deserve to shape the employee.
After reading, the analyst shadows a senior teammate. They see exact requests and exact replies: summarize in bullets, return JSON, refuse unsafe work, ask a clarifying question, stop after the answer. This is the shadow shift. The employee is not gaining all new knowledge; they are learning the product contract.
But two answers can both be correct. One is short and useful. One is long and annoying. One refuses too much. One takes the riskier path. Humans compare drafts and create the preference desk. The desk tells the analyst which acceptable answer should win.
Behind the scenes, the training manager has a second problem: the employee is huge. Their notes, scratch work, review forms, and memory pile may not fit in one room. So the manager uses the GPU kitchen: split work across stations, checkpoint bulky intermediate work, keep logs, and avoid burning the whole budget on a run that is already failing.
The tools matter too. PyTorch is the workbench where you can build every joint by hand. Hugging Face is the labeled supply shelf: tokenizers, model classes, datasets, registries, launchers, trainers. The shelf is fast until the pre-cut panel does not fit. Then you need to understand the workbench underneath.
The lifecycle is the sequence of pressures:
raw text pressure
│
▼
curriculum choice ──→ next-token learning ──→ base model
│ │
▼ ▼
tooling + GPU kitchen instruction failure
│ │
▼ ▼
shadow shift ──→ preference desk ──→ evaluated product model
The trick is to stop seeing training as one magic run. It is a chain of contracts. Data contracts decide what world the model compresses. Loss contracts decide what behavior is rewarded locally. Tool contracts decide what code actually runs. Preference contracts decide which useful-looking behavior survives.
That is the core first principle of the module: every training mechanism is a response to a mismatch. The model predicts text, but the product needs a contract. The corpus is huge, but the useful signal is rare. The loop is simple, but the tensors do not fit. The answer is plausible, but not the one users prefer. The benchmark improves, but the release gets worse.
Once you see the mismatches, the lifecycle becomes easier to remember:
The same chain explains why a small targeted change can beat a huge generic one. If the analyst already knows what a rollback is but keeps writing essays, more wiki pages are not the first fix. The missing stage is the shadow shift: show exact requests and exact answers until the right behavior becomes the obvious continuation.
It also explains why tooling is part of the curriculum, not an appendix. If the manager records examples in one form but the live desk sends them in another, the analyst sees a different job at deployment time. That is what a chat template bug feels like. It is not "just formatting"; it changes where the answer begins.
Now stretch the analogy to hardware. A tiny employee can work alone at one desk. A giant employee needs a kitchen: one station holds notes, one station holds drafts, one station holds review sheets, and a runner carries pieces between them. If the kitchen plan is wrong, the employee may be brilliant and still unable to finish the shift. That is the GPU kitchen.
This module is therefore not a survey of every training trick. It is a pressure chain:
- why the base model fails the product contract
- why data mix becomes behavior
- how the next-token loop actually changes weights
- why memory and parallelism dominate large runs
- where PyTorch and Hugging Face fit
- why SFT teaches assistant behavior
- why templates and curation decide reliability
- why preferences need drift control
- how eval gates decide whether a checkpoint is real progress
One recurring mistake is to confuse stage names with explanations. "We used SFT" does not explain why the answer improved. Did the examples teach role boundaries? Did masking target only assistant tokens? Did the data contain the hard cases? Did evals isolate format from factual preservation? The stage name is a label. The mechanism is the explanation.
Another recurring mistake is to treat later stages as cleanup for earlier negligence. Sometimes they are. Usually that is expensive. A bad curriculum creates priors that the shadow shift must fight. A shallow preference desk rewards cheap style that later product evals must catch. A sloppy supply shelf configuration can make a good checkpoint look broken.
So read this module like a detective story, not a glossary. In each chapter, ask: what failure fooled the team, what constraint made the obvious repair insufficient, and what new cost did the successful mechanism create? If a concept does not answer those questions, you have memorized a label instead of owning the tool.
Keep the analyst story in mind, but do not force it beyond usefulness. The analogy is there to preserve the pressure sequence: read broadly, learn the job, compare drafts, fit the kitchen, measure the release. The real mechanisms are tensors, losses, datasets, process groups, templates, human labels, and evals.
By the end, a reader should be able to look at a training failure and ask the right first question:
Is this a knowledge problem?
Is this a behavior-copying problem?
Is this a protocol/template problem?
Is this a preference-ranking problem?
Is this a memory/runtime problem?
Is this an eval blindness problem?
That diagnostic split is the point of the module. It keeps teams from using the most expensive knob just because it is the most famous one.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| the wiki reader | the pretrained base model that learned broad text patterns before becoming an assistant |
| the curriculum | the data mixture, filters, dedup rules, tokenizer choices, and sampling ratios that shape pretraining |
| the GPU kitchen | the hardware/runtime system: memory, optimizer state, parallelism, checkpointing, and launch tooling |
| the workbench | PyTorch-level primitives: tensors, autograd, modules, training loops, and distributed wrappers |
| the supply shelf | Hugging Face ecosystem pieces: Transformers, Datasets, Tokenizers, Hub, Trainer, Accelerate |
| the shadow shift | supervised fine-tuning that copies demonstrated assistant behavior |
| the preference desk | reward models, PPO, DPO, evals, and human comparisons that choose among plausible answers |
Top resources¶
- Let's reproduce GPT-2 (124M) — end-to-end pretraining mechanics with visible intermediate decisions.
- The Ultra-Scale Playbook — practical memory, parallelism, and cluster-training tradeoffs.
- The Illustrated GPT-2 — visual grounding for how a pretrained decoder behaves.
- InstructGPT — the classic pipeline from SFT to reward modeling to PPO.
- Direct Preference Optimization — a simpler preference-training path and its assumptions.
- Chinchilla — why compute, data, and model size must be balanced instead of scaled blindly.
- Hugging Face Course — practical Transformers, Tokenizers, Datasets, and Hub fluency.
- PyTorch tutorials — tensors, autograd, modules, AMP, and distributed basics.
What's coming¶
- 01-base-model-product-contract.md — why a fluent base model still fails the product job.
- 02-curriculum-data-mix.md — how data mix becomes model behavior.
- 03-next-token-training-loop.md — how next-token loss, autograd, and a real loop actually train.
- 04-memory-and-parallelism.md — why training breaks on memory before it breaks on math.
- 05-pytorch-hf-tooling.md — how PyTorch and Hugging Face encode training decisions.
- 06-sft-behavior-copying.md — why SFT changes manners without changing the basic loss.
- 07-chat-protocol-and-data-quality.md — why templates and curation beat vague "more examples."
- 08-preferences-reward-ppo-dpo.md — how preference training improves behavior without letting it drift.
- 09-lifecycle-decisions-evals.md — how to choose stop points, eval gates, and deployment handoffs.
- 10-honest-admission.md — what still works more empirically than theoretically.
Memory map¶
| Concept | Prerequisite | Pressure family | Recurs later as | Layer touched |
|---|---|---|---|---|
| Base model contract | next-token prediction | product behavior | SFT target shape, eval gates | model → API → product |
| Data mix | corpus sampling | data quality | source-sliced evals, preference drift | data → tokenizer → behavior |
| Training loop | shifted tokens + loss | optimization | SFT masking, DPO updates | tensor → gradient → checkpoint |
| Memory plan | parameters + activations | bounded compute | sharding, checkpointing, launch config | GPU → runtime → operator |
| Tooling boundary | PyTorch loop | abstraction leakage | template bugs, checkpoint reproducibility | library → config → run |
| SFT examples | base model behavior | role adaptation | chat protocol, preference pairs | data row → loss mask → assistant |
| Chat protocol | SFT row structure | protocol consistency | train/eval/serve mismatch | template → token ids → inference |
| Preference training | SFT candidate answers | taste control | KL drift, reward hacking | label → objective → behavior |
| Eval gate | product contract | release risk | rollback, ablation, honest uncertainty | metric → decision → deployment |
Bridge. The wiki reader can sound smart while violating the user's contract. The first failure is not ignorance; it is the gap between language continuation and product behavior. → 01-base-model-product-contract.md