Skip to content

00. LLM Training Lifecycle — The Five-Year-Old Version

Module 04 built the attention machine. This module turns that machine into a trained, useful, inspectable assistant.

The strange part is this: nobody directly programs the assistant to be helpful. We mostly choose what it reads, which tokens get punished, which examples count as good behavior, which comparisons win, and which checkpoints survive evaluation. A trained LLM is therefore less like a hand-written rulebook and more like a city shaped by roads, prices, inspections, and habits.

Imagine hiring a brilliant new analyst for a support team. On day one, you do not put them on live customer chats. First they read the company wiki: manuals, old tickets, product docs, policy notes, bug reports, and code snippets. That stage creates the wiki reader. The reader knows patterns, phrases, and facts, but still treats every request as another page to continue.

Then you choose what goes into the wiki. If the reading pile has duplicate pages, stale policies, spam, or too much of one department's writing, the analyst absorbs those biases. That pile is the curriculum. Training is not just "more reading"; it is deciding which experiences deserve to shape the employee.

After reading, the analyst shadows a senior teammate. They see exact requests and exact replies: summarize in bullets, return JSON, refuse unsafe work, ask a clarifying question, stop after the answer. This is the shadow shift. The employee is not gaining all new knowledge; they are learning the product contract.

But two answers can both be correct. One is short and useful. One is long and annoying. One refuses too much. One takes the riskier path. Humans compare drafts and create the preference desk. The desk tells the analyst which acceptable answer should win.

Behind the scenes, the training manager has a second problem: the employee is huge. Their notes, scratch work, review forms, and memory pile may not fit in one room. So the manager uses the GPU kitchen: split work across stations, checkpoint bulky intermediate work, keep logs, and avoid burning the whole budget on a run that is already failing.

The tools matter too. PyTorch is the workbench where you can build every joint by hand. Hugging Face is the labeled supply shelf: tokenizers, model classes, datasets, registries, launchers, trainers. The shelf is fast until the pre-cut panel does not fit. Then you need to understand the workbench underneath.

The lifecycle is the sequence of pressures:

raw text pressure
curriculum choice ──→ next-token learning ──→ base model
      │                                      │
      ▼                                      ▼
tooling + GPU kitchen                 instruction failure
      │                                      │
      ▼                                      ▼
shadow shift ──→ preference desk ──→ evaluated product model

The trick is to stop seeing training as one magic run. It is a chain of contracts. Data contracts decide what world the model compresses. Loss contracts decide what behavior is rewarded locally. Tool contracts decide what code actually runs. Preference contracts decide which useful-looking behavior survives.

That is the core first principle of the module: every training mechanism is a response to a mismatch. The model predicts text, but the product needs a contract. The corpus is huge, but the useful signal is rare. The loop is simple, but the tensors do not fit. The answer is plausible, but not the one users prefer. The benchmark improves, but the release gets worse.

Once you see the mismatches, the lifecycle becomes easier to remember:

mismatch discovered ──→ pressure named ──→ mechanism chosen ──→ new pressure measured

The same chain explains why a small targeted change can beat a huge generic one. If the analyst already knows what a rollback is but keeps writing essays, more wiki pages are not the first fix. The missing stage is the shadow shift: show exact requests and exact answers until the right behavior becomes the obvious continuation.

It also explains why tooling is part of the curriculum, not an appendix. If the manager records examples in one form but the live desk sends them in another, the analyst sees a different job at deployment time. That is what a chat template bug feels like. It is not "just formatting"; it changes where the answer begins.

Now stretch the analogy to hardware. A tiny employee can work alone at one desk. A giant employee needs a kitchen: one station holds notes, one station holds drafts, one station holds review sheets, and a runner carries pieces between them. If the kitchen plan is wrong, the employee may be brilliant and still unable to finish the shift. That is the GPU kitchen.

This module is therefore not a survey of every training trick. It is a pressure chain:

  • why the base model fails the product contract
  • why data mix becomes behavior
  • how the next-token loop actually changes weights
  • why memory and parallelism dominate large runs
  • where PyTorch and Hugging Face fit
  • why SFT teaches assistant behavior
  • why templates and curation decide reliability
  • why preferences need drift control
  • how eval gates decide whether a checkpoint is real progress

One recurring mistake is to confuse stage names with explanations. "We used SFT" does not explain why the answer improved. Did the examples teach role boundaries? Did masking target only assistant tokens? Did the data contain the hard cases? Did evals isolate format from factual preservation? The stage name is a label. The mechanism is the explanation.

Another recurring mistake is to treat later stages as cleanup for earlier negligence. Sometimes they are. Usually that is expensive. A bad curriculum creates priors that the shadow shift must fight. A shallow preference desk rewards cheap style that later product evals must catch. A sloppy supply shelf configuration can make a good checkpoint look broken.

So read this module like a detective story, not a glossary. In each chapter, ask: what failure fooled the team, what constraint made the obvious repair insufficient, and what new cost did the successful mechanism create? If a concept does not answer those questions, you have memorized a label instead of owning the tool.

Keep the analyst story in mind, but do not force it beyond usefulness. The analogy is there to preserve the pressure sequence: read broadly, learn the job, compare drafts, fit the kitchen, measure the release. The real mechanisms are tensors, losses, datasets, process groups, templates, human labels, and evals.

By the end, a reader should be able to look at a training failure and ask the right first question:

Is this a knowledge problem?
Is this a behavior-copying problem?
Is this a protocol/template problem?
Is this a preference-ranking problem?
Is this a memory/runtime problem?
Is this an eval blindness problem?

That diagnostic split is the point of the module. It keeps teams from using the most expensive knob just because it is the most famous one.

The placeholders you will see called back

Placeholder Meaning
the wiki reader the pretrained base model that learned broad text patterns before becoming an assistant
the curriculum the data mixture, filters, dedup rules, tokenizer choices, and sampling ratios that shape pretraining
the GPU kitchen the hardware/runtime system: memory, optimizer state, parallelism, checkpointing, and launch tooling
the workbench PyTorch-level primitives: tensors, autograd, modules, training loops, and distributed wrappers
the supply shelf Hugging Face ecosystem pieces: Transformers, Datasets, Tokenizers, Hub, Trainer, Accelerate
the shadow shift supervised fine-tuning that copies demonstrated assistant behavior
the preference desk reward models, PPO, DPO, evals, and human comparisons that choose among plausible answers

Top resources

What's coming

  1. 01-base-model-product-contract.md — why a fluent base model still fails the product job.
  2. 02-curriculum-data-mix.md — how data mix becomes model behavior.
  3. 03-next-token-training-loop.md — how next-token loss, autograd, and a real loop actually train.
  4. 04-memory-and-parallelism.md — why training breaks on memory before it breaks on math.
  5. 05-pytorch-hf-tooling.md — how PyTorch and Hugging Face encode training decisions.
  6. 06-sft-behavior-copying.md — why SFT changes manners without changing the basic loss.
  7. 07-chat-protocol-and-data-quality.md — why templates and curation beat vague "more examples."
  8. 08-preferences-reward-ppo-dpo.md — how preference training improves behavior without letting it drift.
  9. 09-lifecycle-decisions-evals.md — how to choose stop points, eval gates, and deployment handoffs.
  10. 10-honest-admission.md — what still works more empirically than theoretically.

Memory map

Concept Prerequisite Pressure family Recurs later as Layer touched
Base model contract next-token prediction product behavior SFT target shape, eval gates model → API → product
Data mix corpus sampling data quality source-sliced evals, preference drift data → tokenizer → behavior
Training loop shifted tokens + loss optimization SFT masking, DPO updates tensor → gradient → checkpoint
Memory plan parameters + activations bounded compute sharding, checkpointing, launch config GPU → runtime → operator
Tooling boundary PyTorch loop abstraction leakage template bugs, checkpoint reproducibility library → config → run
SFT examples base model behavior role adaptation chat protocol, preference pairs data row → loss mask → assistant
Chat protocol SFT row structure protocol consistency train/eval/serve mismatch template → token ids → inference
Preference training SFT candidate answers taste control KL drift, reward hacking label → objective → behavior
Eval gate product contract release risk rollback, ablation, honest uncertainty metric → decision → deployment

Bridge. The wiki reader can sound smart while violating the user's contract. The first failure is not ignorance; it is the gap between language continuation and product behavior. → 01-base-model-product-contract.md