09. Lifecycle Decisions and Evals — choosing the next knob without fooling yourself¶

What each stage changes and what still needs a gate¶

By now we have several levers. The curriculum changes priors, the training loop changes weights, the GPU kitchen constrains what fits, SFT teaches behavior, chat protocol preserves role boundaries, and preferences choose between good-looking answers.

The new problem is decision-making. Every lever can improve one metric while damaging product behavior, latency, cost, safety, or rollback ability. A better checkpoint is not automatically a shippable checkpoint.

This chapter teaches stage gates: write the hypothesis, non-averaged eval slices, cost budget, and rollback artifact before starting the run.

What this file solves¶

A checkpoint can improve a benchmark while hurting product behavior, cost, latency, or safety. This file shows how to write the hypothesis, non-averaged gates, cost budget, and rollback artifact before starting the run.

Why training stages need gates¶

Every lifecycle stage trades one pressure for another. A gate forces the team to name the intended improvement, the slice that must not regress, the budget, and the rollback path before the run creates sunk cost.

When aggregate scores hide release risk¶

The naive repair is to ship the checkpoint with the better average score. If the average hides latency, refusal, format, safety, or source-slice regressions, the release can get worse while the benchmark looks better.

When a better benchmark is a worse release¶

A checkpoint improves a public benchmark by two points.
It also gets slower and refuses harmless support requests.
The number improved; the release got worse.

Rule: every stage needs a gate and rollback¶

Each training stage needs a hypothesis, an eval gate, a cost budget, and a rollback path.

Why gates beat vibes. A metric is useful only if it helps you choose the next action: continue, stop, ship, roll back, or change the plan.

1) Hook — the run improved the benchmark and hurt the product¶

The team celebrates a two-point public benchmark gain. Support users complain the model is slower, more verbose, and less willing to answer benign operational questions. The run improved something. It did not improve the product.

The hook is that "better model" is not a scalar. A checkpoint can move northeast on a leaderboard and southwest for your users because the product lives on slices, constraints, and costs.

2) Mental model — stage gates, not a pipeline poster¶

data hypothesis ─→ train run ─→ eval gate ─→ ship / stop / revise
       ▲                            │
       └──────────── incident learnings ◀──── user telemetry

Every stage should say what pressure it relieves and what new pressure it creates.

one checkpoint
      │
      ├─ quality slice improved?
      ├─ safety boundary preserved?
      ├─ latency/cost still viable?
      ├─ rollback artifact complete?
      └─ next failure now more important?

3) Running example — incident bot release¶

Hypothesis: SFT on curated incident rows improves action preservation without increasing over-refusal.

Gate:

exact 3-bullet format pass >= 95%
critical-action preservation >= 92%
harmless refusal rate <= baseline + 1%
p95 latency unchanged after deployment format
regression suite no worse than -1% on general support tasks

If one gate fails, do not average it away.

4) Choosing evals that change decisions¶

Loss/perplexity is good for training mechanics, but blind to product usefulness.
Task unit tests catch exact contracts, but miss phrasing diversity.
Pairwise human evals measure preference quality, but cost more and vary by reviewer.
Adversarial evals expose boundary failures, but can overfit to a red-team set.
Online telemetry sees real users, but carries confounding and delayed harm.

No single eval owns truth.

5) Stopping is an engineering decision¶

Stop when marginal gain is smaller than cost/risk, not when the curve looks emotionally satisfying. A run that still improves loss can already hurt refusal calibration or latency.

6) Aggregate scores hide boundary regressions¶

flowchart TD
  A[Overall score +2] --> B[Team ships]
  A --> C[Safety harmless-refusal +8%]
  C --> D[Support users blocked]
  B --> E[Rollback after complaints]

The aggregate did not lie; it compressed away the decision boundary.

7) What gates prevent and cost¶

Longer pretraining might add +0.5 on a broad eval, but costs large GPU spend.
More SFT rows might add +3 to format pass rate, but raises forgetting risk.
Stronger preference optimization might add +5 to human win rate, but raises reward-hacking risk.
A larger model can improve quality, but adds latency and serving memory.
An adapter fine-tune is cheap adaptation, but adds merge and serving complexity.

8) Signals that a checkpoint is shippable or risky¶

Healthy: each run has a written hypothesis and a small set of non-averaged gates.
First degrading metric: one slice regresses while aggregate improves.
Misleading beginner metric: leaderboard score alone.
Expert graph: quality/cost/latency/safety Pareto frontier by checkpoint.

9) Where lifecycle gates help and where metrics get ritualized¶

Stage-gated lifecycle thinking applies to training, fine-tuning, and vendor model selection. It becomes pathological when gates become bureaucratic rituals disconnected from user harm. It hits a limit when product requirements are not measurable enough to evaluate.

10) Wrong model: training is done when loss stops improving¶

Wrong model: "Training is complete when the loss stops improving."

Replacement: training is complete when the model satisfies product gates under cost and risk constraints better than available alternatives.

11) Other ways lifecycle decisions fool teams¶

public benchmark overfitting
eval contamination
aggregate score hides slice failures
latency ignored until serving
checkpoint cannot be reproduced
rollback path missing
safety gate measured only on obvious prompts
cost per quality point exceeds product value
adapter merge changes behavior unexpectedly

12) The same release-gate pattern in incidents and serving¶

This echoes production incident response: metrics matter only when tied to user impact and rollback decisions. It also prepares the next module: quantization and fine-tuning are lifecycle continuations driven by hardware and adaptation constraints.

13) Quick test: can this run change a decision?¶

What hypothesis does this run test?
What metric cannot be averaged away?
What checkpoint, tokenizer, dataset, and code revision would you roll back to?
What cost per quality point is acceptable?
Which failure belongs to model training versus serving or product policy?

Where lifecycle gates appear in model releases¶

Model cards — document intended use, evals, and limitations.
Eval harnesses — standardize regression tests across checkpoints.
Canary releases — compare behavior before full rollout.
A/B tests — measure real user preference with guardrails.
Red-team suites — stress policy and adversarial boundaries.
FinOps dashboards — connect quality gains to training/serving cost.
Registry systems — preserve checkpoint lineage.
Incident postmortems — feed new failures into data and eval design.
Shadow deployments — compare behavior without exposing all users.
Golden datasets — protect high-value slices from aggregate dilution.
Model registries — bind checkpoint, tokenizer, data, code, and eval result.
Release scorecards — force tradeoffs into explicit ship/no-ship decisions.
Regression bisection — identifies which stage introduced behavior drift.
Latency load tests — catch serving costs created by training choices.
Human review queues — inspect cases where automated evals disagree.

What you should remember¶

This chapter explained why training is a sequence of decisions, not a pipeline poster. The important idea is that every stage needs a hypothesis, eval gate, cost budget, and rollback path before the run creates sunk cost.

You learned to choose the next knob by failure type and to judge checkpoints with non-averaged gates instead of one attractive score. That solves the opening failure because a benchmark improvement can still be a worse release if latency, refusals, safety, format, or product slices regress.

Carry this diagnostic forward: when a checkpoint improves a headline metric, ask what decision that metric supports and what slice could be hiding damage. If the run cannot change a clear action, the eval is probably ritual instead of engineering.

Remember:

A better benchmark is not automatically a better release.
Every run needs a hypothesis before it needs a dashboard.
Some metrics must be gates, not averages.
Cost, latency, safety, and rollback are part of model quality.
If an eval cannot change a decision, it is probably ritual.
Ship only when the checkpoint, tokenizer, data, code, evals, and rollback path are tied together.

Check your understanding of eval-gated training¶

Why can a better aggregate score still be a worse release?
What belongs in a stage gate for the incident bot?
Why is rollback part of training lifecycle design?
How does this chapter bridge into quantization and fine-tuning?
What makes an eval actionable instead of merely interesting?
Why should some metrics be non-averaged gates?

Interview Q&A¶

Q. What makes an eval gate stronger than a benchmark number?
A. It ties a specific product hypothesis to non-averaged pass/fail thresholds, slices, costs, and rollback decisions.
Common wrong answer to avoid: "A bigger benchmark suite is always enough."

Q. When should you stop training even if loss improves?
A. When product-critical evals, safety, latency, cost, or regression gates no longer justify the marginal gain.
Common wrong answer to avoid: "Only stop at convergence."

Q. Why include serving constraints before the next module?
A. Training decisions affect model size, dtype, adapters, latency, and deployability; quantization/fine-tuning continue the same lifecycle under hardware pressure.
Common wrong answer to avoid: "Serving starts after training and is unrelated."

Q. What is a non-averaged gate?
A. A threshold that must pass independently because averaging it with other wins would hide unacceptable product or safety failure.
Common wrong answer to avoid: "All metrics should roll into one score."

Q. Why write the hypothesis before the run?
A. It prevents post-hoc storytelling and makes the result interpretable: the run either tested the intended pressure or revealed the hypothesis was wrong.
Common wrong answer to avoid: "We can decide what it meant after seeing the charts."

Q. What artifacts make rollback real?
A. The exact checkpoint, tokenizer, adapter, generation config, serving image, data revision, code revision, and eval record needed to restore behavior.
Common wrong answer to avoid: "Rollback means keep the old weights somewhere."

Apply now (10 min)¶

Model the exercise: write a release gate for one assistant feature.
Your turn: add one slice metric that cannot be averaged away.
Reproduce from memory: draw train-run-eval-gate-ship/stop/revise.

Bridge. The clean lifecycle diagram still hides uncertainty. Some recipes work, but the field cannot fully predict data mixtures, preference proxies, emergent regressions, or user feel. → 10-honest-admission.md