06. Quality gates for ML — speed with a sober bouncer¶

~15 min read. Fast pipelines matter only when bad models stop at the door.

Built on the ELI5 in 00-eli5.md. The quality gate — the checkpoint before goods leave the factory — decides whether a candidate model deserves promotion.

Why one good metric can still hide a bad model¶

See.

Teams love one headline number because it feels clean. Real products are messier.

A candidate may improve AUC and still damage a critical customer segment. A ranking model may gain overall CTR while hurting new users badly.

A classifier may improve accuracy while calibration becomes useless for decision thresholds. That is why the quality gate must compare more than one metric.

It must look at overall quality, important slices, and operational guardrails together.

Look at the flow first.

candidate model
      │
      ▼
┌───────────────┐
│ compare first │
└──────┬────────┘
       ▼
┌───────────────┐
│ check slices  │
└──────┬────────┘
       ▼
┌───────────────┐
│ guardrails    │
│ latency safety│
│ cost calib.   │
└──────┬────────┘
       ▼
 approve / hold / reject

Simple, no?

The champion is the current production model. The candidate is the new model asking for entry.

The question is not, "Is the candidate good in isolation?" The question is, "Is it safe to replace the champion?"

So what to do? Always frame gate logic as candidate versus champion plus fixed business limits.

That keeps comparisons honest when the environment is noisy.

Build the gate like a sequence, not a vibe¶

The gate should behave like a small decision engine, not like a meeting.

First, the candidate enters with metadata, training context, and evaluation results.

Second, compare headline metrics against the champion and against minimum thresholds. Third, check slice metrics for groups that actually matter to the product.

Fourth, check guardrails like latency, calibration, safety incidents, and serving cost. Finally, return one clear outcome: approve, hold, or reject.

Here is the sequence in a slightly richer picture.

candidate enters
      │
      ▼
┌──────────────────┐
│ compare headline │
│ metrics          │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ compare slices   │
│ region, segment, │
│ language, price  │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ check guardrails │
│ latency safety   │
│ cost calibration │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ approve / hold / │
│ reject           │
└──────────────────┘

Yes?

This design matters because failure modes are different.

A weak headline metric means broad quality damage. A weak slice metric means local damage hiding inside averages.

A broken latency or cost guardrail means the model may be accurate but operationally unshippable. A broken safety check means you should stop arguing and block promotion.

That is the job of the quality gate. It turns arguments into policy.

What the gate should check before promotion¶

Start with headline metrics. These are the summary numbers leadership asks about first.

Examples include accuracy, F1, recall, NDCG, CTR lift, or task success rate. But do not stop there.

Slice metrics ask, "Which users or situations are getting worse?" Useful slices may include country, device class, payment tier, language, or high-risk transaction size.

Regression thresholds make the decision explicit. For example, recall may not drop more than 1 percentage point.

Calibration matters when the score itself drives a downstream action. A score of 0.8 should mean something stable.

Latency matters because a good model that times out is still a bad product experience. Safety matters because the model may create harmful, biased, or policy-breaking outputs.

Cost matters because some upgrades are too expensive for the value they add.

Look at one compact gate checklist.

headline metrics   ──→ did the main task improve enough?
slice metrics      ──→ did any critical segment collapse?
regression limits  ──→ did anything fall past allowed thresholds?
calibration        ──→ can downstream users trust the score?
latency            ──→ can production serve this on time?
safety             ──→ did harmful behavior increase?
cost               ──→ is the gain worth the spend?

Simple, no?

Notice how each line protects a different kind of failure.

One metric cannot cover all of that. AUC alone definitely cannot.

A tiny example of champion versus candidate gating¶

Suppose the current champion for loan approval ranking has these results.

AUC is 0.84. Approval decision calibration error is 0.03. P95 latency is 110 ms.

For first-time applicants, default-risk recall is 0.79. Monthly serving cost is $18,000.

Now the candidate arrives. AUC improves to 0.86, which looks like celebration time.

But first-time applicant recall falls to 0.68. Calibration error worsens to 0.09.

P95 latency climbs to 165 ms. Monthly serving cost jumps to $29,000.

Look at the table.

metric                              champion   candidate   gate
AUC                                 0.84       0.86        pass
first-time applicant recall         0.79       0.68        fail
calibration error                   0.03       0.09        fail
p95 latency                         110 ms     165 ms      fail
monthly serving cost                $18k       $29k        hold

See the trap.

If the team watched only AUC, the bad candidate would look better. If the quality gate checks slices and guardrails, the candidate is blocked.

So what to do? Write explicit promotion rules before the excitement starts.

For example, require no critical slice drop beyond 2 points, calibration error below 0.05, and latency under 130 ms.

That way the decision stays boring and repeatable.

Approve, hold, and reject mean different things¶

Approve means the candidate passed required checks and may move forward to the registry or rollout path.

Hold means the evidence is incomplete, suspicious, or expensive enough to need a human review. Reject means the candidate failed a hard rule and should not be promoted.

Teams often misuse hold as a polite reject. That creates confusion later.

Use hold for ambiguity, not for denial without courage. Use reject when a clear rule was violated.

Use approve only when the evidence is actually good. A healthy gate also records why the decision happened.

Which metrics passed? Which slices failed? Which guardrail caused the stop? That trace matters during incident review and model audits.

Look.

The gate is not there to slow teams down. It is there to protect speed from becoming recklessness.

A weak gate gives false confidence. A strong gate gives reliable velocity.

That is the whole point.

Where this lives in the wild¶

Google Ads bidding — ML software engineer: compares new auction models against champions using revenue, fairness slices, and latency limits.
Stripe Radar — risk ML engineer: blocks candidate fraud models that improve overall metrics while hurting high-value merchant protection.
LinkedIn feed ranking — relevance engineer: checks slice performance for new members, geographies, and device classes before promotion.
Airbnb search ranking — marketplace scientist: evaluates booking lift together with host-region slices and serving cost guardrails.
Uber ETA prediction — applied scientist: promotes only candidates that improve error while staying within latency budgets for rider requests.

Pause and recall¶

Why is candidate-versus-champion framing stronger than a single absolute metric?
What kinds of slice metrics usually belong in a quality gate?
Why can a model with better AUC still be a bad promotion?
When should the gate approve, hold, or reject?

Interview Q&A¶

Q: Why is a single headline metric a weak promotion criterion for ML models? A: One headline number can hide failures in critical user segments, calibration, latency, safety, or cost. Promotion decisions need a portfolio of checks because production risk is multi-dimensional. Common wrong answer to avoid: "Because metrics are subjective." Many are objective; the issue is incomplete coverage.

Q: Why compare a candidate to the champion instead of only to a fixed threshold? A: Fixed thresholds catch absolute failure, while champion comparison detects regressions relative to what users already experience. Using both protects against silent backsliding and stale minimum bars. Common wrong answer to avoid: "Because champions are always optimal." They are just the current baseline, not perfection.

Q: Why does calibration belong in a quality gate for some systems? A: When downstream actions depend on score confidence, poor calibration makes thresholds misleading even if ranking quality looks fine. That can break approvals, alerts, and prioritization logic. Common wrong answer to avoid: "Calibration only matters for research dashboards." It often matters in live decisions.

Q: When should a team use hold instead of reject? A: Use hold when evidence is incomplete or a costly tradeoff needs review, not when a hard rule already failed. Reject is for clear policy violations. Common wrong answer to avoid: "Hold means the same as reject, but sounds nicer." That language drift destroys decision clarity.

Apply now (5 min)¶

Exercise. Pick one model you know and write a five-line gate policy. Include one headline metric, one critical slice, one latency rule, one safety rule, and one cost rule.

Then write what would trigger approve, hold, and reject.

Sketch from memory. Draw the quality gate flow from candidate entry to compare, slice checks, guardrails, and final decision.

Label one example of a hidden failure that a single metric would miss.

Bridge. A strong gate checks model quality, but quality still collapses when training features and serving features disagree. Next we study feature stores, which reduce that train-serve skew. → 07-feature-stores.md