06. Evaluation and A/B testing¶

⏱️ Estimated time: 21 min | Level: advanced

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

Evaluation asks whether change is useful, not merely different¶

A new model matters only if it improves the right outcome. That sentence sounds small. It saves whole quarters. Offline gains are clues, not verdicts. See. The quality inspector decides with evidence, not excitement. The kitchen may produce many clever candidates. The prep station must keep feature definitions stable during comparison. The recipe book should tell you exactly which versions are competing. The serving counter must route traffic safely between them. So what to do? Define the decision rule before the test starts. Pick primary metric, guardrails, duration, and rollback threshold. Make sure product and engineering agree on the same scorecard. Simple, no? Clear criteria reduce political debates later. Evaluation is a release discipline, not a final chart.

candidate vs baseline
      │
      ├── offline check
      ├── online exposure
      ├── metric compare
      └── decision rule
             ↓

Decide success criteria before seeing results.
Keep baseline and challenger clearly identified.
Include guardrails for latency, errors, and fairness.
Tie experiment design to business decision timing.
Good evaluation protects focus as much as production.
Now watch. Offline metrics still matter, but only in context.

Offline evaluation is the first filter¶

Use offline tests to reject obviously weak or risky models. Compare against current production, not against memory. Check precision, recall, calibration, ranking metrics, or task-specific loss. Then slice by geography, segment, language, and device. See. Hidden regressions often sit inside a small cohort. Confidence intervals matter even offline when datasets are limited. Error analysis matters more than pretty aggregate charts. Read bad examples one by one. So what to do? Create an offline checklist before promotion. Include quality, robustness, data freshness, and cost estimate. Also test threshold sensitivity when decisions depend on cutoffs. Small threshold moves can change operations dramatically. Now watch. Offline readiness simply earns the right to test online.

holdout set
   │
   ├── global metrics
   ├── slice metrics
   ├── error review
   └── readiness verdict
            ↓

Use offline evaluation to eliminate weak candidates early.
Keep cohort and threshold analysis mandatory.
Review examples that cause operational pain.
Estimate serving cost before online exposure.
Offline readiness is necessary and still incomplete.
Simple, no? Paper wins do not pay production bills.

Shadow mode and canary rollout reduce blast radius¶

Shadow mode runs the new model without affecting user-visible decisions. This reveals latency, feature, and logging issues safely. It also helps compare predictions against the live baseline. Canary rollout then sends small real traffic to the candidate. See. These are not the same thing. Shadow mode tests operational readiness without business risk. Canary tests business effect with limited risk. So what to do? Start with shadow when runtime uncertainty is high. Move to canary once contracts and performance look sane. Watch traffic allocation, exposure bias, and rollback triggers closely. Also protect dependent services from doubled load in shadow mode. Mirrored traffic can surprise budgets. Now watch. Safe rollout is infrastructure plus statistical hygiene.

live request
    │
    ├── baseline decision → user
    └── shadow candidate → logs only
            │
            └── compare + inspect
                    ↓

Use shadow mode to validate runtime behavior first.
Use canaries for real business impact under controlled exposure.
Define rollback triggers before traffic flows.
Account for duplicate load during shadow evaluation.
Safety steps are cheaper than cleanup steps.
See. Smaller blast radius means calmer learning.

A/B tests and interleaving need statistical discipline¶

A/B tests split traffic and compare outcomes between variants. Interleaving mixes ranked results from two systems in one session. Interleaving is useful for search and ranking because differences show faster. Both approaches need careful assignment rules. See. Sample contamination can invalidate beautiful dashboards. Keep users stable in one bucket when carryover matters. Avoid switching variants every request without a reason. Decide the minimum detectable effect before launch. So what to do? Compute sample size and expected duration upfront. Watch novelty effects and day-of-week effects. Use sequential testing rules carefully if you peek early. Stopping because a chart looks good is not science. Now watch. Significance without practical value is still a weak win.

users
  ├── bucket A → baseline
  ├── bucket B → candidate
  ├── log outcomes
  ├── test significance
  └── evaluate lift
         ↓

Stabilize assignment when user memory or behavior matters.
Estimate duration and effect size before launch.
Avoid peeking rules that inflate false confidence.
Compare practical lift, not p-values alone.
Statistical discipline protects product trust.
Simple, no? Numbers need manners too.

Decision frameworks must combine metrics, cost, and ethics¶

Sometimes a candidate improves one metric and hurts another. Maybe conversion rises but complaints rise too. Maybe relevance improves but latency doubles. See. Promotion is a portfolio decision, not a single-column sort. So what to do? Use a release scorecard with weighted priorities. State hard no-go thresholds for safety or fairness harms. State soft tradeoff ranges for cost and latency. Bring domain experts into the review when impact is sensitive. Also document what you still do not know. Short tests cannot reveal all long-term feedback loops. That is fine when you say it clearly. Now watch. Mature teams launch with humility and guardrails together. Then they keep measuring after rollout rather than declaring victory.

quality lift
    │
    ├── business gain
    ├── latency cost
    ├── fairness / risk
    └── final release call
              ↓

Use scorecards when metrics pull in different directions.
Hard guardrails should stop unsafe wins immediately.
Keep unknowns documented beside decisions.
Post-launch monitoring remains part of evaluation.
Good release judgment mixes stats with domain context.
See. Wisdom begins where one metric stops ruling.

Where this lives in the wild¶

A ranking team uses interleaving because subtle ordering changes appear faster than in plain A/B tests.
A fraud system shadows new models first to validate logging and latency before business exposure.
A recommendation team defines rollback on both conversion drop and complaint rate rise.
A lending workflow team requires domain review because short experiments miss long-term fairness effects.
A search platform computes minimum detectable effect before every major online experiment.

Pause and recall¶

Why must success criteria be defined before a test begins?
What extra value does shadow mode provide beyond offline metrics?
When is interleaving more useful than a plain A/B test?
Why can statistical significance still lead to a bad release decision?

Interview Q&A¶

Q: How do you evaluate a new model before full rollout? A: Use offline validation first, then shadow or canary exposure, then controlled online testing with predefined success and rollback rules. Common wrong answer to avoid: Train it, see a better metric, and ship to everyone.

Q: What is the difference between shadow mode and canary? A: Shadow mode mirrors traffic without affecting decisions, while canary sends a small percentage of real decisions through the candidate. Common wrong answer to avoid: They are basically the same thing with different names.

Q: Why is p-value alone not enough? A: Because a statistically significant change may be too small to matter or may hurt cost, latency, or fairness guardrails. Common wrong answer to avoid: Once the p-value is under the threshold, the model is automatically better.

Q: How do you avoid bad A/B test conclusions? A: Stabilize assignment, size the experiment properly, avoid uncontrolled peeking, and interpret lift alongside domain context. Common wrong answer to avoid: Refresh the dashboard often and trust your intuition.

Apply now (5 min)¶

Choose one model change you want to test. Write the primary metric, two guardrails, traffic plan, and rollback threshold. Then decide whether shadow, canary, A/B, or interleaving fits best. Finally, write one unknown that the test still cannot answer. That habit keeps evaluation honest.

Bridge. New model validated. But what if it degrades over time? → 07