Skip to content

10. A/B testing — when the offline winner loses the live argument

~18 min read. Offline evals tell you the new prompt is better. Live traffic tells you who is right. This chapter is about the gap between those two answers, and how to run the comparison so the live answer is the one you trust.

Builds on 09-drift-detection.md. Drift detection tells us when quality slips on its own; the shift change — A/B testing — is how we make a deliberate change without slipping ourselves. Same instrument family, opposite causality: drift is involuntary degradation; an A/B is voluntary perturbation.


What earlier chapters fixed and what still goes wrong on launch day

By now the refund chatbot has a rubric, a golden set, a calibrated judge, a drift detector, and dashboards that update every fifteen minutes. The team can prove the current prompt scores 62% policy-correct on the live distribution and that it has held steady for three weeks. Drift detection is quiet. The shift table is clean.

Then a teammate proposes a new system prompt. Offline, on the same 500-row golden set used since chapter 03, the candidate scores 67%. Same judge, same rubric, same data. A +5pp offline lift is exactly the kind of signal the module has been training the team to take seriously. The temptation is to ship.

What the previous chapters did not teach is the gap between "this answer is better when measured against a frozen rubric" and "this answer is better when a real customer is reading it on a phone at 11pm." Offline evals freeze the question, the input, and the grader. Live traffic refuses to freeze any of those. Users behave differently when an answer feels terse. Sessions get shorter. CSAT moves. The eval score and the customer relationship can disagree, and when they do, the eval score is not the one the CFO will care about.

What this file solves

This chapter shows how to compare a candidate prompt, model, retriever, or agent against the current one under live traffic, with enough statistical care that the result is decision-grade. It walks the refund chatbot's new system prompt through a 7-day 50/50 A/B with 8,400 chats, shows how a +6pp lift in policy-correct collides with a -4pt drop in CSAT, and teaches the move that matters most: when the two numbers disagree, which one is the lesson. The first concrete artifact you will learn to build is a sample-size table that tells you, before the experiment starts, whether the experiment can answer the question at all.

Why offline evals are necessary but never sufficient

The offline eval is a controlled lab. The prompt is fixed, the inputs are fixed, the judge is fixed, and the outcome being measured is whatever the rubric says. None of those four things stay fixed in production. The prompt is the same, sure. But the inputs drift — last week's refund questions are not next week's. The judge is the same model, but the user is a new judge, with new criteria, none of them written down. And the outcome — the thing you actually care about — is not "did the rubric pass?" It is "did the user get what they wanted, want to come back, recommend us, not file a chargeback?" Those four things are not the same as a rubric score.

The refund bot's old prompt is verbose, friendly, sometimes over-explains. The new prompt is crisper — "Refund of ₹1,250 will reach your account in 5–7 business days. Anything else?" — and on the rubric it earns full marks more often because every required clause appears and nothing extraneous distracts the judge. Offline lift: +5pp. But the rubric never asked "does this feel like a person trying to help you?" That dimension exists only in the user's head, and the only way to measure it is to let users vote with the things they do next: rate the chat, escalate, churn, come back tomorrow.

So the real cause is not that the offline eval was rigged. It is that the eval measures answer shape and the user measures answer feel. So how do we measure the second one without shipping the change to everyone first?

The naive repair and where it visibly breaks

A smart-but-impatient version of this story goes: "the offline result was +5pp, let's ship to 100% and watch the dashboard." On day one CSAT drops two points. The team waits. Day two it drops two more. On day three someone notices that retention on the "refund denied" slice has fallen 9%, but by then 100% of users have already had a worse week and the rollback brings its own anger — "why did the bot get rude and then go back to normal?" The team learned something, at a price the business did not budget for.

The slightly-less-naive version is: "let's just run it for one day, eyeball the numbers, decide tomorrow." That fails for a different reason. One day on 8,000 chats might look like the new prompt is winning by 2pp. But the standard error on a one-day comparison at that traffic is roughly ±1.7pp; a 2pp lift is not distinguishable from luck. Peeking at the dashboard and shipping on day-one momentum is how you ship variance.

Not a "need a bigger model" problem. Not a "need a better prompt" problem. A measurement-under-uncertainty problem. The question is not "is the new prompt good?" — offline already answered that. The question is "is the new prompt better, on the metric that actually matters to the business, by an amount large enough to be sure given how much traffic we have?" That is what A/B testing answers, when it is run honestly.

When the tiny example already shows the disagreement

Strip the experiment to its bones. One hour of traffic. 200 chats on the old prompt, 200 on the new.

HOUR-1 SNAPSHOT
  Arm A (control, old prompt)
    chats: 200    policy-correct: 124 (62%)   CSAT avg: 4.3 / 5
  Arm B (treatment, new prompt)
    chats: 200    policy-correct: 138 (69%)   CSAT avg: 4.1 / 5

  Eval rubric says:    B is better by +7pp
  Customer rating says: B is worse by -0.2

The disagreement is already there in 400 chats. The rubric and the rating are not measuring the same thing. The rubric notices that B includes the required policy clause more often. The rating notices that B did not say "I understand this is frustrating" before naming the policy. Both are real signals. They are pointed at different parts of the same answer. The chapter's whole arc is in this snapshot — the rest is statistical care so you do not act on the snapshot alone.

Mini-FAQ. "Isn't the rubric just wrong then?" Sometimes. More often the rubric is right about what it measures and silent about what it does not. Adding a tone dimension would help the rubric notice. But user behavior is the ground truth — when the rubric and the user disagree, the rubric is the model and the user is the data.

The rule: A/B testing decides between two systems on the metrics users feel, not the metrics rubrics measure

State it plainly. An A/B test is the experiment that holds everything constant except the change, runs it on users you cannot see, and lets the metrics that matter to the business choose between the two versions. Everything else in this chapter — sample size, peeking, Goodhart, canary vs shadow vs full A/B — is a way to make sure the experiment is honest enough that the user vote is the one we trust.

Teacher voice. Offline evals shape candidates. A/B tests pick winners. A team that uses offline evals for picking is a team that ships variance. A team that uses A/B tests for shaping is a team that wastes months. Use each tool where its question lives.


1) Picture before numbers — the four shapes of the shift change

The phrase "A/B test" hides four different experiments that ship under one name. They differ in how much risk a user is exposed to and how much information you get.

   FOUR SHAPES OF THE SHIFT CHANGE
   ───────────────────────────────

   1) SHADOW                       2) CANARY (1-5%)
   ┌──────────┐                    ┌──────────┐
   │  user    │                    │  user    │
   └────┬─────┘                    └────┬─────┘
        │                               │ 95-99%
        ▼                               ▼
   ┌──────────┐  copy           ┌──────────────┐    1-5%
   │ control  │──────┐          │   control    │◀────┐
   │  (live)  │      │          │   (live)     │     │
   └──────────┘      ▼          └──────────────┘     │
              ┌──────────┐                    ┌──────────┐
              │treatment │                    │treatment │
              │(no users)│                    │  (live)  │
              └──────────┘                    └──────────┘
   reads only, never replies     small slice gets real reply

   3) RANDOMIZED A/B (50/50)       4) HOLDOUT (treatment is default)
   ┌──────────┐                    ┌──────────┐
   │  user    │ random             │  user    │ 95% treatment
   └────┬─────┘                    └────┬─────┘  5% control
        ├────50%──→ control             ├──→ treatment (new normal)
        └────50%──→ treatment           └──→ control (kept aside)
   power-maximizing, equal risk    measures long-term lift after rollout

Shadow runs the candidate alongside the live model, sees the same input, but never sends its output to a user. It catches latency, cost, and policy-violation signals before any user is touched. It cannot measure user behavior — there is no user. Canary exposes a small slice (1-5%) and watches guardrails before widening; the risk is bounded by the slice size. A full randomized A/B at 50/50 gives the cleanest comparison and the fastest statistical answer, but at the cost of exposing half your users to the candidate. Holdout flips the question — after the treatment has been rolled out, keep a small slice on the old version permanently to measure long-run lift, retention, and the "does the new prompt actually help thirty days later?" question that a short A/B cannot reach.

For the refund chatbot, with 8,400 chats a week and a prompt change that is reversible in one config push, a 50/50 randomized A/B is the right choice. The candidate is not state-mutating (no money moves on the prompt change alone), the risk per chat is bounded by "customer reads a terser-than-usual reply", and the team wants the answer in a week, not a month. Canary would be the right call if the candidate were a new tool that could issue refunds; shadow would be right if the candidate were a new retrieval index where you wanted to see top-k overlap before exposing anyone.

2) Picking the metrics that matter — and the trap of picking the wrong one

A test is only as honest as its primary metric. The refund bot's metrics ladder, in priority order:

  • Primary outcome metric. Self-serve resolution rate — share of chats where the user got an answer and did not escalate to a human within 24 hours. This is the metric the business actually buys the chatbot for.
  • Quality guardrail. Policy-correct rate — the rubric score from chapter 07. If this falls, the chatbot is making policy mistakes regardless of how snappy it feels.
  • User-feel guardrail. Post-chat CSAT — five-point rating, opted-in. Noisy but the only direct user vote.
  • Operational guardrails. Median latency, p95 latency, cost per chat, refusal rate. If any of these moves badly, the win is not a win.
  • Long-tail guardrail. 7-day return-to-product rate — proxy for "did the user come back?" Lags the experiment but worth tracking.

Six metrics, one of them primary. The reason for the asymmetry is the Goodhart trap. The moment a team has five equal metrics, the experiment becomes an argument about which one to weight. The moment a team has one primary and the rest as guardrails, the experiment becomes a decision rule: the primary moves, the guardrails do not get worse, ship. If a guardrail moves badly, the win is contested and the team has to decide whether the lift is worth the cost.

The classic LLM Goodhart story: a team optimizes click-through-rate on a recommendation chatbot. CTR rises 18%. Three months later, session length has collapsed because the bot learned that punchy, dopamine-grabbing suggestions get clicked but do not get engaged with. The proxy (CTR) detached from the user metric (long-session value), and the team was rewarded by their dashboard for a behavior that was hollowing out the product. The shift change is supposed to protect against this, not enable it. The protection is in the primary metric choice, not in the experimental machinery.

Mini-FAQ. "Why not optimize for CSAT directly?" Because CSAT response rate is 8–15% and biased toward extreme experiences. Use it as a guardrail, not as the primary. The primary should be a behavior every user produces — resolution, conversion, escalation — not a vote only the strongly-opinionated cast.

3) The worked example — refund chatbot, 7-day 50/50

The candidate prompt promised +4 to +6pp offline. The team scoped a one-week 50/50 randomized A/B. Eligibility: any chat that hit the refund-policy intent classifier. Stratification: by tier (free vs paid) so the random hands_on_lab did not accidentally over-represent paid in one arm. Assignment: hashed user-id mod 2, so the same user sees the same arm if they return mid-week.

REFUND CHATBOT A/B — 7 DAYS, 8,400 ELIGIBLE CHATS
────────────────────────────────────────────────
                          Arm A (control)   Arm B (treatment)
chats                          4,210             4,190
policy-correct                 2,610 (62.0%)     2,849 (68.0%)
self-serve resolution rate     54.1%             56.9%
post-chat CSAT (n responded)   4.31 (n=412)      3.91 (n=398)
median latency                 1.8s              1.7s
escalation rate                14.2%             11.6%
7-day return rate              38.4%             36.1%

Primary lift:        +2.8pp self-serve resolution
Quality guardrail:   +6.0pp policy-correct (offline predicted +5pp — close)
CSAT guardrail:      -0.40 points (-4pp on a 100-pt scale, p < 0.01)
Return-rate guardrail: -2.3pp (n too small to be sure)

The two-proportion z-test on policy-correct: pooled rate = 0.65, SE ≈ sqrt(0.65×0.35×(1/4210 + 1/4190)) ≈ 0.0104, z = 0.06 / 0.0104 ≈ 5.77, p < 0.001. The +6pp policy lift is real. The two-proportion test on self-serve resolution: z ≈ 2.6, p ≈ 0.009. Also real. The CSAT drop: t-test on means, t ≈ -2.9, p ≈ 0.004. Also real.

Everything passes its individual significance test. The disagreement is what is interesting. The new prompt is more often policy-correct, more often resolves without escalation — and makes users rate the chat lower and return slightly less. Three of those numbers say "ship". Two say "don't". Which is the lesson?

The terse prompt is winning the current chat and losing the next chat. Resolution and policy-correct are single-session metrics; the bot answered well and the user did not need a human this time. CSAT and return rate are relationship metrics; the user did not feel cared for. For a refund product where every refund is a moment of customer goodwill or its absence, the relationship metrics are not optional. The decision: ship a mixed version that keeps the new prompt's clause coverage and adds back one empathy sentence. Run a second A/B on the mix. Do not ship the candidate as-is.

The threaded lesson: +6pp offline did not lie. It just did not measure what mattered most. A team that took the offline win to production without an A/B would have shipped a CSAT regression and not learned why for a quarter.

4) Why a 50/50 A/B instead of a canary here

Plausible alternative: 5% canary for two days, then ramp. Why not?

Under this workload — 8,400 chats/week, reversible prompt change, no money movement on the prompt itself, a primary metric (resolution) where 3pp lift is the business-meaningful threshold — the canary gives less information per day at higher decision risk. At 5% traffic, the treatment arm is 60 chats/day; the standard error on a 60-chat sample is ~6pp, so a real 3pp lift is invisible for the entire first week. A 50/50 split moves both arms together at full size, surfaces the CSAT disagreement on day one, and limits exposure no more than 100% rollout would have anyway since both arms are bounded by the rubric and CSAT alarm.

The canary makes sense when the candidate could hurt a user badly (a new tool with refund authority, a model with unverified safety profile, a UI change that breaks accessibility). For a system-prompt swap with an obvious rollback, the canary is too cautious; the 50/50 is the right rigor level. Shadow makes sense when you want to see the candidate's behavior without exposing any user — useful for a new retriever where you want to compare top-k overlap before the new docs ever show up in a reply.

Shadow Canary 1-5% 50/50 A/B Holdout
What you learn latency, cost, output shape early harm signal full lift + guardrails long-run lift after rollout
What you cannot learn user behavior small-effect lifts long-term retention short-term flipped comparison
Risk per user zero bounded by slice full tiny (held-out arm)
Use when new component, no user feel matters yet safety/policy-mutating, irreversible reversible change, want fast answer already shipped, want to keep measuring
Refund-bot fit retriever swap new refund-issuing tool system prompt swap (this chapter) post-ship monitoring of the winning prompt

5) Sample-size math — can the experiment even answer the question?

Before running the experiment, you must answer: with this much traffic and this minimum effect we care about, can we tell the difference from noise? This is the question power analysis answers. Skip it and you may run an experiment that mathematically cannot reach a decision, no matter how the truth lies.

The inputs:

  • p₁ — baseline conversion rate of the control arm. For the refund bot, policy-correct baseline is 62%.
  • MDE — minimum detectable effect, the smallest lift the team cares about. Say 2pp; below that, the business does not care.
  • α — false-positive rate we tolerate. Standard: 0.05.
  • Power (1 - β) — probability of detecting a true effect of size MDE. Standard: 0.80.

For a two-proportion test, sample size per arm is approximately:

n_per_arm ≈ (z_α/2 + z_β)² × ( p₁(1-p₁) + p₂(1-p₂) ) / (p₂ - p₁)²
        = (1.96 + 0.84)² × ( 0.62×0.38 + 0.64×0.36 ) / (0.02)²
        = 7.84 × ( 0.2356 + 0.2304 ) / 0.0004
        = 7.84 × 0.4660 / 0.0004
        ≈ 9,134 per arm

So roughly 9,400 chats per arm to detect a 2pp lift at 62% baseline with α=0.05 and 80% power. The refund bot gets 8,400 chats/week total, 4,200 per arm. To detect a 2pp effect on policy-correct, the team needs more than two weeks. The +6pp effect they actually observed is much larger; at MDE=6pp the required n drops to roughly 1,050 per arm, easily reached in two days. This is why the actual A/B was decisive in seven days despite the math saying "2pp would need two weeks" — the true effect was big.

Run this calculation before the experiment, not after. If the math says you cannot reach the MDE in the time you have, the experiment cannot answer the question. Three options: lengthen the experiment, accept a larger MDE (and write down that smaller effects are invisible to you), or restructure the comparison (paired design, switchback, interleaving) to extract more signal per user.

Teacher voice. A team that runs an underpowered experiment and ships the winner is a team that ships noise half the time. The most expensive A/B is the one that ran for the time you had instead of the time the math needed.

6) Statistical pitfalls — peeking, multiple comparisons, non-IID users, interference

Four pitfalls, each capable of inverting an honest experiment.

Peeking. Checking the dashboard every hour and stopping when p < 0.05 inflates the false-positive rate from 5% to about 30% over a week. The reason: p-values assume one look at the data; many looks gives many chances for noise to cross the threshold. Fix: pre-commit to a sample size or use sequential-test methods (mSPRT, always-valid p-values) explicitly. The refund-bot team committed to seven days, ignored the day-three temptation to call it, and only opened the dashboard on day seven.

Multiple comparisons. Five metrics, each tested at α=0.05, gives a 23% chance of a false positive on at least one if all five are nulls. Fix: pre-declare the primary metric. Guardrails get directional checks ("did it move badly?") not significance celebrations. Bonferroni or Benjamini-Hochberg corrections when you genuinely need many comparisons.

Non-IID users. Users return; if a user is hashed to arm A on Monday and re-hashed to arm B on Tuesday, the comparison is contaminated. Worse, if the bot's reply on Monday changes what the user asks on Tuesday, the arms become coupled. Fix: hash on user-id, not session-id; lock hands_on_lab for the experiment window; track per-user rather than per-chat metrics for anything relationship-related.

Interference effects. One agent's behavior affects another's eval — a refund bot that escalates more dumps more load on the human agent queue, which makes human responses slower, which makes overall CSAT drop in both arms. Bot-to-bot interference, bot-to-human interference, marketplace-side interference (a new search ranking that reorders results affects every search). Fix: cluster-randomize (assign whole accounts, whole geographies, whole agent shifts) when interference is plausible. For the refund bot, interference was small because human agents were under-loaded; for a recommendation A/B at Booking.com, cluster-randomization would have been mandatory.

7) Operational signals — what tells you the A/B is healthy or rotting

A healthy A/B has a quiet dashboard for the first half. Both arms move together with the natural daily rhythm. The lift line wobbles around its eventual value with shrinking error bars. Sample-ratio mismatch — A getting 50.4% of traffic instead of 50.0% — is statistically tested every morning and stays non-significant.

The first signal of a sick experiment is sample-ratio mismatch. If the splitter is supposed to assign 50/50 and the hands_on_lab table shows 51.6/48.4, something is wrong with the hands_on_lab, not the model. Common causes: arm-specific error rates (treatment crashes more often and falls back to control), bot-detection logic that filters arms differently, eligibility checks that depend on a stateful field the treatment writes differently. SRM is the kill-switch signal — if it fires, stop the experiment and find the bug; whatever the metrics say is downstream of a broken hands_on_lab.

The misleading metric a beginner watches: the primary-metric line by itself, day by day. It is noisy on a daily basis. The graph an experienced operator opens first: the difference (B - A) with its confidence interval, plotted over the experiment window. The CI narrows as n grows; the eye learns to read whether the band has crossed zero and stayed there.

The deepest signal, the one a team learns over many A/Bs: the disagreement between metrics. When the primary moves one way and a guardrail moves the other way, that is the lesson, not the noise. The refund-bot team learned more from the +6pp / -4pt disagreement than from any individual significant lift in the previous quarter.

8) Boundary of applicability — where A/B testing stops working

A/B testing is not always the right tool. It fails or becomes wasteful in three regimes.

Low traffic. A B2B product with 50 conversations a week cannot run a meaningful A/B on a 2pp effect — the math says years. Use offline eval as the primary decision tool, plus a one-week shadow or careful pilot with a single customer.

Long-horizon outcomes. If the metric you care about is "did the user re-subscribe in three months", the experiment must run three months, and during those three months the product cannot iterate on the same surface. Use a holdout: ship the candidate, keep a small held-out arm on the old version permanently, measure the long-run difference even while the rest of the product moves on.

Irreversible or compounding effects. If the candidate writes to a database in a way the control does not, the two arms diverge and cannot be compared cleanly. If the candidate sends emails that the control does not, the recipients are not independent samples. For these cases, design the experiment so the divergence is contained — log-only writes, simulated emails, dual-write with a kill switch.

The pathology to avoid is treating the A/B framework as a substitute for thinking. A team that A/Bs every change without offline eval first will burn user trust running underpowered tests of weak candidates. The order matters: offline filter the obviously bad, A/B the plausibly good, holdout the shipped winner.

Teacher voice. The shift change is a measurement device. It does not generate good candidates; it picks between them. A team with a weak offline pipeline runs many bad A/Bs. A team with a strong offline pipeline runs few decisive ones.

9) The wrong mental model — "offline lift transfers to online lift"

The seductive belief: "the offline rubric scored +6pp, so the online primary metric will move +6pp too." It will not. The offline rubric measures answer shape under fixed input. The online primary measures user behavior under variable input. The transfer ratio between the two is not 1.0, not stable, and not always positive.

Three reasons the offline-to-online lift can be smaller, zero, or negative.

First, the rubric and the user metric measure different things. The rubric says "required clauses present"; the user metric says "user didn't escalate". Required clauses being present does not automatically prevent escalation — sometimes the way the clauses are delivered drives the user to type "talk to a human" faster.

Second, user behavior shifts in response to the new system. A terser bot gets terser questions; questions that used to be three turns become one. The denominator of "chats per resolution" changes, and rate-style metrics move for reasons that have nothing to do with quality.

Third, the long tail of inputs the offline set never saw is where the new prompt may shine or fail. The 500-row golden set captured common cases. The candidate's behavior on the weird-2% might be much better or much worse, and the live A/B is the first place that distribution gets sampled honestly.

Replace the wrong model with the right one: the offline eval ranks candidates; the A/B measures effects. They answer different questions. A +6pp offline result means "this candidate is worth testing live", not "this candidate will move the live metric by +6pp". The refund-bot team learned this in one experiment.

10) Six more failure shapes A/B tests produce

  • Survivorship in retention metrics. Only users who returned can be measured on 7-day return rate. If the treatment changes who returns, the comparison on returners is biased.
  • Novelty effect. Users behave differently in the first 48 hours of a UI change because it's new, not because it's better. The "win" fades by week three.
  • Primacy effect. The opposite — power users hate the change in week one and adapt by week three. The "loss" fades.
  • Winner's curse. The candidate that won big in a small A/B usually has a smaller true effect than the observed lift; ship and watch the lift shrink in the holdout.
  • Twyman's law triggered. Any result that looks too good (+20pp on a mature metric) is probably an instrumentation bug. Healthy lifts are 1–5pp.
  • Network/spillover. Bot in arm B routes more chats to human agents in both arms; both arms' CSAT drops; the experiment looks like nothing happened.

11) Cross-topic reinforcement — same shape, different chapter

  • Same invariant as judge calibration. 08-judge-calibration.md taught that the rubric must be measured for agreement, not assumed honest. A/B testing pushes the same invariant one layer up: the primary metric must be measured for honesty against the user vote, not assumed.
  • Same failure geometry as drift detection. 09-drift-detection.md catches involuntary change in live distribution; A/B testing catches voluntary change. Both depend on stratified sampling, both fail when hands_on_lab is non-random, both confuse beginners on the same statistic — sample-ratio mismatch.
  • Echo of shipping-on-vibes. 01-shipping-on-vibes.md said "a quality claim covers only the sample that generated it". A/B testing operationalises that rule for changes: the comparison only covers the population the experiment sampled, the metric the experiment defined, and the time window the experiment ran.
  • Forward pressure into logging-tracing. Without per-chat variant hands_on_lab in every trace, post-hoc analysis becomes impossible. The next chapter is what makes the after-the-fact deep-dive possible at all.

A self-test before you push the experiment button

  • Have you written the primary metric and MDE down before looking at any data?
  • Is the hands_on_lab hashed on user-id (not session-id, not request-id)?
  • Have you computed n_per_arm and confirmed your traffic will reach it inside the window?
  • Have you listed the guardrails and pre-committed to which ones can veto a ship?
  • Have you decided when you will look at the data, and committed not to peek before then?

Four yeses out of five is fine. Three or fewer means the experiment will produce a number but not a decision.

Where this lives in the wild

A/B testing for LLM and ML features looks structurally similar across companies; what differs is what they measure and how aggressively they protect against pitfalls.

  • Statsig — feature flagging plus experimentation as one product; ships sequential testing and SRM checks built into the dashboard so teams cannot accidentally peek.
  • LaunchDarkly — feature flag platform where the hands_on_lab is shared between the experiment and the rollout, so the same hashing decides who sees what.
  • GrowthBook — open-source A/B platform with Bayesian and frequentist analyses side by side, used by teams that want to argue with their stats team in public.
  • Eppo — experimentation product that pushes CUPED variance reduction and explicit primary-metric declaration into the standard workflow.
  • Optimizely — the original A/B testing product; the lesson it taught the industry is that simple lift charts are not enough — you need stratification and guardrails.
  • Anthropic Claude release canaries — every new model version goes through a graduated rollout where guardrails (safety eval scores, refusal rate, jailbreak rate) gate widening the slice.
  • OpenAI gradual rollouts — staged release pattern for new GPT versions with explicit holdouts for measuring long-run model behavior changes.
  • GitHub Copilot's experimentation framework — internal A/B platform that gates suggestion-quality changes on acceptance rate and code-survival rate (does the user keep the suggestion after 5 minutes, not just accept it).
  • Netflix interleaving experiments — instead of A/B at the user level, interleave A and B results on the same page and measure click position; massive statistical power on a small user base.
  • Booking.com — runs >1,000 concurrent experiments; their playbook on Sample Ratio Mismatch is the public reference text on why SRM is the kill-switch signal.
  • Airbnb's ERF (Experimentation Reporting Framework) — solved network interference for marketplace experiments by cluster-randomizing on geography.
  • Spotify — experimentation platform with built-in metric trees so guardrails are computed in the same pipeline as primary metrics, no off-dashboard surprises.
  • Meta's Deltoid / FBLearner — internal experimentation infrastructure where every model change ships behind a flag and the flag is the A/B.
  • Microsoft's ExP platform — published the "Trustworthy Online Controlled Experiments" book (Kohavi et al.); the canonical Twyman's-law and SRM examples come from here.
  • Linear / Vercel AI SDK feature flags — smaller-team pattern, used to A/B prompt changes on internal tools before customer exposure.
  • Intercom Fin — canary-tests new support responders so live ticket quality and escalation rates stay controlled, the same pattern as the refund-bot example here.
  • Perplexity — A/B tests answer-style changes while watching citation rate and follow-up rate, not just satisfaction.
  • Duolingo Max — A/B tests tutor-prompt variants on lesson completion and next-day return, the long-horizon metric a one-week A/B cannot reach alone.
  • Cursor — A/B tests suggestion-style changes on accept-and-keep rate, the same idea as Copilot's survival metric.
  • Shopify Sidekick — runs candidate-prompt A/Bs gated on merchant-task completion rate, with holdout slices kept for long-run measurement.

The lesson across the list is consistent: every team that runs LLM features in production builds, buys, or rents this layer. The companies that ship reliable AI changes are the ones whose experiment dashboards open with the difference graph, not the level graph.

Pause and recall

  1. Why is +6pp offline not the same number as +6pp online, even with the same prompt?
  2. Name the four shapes of A/B testing and one scenario each is the right fit for.
  3. State the sample-size formula for a two-proportion test and the four inputs it needs.
  4. Why is sample-ratio mismatch the kill-switch signal of an experiment, not a side metric?
  5. In the refund-bot result, why was +6pp policy-correct, -4pt CSAT the lesson rather than a contradiction?
  6. Give two reasons peeking inflates the false-positive rate beyond α.
  7. What is the Goodhart trap in A/B testing, and which line in the metrics ladder protects against it?
  8. When is a holdout the right design, and when is a 50/50 A/B better?

Interview Q&A

Q1. Offline eval says +5pp on the golden set. CEO wants to ship to 100% Monday. What do you propose?

A. A 7-day 50/50 A/B with a pre-declared primary metric (self-serve resolution) and two guardrails (CSAT, return rate). The offline +5pp is necessary evidence the candidate is worth testing live; it is not sufficient evidence the live primary will move that much. The cost of a week of A/B is small; the downside protection is large — the offline rubric does not measure user feel, and a CSAT regression that would surface in the A/B costs less to catch on a randomized slice than on 100% of users. Common wrong answer to avoid: "Ship at 100% and watch the dashboard for two days." That is shipping variance — daily numbers at this traffic have a ±1.7pp SE; a 2pp move is indistinguishable from noise.

Q2. You have 8,000 chats/week, baseline 62%, want to detect a 2pp lift. Can you run the A/B in one week?

A. No. Two-proportion power analysis at α=0.05 and power=0.80 needs roughly 9,400 per arm at p₁=0.62 and MDE=2pp. With 8,000 chats/week split 50/50 you get 4,000 per arm per week — under half the required n. You need ~2.5 weeks, or you must either accept a larger MDE (write down that smaller real effects are invisible) or use variance-reduction methods like CUPED to shrink the required n. Running it for one week anyway means the "not significant" result is a measurement failure, not a finding. Common wrong answer to avoid: "Run it for a week and see." That is exactly the case where you reach a conclusion that the data could not have justified.

Q3. Day 3 of a 7-day A/B, primary metric is up 4pp with p=0.04. Do you ship early?

A. No. The pre-committed end of the experiment is day 7; stopping at day 3 because p crossed 0.05 is peeking, and over a 7-day window the cumulative false-positive rate is closer to 0.20. The right answer is to keep running until day 7 unless a guardrail fires — guardrails get directional stop-rules ("if CSAT drops more than 5pp, stop") precisely because catching harm early is asymmetrically valuable, but catching a win early is not. Common wrong answer to avoid: "p < 0.05 means we can stop." That is the canonical peeking error.

Q4. Primary up +3pp p<0.001, CSAT down -0.4 points p<0.01. Which is the lesson?

A. The disagreement is the lesson. The primary measures single-session resolution; CSAT measures user feel. They are different constructs and they can move in opposite directions. The decision rule depends on the business — for a refund product where every interaction is goodwill or its absence, a CSAT regression is not a tradeoff to absorb but a signal that the candidate is winning the chat and losing the customer. The right move is a second experiment on a mixed candidate that keeps the new prompt's policy structure and restores the empathy cue the old prompt had. Common wrong answer to avoid: "Primary is significant, ship." That treats the metrics ladder as a vote count rather than a structured judgment.

Q5. Cumulative — you have drift detection running (chapter 9), then run an A/B. The drift detector fires on day 2 of the experiment. Is this an A/B problem, a drift problem, or both?

A. Both, and likely the same root cause. A drift firing during an A/B usually means traffic composition changed (marketing campaign, news cycle, week-day effect) and that change is being measured into the A/B result. Two checks: (a) verify SRM — if hands_on_lab is still 50/50, the drift is real and is hitting both arms equally, in which case the A/B lift is still meaningful but the absolute levels are misleading; (b) check whether the drift is treatment-induced — if arm B systematically attracts a different mix of users (e.g., terser replies cause more retries), the drift is because of the experiment, and the comparison is contaminated. The diagnostic question is "would this drift have fired without the A/B?" Run the drift detector on control alone vs. the pre-experiment baseline. Common wrong answer to avoid: "Drift is unrelated to the A/B." Drift during an experiment is almost always relevant to it — either as a confounder you need to control for or as a treatment effect you need to attribute.

Q6. Why is 50/50 A/B usually better than 5% canary for a reversible prompt change?

A. Statistical power. At 5% traffic, the treatment arm gets 5% of the chats, so the standard error on any metric is ~4–5x larger than the control. Detecting a 3pp effect that 50/50 would catch in a week takes the 5% canary roughly 25 weeks. The canary is the right call when the candidate could hurt a user badly — new tool with write permissions, untested safety profile, irreversible side effects. For a system-prompt swap with one-config rollback, the 50/50 is the right rigor level; the canary trades information for a safety margin that isn't needed. Common wrong answer to avoid: "Canary is always safer." Canary is safer per-user but more dangerous per-decision because it ships on less evidence.

Q7. Your A/B platform shows arm A got 51.6% of traffic, arm B got 48.4%. p<0.001 on the SRM test. What do you do?

A. Stop interpreting metrics until you fix it. SRM means the splitter is not doing what the experiment thinks it is doing, and any metric you compute on the two arms is conditioned on a broken hands_on_lab. Common causes: arm-specific crash rates (treatment falls back to control on errors and gets re-counted as A), bot/spider filtering that interacts with arm logic, eligibility checks that depend on a field the treatment writes differently. The metrics dashboard is downstream of hands_on_lab; you cannot reason about lift until SRM is gone. Common wrong answer to avoid: "Small imbalance, ignore it." 1.6pp on 8,000 chats is not small — it's evidence the experiment has a bug.

Q8. Cumulative — judge says treatment is +6pp on the rubric, user CSAT says -4pp. Earlier you calibrated this judge against humans (chapter 8) and got 87% agreement. What does the disagreement now teach you?

A. The judge is honest about what the rubric measures. The rubric does not have a tone-or-empathy dimension. The 87% agreement was on the rubric's dimensions; it never claimed the rubric was complete. The disagreement is not a judge failure or a calibration failure — it is a rubric-coverage failure. The fix lives upstream of the A/B: extend the rubric to include a tone-anchored dimension, re-calibrate, then re-evaluate the candidate. The A/B result itself still stands: the user vote says the candidate is worse on the relationship metric, and that is the ground truth for shipping. Common wrong answer to avoid: "The judge is wrong." The judge is doing its job; the rubric is the thing that needs updating. Diagnosing this as a judge problem sends the team chasing the wrong layer.

Apply now (10 min)

Step 1 — model the exercise. Take the refund-bot A/B from this chapter. Here is the decision table I would put in front of the launch review:

Metric Arm A Arm B Δ p Verdict
Self-serve resolution (primary) 54.1% 56.9% +2.8pp 0.009 win
Policy-correct (quality guardrail) 62.0% 68.0% +6.0pp <0.001 win
CSAT (user-feel guardrail) 4.31 3.91 -0.40 0.004 regression
Median latency 1.8s 1.7s -0.1s n.s. flat
7-day return rate 38.4% 36.1% -2.3pp 0.06 watch
Ship decision iterate — fix tone, re-test

The decision is not" "ship because primary won". It is "iterate because the disagreement matters more than the lift". That is the move the chapter teaches.

Step 2 — your turn. Take one feature in your product. Write the metric ladder: primary, quality guardrail, user-feel guardrail, operational guardrail. Compute n_per_arm for your MDE, baseline, α=0.05, power=0.80. Decide which of the four shift-change shapes (shadow, canary, 50/50, holdout) fits the change. Write down the stop-rule for each guardrail before starting.

Step 3 — reproduce from memory. Without scrolling up, draw the four-shapes diagram (shadow, canary, 50/50, holdout) and label one production scenario each is right for. Then write the two-proportion sample-size formula and compute n for p₁=0.55, MDE=3pp, α=0.05, power=0.80. If you can do both cold, you carry the shift change.

What you should remember

This chapter explained why an offline rubric win does not transfer mechanically to a live primary-metric win, and how to run the comparison so the live answer is the one the business trusts. The refund chatbot's new prompt scored +5pp offline and produced +6pp on policy-correct and +2.8pp on self-serve resolution in a 7-day 50/50 A/B — and simultaneously dropped CSAT by 4 points. The disagreement is the lesson: the rubric measures answer shape, the user measures answer feel, and the shift change is the only place those two estimates meet.

You learned the four shapes of an A/B (shadow, canary, 50/50, holdout) and when each is the right fit. You learned the sample-size math that decides whether your experiment can even answer the question — for the refund bot, n ≈ 9,400 per arm to detect a 2pp lift at 62% baseline with α=0.05 and power=0.80. You learned why peeking, multiple comparisons, non-IID users, and interference are the four pitfalls that can invert an honest experiment, and how each is defended against. And you learned the Goodhart trap — that one primary metric plus directional guardrails is the structure that protects an A/B from being argued into the wrong answer.

Carry this diagnostic forward: when offline and online disagree, trust the user vote on the user-feel guardrail. When primary and guardrail disagree, the disagreement is the experiment's finding, not its bug. Before pushing the experiment-start button, write down primary, MDE, n_per_arm, and the day you will look at the data. If you cannot answer all four, the experiment is not ready. Vibes belong on questions about possibility; offline evals belong on questions about candidate ranking; A/B tests belong on questions about live effects on user behavior. Use each where its question lives.

Remember:

  • Offline evals rank candidates. A/B tests measure live effects. The two answer different questions and the lift between them is not a fixed ratio.
  • A 50/50 A/B is the right rigor for reversible changes; shadow for new components with no user-feel question yet; canary for irreversible or safety-mutating changes; holdout for long-run measurement after shipping.
  • Sample-size math comes before the experiment, not after. An underpowered A/B produces a number but not a decision.
  • One primary metric plus directional guardrails. Many equal metrics turn the experiment into an argument.
  • The disagreement between primary and guardrail is the lesson, not a tie to break. When they disagree, the candidate is winning one game and losing another — name the games.
  • Sample-ratio mismatch is the kill-switch signal. If hands_on_lab is broken, every downstream metric is meaningless until it's fixed.

Bridge. A clean A/B tells you which version is winning on the metrics you chose. It cannot tell you why a specific failing chat failed, or which step inside the agent — retriever, planner, tool call, judge — broke. That diagnostic question requires looking inside one conversation at full resolution: every prompt, every tool call, every intermediate output, every latency hop. The next chapter is the instrument that makes that inspection possible. We can compare versions; now we need to dissect failures.

11-logging-tracing.md