05. Shadow and A/B testing — the trial bake before the real customers¶

~17 min read. A prompt change that passed offline eval can still break production. The fix is to bake the new recipe alongside the old one, in the real kitchen, with real orders — and only let real customers taste it after we have watched it bake a few hundred times.

Builds on 04-review-gates.md. Review gates catch shape and policy mistakes before merge. They cannot catch behavior. The trial bake catches behavior.

1) Hook — three ways to ship a prompt change¶

A bakery has perfected v17 of its sourdough recipe. Sales are steady. Reviews are kind. One morning the head baker tweaks the recipe — a touch less salt, a touch more time in the oven — and calls it v18. The question is not whether v18 is better on paper. The question is whether v18 is better for tomorrow's customers, in tomorrow's weather, baked in tomorrow's ovens, eaten by tomorrow's mouths.

Three ways to find out. Each one has a cost. Each one has a blind spot.

Pattern one — offline eval only. Bake fifty loaves of v18 in the test kitchen. Score them against a rubric. If the score beats v17, ship v18 to everyone tomorrow.

This is the cheapest path. It is also the riskiest. The test kitchen is not the real kitchen. The rubric is not the customer's tongue. Many prompt changes that win on rubric lose on csat. The eval set covers what the team thought to cover. Production has a long tail the team did not think of.

Pattern two — shadow mode. For every real customer order tomorrow, the kitchen bakes both v17 and v18. The customer eats v17. The v18 loaf goes into a tray. At the end of the day, the head baker walks the tray and scores v18 against v17, pair by pair, on real orders.

The customer never tastes v18. The risk is contained. The cost is roughly doubled bake-time on every shadowed order. The signal is real-traffic, real-distribution, real-stratification. This is the trial bake — the new recipe runs alongside the old one, in the real kitchen, with real orders, but the new output is discarded for the customer.

Pattern three — split A/B. Five percent of customers get v18. Ninety-five percent get v17. The kitchen watches csat, complaint rate, and rebuy rate side by side. If v18 holds for a week, ramp to twenty-five percent. Then fifty. Then one hundred.

The customer does taste v18 in this pattern — but only a small slice of customers, and only after shadow mode said v18 was at least as good as v17 on real-traffic samples. The cost is one bad week for five percent of customers, in the worst case, plus the rollback when the gate trips.

Look. None of these patterns replaces the other two. A mature team uses all three, in order. Offline first, then shadow, then split. Each step buys confidence the previous step could not.

Picture the new recipe sitting on the counter of the recipe book. The SHA is fresh, the diff is small, the review is signed off. Before this recipe replaces the live one, the bakery does what every careful kitchen does — it bakes the new recipe a hundred times in parallel with the old one, on the same flour, in the same ovens, under the same morning rush. The trays come out side by side. The head baker walks the line, tasting pairs. Same croissant on the left from v17. Same croissant on the right from v18. Better, same, or worse? Note it. Move on.

Only after a hundred trial bakes confirm "v18 is at least as good as v17" does the bakery start serving v18 — and even then, to one customer in twenty, not to all customers at once. That is the trial bake plus the careful ramp. Together they form the bridge between the recipe being approved and the recipe being lived in.

3) The anatomy — three patterns, three cost profiles¶

┌────────────────────────────────────────────────────────────────────┐
│ HOW SAFE IS THIS PROMPT CHANGE?                                    │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  1) OFFLINE EVAL                                                   │
│     run prompt against fixed eval set                              │
│     cost: 1x compute on N examples                                 │
│     blind spot: distribution drift, long-tail queries              │
│     verdict by: rubric score / pairwise judge                      │
│                                                                    │
│  2) SHADOW MODE  (the trial bake)                                  │
│     route real traffic to BOTH old and new                         │
│     customer sees: old output only                                 │
│     cost: ~2x compute on shadowed requests                         │
│     blind spot: business metrics (csat) — no user feedback yet     │
│     verdict by: pairwise win-rate on real samples                  │
│                                                                    │
│  3) SPLIT A/B                                                      │
│     deterministic bucketing on user/session ID                     │
│     customer sees: new or old, by bucket                           │
│     cost: real users on new prompt — bounded blast radius          │
│     blind spot: small samples, tail effects                        │
│     verdict by: business metric deltas with significance test      │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The progression is not "pick one." It is "use them in order, gate at each step." Offline weeds out the obviously broken. Shadow weeds out the subtly broken. A/B weeds out the broken-in-ways-rubrics-cannot-see.

4) Shadow mode — how it actually runs¶

Shadow mode sounds complicated. It is not. The serving layer routes the same input to two prompt versions. Both calls happen. Both outputs come back. The customer receives only the live one. The shadow one is logged with the same trace ID, the same input, and the SHA of the shadow prompt.

USER REQUEST
     │
     ▼
┌────────────────┐
│ serving layer  │──► live prompt (v17, SHA a8c3f9…) ──► customer
└────────────────┘                                         ▲
     │                                                     │
     └──► shadow prompt (v18, SHA b1d7e4…) ──► log only ───┘
                                                  │
                                                  ▼
                                         shadow comparison store
                                         (input, old_out, new_out)

Two cost knobs and one design decision.

Knob one — sample rate. Shadow every request (highest signal, ~2x cost on the shadow path). Or shadow ten percent of requests (one-tenth the cost, still real-distribution). Or shadow stratified — pick a budget that covers the slice mix you care about (high-value customers, long queries, refund flows, free-tier users). Stratified is the answer for mature teams. Random uniform is the answer for small teams that just want to get started.

Knob two — comparison. Pairwise win-rate is the workhorse. For each shadowed request, an LLM judge or a human rater reads (input, old_out, new_out) and says better, same, worse. Aggregate over a few hundred to a few thousand pairs. The team learns: v18 wins on 38%, ties on 49%, loses on 13%. A net positive of 25 points. Ship is plausible. Or: v18 wins on 22%, ties on 31%, loses on 47%. A net negative of 25 points. Do not ship. Send back to drafting.

Design decision — output discard, always. The shadow output never reaches the customer. Not "usually doesn't." Never. Tools the shadow call would have triggered must not fire. If your prompt produces a side-effecting tool call — refund, email, deploy — the shadow path must short-circuit those. The shadow is a read of what the new recipe would have done. Not a write.

5) Split A/B — buckets, ramps, hold times¶

After shadow says "v18 is at least as good as v17," the team ramps. The standard ladder is five steps.

RAMP LADDER                  HOLD TIME      EVAL GATE
───────────                  ─────────      ─────────
1%   ──── canary ────         24-48h        no incidents, latency steady
5%   ──── early ────          5-7 days      csat ≥ baseline, complaint rate ≤
25%  ──── significant ──      5-7 days      win-rate stable, no tail effects
50%  ──── majority ──         3-5 days      full business-metric parity
100% ──── full ──             —             v17 archived, not deleted

Bucketing is deterministic — a hash of user ID or session ID into a bucket. Same user always sees the same variant within the ramp window. This matters because users complain in pairs ("yesterday it greeted me, today it didn't") and the team needs to know which variant the user actually saw.

Hold times exist because problems do not show up uniformly. A regression in refund handling does not surface in the first hour — refunds happen on a weekly cadence. A degradation in tone does not surface in the first complaint — users habituate, then churn silently four weeks later. The hold time at each rung is the team buying time for slow signals to arrive.

The eval gate between rungs is not just offline eval. It is offline eval plus live business metrics — csat scores from the variant bucket, complaint rate per thousand sessions, NPS, rebuy rate, escalation rate. The gate is structured as "v18 must be at least as good as v17 on these N metrics, with confidence X." If a metric trips, the ramp pauses or the rollback fires.

6) Statistical significance — the back-of-envelope a lead is expected to know¶

The most common A/B sin is calling a winner too early. To detect a 2% effect on a binary metric (csat-good vs csat-bad) at 95% confidence and 80% power, the rough sample size needed is around 4,000 to 10,000 trials per variant, depending on the base rate. For a 5% effect, the number drops to roughly 600 to 1,500 per variant. For a 1% effect, it climbs past 20,000.

EFFECT TO DETECT     SAMPLES PER VARIANT (rough)
─────────────────    ─────────────────────────────
10%                  ~150
5%                   ~600 - 1,500
2%                   ~4,000 - 10,000
1%                   ~20,000 - 40,000
0.5%                 ~80,000+

The numbers move with base rate and metric noise. The order of magnitude is steady. The implication for prompt ops is sharp — most prompt A/Bs do not run long enough to detect anything subtle. If your traffic is 1,000 sessions a day and your variant share is 5%, you collect 50 samples a day in the new bucket. Three weeks to a thousand. Two months to detect a 2% effect. Plan the ramp around the traffic.

This is why shadow mode is so valuable. A shadow sample is paired — same input, two outputs — and pairwise comparisons need an order of magnitude fewer samples than unpaired business-metric tests. A few hundred shadow pairs is a strong signal. A few hundred A/B sessions is barely a signal.

7) Worked example — the customer-support greeter, v17 to v18¶

Same recipe from chapter zero. v17 of customer_support_greeter says "Always greet the user by name." v18 removes that line — the engineer thought the greeting felt too formal. The team treats this as a routine prompt change. Here is what shipping it carefully looks like.

Day 0 — offline eval. The eval suite runs against v18. 200 examples, four rubric dimensions (helpfulness, tone, accuracy, brevity). v18 scores 4.1/5 average vs v17's 4.0. Pass.

Days 1-3 — shadow mode at 100%. Every production request is shadowed to v18. The customer sees v17. The shadow store accumulates 18,000 pairs over three days. An LLM judge scores them pairwise. Result: v18 wins 14%, ties 71%, loses 15%. Net zero. Surprising — the offline rubric said v18 was better. The pairwise judge on real traffic says it is a wash.

The team digs in. The 15% of losses cluster on requests where the customer signed off with their first name in the previous message. v17 picks up the name. v18 does not. The team is now seeing a real-distribution effect the offline eval missed.

Decision point. Two options. Iterate v18 to fix the name pickup, or ship anyway. The team picks ship-anyway, because the win-rate is a wash and the team wants to validate "no offline rubric does not guarantee shadow win-rate." But they ramp cautiously.

Days 4-10 — A/B at 5%. Bucket users by hash mod 20. One bucket sees v18. The team watches csat, complaint volume, and escalation rate. After seven days, csat in the v18 bucket is 4.2 vs 4.3 in v17 — within noise on 1,200 sessions. Complaint volume is +18% in the v18 bucket — outside noise. Customers are noticing the lack of personalization.

Decision point. Roll back, or fix forward. The team rolls back v18 to 0% and opens v19, which adds back name pickup but in a less formal greeting style. v19 enters the same pipeline. Offline first, shadow next, A/B after that.

Days 11-18 — v19 through the pipeline. v19 offline passes. v19 shadow wins 31% / ties 58% / loses 11% — a real net positive. v19 A/B at 5% holds csat and reduces complaint volume by 4%. v19 ramps to 25% on day 19, 50% on day 24, 100% on day 28.

What did the trial bake save the team from? A silent regression that the offline eval cleared. A two-week customer-support fire if v18 had shipped at 100% on day 4. Three to five thousand complaint tickets, conservatively. One angry email from a VP. The cost of running the pipeline was a couple of extra weeks and ~2x compute on shadowed requests for three days. Cheap insurance.

Mid-content recall¶

Why does the shadow output never reach the customer, even when the shadow prompt clearly produces a good answer?
Why does pairwise comparison need fewer samples than an unpaired business-metric A/B?
What is the standard ramp ladder, in five rungs?

8) Failure modes — where teams ramp into the rocks¶

FAILURE MODE                              FIX
────────────                              ───
ramp straight to 50% to "save time"   →   hold at 5% and 25% — tail effects need time
no eval gate between rungs            →   gate every rung on offline + business metrics
shadow without short-circuit on       →   shadow path MUST disable side-effecting tools
  side-effects (tool calls fire 2x)
random sampling on stratified         →   stratify by query class, customer tier, locale
  traffic (rare slices undersampled)
bucketing on request ID, not user ID  →   user sees both variants — invalidates the test
declaring a winner at 50 sessions     →   compute sample size before the ramp starts
no rollback playbook                  →   document the rollback before the ramp begins
shadow output stored with PII forever →   redact and TTL the shadow store
LLM judge is the same model that      →   ensemble of judges, or one different family
  generated v18 (self-bias)

The pattern across most of these is "skipping the gate to ship faster." Every gate exists because someone, somewhere, lost a week of csat to the corner case the gate would have caught.

9) Where this lives in the wild¶

The shadow-and-A/B pattern is older than LLMs — search and recommendation teams have shipped this way for fifteen years. The LLM-specific surfaces wrap the same plumbing.

LaunchDarkly — feature flags as the bucketing layer; prompt SHA gated per flag value, ramp via percentage rollout.
Statsig — experimentation platform with A/B + holdouts + automated significance testing.
Split.io — feature flag and A/B platform used widely for prompt-variant gating.
Vercel Edge Config / flags — deterministic flag evaluation at the edge for prompt-variant routing.
Optimizely — A/B platform with stratified bucketing and lift calculations.
Eppo — experimentation platform with CUPED variance reduction, used for low-traffic LLM tests.
Langfuse — shadow trace tagging via variant metadata; experiment views compare prompt versions side by side.
LangSmith — pairwise eval API, run two prompt versions over a dataset and judge.
Braintrust — diff view across prompt versions with pairwise scoring built in.
Helicone — request-level tagging of prompt version and experiment ID for downstream analysis.
PromptLayer — A/B test view across stored prompt templates.
Vellum — built-in experiment console with rollout controls and metric dashboards.
Pezzo — variant routing rules tied to user attributes for staged rollouts.
Phoenix (Arize) — production trace store that compares output distributions across prompt versions.
Galileo — production observability with built-in A/B and regression comparison views.
Datadog LLM Observability — variant tag on every span, dashboards for metric comparison.
OpenLLMetry — OpenTelemetry-based traces with prompt-version attributes for A/B segmentation.
OpenAI Evals — paired eval harness for pre-shadow offline scoring.
Promptfoo — local pairwise eval CLI, often the offline gate before shadow.
DeepEval — pytest-style assertions for offline pre-shadow checks.
GitHub Actions — runs the offline eval gate on every prompt PR; CircleCI does the same.
Notion AI — internal staged rollouts of prompt changes, gated by csat in beta cohorts.
GitHub Copilot — known to A/B prompt variants across editor sessions, gated by acceptance rate.
Cursor / Windsurf — staged rollouts of system prompt changes via remote config.
Linear's AI features — prompt variants tested across workspaces before full rollout.
Intercom Fin — staged prompt rollouts by customer-tier bucketing.

If a product is shipping LLM features at scale and is not running at least shadow comparison, the team is either pre-revenue or under-resourced. Once revenue depends on the prompt, the trial bake stops being optional.

Pause and recall¶

What are the three patterns for testing a prompt change, in order of increasing real-customer exposure?
Why is stratified sampling preferred over random sampling for shadow mode?
What is the typical sample size to detect a 2% effect at 95% confidence?
What must the shadow path not do, even when the new prompt clearly wants to?
Why is deterministic bucketing by user ID better than by request ID?
What is the role of the hold time between ramp rungs?
Name three classes of metric that should gate the ramp besides offline eval score.

Interview Q&A¶

Q1. How do you A/B test a prompt change safely? A. Three stages, gated. First, offline eval against a fixed set — fast, cheap, broad coverage. Second, shadow mode on real traffic — run both prompts, customer sees old, pairwise compare outputs to get a win-rate on the real distribution. Third, split A/B with deterministic user bucketing and a ramp ladder (1% → 5% → 25% → 50% → 100%) with hold times and eval gates between rungs. The eval gate at each rung is offline score plus business metrics like csat and complaint rate. Trap: "We run an A/B straight from offline." Skipping shadow is how you ship offline-clean regressions onto real users.

Q2. What is shadow mode and why does it cost ~2x? A. Every (sampled) production request is routed to both the current prompt and the candidate prompt. The customer receives only the current prompt's output. The candidate's output is logged with the same trace ID and input for offline pairwise scoring. The cost is roughly doubled on the shadow path because both LLM calls happen — though shadow can be sampled (10%, stratified) to control cost. Trap: Forgetting that the shadow path must short-circuit side-effecting tool calls. If shadow refunds fire, you have just doubled your refund volume.

Q3. How do you decide the ramp percentage and hold time? A. Ramp percentage depends on blast radius — how many users a regression would hurt before detection. Hold time depends on the slowest business signal — if csat takes a week to surface, hold for at least a week. Standard ladder is 1% / 5% / 25% / 50% / 100% with 1-2 days at canary, 5-7 days at 5% and 25%, 3-5 days at 50%. Lower-traffic systems hold longer to accumulate signal. Trap: Ramping straight to 50% to "save time." Tail effects (refund-flow regressions, churn-driven complaints) need calendar time, not session count.

Q4. How big does an A/B sample need to be to detect a small effect? A. Rough numbers — to detect a 2% effect on a binary metric at 95% confidence and 80% power, around 4,000 to 10,000 samples per variant. A 5% effect needs ~600 to 1,500. A 1% effect needs 20,000+. The numbers move with base rate and metric noise; the order of magnitude is stable. Plan the ramp around traffic — if your daily sessions are small, the A/B may not reach significance for months. Trap: Declaring a winner at 100 sessions. With anything subtler than a 20% effect, that is noise.

Q5. How do you bucket users for a prompt A/B? A. Deterministic hash of stable user or session ID into N buckets. Stable means the same user sees the same variant for the duration of the test, which keeps the user's experience consistent and lets you attribute complaints correctly. Avoid bucketing by request ID — the same user sees both variants and the test is invalidated. Trap: Bucketing by IP, which churns across mobile networks and corrupts attribution.

Q6. What is the LLM-as-judge bias problem in pairwise comparison? A. The judge model has preferences. If your judge is the same family that generated the candidate, the judge is biased toward outputs that look like its own. Position bias is also real — many judges prefer the first or second option presented. Mitigations: ensemble of judges from different families, swap order on every comparison, calibrate against a small human-labeled set, and weight by judge-human agreement. Trap: Using a single same-family judge and treating its win-rate as ground truth.

Q7. How do you handle a prompt change that triggers state-mutating tools, in shadow mode? A. The shadow path must short-circuit all side-effecting tools — refunds, emails, deploys, writes. The shadow is a plan that the team scores; it is not allowed to take action. The simplest implementation is a shadow=true flag passed into the tool runner that intercepts mutating calls and returns a synthetic success. The judge then scores the planned tool calls as part of the output. Trap: Letting shadow fire tools "for realism" — you have just doubled state changes.

Q8. When should you skip shadow and go straight to a 1% A/B? A. Rarely. Acceptable cases: the prompt change is a trivial wording fix that offline eval has confirmed has zero behavioral diff (see chapter 06 on drift detection); the system has no shadow infrastructure and the team has explicitly accepted the higher blast radius; or the change is reversible in <60 seconds via a feature flag and the team has rollback tested. In every other case, shadow first. Trap: "Our offline eval is comprehensive enough." It is not. Production has a long tail your eval set does not.

Apply now (5 min)¶

Step 1 — model first. For a customer-support prompt change at a system with 5,000 daily sessions, sketch the rollout. Days 0: offline eval. Days 1-3: shadow at 100%, pairwise judge. Days 4-10: A/B at 5%. Days 11-15: 25%. Days 16-20: 50%. Day 21+: 100%. Hold times at each rung. Eval gate at each rung — offline score, pairwise win-rate, csat, complaint rate.

Step 2 — your turn. Pick one prompt in your system. List the shadow infrastructure you would need — variant routing, paired log store, pairwise judge, win-rate dashboard. List the gates — offline rubric, shadow win-rate, A/B business metric, rollback trigger. Write down the rollback command you would execute if the 25% rung trips.

Step 3 — sketch from memory. Redraw the three-stage diagram from section 3 — offline, shadow, A/B — with costs and blind spots beside each. Then redraw the ramp ladder from section 5 with hold times and eval gates.

Bridge. Shadow tells you that v18 differs from v17 on real traffic. What it does not tell you is how it differs, or whether the team should care about the difference. A wording change can shift output length by 70% without changing win-rate by a single point. The next page is how to detect, quantify, and reason about that hidden shift.

→ 06-prompt-drift-detection.md