10. A/B testing prompts — measure behavior, not vibes¶
~14 min read. A prompt is good only if it wins on the task that matters.
Built on the ELI5 in 00-eli5.md. The Revision ledger — the record of prompt variants — becomes useful only when variants are compared with real metrics.
Why offline opinions are not enough¶
Look. A prompt can sound better to the team, yet perform worse for users. It may be shorter, warmer, or more elegant. Still, it can lower resolution rate, raise hallucinations, or hurt parser success. That is why prompt decisions need experiments.
Picture first.
traffic split
┌──────────────┬──────────────┐
│ variant A │ variant B │
│ old prompt │ new prompt │
└──────┬───────┴──────┬───────┘
▼ ▼
user outcomes user outcomes
│ │
└────── compare metrics ──────→ ship or rollback
Simple, no? An A/B test asks one practical question. For similar traffic, did prompt B improve the outcome we care about? Not, "Did senior people like reading it?" Not, "Did one demo go well?" Real usage decides.
Now what is the problem? A prompt change often affects many things at once. Accuracy may rise. Latency may rise too. Refusal rate may fall. Risk may rise. So what to do? Choose primary metrics, secondary metrics, and guardrail metrics before the test begins. Otherwise people cherry-pick the result they like.
What to measure in a prompt experiment¶
The right metric depends on the product. Support bots may track containment rate, customer satisfaction, resolution accuracy, and escalation rate. Classification systems may track exact match, parse success, and manual-review rate. Coding assistants may track acceptance rate, undo rate, and task completion.
prompt experiment scorecard
┌──────────────────────────────┐
│ primary metric │
│ secondary metrics │
│ safety guardrails │
│ latency / cost guardrails │
└──────────────────────────────┘
The primary metric should reflect core product value. One primary metric is usually enough. Secondary metrics help explain tradeoffs. Guardrails stop harmful wins. Example. A support prompt might improve containment, but if policy hallucinations double, that is not a real win.
Be careful with proxies. Longer answers may look thoughtful. But longer is not the goal. User success is the goal. The Reply form may also need its own guardrail. If parse rate drops, backend failures may erase any quality gain.
Randomization and significance for prompts¶
See. A/B testing is not only splitting traffic. You need comparable traffic. You need enough samples. You need stable hands_on_lab. And you need to ask whether the observed lift could just be noise. That is where significance comes in.
Picture first again.
small noisy sample larger stable sample
┌──────────────────────┐ ┌──────────────────────┐
│ A wins on 12 chats │ │ A vs B on 12,000 chats│
│ maybe luck │ │ clearer signal │
└──────────────────────┘ └──────────────────────┘
You do not need to become a statistician overnight. But you do need discipline. Predefine the metric. Estimate sample size. Run until the planned threshold, not until your favorite version looks lucky. That last mistake is common and costly.
For many teams, a practical rule works. If the prompt touches high-risk behavior, start offline, then canary, then small online split, then broader rollout. If the change only affects tone, you may move faster. Still, randomize fairly. Do not give variant B all the easy traffic.
Worked example — support prompt A vs B¶
Suppose version A is the current billing-support prompt. Version B adds one negative example and a stricter JSON output contract. The product team wants to know if B is better. They choose these metrics.
Primary metric: correct resolution rate
Secondary metrics: containment rate, customer CSAT, parser success
Guardrails: unsupported promise rate, refusal rate, latency
Traffic split: 50% to A. 50% to B. Stable by conversation ID. Run for one week.
Possible results.
Variant A
- Correct resolution rate: 78.4%
- Parser success: 93.1%
- Unsupported promise rate: 4.8%
- Median latency: 2.2s
Variant B
- Correct resolution rate: 82.1%
- Parser success: 98.6%
- Unsupported promise rate: 1.7%
- Median latency: 2.5s
Now what is the decision? If the sample is large enough, B likely wins. Correctness improved. Parser success improved. Risk dropped. Latency rose slightly, but stayed inside guardrail. That is a good trade.
See how the Revision ledger and experiment connect. The ledger says what changed. The A/B test says whether the change mattered. Without both, you get stories, not evidence.
Common experiment mistakes¶
Mistake one. Changing prompt and model at the same time. Then you cannot attribute the effect. Mistake two. Changing temperature too. Same problem. Mistake three. Watching the dashboard every hour, then stopping early when your favorite prompt leads. That inflates false wins. Mistake four. Using only one vanity metric.
So what to do? Control one major change at a time when possible. Log the inference config. Respect the planned duration. Inspect segment-level effects too. A prompt may help short chats and hurt long chats. Overall averages can hide that.
Where this lives in the wild¶
- Intercom Fin — support AI teams compare prompt variants on containment, resolution quality, and escalation outcomes instead of relying on anecdotal transcripts.
- GitHub Copilot — experimentation teams measure suggestion acceptance, follow-up edits, and task completion because pretty completions are not enough.
- Perplexity and search-answer products — answer-style variants can be tested on click-through, citation trust, and session success while guarding against factual regressions.
- Enterprise AI copilots on Azure OpenAI or Bedrock — operations teams often canary new prompt versions on limited tenant traffic before wider rollout.
- LangSmith-style prompt platforms — prompt owners link traces, dataset evals, and online metrics so version decisions stay evidence-backed.
Pause and recall¶
- Why can a prompt that sounds nicer still be a worse product prompt?
- What is the difference between a primary metric and a guardrail metric?
- Why must prompt experiments control model and sampling settings?
- Why is early stopping on lucky dashboards dangerous?
Interview Q&A¶
Q: Why is an A/B test stronger than team preference when choosing between prompts? A: Because it measures real user outcomes on comparable traffic. Team preference is subjective and often weakly correlated with production success.
Common wrong answer to avoid: "Because users always know best immediately." User data is valuable, but experiments still require careful metric design and guardrails.
Q: Why should prompt experiments define guardrail metrics before launch? A: A prompt can improve the main outcome while harming safety, cost, or latency. Guardrails prevent those hidden regressions from being ignored.
Common wrong answer to avoid: "Guardrails are optional if the primary metric improves a lot." Large gains do not excuse harmful regressions automatically.
Q: Why is it hard to attribute wins when prompt and model change together? A: Because the treatment is no longer isolated. You cannot tell whether improvement came from the prompt, the model, or their interaction.
Common wrong answer to avoid: "If the result is better, attribution does not matter." Attribution matters for learning, rollback, and future iteration.
Q: Why do prompt experiments need enough sample size and fixed duration? A: Because prompt effects can be small and traffic is noisy. Without adequate samples and discipline, random variation can look like a real win.
Common wrong answer to avoid: "A few good conversations are enough to decide." They are useful examples, not reliable evidence.
Apply now (5 min)¶
Exercise. Pick one AI workflow you know. Write one primary metric, two secondary metrics, and two guardrails for a prompt experiment. Then state what must stay fixed besides the prompt text. Add model, decoding, and traffic hands_on_lab.
Sketch from memory. Draw the split funnel. Put variant A on the left, variant B on the right, and metrics in the middle. Write, "No cherry-picking." under the diagram.
Bridge. A/B testing tells us which prompt wins. But when a prompt loses, we still need to know why. So next we move from experimentation to diagnosis. → 11-prompt-debugging.md