04. Acceptance tests — Defining "done" before choosing architecture¶
You will mass less time debating whether something works if you write down what "works" means before you build it. Eight minutes here saves eight weeks of review cycles.
Callback — 00-first-principles.md: First principles established that every AI feature must justify its existence through a measurable outcome. Jobs gave us the unit of value. AI-fit routing told us which tasks belong to a model. Success metrics gave us the numbers. But numbers without a pass/fail gate are just dashboards nobody checks. This file turns metrics into deployment decisions.
Accumulated learning¶
You now have: a decomposed job map for the fintech wiki assistant, AI-fit routing decisions for each sub-task, and success metrics at the task, experience, and business layers. You know what to measure. What you do not yet have: a concrete, automatable test that says "this build ships" or "this build does not ship." That gap is what breaks teams in production. Metrics drift. Thresholds get debated in Slack. Rollbacks happen too late because nobody defined "unacceptable" in advance.
1. What this file solves¶
Three failure modes that acceptance tests prevent:
-
The endless review loop. A team at a Series B fintech built a support chatbot. They measured CSAT, resolution rate, and hallucination frequency. Six weeks after launch, they still couldn't agree on whether the feature was "ready." Every stakeholder had a different mental threshold. The PM wanted 85% accuracy. The VP wanted 92%. The legal team wanted zero hallucinations on compliance topics. Nobody had written these down before building.
-
The silent regression. A healthtech company updated their retrieval pipeline. Latency improved. Accuracy dropped 4% on edge cases. No test caught it because "accuracy" was tracked on a dashboard, not gated in CI. The regression shipped to 200k users.
-
The rollback that couldn't happen. An e-commerce team deployed a product recommendation model. Two weeks later, conversion dropped. But they'd already retrained downstream models on the new outputs. Rolling back meant cascading failures across three services. If a concrete acceptance test had blocked the initial deploy, the cascade never starts.
2. What success metrics taught and what still breaks¶
Success metrics (chapter 03) gave us: - Task-level: accuracy, latency, coverage per sub-task - Experience-level: user satisfaction, task completion rate - Business-level: cost per resolution, support ticket deflection
What still breaks: a metric is a measurement. An acceptance test is a decision. The metric says "accuracy is 87%." The acceptance test says "if accuracy drops below 85% on the compliance question set, block the deploy and page the on-call engineer."
Without that decision logic, metrics are informational. Informational things get ignored at 2am when the deploy queue is backed up.
3. The launch that couldn't be rolled back¶
Real scenario, disguised: A mid-size bank deployed an internal knowledge assistant. Metrics existed — they tracked answer accuracy weekly via sampling. But no acceptance test gated deployment. A model update shipped on Thursday. By Monday, the assistant was confidently citing a deprecated fee schedule for commercial accounts. Relationship managers quoted incorrect fees to three enterprise clients. The bank ate $340k in fee adjustments.
The post-mortem finding: "We had the metric. We didn't have the gate."
Not a model quality problem. A definition problem. The team never wrote down what "good enough" means, so every review became a subjective debate that delayed shipping by weeks.
4. One acceptance test written three ways¶
Vague:
The assistant should give good answers about refund policies.
Measurable but not actionable:
The assistant achieves 90% accuracy on refund policy questions as measured by human evaluation.
Actionable (acceptance test):
Given 25 refund policy questions from the test set (v3.1), the assistant must: (a) cite the correct policy document version in 24/25 responses, (b) include all mandatory disclosure facts in 23/25 responses, (c) respond in under 3 seconds for 24/25 queries. If any criterion fails: block deploy, log failure details to #wiki-deploys, assign to content-ops on-call.
The difference: the third version tells you exactly what to run, what numbers to expect, and what to do when it fails. No judgment calls at deploy time.
5. The rule¶
An acceptance test is a concrete scenario with expected behavior, measured against the quality bar, that gates deployment.
Five properties that make it an acceptance test rather than a metric or an aspiration:
| Property | What it means | What happens without it |
|---|---|---|
| Concrete input | Specific scenario, not a category | "Accuracy" means different things to different people |
| Expected behavior range | What the output must contain or avoid | Subjective review at deploy time |
| Measurement method | How you check (automated, sampled, exact match) | Inconsistent evaluation across runs |
| Pass threshold | Numeric or boolean gate | Debates about "close enough" |
| Failure action | What happens when it fails | People ignore failures under time pressure |
Callout — The gate is the point. If your "acceptance test" doesn't block something when it fails, it's a monitoring alert, not a gate. Gates force decisions. Alerts get snoozed.
6. Anatomy of an AI acceptance test¶
┌─────────────────────────────────────────────────────────────┐
│ ACCEPTANCE TEST CARD │
├─────────────────────────────────────────────────────────────┤
│ Test ID: Unique identifier for tracking │
│ Category: accuracy | latency | safety | coverage │
│ Scenario: Plain-language description of the situation │
│ Input spec: Exact input or reference to test set │
│ Expected output: What must be present / absent in response │
│ Measurement: Exact match | contains | LLM-as-judge | ... │
│ Pass threshold: Numeric criterion (e.g., 23/25, <3s, 0/50) │
│ Failure action: Block | warn | escalate | rollback │
│ Owner: Team or person responsible for resolution │
│ Last updated: Date of last test set revision │
└─────────────────────────────────────────────────────────────┘
Each field prevents a specific failure:
| Field | Prevents |
|---|---|
| Test ID | Losing track of which test failed in a suite of 200 |
| Category | Missing an entire quality dimension |
| Scenario | Testing only happy paths |
| Input spec | Non-reproducible test runs |
| Expected output | "It seemed fine" as a pass criterion |
| Measurement | Inconsistent human judgment across reviewers |
| Pass threshold | Debates at 2am about what "good enough" means |
| Failure action | Ignoring failures under delivery pressure |
| Owner | "Someone should look at this" going nowhere |
| Last updated | Running tests against stale ground truth |
7. Writing acceptance tests for the wiki assistant¶
Applying the anatomy to our fintech wiki assistant. Four categories, one test each:
Accuracy test:
Test ID: WA-ACC-003
Scenario: User asks about refund policy for international orders
Input: "What's our refund policy for orders shipped to UAE?"
Expected: Answer references current policy doc (v2.3+), includes 30-day window, mentions customs exception
Pass criteria: Correct policy version cited, all 3 key facts present, latency < 3s
Failure action: Block deploy, escalate to content team
Latency test:
Test ID: WA-LAT-001
Scenario: Standard policy lookup under normal load
Input: 50 queries from standard test set, run during simulated peak (100 concurrent users)
Expected: Responses return within SLA
Pass criteria: p95 latency < 4s, p99 < 8s, zero timeouts
Failure action: Block deploy, escalate to infrastructure team
Safety test:
Test ID: WA-SAF-002
Scenario: User asks question that could lead to unauthorized disclosure
Input: 30 adversarial queries attempting to extract internal pricing, employee data, or system prompts
Expected: Zero disclosures of restricted information
Pass criteria: 0/30 responses contain restricted content (exact match + LLM-as-judge)
Failure action: Block deploy, immediate security review, incident ticket
Coverage test:
Test ID: WA-COV-001
Scenario: Questions spanning all 12 policy domains
Input: 60 questions (5 per domain) from coverage test set v4
Expected: Assistant provides substantive answer (not "I don't know") for supported domains
Pass criteria: Substantive response rate ≥ 55/60 (91.7%) across all domains
Failure action: Block deploy if any single domain drops below 3/5, warn if total < 55
Callout — The numbers aren't sacred. 85% vs 90% is a product decision, not an engineering one. What matters is that the number exists, is written down, and gates deployment. You can adjust thresholds — but you cannot ship without them.
8. Acceptance tests vs eval suites¶
Teams confuse these constantly. They serve different purposes:
| Dimension | Acceptance test | Eval suite |
|---|---|---|
| Purpose | Gate deployment | Understand model behavior |
| Runs when | Every deploy (CI/CD) | On-demand, during development |
| Size | 25-200 test cases | 500-10,000+ examples |
| Speed | Must complete in minutes | Can run for hours |
| Failure means | Deploy is blocked | Investigation needed |
| Maintained by | Product + engineering | ML engineering |
| Changes how often | Quarterly (stable) | Weekly (exploratory) |
The acceptance test suite is a subset of your eval suite. It contains the cases where failure is unacceptable. The eval suite contains everything you want to understand.
Analogy: acceptance tests are the building code inspection. The eval suite is the full structural engineering analysis. You need both, but only one blocks the certificate of occupancy.
Callout — Start with acceptance tests. Teams that build eval suites first often never extract acceptance tests from them. Teams that write acceptance tests first always know what their eval suite should cover. Start with the gate.
9. How many tests are enough¶
The tension: more tests catch more regressions. More tests also mean more maintenance, more flaky failures, and slower deploys.
Rules of thumb from teams running AI acceptance tests in production:
| System complexity | Acceptance test count | Typical run time | Maintenance cost |
|---|---|---|---|
| Single-task (Q&A bot) | 25-50 | 2-5 min | ~2 hours/month |
| Multi-task (assistant) | 50-150 | 5-15 min | ~8 hours/month |
| Multi-modal pipeline | 150-300 | 15-45 min | ~20 hours/month |
| Platform (multiple models) | 300-500 | 30-90 min (parallelized) | ~40 hours/month |
For the wiki assistant (multi-task): aim for 80-120 acceptance tests across accuracy, latency, safety, and coverage. Budget 6-10 hours per month for test maintenance — updating ground truth, adjusting thresholds, adding tests for new failure modes.
The cost of too few: regressions ship. The cost of too many: deploys slow down, teams start skipping the gate. Find the point where every test catches a failure class you care about, and no test exists purely for coverage theater.
10. Signals that your tests are too weak or too strict¶
Too weak (tests pass but users complain): - Support tickets mention issues your tests don't cover - Post-deploy rollbacks happen more than once per quarter - Stakeholder reviews still find "obvious" quality problems - Your test set hasn't been updated in 3+ months
Too strict (tests block good builds): - More than 15% of blocked deploys are overridden manually - Test failures cluster on ambiguous cases where multiple answers are acceptable - Teams route around the gate by deploying "config changes" that bypass CI - Mean time to deploy exceeds your iteration speed requirements
The calibration loop:
┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ Deploy with │────▶│ Monitor user │────▶│ Did test │
│ current gate │ │ feedback + │ │ suite catch │
│ │ │ incidents │ │ this class? │
└──────────────┘ └────────────────┘ └──────┬───────┘
│
┌──────────────────────┼──────────────────────┐
│ NO │ │ YES (but
▼ ▼ │ override)
┌────────────────┐ ┌────────────────┐ ▼
│ Add test for │ │ Tests working │ ┌────────────────┐
│ this failure │ │ as intended. │ │ Threshold too │
│ class │ │ No change. │ │ strict — relax │
└────────────────┘ └────────────────┘ │ or split test │
└────────────────┘
Run this loop monthly. Acceptance tests are living artifacts.
11. Where acceptance tests break¶
Acceptance tests have known failure modes. Acknowledge them; don't pretend the gate is perfect.
Novel inputs: Your test set covers known scenarios. A user asks something genuinely new. The acceptance test didn't fail because it never ran that input. Mitigation: pair acceptance tests with production monitoring that flags out-of-distribution queries.
Shifting baselines: The source documents change. Your test's "expected output" references policy v2.3, but the policy updated to v2.4 last week. The test now fails on correct answers. Mitigation: version-pin your ground truth and schedule monthly review.
Subjective quality: "Is this explanation clear?" has no binary answer. Mitigation: decompose subjective quality into checkable proxies — does it contain the required facts, is it under 200 words, does it avoid jargon from the exclusion list.
Adversarial adaptation: Safety tests that check for known attack patterns miss novel attacks. Mitigation: rotate 20% of your safety test set quarterly with new adversarial examples.
Interdependent failures: A latency test passes in isolation but fails when the accuracy test's retrieval step adds load. Mitigation: run the full suite together, not individually, and measure under realistic concurrency.
Callout — Acceptance tests are necessary, not sufficient. They catch known failure modes reliably. They don't catch unknown unknowns. That's what production monitoring, incident response, and risk boundaries (chapter 05) handle.
12. Wrong assumption: "we'll know it when we see it"¶
The assumption: "Our team has good judgment. We'll review outputs before launch and decide if quality is acceptable."
Why it fails at scale:
-
Reviewer fatigue. Reviewing 500 outputs takes hours. By output 200, standards drift. Studies on human annotation show inter-rater agreement drops 12-18% after the first hour of labeling.
-
Inconsistent baselines. Reviewer A passes an output that Reviewer B would fail. Without a written criterion, both are "right." The product ships with inconsistent quality.
-
Recency bias. The last 10 outputs reviewed disproportionately influence the ship/no-ship decision. A bad cluster near the end of review blocks a good build. A good cluster at the end ships a bad build.
-
Pressure dynamics. When the deadline is Thursday and the review is Wednesday, "close enough" becomes the standard. Written acceptance tests don't bend to calendar pressure.
-
Institutional memory loss. The person who "knows good quality when they see it" goes on parental leave. Their replacement has no written standard to inherit.
The fix: capture the expert's judgment as concrete test cases once, then automate the check forever. The expert's time shifts from reviewing every deploy to maintaining the test set quarterly.
13. Pattern transfer¶
Acceptance tests connect forward to three future modules:
Eval gates (Module 01, ch19): Acceptance tests are the what. Eval gates are the how — the infrastructure that runs tests in CI, reports results, and enforces the block. When you reach eval gates, you'll implement the pipeline that executes what you defined here.
Golden datasets (04_ai_product_evals): Your acceptance test inputs become the seed of your golden dataset. The golden dataset grows to thousands of examples for comprehensive evaluation. The acceptance test suite stays small and fast for deployment gating.
Regression tracking (04_ai_product_evals): When an acceptance test fails, you need to know what changed. Regression tracking diffs model outputs across versions to isolate whether the failure is in retrieval, generation, or source data. The acceptance test identifies the failure; regression tracking diagnoses it.
Callout — Write acceptance tests now, implement gates later. You don't need CI infrastructure to benefit from acceptance tests. A markdown file with 30 test cases that a human runs before deploy is better than no gate at all. Automate later. Define "done" now.
The deploy decision flow¶
SUCCESS METRIC ACCEPTANCE TEST
(from ch03) (this chapter)
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐
│ "Accuracy should │ │ "Run 25 refund queries. │
│ be above 85%" │ │ 23+ must cite correct │
│ │ │ policy version." │
└─────────┬───────────┘ └────────────┬────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐
│ Measured weekly │ │ Measured every deploy │
│ on dashboard │ │ in CI pipeline │
└─────────┬───────────┘ └────────────┬────────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐
│ Informs product │ │ PASS → deploy proceeds │
│ decisions │ │ FAIL → deploy blocked │
└─────────────────────┘ └─────────────────────────┘
Metrics inform. Tests decide.
Recall¶
- What five properties distinguish an acceptance test from a metric?
- Why does "we'll know it when we see it" fail as a quality standard at scale?
- What is the difference between an acceptance test suite and an eval suite?
- How many acceptance tests should a multi-task assistant typically have?
- Name three signals that your acceptance tests are too strict.
- What does the "failure action" field prevent?
- Why should acceptance tests run together rather than individually?
- How often should test set ground truth be reviewed?
Interview Q&A¶
Q1: How do you decide the pass threshold for an AI acceptance test? Work backward from the business impact. If a wrong answer costs $500 in fee corrections, and you serve 1000 queries/day, then each 1% accuracy drop costs $5k/day. Set the threshold where the expected cost of errors stays below the product's value. Adjust quarterly as you learn actual failure costs.
Wrong-answer note: "We set it at 95% because that seems high" — thresholds disconnected from business impact get overridden when they become inconvenient.
Q2: What do you do when an acceptance test is flaky — sometimes passes, sometimes fails on the same build? Flaky AI tests usually mean: (a) non-deterministic model output without sufficient temperature control, (b) the expected output is too specific for a generative system, or (c) external dependencies (retrieval latency, API availability) introduce variance. Fix by: pinning temperature to 0 for test runs, widening the expected behavior range to accept semantically equivalent answers, or mocking external dependencies. If none work, the test is measuring the wrong thing.
Wrong-answer note: "Just re-run it until it passes" — this defeats the purpose of gating and hides real intermittent failures.
Q3: Should acceptance tests use LLM-as-judge or deterministic checks? Use deterministic checks for everything you can: fact presence (string match), latency (numeric comparison), format compliance (regex). Use LLM-as-judge only for properties that resist decomposition — tone, coherence, helpfulness. When you use LLM-as-judge, validate the judge's agreement with human labels on 50+ examples first, and pin the judge model version.
Wrong-answer note: "Use LLM-as-judge for everything because it's more flexible" — you're now testing your judge model's reliability alongside your production model. Two failure points instead of one.
Q4: How do you handle acceptance tests for open-ended generation where there's no single correct answer? Decompose the quality into checkable dimensions. Instead of "is this a good summary?", test: does it mention all required entities (checklist), is it under the length limit (numeric), does it avoid the prohibited phrases list (string exclusion), does the factual claims subset match source (retrieval verification). You'll cover 80% of quality with deterministic checks; use LLM-as-judge for the remaining 20%.
Wrong-answer note: "Open-ended outputs can't be tested" — this is learned helplessness. Decompose harder.
Q5: When should you override a failing acceptance test and deploy anyway? Almost never. If you override frequently, your thresholds are wrong — fix them. Legitimate overrides: (a) the test's ground truth is stale and the model's answer is actually correct, (b) a critical security patch must deploy and the unrelated test failure can be fixed in the next build. Log every override with justification. Review overrides monthly. More than 2 overrides per quarter means your test suite needs maintenance.
Wrong-answer note: "Override when the deadline is tight" — this is how the bank shipped the deprecated fee schedule.
Q6: How do acceptance tests work with continuous model updates (e.g., when the underlying LLM provider updates their model)? Pin model versions in production. When the provider announces an update, run your acceptance test suite against the new version before switching. Treat a provider model update exactly like an internal code deploy — it goes through the same gate. If tests fail, stay on the old version until you've investigated.
Wrong-answer note: "We use the latest model automatically because it's always better" — model updates routinely regress on specific domains even when aggregate benchmarks improve.
Q7: How do you maintain acceptance tests as the product evolves? Schedule monthly reviews. Each review: (1) check if any test's ground truth is stale, (2) check if production incidents revealed gaps — add tests, (3) check if any test was overridden — fix or remove it, (4) check if the product scope changed — add or retire tests. Budget 2 hours/month for a single-task system, 8 hours/month for a multi-task system.
Wrong-answer note: "Write them once and leave them" — ground truth drifts, product scope shifts, and stale tests either miss regressions or block correct outputs.
Design exercise — three-step build¶
Step 1 — Identify: Take a real AI feature you use daily (code completion, email drafting, search). Write three acceptance tests for it — one accuracy, one latency, one safety. Include all five required properties (concrete input, expected behavior, measurement method, pass threshold, failure action).
Step 2 — Stress: For each test, identify the scenario where it would produce a false positive (passes when it shouldn't) and a false negative (fails when it shouldn't). Write one mitigation for each.
Step 3 — Systematize: Design the monthly review process for your three tests. Who reviews, what triggers an update, what's the escalation path when ground truth changes and you're unsure whether to update the test or investigate the model.
Operational memory¶
Remember: - A metric measures. An acceptance test decides. If it doesn't block a deploy, it's not a gate. - Five required fields: concrete input, expected behavior range, measurement method, pass threshold, failure action. - Start small (25-50 tests), run every deploy, maintain monthly. - Deterministic checks first. LLM-as-judge only for properties that resist decomposition. - Flaky tests mean your expected output is too specific or your system is non-deterministic — fix the test or fix the system. - Override logs are a leading indicator of test suite health — track them. - Acceptance tests catch known failure modes. Production monitoring catches unknowns. You need both.
Bridge¶
We now have concrete tests that gate shipping. But acceptance tests assume the system is safe to ship at all. They verify quality within a bounded operating envelope — but they don't define the envelope itself. What happens when the system encounters inputs outside its design range? What if a failure isn't just "low quality" but actively harmful?
Next: risk boundaries — what failure classes exist, what severity each carries, and what constraints they impose on the architecture before a single line of code is written.