04. Test and doc generation — 90% coverage that asserts nothing¶
~18 min read. Ask an assistant to "write tests for this module" and it produces a green suite that lifts coverage from 40% to 88% in an afternoon. The dashboard celebrates. Then a real bug ships through a test that called the function, got the wrong answer, and asserted nothing about it. This file shows why coverage measures the wrong thing for generated tests, how a test can pass while protecting nothing, how mutation testing reveals the hollow ones, and how the same trap appears in generated docs that read fluently while quietly lying.
Built on 00-first-principles.md. The forces here are the source of truth, the guardrail metric, the grounding gap, and the vanity metric. File 03 made the review gate trustworthy by blocking on deterministic findings — and the strongest deterministic findings come from tests. This file asks whether AI-generated tests are the gate they appear to be.
What we know so far and what still breaks¶
File 01 found that net leverage is output minus rework, and rework hides where verification was too cheap. File 02 kept a human spec as the source of truth so generated artifacts conform to intent. File 03 made the AI reviewer trustworthy by blocking only on deterministic findings — and noted that the strongest deterministic signal a reviewer can lean on is a test that fails when the code is wrong. Every gate so far has quietly assumed the test suite is real.
That assumption is exactly what AI test generation breaks. When you ask a model to write tests, it writes tests that pass, because passing is what tests do and the model has seen millions of green ones. A test that passes against the current code tells you the test agrees with whatever the code currently does — including its bugs. Coverage goes up because the lines got executed. Whether anything was asserted about their behavior is a separate question coverage cannot answer.
This chapter answers three things: why a generated test can pass while protecting nothing, how to measure whether a test would actually catch a regression (mutation testing, not coverage), and why generated docs fail the same way — fluent, plausible, and ungrounded in the code they describe.
What this file solves¶
A team asks an assistant to backfill tests on a legacy module, coverage jumps from 41% to 88%, and the PR sails through because "more tests, higher coverage, obviously good." Two sprints later a refactor introduces a real regression — and every one of the generated tests stays green, because they asserted that the function runs, not what it returns. This file gives you the concrete move: stop trusting coverage as the test guardrail, measure assertion strength with mutation testing (does the suite fail when you deliberately break the code?), pin tests to the behavior a human owns rather than to current output, and apply the same grounding discipline to generated docs.
Why generated tests pass even when they protect nothing¶
Watch Meridian backfill tests on its legacy billing module — 41% coverage, nobody wants to touch it. A developer, Sam, asks the assistant: "Write unit tests for calculate_invoice." Forty seconds later there are fourteen tests, all green, coverage at 88%. Look at one of them:
def test_calculate_invoice_runs():
result = calculate_invoice(order_id=42, tax_rate=0.18)
assert result is not None # passes for ANY non-null return
The function ran. The line is covered. The test is green. And it asserts almost nothing — result is not None passes whether the invoice is $100, $0, or $-9999. If calculate_invoice starts returning the wrong total tomorrow, this test stays green. It is a test in shape only.
Why does the model write this? Because it was asked to generate tests, it generated something that passes, and the cheapest way to make a test pass is to assert something trivially true. It does not know the correct invoice for order 42 — that depends on business rules, tax logic, and data the model can't see. So it asserts the thing it can guarantee will pass: that the call returned something. Many of the fourteen tests are variations on this. Coverage is real; protection is theater.
So the real problem is not "the generated tests are wrong." They pass, they compile, they raise coverage. The problem is that a test's value is in its assertions, and the model optimizes for passing, which pushes assertions toward the trivially true. Coverage measures which lines ran; it says nothing about whether the test would fail if those lines started misbehaving. The two are not the same thing, and generated tests widen the gap.
So how do we tell a test that protects behavior from a test that merely executes a line?
The naive fix: chase a higher coverage number¶
Meridian's first instinct is the industry reflex: set a coverage gate. "All PRs must maintain 85% line coverage." Generated tests make this trivially easy to hit, so coverage climbs across every module. The dashboard turns green org-wide.
The break shows up at the next real regression. A refactor to calculate_invoice flips a discount sign — customers get charged a negative discount, i.e., a surcharge. The suite runs. Every test passes. Coverage is still 88%. The bug ships, finance notices a revenue anomaly, and three engineers spend a day discovering that the module with the highest coverage had the least protection — because its tests were generated to pass, not to catch.
Module Line coverage Caught the discount-sign bug?
billing (gen) 88% NO (asserts is-not-None)
auth (hand-written) 72% YES (asserts exact claims)
Higher coverage, less protection. The coverage gate measured execution and rewarded the suite that executed the most lines while checking the fewest.
So the real cause is not low coverage; it is that coverage is a proxy for protection, and AI test generation breaks the proxy by making it cheap to execute lines without asserting behavior. Optimizing the proxy (coverage) actively degrades the target (regression-catching) when the optimizer is a model rewarded for green. This is Goodhart's law on the test suite: the measure became a target and stopped being a good measure.
So how do we measure the thing we actually want — would this suite fail if the code broke — instead of the proxy?
When a passing test catches nothing¶
Here is the smallest version of the whole problem, on one function.
def discount(total, rate):
return total * (1 - rate)
# Generated test — green, "covers" the function, protects nothing:
def test_discount():
assert discount(100, 0.2) is not None # passes for any return
# What it should assert — pins the behavior a human owns:
def test_discount():
assert discount(100, 0.2) == 80.0 # fails if the formula breaks
assert discount(100, 0.0) == 100.0 # boundary
assert discount(100, 1.0) == 0.0 # boundary
The first test executes every line of discount — 100% coverage of the function. Break the formula to total * (1 + rate) and it stays green. The second test fails the instant the formula is wrong. Same coverage, opposite protection. The difference is entirely in the assertion, which is exactly the part coverage cannot see and the model is biased to weaken.
Rule: a test is only as strong as the change it would fail on¶
The load-bearing truth of this chapter: a test protects behavior only if there exists a wrong version of the code that would make it fail. A test that passes for every plausible mutation of the function — right or wrong — protects nothing, no matter how much coverage it adds. The question to ask of any test is not "does it pass?" (all good tests do) but "what broken code would it catch?" If the answer is "nothing," it is a coverage decoration, not a test.
Why coverage breaks as the guardrail. The primitive is the oracle: a test needs a correct expected value to compare against, and that value comes from human-owned behavior, not from the code. Coverage measures only that a line executed during the test — it never inspects the assertion, so a line covered by
assert Trueand a line covered byassert result == exact_valuescore identically. The constraint that breaks the proxy is that the model supplies the oracle from the current code's output, not from intent, so the assertion encodes "whatever the code does now" — including bugs. The fix is to measure assertion strength directly (mutation testing) and to source expected values from a human spec, not from the code.
1) Mutation testing — how to measure whether a test would actually fail¶
The mechanism that turns "does it pass?" into "would it catch a bug?" is mutation testing. It is the only test-quality metric that AI-generated suites cannot game by adding green.
The idea is direct: deliberately break the code in small ways (mutations) and check whether the test suite notices. A mutant is a one-edit change to the source — flip > to >=, swap + for -, replace a return value with a constant, delete a line. For each mutant you run the suite. If a test fails, the mutant is killed — the suite caught the bug. If all tests still pass, the mutant survived — the suite is blind to that bug. The mutation score is the fraction of mutants killed.
Original: return total * (1 - rate)
Mutant 1: return total * (1 + rate) ← suite passes? SURVIVED (blind) / fails? KILLED
Mutant 2: return total * (1 - rate) + 1 ← survived means no exact-value assertion
Mutant 3: return total ← survived means rate is never checked
Mutation score = killed / total mutants
Run mutation testing on Meridian's billing module and the truth surfaces immediately: 88% line coverage, 31% mutation score. Two-thirds of deliberately-broken versions of the code pass the suite untouched. The hand-written auth module: 72% coverage, 84% mutation score. Now the dashboard tells the truth — auth is far better protected despite lower coverage, because its assertions are real.
Teacher voice. Here is the move, na — coverage asks "did the test touch this line?", mutation testing asks "would the test notice if this line were wrong?" Only the second question is what you actually want from a test. Generated tests inflate the first and barely move the second, so the gap between the two numbers — high coverage, low mutation score — is the precise signature of hollow tests. Watch the gap, not the coverage.
For Meridian, the mutation score becomes the real test guardrail: a PR can raise coverage all it wants, but if it doesn't raise the mutation score, it added decoration, not protection.
2) The coverage-vs-protection mental model — picture before the gate¶
This is the core mental model of the chapter. Keep it as the canonical ASCII image: coverage and protection are different axes, and generated tests live in the dangerous corner.
ASSERTION STRENGTH (mutation score)
weak strong
┌──────────────────┬───────────────────────┐
high │ HOLLOW SUITE │ GOLD SUITE │
line │ 88% cov / 31% │ 85% cov / 84% mut │
coverage │ mutation │ │
│ ← generated │ ← what you want: │
│ tests land │ covered AND │
│ HERE by │ checked │
│ default │ │
├──────────────────┼───────────────────────┤
low │ HONEST GAP │ THIN BUT REAL │
coverage │ low cov / weak │ low cov / strong │
│ — at least it │ — covers little but │
│ doesn't lie │ protects what it │
│ │ touches │
└──────────────────┴───────────────────────┘
The trap: the HOLLOW SUITE (top-left) scores BEST on a coverage gate
and protects LEAST. A coverage gate rewards exactly the wrong corner.
The whole danger is the top-left quadrant: high coverage, weak assertions — the suite that looks best on the dashboard and catches the least. Generated tests gravitate there because the optimizer (pass + cover) points straight at it. The goal is the top-right (covered and checked), and the only metric that distinguishes top-left from top-right is the horizontal axis — mutation score — which coverage cannot see.
3) Meridian's test backfill — the running example, with numbers¶
Meridian decides to use AI to backfill tests on the billing module the right way. Watch the two approaches and what the guardrail does.
Attempt A — "write tests for this module," coverage gate¶
Prompt: "Write unit tests for the billing module."
Result:
14 tests, all green
Line coverage: 41% → 88% ✓ passes the 85% gate
Mutation score: ~31% (not measured — no mutation gate yet)
Discount-sign regression two sprints later: NOT CAUGHT
Attempt B — spec-anchored tests with a mutation gate¶
Step 1: Human writes the oracle — the behavior the tests must pin (the source of truth):
- discount(100, 0.2) == 80.0
- negative rate must raise ValueError
- invoice total = sum(line items) + tax, tax = subtotal * tax_rate
- zero-item order returns 0.0, not None
Step 2: Prompt: "Write tests that assert THESE behaviors, including boundaries
and error cases. Use exact expected values, not is-not-None."
Step 3: Gate (CI):
✓ line coverage ≥ 80%
✓ mutation score ≥ 70% ← the real guardrail
✗ FAIL if any new test's only assertion is is-not-None / truthiness
Result:
18 tests, all green
Line coverage: 88%
Mutation score: 31% → 79% ← protection is now real
Discount-sign regression: CAUGHT at the next run (exact-value assert fails)
The model did the verbose work in both cases. The difference is that in B, a human owns the oracle — the expected values come from billing rules a person can defend, not from whatever the code currently returns — and the gate measures mutation score, not coverage. The assistant's labor is cheap; the oracle is the expensive, human-owned source of truth, exactly as the spec was in file 02.
Teacher voice. Notice where the human judgment goes, na. Not into writing 18 tests by hand — that's the model's cheap labor. It goes into the four lines of oracle: what should
calculate_invoicereturn. That's the part the model cannot know and the part that makes every generated assertion meaningful. Same division of labor as the spec in file 02: human owns intent, AI fills in the verbose translation, a deterministic gate checks conformance.
4) Why mutation testing, not coverage, branch coverage, or "just more tests"¶
The plausible alternatives are line coverage, branch coverage, and "generate even more tests." Why mutation testing under Meridian's workload of AI-generated suites?
Line and branch coverage measure execution: which lines or branches ran during the suite. They are useful as a floor — code never executed is certainly untested — but they are blind to assertions, which is precisely the dimension AI generation degrades. Branch coverage is a little better than line coverage (it forces both sides of a condition to run) but still passes for assert True on both branches. "Generate more tests" makes the problem worse: more generated tests means more weak assertions and a higher coverage number with no more protection — you are optimizing the broken proxy harder.
Mutation testing measures protection directly: it asks the only question that matters — would the suite fail if the code were wrong — by actually making the code wrong. Its cost is real: running the full suite once per mutant is expensive (a module with 200 mutants runs the suite 200 times), which is why teams run it on changed files in CI and full-suite mutation nightly. Under a workload where the failure mode is specifically high coverage, weak assertions, mutation testing is the only metric that detects it, so its cost is justified exactly where generated tests dominate. For modules with hand-written tests and stable mutation scores, coverage-as-floor plus periodic mutation runs is enough.
5) The property that changes the design: who supplies the test oracle¶
If you change one thing about how you generate tests, change this: the design variable is where the expected value comes from. A test compares an actual result to an expected one; the expected one is the oracle. If the model derives the oracle from the current code's output, the test can only ever assert "the code does what it does" — it pins bugs in place and will never catch a regression that the buggy code already had. If a human supplies the oracle from intent, the test asserts "the code does what it should" and catches divergence.
Oracle source = current code output:
test asserts "code == code" → tautology, catches nothing it doesn't already do
→ green forever, even on a pre-existing bug, even after a regression that matches
Oracle source = human-owned behavior (spec, requirement, business rule):
test asserts "code == intended" → catches any divergence from intent
→ this is the only kind of test that protects
This is why "write tests for this code" is the dangerous prompt — it makes the code its own oracle. "Write tests that assert these behaviors" supplies an independent oracle. The same model produces a protective suite or a hollow one depending entirely on where the expected values come from. Test characterization (snapshotting current behavior) has a legitimate use — locking down legacy behavior before a refactor — but it must be labeled as such, because it explicitly pins current behavior including bugs, and is not regression protection against the spec.
6) One failure walked through: the snapshot that froze a bug¶
Trace the canonical generated-test failure end to end.
1. The billing module already has a latent bug: tax is computed on the post-discount
total, but the spec says tax is on the pre-discount subtotal. Nobody noticed.
2. Sam asks the assistant: "Write tests for calculate_invoice."
3. The model runs the function, sees it returns $94.40 for a sample order, and writes:
assert calculate_invoice(order_42) == 94.40
It made the CURRENT (buggy) output the oracle.
4. Tests green, coverage 88%, PR merged. The bug is now PROTECTED by a test.
5. Months later a developer reads the spec, fixes the tax-base bug. The correct
answer is now $96.20. The generated test FAILS.
6. The developer, trusting "tests are the source of truth," assumes their fix is
wrong and reverts it. The bug is now permanent, defended by a test that
enshrined it.
Where did the system fail? Not at generation — the test runs and asserts an exact value, which looks like a good test. It failed at the oracle: the model snapshotted the buggy code as the definition of correct, so the test now actively defends the bug against the fix. This is worse than no test, because a missing test merely fails to catch a bug while a wrong-oracle test prevents the bug from being fixed. The grounding gap again: the test was fluent and exact but grounded in the code instead of the spec.
The fix is the file-02 move: the spec (tax on pre-discount subtotal) is the source of truth, the test asserts the spec, and when code and spec disagree, the code is wrong — never the spec silently winning by being enshrined in a generated test.
7) Generated docs fail the same way — fluent and ungrounded¶
Documentation generation is the same failure with a different artifact. Ask a model to "document this module" and it produces clean, confident prose. The prose is grounded in the code's names and structure, not its behavior, so it is fluent exactly where it is most likely to be subtly wrong.
Generated docstring:
"""Sends a notification with up to 3 retries and exponential backoff."""
Actual code (from the file-02 drift): 5 retries, FIXED backoff.
→ The doc reads plausibly, matches the function name, and is WRONG.
→ A reader trusts it, debugs against "3 retries exponential," wastes an hour.
The danger of generated docs is precisely their fluency: a wrong doc that looks uncertain gets verified; a wrong doc that reads like confident reference gets trusted. The same grounding discipline applies — docs should be generated from the source of truth (the spec, the signature, the actual config) and checked against it, not free-narrated from the code's vibe. Doc-from-code that asserts behavior the code doesn't have is a documentation bug that costs reader-hours, and it compounds: onboarding engineers learn the wrong model of the system from confident, wrong docs.
The useful, low-risk doc generation is the kind grounded in structure that can't drift from behavior: API reference from type signatures, changelog from commit diffs, "what changed in this PR" summaries. The risky kind is behavioral narration — "this service guarantees X" — where the model has no oracle for X.
8) Cost movement — what generated tests and docs buy and bill¶
| What changes | Direction | Concrete (Meridian) | Who absorbs it |
|---|---|---|---|
| Time to backfill a test suite | cheaper | 2 days → 2 hours | the author |
| Line coverage | rises fast | 41% → 88% | the dashboard (vanity) |
| Mutation score | flat unless oracle is human-owned | 31% (A) vs 79% (B) | regression protection |
| False confidence in the suite | rises with hollow tests | team trusts a suite that protects 31% | future debuggers |
| Mutation-testing compute | new cost | suite runs N× per N mutants | CI + the budget |
| Oracle-writing effort | new cost | minutes per behavior, human-owned | the author, up front |
| Onboarding accuracy | risk | confident wrong docs teach wrong models | new hires |
The pressure relieved is test-writing labor and doc-writing labor. The pressure created is false confidence (absorbed by whoever later trusts the hollow suite or the confident-wrong doc) and mutation-testing compute (absorbed by CI). The trade is strongly positive when a human owns the oracle and the gate is mutation score; it is negative when the gate is coverage, because then you have paid compute to manufacture confidence in protection you don't have.
Mini-FAQ. "Isn't 31% mutation score still better than no tests at all?" Sometimes worse. No tests fail to catch a bug; a hollow suite makes the team believe the bug would be caught, so they review less and refactor boldly on top of unprotected code — and a wrong-oracle test actively defends a bug against its fix. False confidence is a liability, not a neutral. Measure mutation score so the confidence matches the protection.
9) Signals — healthy, first to degrade, misleading, expert's graph¶
Healthy: mutation score rising with coverage (protection tracks execution); new tests assert exact values and boundaries; generated docs verified against the source of truth; the coverage/mutation gap small and stable.
First metric to degrade: the gap between line coverage and mutation score. When coverage climbs while mutation score stays flat — the signature of a hollow backfill — protection is decoupling from coverage. It moves before any regression ships, so it is the leading indicator that the suite is becoming decoration.
The misleading metric everyone watches: line coverage and "number of tests generated." Pure vanity metrics, the file-01 family. They rise by construction whenever you generate tests and are uncorrelated with regression protection once assertions go weak. A coverage gate optimized by a model actively rewards hollow tests.
The graph an expert opens first: mutation score and line coverage plotted on the same axis per module. Healthy modules track together; the hollow ones show coverage high and mutation low — the diverging lines are the precise visual of the top-left quadrant. Segment by "tests written by AI" to see whether generated suites cluster in the hollow corner.
10) Wrong assumption: "high coverage means well-tested"¶
The seductive belief is that coverage measures test quality — 88% covered means 88% protected. It was never quite true, and AI test generation makes it dangerously false. Coverage measures execution, not assertion; the model can execute every line while asserting nothing, decoupling the proxy from the target completely.
Replace the wrong belief with: coverage measures what your tests touch; mutation score measures what they protect. A suite at 88% coverage and 31% mutation score has touched almost everything and protects almost nothing. The number that tells you whether a regression would be caught is the one coverage can't show — so stop gating on coverage alone the moment a model is writing the tests. This inversion — the highest-coverage module being the least-protected — is the chapter's memory hook.
11) Other failure shapes to recognize¶
- Tautological assertions.
assert result == result,assert x or not x,assert isinstance(r, dict)— green forever, catches nothing. - Snapshot-as-truth. Generated tests assert current (possibly buggy) output as the expected value, freezing bugs and blocking their fix.
- Mock-everything tests. The test mocks every dependency, so it asserts the mocks behave — never the real integration; passes while the real wiring is broken.
- Happy-path-only. Generated tests cover the success case and skip the error and boundary cases where bugs actually live; coverage looks fine because the happy path is most of the lines.
- Flaky generated tests. Tests that assert on timing, ordering, or unmocked time/randomness; they pass locally, fail in CI, train override-and-retry, and erode the trust gate from file 03.
- Coverage-padding tests. Tests written purely to hit an uncovered line with no meaningful assertion, gaming the coverage gate directly.
- Doc drift. Generated docs describe behavior the code no longer has (3 retries vs 5), confidently and plausibly, teaching wrong mental models to readers and new hires.
- Test-to-fit. When a generated test fails, the developer edits the test to match the code instead of investigating which is wrong — silently making the code its own oracle.
12) Pattern transfer — where this pressure recurs¶
- The oracle problem is the same as the source of truth in file 02: correctness is defined by a human-owned artifact (spec, expected value), and generated output must conform to it, never silently become it. A test with a code-derived oracle is the same drift as a service whose prompt was the spec.
- Coverage as a broken proxy is Goodhart's law, the same shape as vanity metrics in file 01 (acceptance rate) and comments-per-PR in file 03: a measure that becomes a target stops measuring the thing, especially when a model optimizes it.
- The grounding gap — fluent output detached from truth — is the same root cause as RAG hallucination (module 08), AI-review's blindness to intent (file 03), and ungrounded incident summaries (file 05). Generated docs are the doc-shaped version: plausible, confident, wrong.
- Mutation testing as the real guardrail is the same move as the eval gate in file 06: when the obvious metric (coverage / acceptance) is gameable, you build a harder, less-gameable metric (mutation score / held-out eval) and gate on that.
13) Design test — five questions before trusting a generated test suite¶
- What broken version of the code would this test fail on? If "nothing," it's decoration.
- Where did the expected value come from — a human-owned behavior, or the code's current output (which freezes bugs)?
- Am I gating on mutation score, or only on coverage (which a model can inflate without protecting)?
- Do the generated tests cover error and boundary cases, or only the happy path?
- For generated docs: is the behavioral claim grounded in the spec/signature, or free-narrated from the code's names?
Where this appears in production¶
- Stryker (JS/.NET/Scala) / PIT (Java) / mutmut, Cosmic Ray (Python) — mutation-testing frameworks that compute the mutation score; the real test-quality guardrail this chapter is built on.
- GitHub Copilot test generation —
/testsand the coding agent generate suites; their value depends entirely on whether a human supplies the oracle. - Diffblue Cover — automated Java unit-test generation that explicitly characterizes current behavior; useful for legacy lock-down if labeled as snapshot, dangerous if mistaken for spec-protection.
- Meta's TestGen-LLM / Sapienz — LLM test generation gated on whether the generated test adds a kill (catches a mutant the existing suite missed), an industrial application of mutation-score-as-guardrail.
- EvoSuite / Randoop — search-based and random test generation that snapshot behavior; same oracle caveat as AI generation.
- Codecov / Coveralls — coverage dashboards; useful as a floor, dangerous as the sole gate once a model writes the tests.
- SonarQube quality gate — can gate on coverage and increasingly on test reliability, the deterministic floor from file 03.
- Hypothesis / fast-check / QuickCheck — property-based testing; asserts invariants ("output is always sorted") rather than exact values, a strong complement that catches the edge cases generated example-tests skip.
- Cursor / Claude Code test workflows — agents that write tests and run them; the discipline is to feed the agent the oracle and check mutation score, not to accept green.
- Mintlify / Swimm / Docusaurus + AI — doc generation; safest when grounded in signatures and diffs, riskiest for behavioral narration.
- Read the Docs / Sphinx autodoc — structure-grounded API docs that can't drift from behavior because they're generated from the code's actual interface.
- Stripe / Google internal — test-quality programs that track mutation-style "would this catch a regression" signals rather than coverage alone.
Pause and recall¶
- Why does an AI-generated test pass even when it protects nothing?
- What does coverage measure, and what does it not measure that you actually care about?
- Define a mutant, a killed mutant, and the mutation score.
- Why is a wrong-oracle test (snapshotting a buggy output) worse than no test at all?
- What single design variable decides whether a generated test protects or merely executes?
- Which metric degrades first when a backfill goes hollow, and why before any regression ships?
- Why does "generate more tests" make the coverage-gate problem worse, not better?
- How do generated docs fail the same way as generated tests, and which doc kinds are low-risk?
Interview Q&A¶
Q1. A teammate raised coverage from 41% to 88% with AI-generated tests in an afternoon. Leadership wants to celebrate. What do you check? A. The mutation score, and the gap between coverage and it. Generated tests inflate coverage while assertions stay weak, so 88% coverage can sit on a 31% mutation score — touching everything, protecting little. Run mutation testing on the module; if the score didn't move, the suite is decoration and the next regression will sail through. Common wrong answer to avoid: "Great, higher coverage means better-tested." Coverage measures execution, not assertion; a model optimizing for green inflates coverage without adding protection.
Q2. Why can a generated test that asserts an exact value still be worse than no test? A. If the model derived that exact value from the current (buggy) code output, the test enshrines the bug as the oracle. When someone later fixes the bug, the test fails, they assume their fix is wrong and revert it — the test now defends the bug against its fix. A missing test merely fails to catch; a wrong-oracle test actively prevents the fix. Common wrong answer to avoid: "An exact-value assertion is always a good test." It's only good if the expected value comes from intent, not from snapshotting the code's current behavior.
Q3. What is mutation testing and why is it the right guardrail for AI-generated suites?
A. It deliberately introduces small bugs (mutants) into the code and checks whether the suite fails — killed mutants mean the suite caught the bug, survivors mean it's blind. The mutation score is killed/total. It's the right guardrail because it measures protection directly and can't be gamed by adding green tests, which is exactly how generated suites inflate coverage.
Common wrong answer to avoid: "Branch coverage is enough." Branch coverage still passes for assert True on both branches; only mutation testing inspects whether assertions would catch a wrong result.
Q4. Why is "write tests for this code" a dangerous prompt? A. It makes the code its own oracle — the model runs the code and asserts whatever it currently returns, so the tests pin current behavior including bugs and can never catch a regression that matches. The safe prompt is "write tests that assert these behaviors," supplying a human-owned oracle from the spec or requirement. Common wrong answer to avoid: "It's fine, the model writes good tests." The model writes passing tests; passing against current code means agreeing with current bugs.
Q5. Generated docs say a service does 3 retries with exponential backoff. Should a new hire trust them? A. Verify against the source of truth first. Generated docs are grounded in names and structure, not behavior, so they read confidently while being subtly wrong (the code may do 5 retries, fixed backoff — the file-02 drift). Behavioral doc claims need grounding in the spec or actual config; structure-grounded docs (API reference from signatures) are safer. Common wrong answer to avoid: "Docs are docs, trust them." Fluent confidence is exactly what makes a wrong generated doc dangerous — it gets believed instead of checked.
Q6. Tests pass, coverage is 88%, but a regression still shipped. Is this a file-01 verification problem, a file-03 review problem, or a file-04 test problem? (cumulative) A. A file-04 hollow-test problem if the suite has high coverage and low mutation score — the tests executed the regressed line but asserted nothing about it. Confirm by running mutation testing on the changed module. It's distinct from file-01 (rework on untested inner-loop code) and file-03 (a real finding dismissed); here the gate existed but protected nothing. Common wrong answer to avoid: "Coverage was high so the tests must be fine; it's a model bug." High coverage with a shipped regression is the signature of weak assertions — check mutation score, not the model.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Here is the oracle-and-gate sketch for Meridian's calculate_invoice:
Oracle (human-owned, the source of truth):
total = sum(line_items) + tax; tax = subtotal_pre_discount * tax_rate
discount applies to subtotal, not to tax
zero-item order → 0.0; negative rate → ValueError
Gate (CI):
line coverage ≥ 80% AND mutation score ≥ 70%
reject tests whose only assertion is is-not-None / truthiness
Forbidden: editing a failing test to match the code without checking the oracle.
Step 2 — Your turn. Take a function from your codebase with generated or weak tests. Write its oracle in 3–5 lines (the behaviors a human owns), then list two mutants — small wrong edits — and predict whether the current suite would kill them. Continue Meridian if you have none: write two mutants of discount and say which of Attempt A's tests survive each.
Step 3 — Reproduce from memory. Redraw the coverage-vs-mutation 2×2, mark which corner generated tests land in and which the coverage gate rewards. Then connect it to file 02: why is "where the oracle comes from" the same source-of-truth invariant as "where the spec comes from"?
Operational memory¶
This chapter explained why an AI-generated test suite can show high coverage while protecting almost nothing: the model is rewarded for passing, the cheapest pass is a trivial assertion, and coverage measures only that lines ran — never whether a wrong result would be caught. The important idea is that a test protects behavior only if some broken version of the code would make it fail, and mutation score measures that directly while coverage cannot — not that "generated tests are bad."
You learned to make a human own the oracle (the expected behavior, from the spec, not from the code's current output), to gate on mutation score instead of coverage, and to reject trivial assertions — turning a 31%-protection suite into a 79%-protection one without writing the verbose tests by hand. That solves the opening failure because the discount-sign regression now fails an exact-value assertion at the next run instead of sailing through a green dashboard. The same grounding discipline applies to generated docs: ground behavioral claims in the source of truth, or they teach confident, wrong mental models.
Carry this diagnostic forward: when "tests pass but a bug shipped," ask whether the suite has high coverage and low mutation score — the signature of hollow tests — and check where the failing test's oracle came from. If a generated test fails after a fix, suspect a snapshotted bug before reverting the fix.
Remember:
- A test is only as strong as the broken code it would fail on; ask "what bug would this catch?" not "does it pass?"
- Coverage measures what tests touch; mutation score measures what they protect — gate on the second once a model writes tests.
- The oracle must come from human-owned intent; a code-derived oracle freezes bugs and can defend them against fixes.
- The highest-coverage module can be the least-protected — the diverging coverage/mutation gap is the hollow-suite signature.
- Generated docs fail like generated tests — fluent and ungrounded; ground behavioral claims or they teach wrong models.
- False confidence is a liability, not neutral; a hollow suite makes the team review less on unprotected code.
Bridge. We grounded tests and docs in a human-owned source of truth so generated artifacts protect real behavior. But all four chapters so far live before deploy — in the editor, the PR, the CI gate. The moment code is in production, the source of truth changes from a spec to live telemetry: logs, traces, metrics, alerts. The next file moves AI into operations, where a copilot summarizes an incident or suggests a runbook step — and where the grounding gap becomes a 3 a.m. liability, because an ungrounded summary doesn't waste reader-hours, it sends on-call chasing a root cause that was never there. → 05-ops-and-incident-copilots.md