01. Coding assistants in the loop — velocity feels up, so why is rework rising?¶
~18 min read. A developer with Copilot or Claude Code finishes a task in half the time and feels twice as productive. The pull-request graph agrees. Six weeks later, the revert graph agrees too. This file shows where assistants genuinely earn their keep, where they quietly move the bottleneck onto review, and how to tell the difference before your org celebrates the wrong number.
Built on 00-first-principles.md. The forces here are the leverage-rework tradeoff, the review tax, the vanity metric, and the amplifier rule. This file grounds them in the inner loop — the place a developer and an assistant spend all day — and sets Meridian's baseline.
What we know so far and what still breaks¶
The overview made one claim with teeth: AI does not fix an engineering org, it amplifies it. The 2025 DORA data showed throughput finally moving up with AI adoption, while stability kept slipping. The METR trial showed experienced developers on big repos getting slower while feeling faster. We named the trap — measuring the tool instead of the outcome — but we have not yet looked at where the leverage and the cost actually come from.
They come from the inner loop: the minute-to-minute cycle of typing code, getting a completion, accepting or rejecting it, running it, fixing it. This is where 90% of an assistant's interactions happen and where the felt productivity is most intense. It is also where the rework is born — invisibly, because rework does not show up in the same week as the velocity.
This chapter answers three things: which inner-loop tasks assistants genuinely accelerate, why "velocity up" and "rework up" happen together rather than contradicting each other, and the one metric pair that lets Meridian see both at once.
What this file solves¶
A team turns on Copilot and within a week pull requests are merging faster and developers report loving it. Leadership reads this as a win and plans to expand. This file shows why "PRs merge faster" and "more changes get reverted within two weeks" are not a contradiction but the same event seen at two different times — and gives you the concrete move: pair every velocity metric with a rework metric, measured per-developer and per-task-type, so the leverage shows up net of the cost it creates.
Where assistants actually earn their keep¶
Walk into Meridian's codebase on a normal Tuesday. A backend engineer, Priya, is doing four kinds of work in one afternoon. Watch where the assistant helps and where it hurts.
First, she writes a paginated list endpoint that looks like fifteen others in the service. The assistant completes the handler, the serializer, and the query in three accepted suggestions. This is the sweet spot: high-pattern, low-novelty, locally-verifiable code. The pattern is everywhere in the repo, so the model has seen it; the output is small enough to eyeball; and a unit test confirms it in seconds. Priya saves twenty minutes and the code is correct.
Second, she writes a regex to parse a vendor's oddly-formatted timestamps. The assistant suggests one instantly. It looks right. It passes her two example strings. In production it silently drops timezone offsets on 4% of inputs. This is the plausible-but-wrong zone: output that is locally fluent and globally incorrect, where the cost of verifying correctness is higher than the cost of writing it herself would have been.
Third, she refactors a 600-line module that touches auth, caching, and a feature flag. The assistant keeps suggesting edits that are individually reasonable and collectively incoherent — it cannot hold the cross-cutting invariants in its head, so it "fixes" the cache by breaking the flag. This is the high-context, cross-cutting zone where the assistant's lack of whole-system understanding turns into a stream of confident, locally-valid, globally-wrong edits.
Fourth, she investigates why a test is flaky. The assistant is genuinely useful again — not writing code, but summarizing the failing stack trace, recalling what a library function does, and suggesting three hypotheses. This is the explain-and-recall zone: the assistant as a fast, lossy reference, where being wrong is cheap because Priya verifies against reality immediately.
Teacher voice. Notice the pattern, na — the assistant is excellent exactly where verification is cheap and the answer is local, and dangerous exactly where verification is expensive and the answer is global. The skill is not "use AI" or "don't use AI." It is knowing which of the four zones you are standing in. A senior engineer routes work to the assistant by verification cost, not by how hard the task feels.
The naive read: PRs merge faster, so we are faster¶
Meridian's first instinct after rollout is the obvious one. Pull this dashboard:
Weeks 1-4 (pre-Copilot) → Weeks 5-8 (post-Copilot)
PRs merged / dev / week: 4.1 → 4.9 (+20%)
Median PR cycle time: 31h → 24h (−23%)
Self-reported productivity: — → +41%
Every number is green. The naive conclusion: 20% more throughput, ship it to all 200 engineers.
The break shows up in week 10, when someone finally plots the metric nobody was watching:
Change-fail rate (PRs reverted or hotfixed within 14d): 9% → 14%
Lines reworked within 2 weeks of merge: 5.7% → 9.1%
So the PRs are merging faster. And more of them are coming back. The velocity was real and the rework was also real; they just landed in different weeks, so a four-week dashboard saw only the first.
So the real problem is not "Copilot makes bad code." It is that generation got cheaper while verification did not, and we measured the cheap half. The throughput number captured the generation speedup and was blind to the verification debt that the same speedup created.
So how do we make the hidden half visible at the same time as the obvious half?
When a 20% speedup hides a 5% loss¶
Here is the smallest version of the whole problem, on one task type.
Task type: "write CRUD endpoint" — 100 instances, measured
Without assistant: avg 45 min to write, 2% reworked later
With assistant: avg 28 min to write, 7% reworked later
Naive view: 45 → 28 min = 38% faster. Win.
Net view: rework costs ~90 min each to find + fix.
Without: 0.02 × 90 = 1.8 min amortized rework / task
With: 0.07 × 90 = 6.3 min amortized rework / task
Net time with assistant: 28 + 6.3 = 34.3 min
Net time without: 45 + 1.8 = 46.8 min
Real speedup: 27%, not 38%.
On CRUD endpoints, the assistant still wins after rework — 27% is great. The reason it wins is that CRUD is in the cheap-to-verify zone. Now run the same arithmetic on the timestamp-regex zone, where rework jumps to 30% and each defect ships to production:
With assistant: 15 min to write, 30% reworked, 300 min each to find in prod + fix
Net: 15 + 0.30 × 300 = 105 min
Without: 40 min to write, 3% reworked
Net: 40 + 0.03 × 300 = 49 min
The assistant made this task type 2× slower, net.
Same tool, same developer, opposite outcome — decided entirely by the verification cost of the task type. This is why a single org-wide "is AI helping?" number is meaningless. The answer is a distribution across task types, and the mean hides both the wins and the losses.
Rule: net leverage is output minus rework, measured where the rework lands¶
The load-bearing truth of this chapter: an accepted suggestion is not a win until it survives the period in which it would have been reworked. Acceptance is a loan against future verification. Sometimes the loan is free (CRUD). Sometimes the interest is brutal (the cross-cutting refactor). You cannot know which from the acceptance event alone — only from the rework that follows it.
Why the naive metric breaks. The primitive is simple: total cost = generation cost + verification-and-rework cost. The assistant slashes the first term and is silent about the second. The constraint that breaks the naive approach is time: rework lands weeks after generation, in a different metric, attributed to a different cause. So any measurement window shorter than the rework horizon will show pure gain. The fix is not a better model — it is a longer, paired measurement.
1) The review tax — where the bottleneck actually moves¶
The inner loop produces a diff. Before that diff is real, a human has to review it. This is the constraint everyone forgets at design time.
Suppose the assistant doubles a developer's code output. The team's review throughput did not double — the same number of senior engineers read the same number of hours per day. So one of three things happens, and you do not get to choose which without designing for it:
Generation rate ↑ 2×, review capacity flat
│
├─→ (A) Reviews get shallower → bugs slip through → rework ↑ (DORA stability drop)
├─→ (B) Review queue grows → cycle time ↑ → velocity gain evaporates
└─→ (C) Authors review less own work → assertion-free code merges → tech debt ↑
Meridian hit (A). Reviewers, facing 30% more diff per PR, started skimming. The defects that slipped through became the week-10 rework spike. The throughput gain did not vanish — it converted into a stability loss. This is the DORA finding playing out in one team: AI raised throughput and lowered stability through the review channel, because review capacity was the unmoved constraint and the extra output had to go somewhere.
Mini-FAQ. "Can't we just use AI to review the AI's code too?" Partly, and the next file is about exactly that. But AI review has its own catch rate and its own false-positive tax, so it relieves the human review constraint without removing it. You move the bottleneck; you do not delete it. Every capacity you add downstream of generation buys time until generation outpaces it again.
The diagnosis sentence to carry: the bottleneck is never the thing AI made faster; it is the next station that AI did not. If coding gets 2× faster, look at review, CI, QA, and deploy — one of them is now the wall.
2) The four-zone mental model — picture before metrics¶
This is the core mental model of the chapter. Keep it as the canonical image: a 2×2 of how novel the task is against how expensive verification is.
VERIFICATION COST
cheap expensive
┌───────────────┬───────────────────┐
low │ GREEN ZONE │ YELLOW ZONE │
novelty │ CRUD, glue, │ subtle logic, │
(seen │ boilerplate, │ parsing, edge │
pattern) │ config │ cases, regex │
│ │ │
│ → accept fast│ → AI drafts, │
│ big net win│ human verifies │
│ │ hard; net ~0 │
├───────────────┼───────────────────┤
high │ BLUE ZONE │ RED ZONE │
novelty │ explain, │ cross-cutting │
(unseen, │ recall API, │ refactor, new │
one-off) │ summarize │ architecture, │
│ trace │ security logic │
│ │ │
│ → AI as fast │ → AI net NEGATIVE│
│ reference, │ confident wrong│
│ cheap wrong│ global breakage│
└───────────────┴───────────────────┘
The assistant's value is highest in green and blue (left column, cheap verification) and turns negative in red (top-right of the expensive column when novelty is also high). The whole skill of using assistants well is routing — knowing which quadrant the current task lives in and adjusting how much you trust the output. Meridian's mistake was treating all four zones as one "AI helps coding" bucket.
3) Meridian's baseline — the running example, instrumented¶
Before applying anything, you measure the before-state. This is the honest baseline, and it is the artifact every later chapter measures against. Meridian's platform team captured eight weeks of pre-rollout data across all 200 engineers:
MERIDIAN BASELINE (8 weeks, pre-AI, 200 engineers)
─────────────────────────────────────────────────
Throughput
Deploys / day: 12
PRs merged / dev / week: 4.1
Median PR cycle time: 31h
Stability
Change-fail rate (revert/hotfix 14d): 9%
Lines reworked within 2 weeks: 5.7%
P1/P2 incidents / month: 6
Flow & satisfaction (SPACE-style survey)
"I spend my time on valuable work": 62% agree
Median uninterrupted focus block: 38 min
Cost
Fully-loaded eng cost / month: ~$3.0M
Note what is and is not here. There is no "lines of code" and no "Copilot acceptance rate" — those come later and are explicitly flagged as vanity. The baseline is built from outcomes (deploys, failures, incidents) and experience (focus, valuable-work). When the rollout happens, these are the numbers that must move, paired, before anyone declares a win.
Teacher voice. The single most common rollout failure is starting the baseline after turning on the tool. Once everyone has the assistant, you have nothing to compare to, and every claim becomes a story. Measure for at least four weeks before you flip the switch. The baseline is cheap to collect and impossible to reconstruct later.
4) Why an assistant, not a smarter snippet library or a better linter¶
The plausible alternative to a generative assistant is the thing teams used before: rich snippet libraries, code generators, scaffolding CLIs, and aggressive linters/formatters. Why did the generative assistant displace them for inner-loop work?
A snippet library is deterministic and safe but rigid — it only knows the patterns you pre-wrote, and it cannot adapt a pattern to the local variable names, types, and surrounding context. The generative assistant's advantage is exactly context-adaptive completion: it reads the open file, the imports, the function signature, and produces something shaped to this spot, not a generic template you must then edit. For green-zone work, that adaptation is the entire value — it removes the edit-after-paste step that made snippets only mildly useful.
But under a different workload the comparison flips. For a security-critical crypto routine (red zone), the deterministic generator that emits a single audited, vetted implementation is safer than an assistant that produces a fresh, plausible, unaudited variant each time. The assistant's adaptiveness becomes a liability where you wanted invariance. So the choice is workload-dependent: adaptive generation wins where local fit matters and verification is cheap; deterministic generation wins where correctness is non-negotiable and you want the same vetted output every time.
This is the amplifier rule in tool form. The assistant amplifies your context — including your bad context. Point it at a file full of insecure patterns and it will fluently extend them.
5) The property that changes everything: verification cost per task¶
If you change one thing about how you think about assistants, change this: the design variable is not model quality, it is verification cost per task type. Two task types with identical assistant accuracy can have opposite net value purely because one is cheap to check and one is not.
Task A: "rename a variable across 5 files"
Assistant accuracy: 95%. Verification: trivial (tests + grep). Net: big win.
Task B: "implement OAuth token refresh with rotation"
Assistant accuracy: 95%. Verification: expensive (security review, edge cases,
replay attacks). The 5% wrong is the 5% that costs a breach.
Net: dangerous even at 95%.
Same accuracy, opposite decision. The reason is that in Task B the cost of the 5% failure is catastrophic and hard to detect, so high accuracy is not enough — you need either near-perfect accuracy or cheap verification, and OAuth gives you neither. This is why blanket "the assistant is 92% accurate" claims are useless for decisions. Accuracy interacts with verification cost and failure cost; only the product decides.
6) One failure walked through: the silent timestamp bug¶
Return to Priya's timestamp regex. Trace it end to end, because this is the canonical yellow-zone failure and you will see it a hundred times.
1. Priya asks for a regex to parse vendor timestamps like "2026-03-14T09:30:00+05:30".
2. Assistant returns: \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2} (drops the offset)
3. It matches her two test strings (both happened to be +00:00). Tests green.
4. PR review: reviewer sees a regex, it has tests, it looks standard. Approve.
5. Merge. Deploy. Works for 96% of traffic (UTC vendors).
6. Two weeks later: an Indian vendor's +05:30 timestamps parse as UTC.
Reconciliation job books transactions 5.5 hours early. Finance notices a
month-end mismatch. Three engineers spend a day tracing it.
Where did the system fail? Not at generation — the regex was a reasonable guess. It failed at verification: the test data did not cover the offset case, the reviewer pattern-matched on "regex + tests = fine," and nothing in the loop forced the question "what inputs does this not handle?" The assistant did not make Priya dumber; it made the easy-looking-wrong answer fast enough to outrun the team's verification habits. The amplifier rule again: weak input-coverage testing, amplified.
The fix is not "don't use AI for regex." It is to recognize the yellow zone and raise verification to match: property-based tests, fuzzed inputs, an explicit "list the cases this does not handle" prompt. The next chapters build exactly these gates.
7) The cost movement — what gets cheaper, what gets more expensive, who pays¶
Every optimization moves cost rather than removing it. Here is the ledger for inner-loop assistants, with Meridian's numbers.
| What changes | Direction | Concrete (Meridian) | Who absorbs it |
|---|---|---|---|
| Time to first draft | cheaper | 45→28 min on green-zone tasks | the author (wins time) |
| Lines of diff per PR | larger | +30% median diff size | the reviewer (more to read) |
| Review depth per line | shallower | reviewers skim under load | stability (defects slip) |
| Rework within 2 weeks | more expensive | 5.7%→9.1% | the team, weeks later |
| P1/P2 incidents | rises if review caps | 6→7.5 / month in heavy-use teams | on-call + customers |
| Tool licensing | new cost | ~$19–39/dev/mo | the budget |
The pressure relieved is generation latency. The pressure created is verification load, and it is absorbed by review, CI, QA, and on-call — the stations downstream of coding. The whole rest of the module is about expanding those downstream stations (AI review, generated tests, eval gates) so the relieved pressure does not just pile up against the next wall.
Teacher voice. Read that table as a conservation law. The work did not disappear when the assistant wrote the code faster; it moved downstream and changed shape from "writing" to "checking." An org that adds generation capacity without adding verification capacity has not gotten faster — it has gotten a longer queue at review and a taller incident graph.
8) Signals — healthy, first to degrade, misleading, and the expert's graph¶
Healthy: throughput up and change-fail rate flat or down; rework within two weeks stable; reviewer cycle time not climbing; focus blocks holding.
First metric to degrade: lines reworked within two weeks. It moves before incidents do, because rework is the early form of the defect that later becomes an incident. Watch it weekly, per team.
The misleading metric everyone watches: acceptance rate (% of suggestions accepted) and lines generated. These are the vanity metrics of this domain. They go up by construction whenever the tool is on, they tell you nothing about whether the accepted code survived, and a high acceptance rate is as consistent with "great suggestions" as with "developers rubber-stamping plausible junk." Track them for adoption-health only; never let them stand in for outcome.
The graph an expert opens first: change-fail rate and rework rate plotted on the same time axis as PR throughput, segmented by team's AI-usage intensity. If the high-usage teams show the throughput-up/stability-down scissors, you have found the review-tax conversion. If high-usage teams hold both, AI is genuine net leverage there and you can expand.
9) Boundary of applicability — where assistants are strong, where pathological¶
Strong fit: large codebases with dense internal patterns (the model has examples to imitate), strong test suites and CI (cheap verification), small-batch PR culture (the amplifier has good practices to amplify), and tasks clustered in the green/blue zones. Here assistants are close to free leverage.
Pathological: novel architecture work, security-critical logic, large cross-cutting refactors, and any codebase with weak tests and big-batch PRs. Here the assistant amplifies weak verification into shipped defects. The METR result — experienced devs slower on their own large, idiosyncratic repos — is this boundary: deep expert context plus low pattern-match value plus expensive verification is the worst case for current assistants.
Scale/workload that breaks naive intuition: the intuition "more powerful model → more value" fails in the red zone. A stronger model produces more convincing wrong answers in high-novelty, expensive-verification work, which can increase net cost by making the wrong answer harder to catch. Capability and safety diverge exactly where you'd most want them aligned.
10) Wrong assumption: "the better the model, the more it helps everywhere"¶
The seductive belief is that assistant value scales monotonically with model capability — bigger model, more help, full stop. It is false, and the falsity is the chapter's memory hook.
In the green and blue zones, better models do help more, because the bottleneck is generation and verification is cheap. In the red zone, a better model can hurt: it generates a more fluent, more authoritative wrong answer in exactly the situation where verification is expensive and the human is most likely to defer to confident output. The variable that decides value is not capability alone — it is capability relative to verification cost. Replace the wrong belief with: a model helps in proportion to how cheaply you can check it, not how smart it is.
11) Other failure shapes to recognize¶
- Anchoring on the first suggestion. The assistant's first completion biases the developer toward its approach even when a better one existed; the human stops thinking and starts editing.
- Copy-paste cloning. Instead of extracting a shared function, developers accept the same generated block in five places — GitClear measured clone frequency rising sharply in AI-heavy code, and clones carry 15–50% more defects.
- Comment-driven hallucination. A developer writes a comment describing intended behavior; the assistant generates code that matches the comment's words but not the system's reality.
- Skill atrophy on juniors. Junior developers who accept completions without understanding them stop building the mental models that make them seniors; the org's future review capacity erodes silently.
- Prompt-shaped code. Code structured to be easy for the assistant to extend, not easy for humans to maintain — long flat functions the model can complete, instead of small composed ones.
- Stale-context completions. The assistant completes against an outdated import or a renamed API because its window did not include the recent change, producing code that compiles but calls the wrong thing.
- Confidence laundering. A junior pastes an assistant explanation into a PR description as if it were their own reasoning, and a reviewer trusts it as human judgment.
12) Pattern transfer — where this same pressure recurs¶
- The leverage-rework tradeoff is the same shape as cache invalidation: the cache (acceptance) makes reads fast, but stale entries (rework) cost more than the reads saved unless you measure hit and miss cost together.
- The review tax is head-of-line blocking from queueing systems: speeding up one stage (generation) just relocates the queue to the slowest unmoved stage (review). Module 03 on agent observability frames the same move — instrument the station, not the speedup.
- The grounding gap introduced in 00 is the same root cause as RAG hallucination in module 08: fluent output detached from ground truth. Here the ground truth is your codebase and tests instead of retrieved documents.
- The amplifier rule echoes the DORA finding across this whole module and recurs in 06 (measurement) and 07 (security): AI multiplies existing practice, so weak practice is amplified into incident.
13) Design test — five yes/no questions before trusting an assistant on a task¶
- Is this task in the green or blue zone (cheap verification, or wrong-is-cheap)? If red, raise scrutiny or write it yourself.
- Do I have a test that would fail if the suggestion is subtly wrong — including edge cases the suggestion might skip?
- Will the reviewer of this diff have the context to catch a plausible-but-wrong line, given how much bigger AI made the diff?
- Am I measuring rework on this task type, or only acceptance?
- If this exact suggestion is wrong, what is the blast radius — a typo, or a financial reconciliation error?
Where this appears in production¶
- GitHub Copilot — inline completion in the editor; the canonical green-zone assistant, strongest on boilerplate and pattern-dense code.
- GitHub Copilot coding agent — GA since 2025; takes an issue and drafts a full PR asynchronously, pushing the review tax to the fore since output arrives without a human in the typing loop.
- Cursor — editor-native agent with multi-file edits; its "apply" flow makes acceptance frictionless, which raises both leverage and the risk of rubber-stamping.
- Claude Code — terminal agent that reads files, runs commands, and reacts to output; lives in the blue zone (explain, recall, iterate) and the green zone, with explicit diffs to review.
- Anthropic internal usage — engineers report Claude Code for test scaffolding, refactors, and trace triage, with humans owning the merge decision.
- Sourcegraph Cody — completion plus codebase-aware chat; the codebase context is what moves yellow-zone tasks toward verifiable.
- JetBrains AI Assistant — IDE completions and explanations across the JetBrains family, same green/blue split.
- Amazon Q Developer (formerly CodeWhisperer) — completions with a built-in reference tracker that flags suggestions resembling known open-source, a direct response to the license-contamination risk in file 07.
- Tabnine — completion with on-prem/self-hosted models for orgs that cannot send code to a vendor, trading some capability for data-boundary control.
- Replit Agent — generates and runs apps end to end; heavy green-zone value for greenfield, heavy red-zone risk when users ship unreviewed.
- Google (Gemini Code Assist) — completions and chat in IDEs and Cloud, with enterprise data-governance controls.
- Meta — internal AI tooling for large-monorepo completion, where dense internal patterns make the green zone large.
- Stripe — internal assistant usage with strong test gates, an example of the "good practices amplified" side of the amplifier rule.
- Shopify — public stance encouraging AI tooling while keeping review and test discipline as the guardrail.
- GitClear — the analytics vendor whose 2025 study quantified the clone/churn rise, the data behind the copy-paste failure mode.
- DX (getdx.com) — developer-experience analytics used by orgs to instrument exactly the throughput/rework scissors this chapter warns about.
Pause and recall¶
- Name the four zones in the verification-cost / novelty model and which one makes assistants net-negative.
- Why do "PRs merge faster" and "more reverts within two weeks" both happen — and why does a four-week dashboard miss it?
- Write the net-leverage formula in words.
- Where does the bottleneck move when generation doubles but review capacity is flat?
- Why is acceptance rate a vanity metric?
- Which single metric degrades first when the review tax bites, and why before incidents?
- Why can a better model make the red zone worse?
- What is the one thing Meridian had to do before turning on Copilot, and why is it irrecoverable later?
Interview Q&A¶
Q1. Your org turned on Copilot and PR throughput rose 20%. Leadership wants to expand. What do you check first? A. The paired guardrail: change-fail rate and rework-within-two-weeks, segmented by AI-usage intensity, on the same time axis as throughput. If high-usage teams show throughput up and stability down, the gain is converting into rework via the review tax and expansion will scale the problem. If they hold both, expand. Common wrong answer to avoid: "Expand — 20% throughput is a clear win." Throughput alone is half the ledger; the rework lands later and in a different metric.
Q2. Why does the same assistant make CRUD endpoints faster but OAuth token-refresh slower, net? A. Verification cost. CRUD is cheap to verify (tests, eyeball) so even at 90% accuracy the net is positive. OAuth is expensive to verify and the 10% failure is a security defect that ships and costs a breach, so the same accuracy is net-negative. Value is accuracy relative to verification and failure cost, not accuracy alone. Common wrong answer to avoid: "OAuth is just harder, so the model is worse at it." The model can be equally accurate; the decision flips on verification and failure cost, not raw difficulty.
Q3. A junior dev's PRs are clean and fast since adopting an assistant. Is that good? A. Maybe, maybe not — check whether they can explain the code and whether their rework rate is low. Clean-looking diffs from confident completions can hide that the dev is editing output they don't understand, which erodes the review capacity the org will need from them as a senior. Pair the velocity signal with comprehension and rework. Common wrong answer to avoid: "Yes, fast clean PRs are the goal." Speed without understanding builds future review debt and is invisible until the dev is asked to debug their own code.
Q4. Why is "Copilot acceptance rate 31%" a poor input to an expansion decision? A. It is a vanity metric: it goes up whenever the tool is on, it conflates great suggestions with rubber-stamping, and it says nothing about whether accepted code survived. It measures the tool, not the outcome. Decisions need outcome metrics (deploys, change-fail, rework) and experience metrics, not engagement. Common wrong answer to avoid: "31% means the model is good." Acceptance measures developer behavior under the tool, not code survival or delivery impact.
Q5. After enabling assistants, coding got faster but PR cycle time got worse. Diagnose. A. The bottleneck moved to review. Generation doubled, review capacity stayed flat, the queue grew, and cycle time — which includes review wait — rose. The fix is to add review capacity (AI pre-review, smaller PRs, more reviewers) not more generation. This is head-of-line blocking: you sped up the wrong station. Common wrong answer to avoid: "The assistant must be producing worse code that takes longer to review." It may be fine code; the issue is throughput mismatch between stations, not per-line quality.
Q6. Is this a model problem or a process problem when rework rises after AI rollout? (cumulative — connects to 03 and 06) A. Almost always process: the model raised generation throughput against unchanged verification capacity, and weak review/test practice (the amplifier rule) let defects through. The fix lives in review gates (file 03), test strength (file 04), and honest measurement (file 06), not in a better model. A better model can even worsen the red zone. Common wrong answer to avoid: "Upgrade to a smarter model." Capability gains don't fix a verification-capacity bottleneck and can make confident wrong answers harder to catch.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Here is the green-vs-yellow analysis for two Meridian task types:
Task: "add a field to an existing API response"
Zone: GREEN (pattern-dense, tests catch it, wrong is cheap)
Decision: trust assistant, accept fast, rely on existing tests.
Task: "parse a new partner's CSV with ad-hoc escaping rules"
Zone: YELLOW (subtle, expensive to verify, wrong ships silently)
Decision: AI drafts; add property-based + fuzzed tests; explicitly
prompt for "inputs this does NOT handle"; senior reviews.
Step 2 — Your turn. Take three tasks from your own current sprint. For each, place it in the 2×2 zone grid, state the verification cost, and decide: trust-and-accept, draft-and-verify-hard, or write-it-yourself. Then predict which one will generate rework if you treat all three as green.
Step 3 — Reproduce from memory. Without looking, redraw the four-zone (verification-cost × novelty) grid, label which quadrant makes assistants net-negative, and write the net-leverage formula. Then connect it to one earlier idea: which 00-overview pressure (amplifier rule, review tax, vanity metric) explains why Meridian's four-week dashboard lied?
Operational memory¶
This chapter explained why an AI rollout can make velocity feel up while quietly raising rework: generation got cheaper, verification did not, and the two effects land in different weeks, so a short dashboard sees only the gain. The important idea is that net leverage equals output minus rework, and the rework is born downstream of the speedup — not that "AI writes bad code."
You learned to route work by verification cost using the four-zone grid, to expect the bottleneck to move to review when generation speeds up, and to set Meridian's honest baseline of outcome and experience metrics before flipping the switch. That solves the opening failure because it makes the hidden half — rework, stability, the review tax — visible on the same axis as the obvious half, so an expansion decision rests on net leverage instead of a vanity number.
Carry this diagnostic forward: when an AI rollout "is working," ask which station is now the wall and what is the rework rate on the task types we trusted. If you see throughput up and stability down in your heaviest-usage teams, inspect review capacity and verification before crediting or blaming the model.
Remember:
- An accepted suggestion is a loan against future verification; it's only a win if it survives the rework horizon.
- Net leverage = generation saved − rework cost; measure both, segmented by task type and team usage.
- The bottleneck moves to the next station you didn't speed up — usually review.
- Acceptance rate and lines generated are vanity; rework-within-two-weeks degrades first and predicts incidents.
- Better model ≠ more help everywhere; a stronger model can worsen the red zone by making wrong answers more convincing.
- Set the baseline before rollout — it's cheap now and impossible to reconstruct later.
Bridge. We sped up the inner loop and learned to measure its rework. But a developer free-typing prompts at an assistant has no anchor — the assistant invents structure, and "what was supposed to be built" drifts from "what got built." The next file moves up a level: generating scaffolds, migrations, and infrastructure from a spec, so the human-owned specification stays the source of truth and the assistant generates toward it instead of redefining it. → 02-spec-to-code-and-scaffolding.md