Skip to content

08. Boundary and tradeoff review — where the evidence is contested and the hype outruns it

~18 min read. Two rigorous 2025 studies looked at AI coding and reached opposite-sounding conclusions: DORA found AI adoption now correlates with higher delivery throughput across thousands of organizations, and METR found experienced developers on their own large repos were 19% slower with the same generation of tools — while believing they were 20% faster. Both are good studies. Both are true. This file sits in that contradiction: where the evidence on AI-for-SDLC is genuinely contested, where the hype outruns what's measured, what works empirically without a clean theory, and what to revisit as the tools and studies keep moving every quarter.

Built on 00-first-principles.md. Every force in the module converges here: the leverage-rework tradeoff, the vanity metric, the guardrail metric, the honest baseline, the grounding gap, the amplifier rule, and the blast radius. The earlier files each resolved one pressure with one gate. This file admits the resolutions are cleaner on paper than in a real org, and teaches you to reason where the evidence itself is in tension.


What we know so far and what still breaks

The module taught a clean arc. The inner loop produces leverage and hides rework (01). A human spec stays the source of truth so scaffolds don't drift (02). The review gate blocks on deterministic findings (03). Tests are gated on mutation score, not coverage (04). Ops copilots cite telemetry instead of asserting causes (05). Delivery metrics with paired guardrails tell leverage from theater (06). Deterministic gates and a data boundary contain the irreversible legal and security blast radius (07). Each chapter ended with a gate that resolves its pressure.

What still breaks is that the evidence underneath the arc is contested in ways no single gate settles. Whether AI makes developers faster depends on who, on what code, measured how — and the best studies disagree. The tools change faster than the studies can evaluate them, so every measured result is about a model generation that's already obsolete. And practitioners routinely do things the textbook (including this module) would caution against, and ship fine. A senior engineer has to hold the clean rules and the messy evidence at once.

This chapter answers three things: where the evidence on AI-for-SDLC is genuinely contested (and how to reason when good studies disagree), where the hype outruns what's measured (and how to tell), and what works empirically without a clean theory — the practices that violate the advice and succeed anyway.

What this file solves

You've learned a set of gates that each resolve a pressure cleanly. Then leadership shows you two studies that contradict each other, a vendor claims a 55% productivity gain, and a senior engineer on your team ships fast by doing the opposite of what file 02 advised — and it works. This file gives you the concrete move: learn to read a contested AI-productivity claim by asking who, what, measured how, and what generation of tool; recognize the gap between a vendor's headline number and a measured outcome; and treat the module's gates as defaults you apply with judgment, not laws — knowing which ones bend in practice and why.

Why the clean story is cleaner than the evidence

Set the two studies side by side, because the contradiction is the whole chapter.

DORA 2025 (thousands of orgs, survey + delivery metrics):
  AI adoption now correlates with HIGHER delivery throughput.
  (And still LOWER stability — the amplifier, file 06.)
  → "AI helps teams ship more."

METR 2025 (RCT, 16 expert OSS devs, 246 tasks, their own large repos):
  Developers were 19% SLOWER with early-2025 AI tools.
  They predicted +24% faster; even after, believed +20% faster.
  → "AI slows experienced developers on familiar code."

Same year. Same tool generation. Opposite headlines. Both rigorous.

A junior reader picks the study that confirms their prior and dismisses the other. A senior reader asks what's different about them, and the difference dissolves the contradiction. DORA measures broad delivery throughput across many orgs and skill levels and unfamiliar code, where AI's boilerplate-and-lookup leverage is large. METR measures expert developers on repos they know intimately, where the marginal value of generation is small and the cost of reviewing and correcting AI output is high — so the net is negative. They don't contradict; they measure different points on the same surface.

So the real lesson is not "the evidence is a mess, ignore it." It is that AI's effect on productivity is not a single number — it depends heavily on developer expertise, code familiarity, task type, and tool generation — so two rigorous studies measuring different points produce opposite headlines without either being wrong. The contradiction is real only if you expect one number. There isn't one.

So how do we reason about a contested claim instead of just picking the study that flatters our prior?

The naive read: pick the study that confirms what you already believe

The reflex when studies disagree is to treat it as a debate with a winner. The AI optimist cites DORA and waves away METR ("only 16 developers, niche OSS work"). The AI skeptic cites METR and waves away DORA ("self-reported survey, correlation not causation"). Both are doing the same thing: using the methodology critique selectively, against the study they dislike and not the one they like.

The break is that this produces a confident conclusion that's wrong in the cases that matter. The optimist rolls AI out to a team of senior engineers on a mature, deeply-understood monolith — the METR population exactly — and is baffled when throughput doesn't move and seniors complain the tool slows them down. The skeptic blocks AI for a team doing greenfield CRUD and integration glue — the DORA-leverage population exactly — and forfeits a real speedup. Each picked the study that matched their prior and applied it to the population where the other study was right.

Belief picked the study      Applied to the population where...      Result
  optimist → DORA              ...METR was right (experts, familiar)   no gain, friction
  skeptic  → METR              ...DORA was right (broad, unfamiliar)   forfeited leverage

So the real cause is not "one study is flawed." It is that the question "does AI help?" is under-specified — it has no answer without specifying who, on what code, doing what task, with which tool generation — so any single study is a point estimate misapplied when you generalize it past its population. The fix isn't to find the "right" study; it's to locate your situation on the surface and read the study that measured near it.

So how do we read a contested claim to extract what it actually supports, instead of what we want it to?

When the same tool helps one developer and slows another

Here is the smallest version of the whole problem, on two developers in the same org.

Priya — new to the codebase, building a standard REST endpoint:
  AI generates the boilerplate, the validation, the test scaffold.
  She'd have spent an hour looking up conventions. Net: clearly faster.
  → the DORA-leverage case.

Sam — 5 years on this module, doing a subtle refactor of its core logic:
  AI suggests plausible code he must read carefully and mostly correct;
  the review-and-fix cost exceeds the typing it saved. Net: slower.
  → the METR-slowdown case.

Same tool, same day, same org. Opposite net effect. The variable is
expertise × code familiarity × task type, not the tool.

The tool didn't change between Priya and Sam. The net effect flipped because they sit at different points on the surface — Priya where generation leverage is high and review cost is low, Sam where generation leverage is low and review cost is high. A policy that says "AI helps" or "AI hurts" is wrong for one of them. The honest policy says "AI helps here and not there," and names the where.

Rule: "does AI help?" has no answer without specifying the conditions

The load-bearing truth of this chapter: AI's effect on software productivity is conditional, not constant — it depends on developer expertise, code familiarity, task type, foundation strength, and tool generation — so any claim, study, or vendor number is only valid for the conditions it measured, and generalizing it past them is the error. The amplifier rule is one face of this (AI multiplies the foundation already present); the expertise-and-familiarity surface is another. There is no context-free productivity number, and the moment someone reports one, they've dropped the conditions that make it meaningful.

Why a single number can't exist here. The primitive is that productivity is a surface, not a scalar: effect = f(expertise, familiarity, task, foundation, tool generation), and the function is non-monotone — more expertise can mean less benefit (Sam), a stronger foundation means more benefit (the amplifier). The constraint that breaks the naive "AI helps/hurts" claim is that any study or vendor stat fixes most of those variables and varies one, producing a point estimate valid only at that point. Reporting it as "AI's productivity effect" drops the conditions — the same vanity-metric move as file 06, now applied to evidence: a number detached from the conditions that ground it. The fix is to always carry the conditions, and to locate your own situation on the surface before applying any study.


1) Reading a contested claim — the who/what/how/when filter

The mechanism for reasoning under contested evidence is a four-question filter you run on any AI-productivity claim before believing or dismissing it.

WHO:   what developers? (juniors / experts / mixed; familiar or new to the code)
WHAT:  what work? (greenfield boilerplate / subtle refactor / debugging / glue)
HOW:   measured how? (RCT on delivery time / survey self-report / vanity metric /
                       delivery outcomes with a baseline and control)
WHEN:  which tool generation? (the field moves quarterly; a 2024 result may not
                               hold for a 2026 model)

Run it on the two studies. METR — WHO: expert OSS devs on their own repos; WHAT: real issues on mature code; HOW: RCT on wall-clock completion time (the gold standard for causation); WHEN: early-2025 tools. DORA — WHO: thousands of mixed orgs; WHAT: general delivery; HOW: survey + delivery correlation (broad, but correlational); WHEN: 2025. Now the contradiction is legible: METR is a causal result on the hardest population for AI (experts, familiar code), DORA is a correlational result across the broadest population. Neither is wrong; they answer different questions, and the filter tells you which one applies to your team.

Teacher voice. Notice the filter is the same discipline as the whole module, na — file 06 taught "measured at the tool or at the outcome?", and HOW is exactly that question applied to a study. A vendor's "55% faster" almost always fails HOW (it's a self-report or a vanity metric) and WHEN (it's last generation). A claim survives the filter only when it names its conditions; the moment it claims a context-free number, it's the evidence version of a vanity metric.

For Meridian, the filter is what turns "two studies disagree" into a rollout policy: AI to the Priya population (new-to-code, standard work) where DORA's leverage is real, lighter-touch for the Sam population (experts on familiar core logic) where METR's slowdown bites, and re-evaluate every tool generation because WHEN keeps moving.

2) The productivity-surface mental model — picture before the policy

This is the core mental model of the chapter. Keep it as the canonical ASCII image: AI's net effect is a surface over expertise and code familiarity, and the two studies measure opposite corners of it.

                     CODE FAMILIARITY
              new to the code        deeply familiar
            ┌──────────────────────┬──────────────────────┐
   junior   │  BIG WIN             │  MODERATE WIN         │
   dev      │  boilerplate, lookup │  AI fills gaps, but   │
            │  AI does what they'd │  they know less to    │
            │  google → fast       │  correct → still net+ │
            │  ◀ DORA's strong     │                       │
            │    leverage corner   │                       │
            ├──────────────────────┼──────────────────────┤
   senior   │  WIN                 │  SLOWDOWN             │
   dev      │  unfamiliar area,    │  knows it cold; AI    │
            │  AI accelerates      │  output costs more to │
            │  orientation         │  review than it saves │
            │                      │  ◀ METR's slowdown    │
            │                      │    corner (−19%)      │
            └──────────────────────┴──────────────────────┘

  There is no single "AI productivity" number — only a position on this surface.
  Vendor headlines quote the top-left corner as if it were the whole surface.

The whole danger of the hype is quoting the top-left corner — junior developer, unfamiliar code, big win — as the universal number, then applying it to the bottom-right, where a senior engineer on familiar code is actually slowed. The surface also has a third axis the diagram can't show — foundation strength (the amplifier rule) — which scales the whole surface up or down: a strong-foundation org sees bigger wins and smaller slowdowns everywhere. Meridian's honest policy reads its own position on the surface per team, instead of quoting a corner as a constant.

3) Meridian faces the contested evidence — the running example

Meridian's leadership, holding the file-06 measured result (throughput up vs control, but change-fail and rework tripped the guardrails), now has to decide rollout policy with the two studies on the table. Watch the two approaches.

Attempt A — pick a study, set one org-wide policy

The optimist VP cites DORA: "AI raises throughput, roll it out hard everywhere,
                             mandate high acceptance rates."
Result: the senior platform team (Sam's population) sees no throughput gain and
        rising rework — exactly METR's corner. They resent the mandate. The
        greenfield team's real gain gets averaged away by the platform team's
        slowdown, so the org-wide number looks flat and nobody can explain why.

Attempt B — locate each team on the surface, tier the policy

Apply the who/what/how/when filter per team, plus the file-06 measured baseline:

  Greenfield / integration team (new-ish code, standard work):
    → DORA-leverage corner. AI on, light gates, expect real throughput gain.
  Platform / core-logic team (experts, deeply familiar code):
    → METR-slowdown corner. AI available, NOT mandated; let them choose where it
      helps (orientation, tests, docs) and skip it on core refactors.
  All teams: keep the file-06 guardrails (change-fail, rework) and re-measure each
    tool generation, because WHEN keeps moving.

Result: greenfield throughput rises (and holds, with the gates), platform team
        keeps their pace and adopts AI selectively, the org-wide number stops being
        a meaningless average because it's reported per-surface-position.
        The change-fail guardrail (still tripped) points at the test gate (file 04),
        not at "AI is bad."

The studies didn't change between A and B. Meridian stopped treating "does AI help?" as one question with one answer and located each team on the surface, applying the leverage where it's real and not mandating it where the evidence says it slows people down. The org-wide average — the thing Attempt A optimized — was always a meaningless blend of opposite corners; reporting per-position is what makes the decision sound.

Teacher voice. See the discipline, na — this is the file-06 control cohort lesson again. An org-wide "AI productivity" number is an average over the whole surface, and averaging the big-win corner with the slowdown corner gives you a flat number that tells you nothing. You have to segment by position on the surface, the same way you segmented the AI cohort from the control. The contested evidence isn't a reason to give up measuring; it's a reason to measure per condition.

4) Why a conditional policy, not a single mandate or a single ban

The plausible alternatives are a single org-wide mandate (AI everywhere, the optimist) and a single org-wide ban or heavy restriction (the skeptic). Why a conditional, surface-aware policy under Meridian's mixed workload?

A single mandate applies the top-left-corner result to the bottom-right population and gets the METR slowdown for the senior teams, plus the resentment of being measured on a tool that demonstrably slows them — and it averages away the real greenfield gain into a flat org number. A single ban applies the bottom-right result to the top-left population and forfeits the large, real leverage on greenfield and unfamiliar-code work, and (per file 07) gets routed around into ungoverned shadow usage. Both are the same error: treating a conditional surface as a constant, then setting one policy for the whole surface.

A conditional policy reads each team's position — expertise, familiarity, task type, foundation — and applies AI where the surface says it helps, lightly where it doesn't, with the file-06 guardrails everywhere and a re-evaluation cadence because the tool generation keeps moving. Under a workload of mixed teams (the reality of any 200-engineer org), only the conditional policy matches the evidence; a single mandate or ban is guaranteed wrong for half the surface. The cost is that "it depends" is harder to communicate to leadership than one number — but the honest answer to "does AI help?" is "it depends, and here's the map."

5) The property that changes the design: how fast the ground moves under you

If you change one thing about how you reason here, change this: the design variable is the half-life of the evidence. A result about a 2024 model generation may not hold for a 2026 one, because the tools change faster than studies can evaluate them — the METR result is explicitly about early-2025 tools, and its authors later revised their experiment design as tools improved. A finding in this field has a short half-life, so a conclusion is a snapshot, not a constant.

Slow-moving (long half-life — trust the result longer):
  DORA's amplifier rule (foundation decides whether change volume → throughput
    or instability) — structural, not tool-specific.
  Goodhart on vanity metrics; the leverage-rework tradeoff; the grounding gap.
  → these are about systems and incentives, not model capability.

Fast-moving (short half-life — re-check every generation):
  "Experienced devs are 19% slower" — about a specific tool generation.
  Vendor capability claims; acceptance rates; which model is best at code.
  → re-evaluate as the tools change; today's slowdown may be tomorrow's win.

The durable findings are the structural ones — the amplifier rule, Goodhart, the tradeoffs, the grounding gap — because they're about systems and incentives, which don't change when the model does. The fragile findings are the capability ones — speed, slowdown, which tool is best — because they're about a model generation that's already being replaced. Meridian trusts the structural lessons for years and re-checks the capability claims every quarter; conflating the two (treating a capability snapshot as a structural law, or dismissing a structural law because a capability number moved) is the error.

6) One failure walked through: the vendor number that didn't survive the filter

Trace the canonical hype failure end to end.

1. A vendor publishes "developers are 55% faster with our AI tool." Leadership wants
   to adopt org-wide and project a 55% capacity gain into the roadmap.
2. Run the filter. HOW: the 55% is a controlled exercise — developers writing a
   from-scratch HTTP server, measured on task-completion time, not delivery. WHO:
   recruited developers, not the team's seniors on the real codebase. WHAT: a
   greenfield, well-specified, boilerplate-heavy task — the top-left corner. WHEN:
   last model generation.
3. The number is real *for that condition* and meaningless as a capacity projection:
   Meridian's actual work is mostly maintenance on a familiar codebase by experienced
   engineers — the bottom-right corner, where the measured effect is near zero or
   negative.
4. Leadership, projecting 55% into the roadmap, commits to deadlines assuming a
   capacity gain that won't materialize on this work. The team misses them.
5. The post-mortem blames the team or the tool. The real fault: a top-left-corner
   number applied to a bottom-right-corner workload, with the conditions dropped.

Where did the system fail? Not at the vendor study — 55% may be honest for a greenfield HTTP server. It failed at generalization: the number was lifted off its conditions and projected onto a workload at the opposite corner of the surface. The hype isn't usually a lie; it's a real number from the favorable corner, quoted as if it were the whole surface. The filter catches it at step 2 — HOW and WHAT immediately show it's a top-left measurement — before it becomes a roadmap commitment.

The fix is the rule: every productivity number carries its conditions, and you locate your own workload on the surface before applying anyone's number to it.

7) Cost movement — what reasoning under contested evidence buys and bills

What changes Direction Concrete (Meridian) Who absorbs it
Decision quality under conflicting studies rises "pick a study" → locate on surface leadership
Over-projection from hype falls 55% not projected onto maintenance work the roadmap
Communication difficulty rises "it depends" is harder than one number whoever briefs leadership
Wasted mandates / bans avoided no org-wide mandate on the slowdown corner senior teams
Re-evaluation cadence new, ongoing re-check capability claims each generation platform team
Confidence calibration improves structural lessons trusted, capability re-checked the whole org

The pressure relieved is mis-generalization — applying a study or vendor number past the conditions it measured. The pressure created is communication difficulty (the honest answer is "it depends," absorbed by whoever has to brief leadership) and a re-evaluation cadence (capability claims re-checked each tool generation, absorbed by the platform team). The trade is strongly positive because one avoided over-projection (the 55% roadmap commitment) costs more than a quarter of re-evaluation, and because a conditional policy captures the real leverage where it exists instead of mandating or banning across the whole surface.

Mini-FAQ. "If the evidence keeps changing, why measure at all — won't the result be obsolete next quarter?" Because the structural findings don't expire — the amplifier rule, Goodhart, the leverage-rework tradeoff, and the grounding gap hold across tool generations, and your own baseline-and-control measurement (file 06) tells you what's true for your org on this generation. What expires is the capability snapshot, which is exactly why you re-measure rather than trust a stale study. Measuring is how you stay current as the ground moves; not measuring is how you end up acting on a 2024 number in 2026.

8) Signals — healthy, first to degrade, misleading, expert's graph

Healthy: rollout policy segmented by position on the surface (team expertise × code familiarity × task type), guardrails (change-fail, rework) holding per segment, capability claims re-evaluated each tool generation, and structural lessons (amplifier, Goodhart) treated as durable. The org reports per-condition, not a single context-free number.

First metric to degrade: the spread between segments collapsing into a single org-wide average. When leadership starts quoting "our developers are X% more productive" as one number, the conditions have been dropped and the average is blending opposite corners of the surface — the same vanity-metric move as file 06, now at the evidence layer. It degrades before any bad decision, because the single number is what enables the over-projection.

The misleading metric everyone watches: vendor capability headlines and self-reported speedups. The METR perception gap is the sharpest warning — self-report said +20% while measurement said −19%, a 39-point error in the most convincing signal. Any number that fails the HOW filter (self-report, vanity metric, no baseline) or the WHEN filter (stale generation) is misleading regardless of how authoritative it sounds.

The graph an expert opens first: the org's own measured effect (file 06: DORA outcomes vs control) segmented by team position on the surface, with the tool generation labeled. Healthy looks like real gains in the leverage corner, flat-to-slight in the slowdown corner, and guardrails holding — not a single org-wide line. The danger signal is a single flat org-wide number that hides a real gain and a real slowdown averaging to nothing.

9) Boundary of applicability — where the module's gates hold, where they bend

Strong fit: the module's gates (human spec, deterministic review, mutation-score testing, grounded copilots, paired metrics, blast-radius governance) hold robustly as defaults across most teams and tool generations, because they're structural — they're about incentives and systems, not model capability, so they have a long half-life.

Where the textbook bends in practice: experienced practitioners on small, low-blast-radius work routinely skip the ceremony and ship fine. A senior engineer prototyping a throwaway script lets the AI run with minimal review and no spec (violating file 02) because the blast radius is near zero and the rework is cheap — the file-07 logic says oversight scales with blast radius, so low blast radius earns low ceremony. Teams ship AI-generated code with coverage gates and no mutation testing (violating file 04) and survive, because their change volume is low enough that the hollow-test risk hasn't bitten yet. These aren't refutations of the rules; they're the rules applied with judgment — the blast radius is genuinely low, so the gate is genuinely optional. The error is generalizing the shortcut to high-blast-radius work.

Scale/workload that breaks naive intuition: the intuition "more AI is more productive" breaks at the expert-on-familiar-code corner, and "the evidence will settle soon" breaks because the tools move faster than the studies. The deepest counterintuitive result in the module — the METR perception gap — is the boundary on trusting your own sense of speed: at the scale of a familiar codebase, the developer's felt productivity is not just imprecise but directionally wrong. Measure; never trust the feeling, especially a strong one.

10) Wrong assumption: "the evidence will settle and give us the answer"

The seductive belief is that this is early days, the studies are noisy, and soon a definitive answer will emerge — AI helps by X%, settled. That answer will never come, because the question is conditional and the tools keep moving: there is no single X, and whatever X you measure is about a generation that's already being replaced.

Replace the wrong belief with: there is no context-free answer coming — AI's effect is a moving surface over expertise, familiarity, task, foundation, and tool generation, so the durable skill is reading conditions and re-measuring, not waiting for a verdict. The structural lessons (amplifier, Goodhart, the tradeoffs) are as settled as they'll get and you can trust them now; the capability numbers will keep moving and you re-check them. The "waiting for the verdict" posture is the chapter's memory hook for what not to do: it leaves you acting on stale numbers while the ground moves, when the right move is to measure your own org continuously and reason per-condition.

11) Other failure shapes to recognize

  • Confirmation-study picking. Citing the study that matches your prior and applying the methodology critique only to the one you dislike.
  • Corner-as-constant. Quoting a favorable-corner result (junior, greenfield, boilerplate) as the universal productivity number.
  • Stale-generation claim. Trusting a capability result about a tool generation that's already been replaced (failing the WHEN filter).
  • Self-report substitution. Believing "developers feel faster" over measured delivery — the METR inversion.
  • Average-over-the-surface. Reporting one org-wide productivity number that blends the big-win and slowdown corners into a meaningless mean.
  • Structural/capability conflation. Treating a capability snapshot as a durable law, or dismissing a durable structural law because a capability number moved.
  • Shortcut generalization. Taking a low-blast-radius practitioner shortcut (skip the spec, skip mutation testing) and applying it to high-blast-radius work.
  • Verdict-waiting. Deferring measurement because "the evidence isn't settled," and acting on stale numbers in the meantime.
  • Hype-cycle whiplash. Swinging from "AI changes everything" to "AI is useless" as headlines flip, instead of holding the conditional surface steady.

12) Pattern transfer — where this pressure recurs

  • The conditional-surface idea is the amplifier rule (file 06) generalized: AI's effect isn't constant, it's a function of conditions (foundation, expertise, familiarity), so any single number drops the conditions that make it meaningful — the same shape as a vanity metric, now applied to evidence and studies.
  • The who/what/how/when filter is the file-06 "tool-side or outcome?" question expanded into reading research: HOW asks where the study measured (RCT on delivery vs self-report vs vanity), exactly the distance-from-outcome axis applied to evidence.
  • The grounding gap recurs at the meta level: a productivity claim detached from its measured conditions is the same fluent-but-ungrounded failure as an uncited incident summary (file 05) — confident, plausible, and unsupported the moment you ask for the conditions behind it.
  • The blast-radius judgment (file 07) is what justifies the practitioner shortcuts: oversight scales with what a wrong action breaks, so low-blast-radius work earns low ceremony — the gates bend exactly where the blast radius is genuinely small, which is judgment, not violation.

13) Design test — five questions before believing an AI-productivity claim

  1. WHO, WHAT, HOW, WHEN — does the claim name its conditions, or quote a context-free number?
  2. Which corner of the surface (expertise × familiarity × task) did it measure, and does my workload sit at the same corner?
  3. Is it a structural finding (long half-life, trust it) or a capability snapshot (short half-life, re-check this generation)?
  4. Is the number measured at the outcome (RCT on delivery, with a control) or self-reported / vanity (fails HOW)?
  5. Am I about to generalize a favorable-corner or low-blast-radius result onto a workload where the opposite corner applies?

Where this appears in production

  • DORA 2024 / 2025 reports — the broad-population correlational evidence; throughput now up, stability still at risk, AI as amplifier — the structural, long-half-life findings.
  • METR early-2025 RCT — the causal evidence that experts on familiar code were 19% slower, with a +39-point perception gap; the bottom-right corner of the surface, and the warning against self-report.
  • METR experiment-design revision — the same team updating its methodology as tools improved, the clearest signal that capability findings have a short half-life.
  • GitClear AI code quality research — evidence of rising code churn and clone rates with AI assistance; the maintainability cost the throughput headlines omit.
  • GitHub / vendor "X% faster" studies — controlled greenfield exercises (e.g., writing a server from scratch); real for the top-left corner, misleading when projected onto maintenance work — the hype failure walked through above.
  • Stanford / academic AI-productivity studies — mixed results segmented by task and developer level; evidence for the surface, not a single number.
  • Atlassian / DX developer-experience surveys — self-reported productivity and friction; useful for the SPACE satisfaction dimension, vulnerable to the perception gap if read as delivery.
  • DX Core 4 benchmarks — outcome metrics segmented across 300+ orgs; the per-condition reporting this chapter argues for.
  • Practitioner blogs (Simon Willison, Pragmatic Engineer, Thoughtworks) — credible accounts of where the tools help and where the ceremony bends in real work; the empirical-without-clean-theory layer.
  • Internal eng-productivity teams (Google, Microsoft, Stripe) — orgs running their own baseline-and-control measurement because no external study answers "does it help us, now."

Pause and recall

  1. Why do DORA and METR reach opposite-sounding conclusions without either being wrong?
  2. What are the four filter questions, and what does each catch?
  3. Why does "does AI help?" have no answer without specifying conditions?
  4. Sketch the productivity surface — which corner do vendor headlines quote, and which does METR measure?
  5. What's the difference between a structural finding and a capability snapshot, and how should you treat each?
  6. Why is a single org-wide productivity number misleading even when it's measured honestly?
  7. Where do the module's gates legitimately bend in practice, and what justifies the bend?
  8. Why will "the evidence settling into one answer" never happen?

Interview Q&A

Q1. Two rigorous 2025 studies say AI raises throughput (DORA) and slows experienced developers 19% (METR). Which is right? A. Both — they measure different points on a surface. DORA is a broad correlational result across mixed orgs and skill levels and unfamiliar code, where AI's boilerplate-and-lookup leverage is large. METR is a causal RCT on expert developers on their own mature repos, where generation value is small and review-and-correct cost is high, so the net is negative. The question "does AI help?" has no single answer; you locate your workload on the surface and read the study that measured near it. Common wrong answer to avoid: "METR's RCT is more rigorous, so AI slows developers." METR is rigorous for its population (experts, familiar code); generalizing it to greenfield or junior work is the same error as quoting a vendor's favorable corner as universal.

Q2. A vendor reports developers are 55% faster. How do you use that number? A. Run the who/what/how/when filter first. It's almost always a greenfield, well-specified, boilerplate-heavy task (WHAT), measured on completion time in a controlled exercise (HOW), on recruited developers (WHO), on a past tool generation (WHEN) — the top-left corner of the surface. It's real for that condition and meaningless projected onto a maintenance workload by experienced engineers. Never project a favorable-corner number onto a different corner as a capacity gain. Common wrong answer to avoid: "55% faster means we can cut the roadmap by half." That lifts the number off its conditions and applies it to the opposite corner; the gain won't materialize on familiar-code maintenance work.

Q3. Your senior platform team says AI slows them down; your greenfield team loves it. Whose experience is the truth? A. Both, because they sit at opposite corners of the surface — experts on deeply-familiar core logic (METR's slowdown corner) versus newer engineers on standard greenfield work (DORA's leverage corner). The right policy isn't to pick one; it's to apply AI where the surface says it helps (greenfield, light gates) and not mandate it where it slows people (expert core refactors), with the guardrails everywhere. An org-wide mandate or ban is wrong for one of the two teams. Common wrong answer to avoid: "The seniors are just resistant to change." Their measured experience matches the strongest causal study; the slowdown is real at their corner, not resistance.

Q4. A senior ships an AI-generated script with no spec and minimal review, violating file 02 and file 03. Are they wrong? A. Not necessarily — the gates scale with blast radius (file 07), and a throwaway script is near-zero blast radius with cheap rework, so low ceremony is the rule applied with judgment, not a violation. The error would be generalizing that shortcut to high-blast-radius work (auth, IaC, a customer-facing service), where the spec and review gates earn their cost. Judge the bend by the blast radius, not by whether it matches the textbook. Common wrong answer to avoid: "Rules are rules, they violated the process." The module's own logic makes oversight conditional on blast radius; a low-blast-radius shortcut is the rule working, not breaking.

Q5. How long should we trust a study that says AI helps (or hurts) by some percentage? A. Depends on whether it's structural or capability. Structural findings — the amplifier rule, Goodhart on vanity metrics, the leverage-rework tradeoff, the grounding gap — have a long half-life because they're about systems and incentives, not model capability; trust them across generations. Capability findings — speed, slowdown, which tool is best — have a short half-life because the tools change quarterly; re-check them each generation against your own baseline-and-control measurement. Common wrong answer to avoid: "A rigorous study's result holds indefinitely." Capability results are about a tool generation that's already being replaced; the METR team itself revised its design as tools improved.

Q6. Our org-wide productivity number is flat after the rollout. Is AI not working — a file-06 measurement problem, or a file-08 evidence problem? (cumulative) A. Likely a file-08 problem surfacing through file-06 mechanics: a single org-wide number averages the big-win corner and the slowdown corner into a meaningless flat mean. Segment by position on the surface (file 08) using the file-06 baseline-and-control method, and you'll likely find a real greenfield gain and a real expert-on-familiar slowdown canceling out. The fix is to report per-condition, not to conclude "AI doesn't work" from a blended average. Common wrong answer to avoid: "Flat number means AI is useless, roll it back." A flat average can hide a strong gain and a real slowdown; you have to segment the surface before concluding anything.

Design/debug exercise (10 min)

Step 1 — Modeled example. Here is Meridian's framework for reasoning about a contested AI-productivity claim:

FILTER any claim:  WHO (devs, familiarity) / WHAT (task type) /
                   HOW (RCT-delivery / survey / vanity, baseline+control?) /
                   WHEN (tool generation)
LOCATE your workload on the surface: expertise × familiarity × task; note foundation.
CLASSIFY the finding: structural (long half-life, trust) vs capability (re-check).
POLICY: apply AI where the surface says it helps; don't mandate where it slows;
        keep file-06 guardrails; re-evaluate each tool generation.
Forbidden: quoting a corner as a constant; one org-wide number; verdict-waiting.

Step 2 — Your turn. Take a real AI-productivity claim you've seen (a study, a vendor stat, a teammate's "it made me 2x faster") and run the filter on it: name WHO/WHAT/HOW/WHEN, locate which corner of the surface it measured, classify it structural or capability, and state whether it applies to your workload. Continue Meridian if you have none: place the platform team and the greenfield team on the surface and write the one-line policy for each.

Step 3 — Reproduce from memory. Redraw the productivity surface (expertise × familiarity), mark DORA's corner and METR's corner and where vendor headlines quote, and the third axis (foundation) that scales it. Then connect it to file 06: why is "report per-condition, not one org-wide number" the same discipline as "pair every speed metric with a guardrail and compare against a control"?

Operational memory

This chapter explained why the clean, gate-per-pressure story of the module sits on contested evidence: two rigorous 2025 studies reached opposite-sounding conclusions because AI's effect on productivity is a surface over developer expertise, code familiarity, task type, foundation strength, and tool generation — not a single number. The important idea is that "does AI help?" has no context-free answer, so any study or vendor stat is valid only for the conditions it measured, and the durable skill is reading conditions and re-measuring rather than waiting for a verdict — not that "the evidence is a mess, so ignore it."

You learned to run the who/what/how/when filter on any claim, to locate your own workload on the productivity surface before applying anyone's number, to separate durable structural findings (the amplifier rule, Goodhart, the tradeoffs, the grounding gap — trust them across generations) from short-half-life capability snapshots (re-check each tool generation), and to set a conditional rollout policy that applies AI where the surface says it helps instead of one org-wide mandate or ban. That resolves the opening contradiction because DORA and METR stop disagreeing the moment you see them as measuring opposite corners — and Meridian's flat org-wide average resolves into a real greenfield gain and a real expert slowdown that were canceling each other out.

Carry this diagnostic forward: when someone reports an AI-productivity number, ask which corner of the surface it measured and whether your workload sits there, and whether it's a structural law you can trust or a capability snapshot to re-check. If you catch yourself waiting for the evidence to settle into one answer, stop — measure your own org per-condition instead, because the surface keeps moving and the single answer is never coming.

Remember:

  • AI's productivity effect is a conditional surface (expertise × familiarity × task × foundation × tool generation), not a single number.
  • Two rigorous studies disagree only if you expect one number; DORA and METR measure opposite corners of the same surface.
  • Run who/what/how/when on every claim; a context-free number has dropped the conditions that make it meaningful.
  • Trust structural findings (amplifier, Goodhart, the tradeoffs, the grounding gap) across generations; re-check capability snapshots each generation.
  • A single org-wide productivity number averages the big-win and slowdown corners into a meaningless mean — report per-condition.
  • The evidence will never settle into one answer; measure your own org continuously instead of waiting for a verdict.

Bridge. This module turned the lens around: across twenty-two modules you learned to build AI systems, and across these nine files you learned to use GenAI to build software faster — with a measurement loop honest enough to tell leverage from illusion and gates that contain the blast radius. You can now do both: design an agentic system and use AI tooling to ship it faster without trading away quality, security, or the ability to know if it helped. The capstone ties the two together — you'll build a real agentic AI system end to end, applying the building skills from the earlier modules and the leverage-and-guardrail discipline from this one, and measure honestly whether the AI tooling actually made you faster at building it. → ../21_capstone_agentic_ai_system/00-eli5.md