02. Spec-to-code and scaffolding — keep the human spec as the source of truth¶

~18 min read. An assistant will happily generate a whole service, a database migration, or a Terraform stack from a one-line prompt. The output compiles. It also encodes a dozen decisions nobody made on purpose. This file shows how to use AI for the boring 80% — scaffolds, migrations, boilerplate, IaC — while keeping a human-owned spec as the thing that defines correct, so the generated code conforms to intent instead of silently becoming the intent.

Built on 00-first-principles.md. The forces here are the source of truth, ambiguity vs drift, the blast radius, and the amplifier rule. This file moves up from the inner loop (file 01) to generating larger artifacts, where a wrong default ships at scale.

What we know so far and what still breaks¶

File 01 measured the inner loop and landed one rule: net leverage is output minus rework, and the rework lands downstream where verification was too cheap. We routed individual completions by verification cost. That works line by line. It does not protect you when the assistant generates a whole artifact at once — a 400-line service skeleton, a schema migration, a Terraform module — because now the unit of acceptance is too big to eyeball and the decisions are buried inside generated structure.

The new pressure is drift. When you prompt "build me a payments service," the assistant fills every gap you left unspecified with a plausible default: a retry policy, a timeout, an index, a naming convention, an IAM policy scope. You never decided those. The generated artifact becomes a de-facto specification that no human authored. Months later someone asks "why is the timeout 30 seconds?" and the honest answer is "the model picked it."

This chapter teaches the move that prevents drift: keep a small, human-owned spec as the source of truth, generate toward it, and verify the generated artifact against it — so the spec defines correct and the AI only fills in the mechanical translation.

What this file solves¶

A team prompts an assistant to scaffold a new service and ships it. Three sprints later they discover the generated code made silent choices — an unbounded retry, a permissive S3 bucket policy, a migration with no down-step — that no design review ever saw, because there was no spec to review against; the prompt was the spec and it evaporated after generation. This file gives you the concrete move: write a short structured spec the assistant generates from, treat the generated artifact as a candidate that must conform to the spec, and put the conformance check — not the prompt — under review.

Why a spec has to exist before you generate¶

Watch Meridian's team start a new service. A developer, Arun, needs a "notifications service." He opens the assistant and types: "Create a FastAPI service that sends emails and SMS, with a queue, retries, and a Postgres table for delivery logs." Forty seconds later he has 600 lines across nine files. It runs. He is thrilled.

Look at what just happened to the decisions. The prompt named maybe eight requirements. The generated service made fifty decisions: the retry count (3), the backoff (fixed 5s, not exponential), the queue (in-memory, lost on restart), the Postgres connection pool size (default 5), whether SMS failures block email (they do, serially), the log table's indexes (none on created_at), the error response shape, the timezone handling, the secret-loading mechanism (env vars, no rotation). Arun decided eight things. The model decided fifty-two. And there is no document anywhere recording which were intentional.

So the real problem is not "the generated code is wrong." Most of it is fine. The problem is that the prompt under-specifies, the model over-specifies to fill the gap, and the gap-filling decisions enter the codebase with no author and no review. There is no artifact that says "here is what we meant," so there is nothing to check the generation against and nothing to catch the bad default.

So how do we let the assistant do the mechanical 80% without letting it quietly own the design decisions?

The naive fix: just write a longer prompt¶

The obvious repair is to stuff everything into the prompt. Arun's second attempt is a 40-line prompt specifying retries, backoff, pool sizes, indexes. The output is better. But three problems appear immediately.

First, the prompt is throwaway. It lives in his chat history, not the repo. The next person who modifies the service never sees it. The decisions are recorded in the act of generation and then lost — exactly the drift problem, just pushed one step later.

Second, prompts are not reviewable as design. A teammate cannot comment "use exponential backoff here" on a chat message that already produced code. There is no diff, no approval, no record.

Third, prose prompts are ambiguous and the model resolves ambiguity silently. "Reasonable retries" means 3-with-fixed-backoff to one generation and 5-with-jitter to the next. You cannot diff two prose prompts to see what changed.

So the real cause is not prompt length; it is that the prompt is the wrong kind of artifact to be a source of truth — it is ephemeral, unreviewable, ambiguous, and not version-controlled. The fix is to move the decisions out of the prompt into a durable, reviewable, checkable spec, and let the prompt merely point at it.

When the spec is one YAML file¶

Here is the smallest version of the right shape. Instead of a prose prompt, Arun writes a short spec committed to the repo:

# notifications-service.spec.yaml  (the source of truth)
service: notifications
channels: [email, sms]
queue: { type: redis, durable: true }     # not in-memory
retries: { max: 5, backoff: exponential, jitter: true }
sms_failure: non_blocking                  # SMS failure must not block email
delivery_log:
  table: delivery_log
  indexes: [created_at, recipient_id]
secrets: { source: vault, rotation: 90d }
timeouts: { send_ms: 5000 }

That is twelve lines a human can review in a design discussion. The assistant generates the 600 lines of code from it. Now there is something to diff (changes to the spec), something to review (the spec, in a PR), and something to check the code against (does the generated retry logic match max: 5, exponential, jitter?). The decisions have an author and a home.

Rule: AI generates the artifact; a human-owned spec defines correct¶

The load-bearing invariant: the specification is human-authored and version-controlled; the implementation is AI-generated and disposable. You can regenerate the code from the spec at any time. You can never regenerate the spec from the code, because the code does not record which of its decisions were intentional. So the spec is the durable asset and the code is downstream of it.

This inverts the usual instinct. Teams treat the generated code as precious (it took the assistant effort) and the prompt as throwaway. The right frame is the opposite: the spec is precious and the code is cheap to regenerate.

Why this rule exists. The primitive is provenance: every decision in shipped software should have an owner who can be asked "why." The constraint that breaks naive prompting is that generation erases provenance — the output cannot tell you which choices were deliberate. A spec restores provenance by recording intent separately from implementation. Without it, the codebase accumulates unowned decisions that no one can defend in an incident review.

1) The four artifact classes — and which one AI should write¶

Not all generated artifacts carry the same risk, because they have different blast radii. The mental model is to sort generation targets by what a wrong default costs.

                        BLAST RADIUS OF A WRONG DEFAULT
              small ◀──────────────────────────────────▶ catastrophic

  BOILERPLATE         SCAFFOLD            MIGRATION          INFRASTRUCTURE
  (DTOs, serializers, (service skeleton,  (schema change,    (Terraform, IAM,
   getters, mappers)   handlers, wiring)   data backfill)     network, secrets)
        │                   │                   │                   │
   wrong = compile      wrong = silent      wrong = DATA LOSS    wrong = SECURITY
   error or trivial     design debt that    or hours of         HOLE or outage
   fix; cheap           lives for months    locked tables;      across the fleet;
                                            HARD to reverse     blast radius = org

   AI: trust freely     AI: generate from   AI: generate, but   AI: generate DRAFT,
                        spec, review spec    review every line   mandatory human
                                            + dry-run            + policy gate

The further right, the smaller the share of the artifact AI should own unsupervised. Boilerplate is pure green-zone (file 01): a wrong DTO field fails to compile. Infrastructure is the opposite: a wrong IAM policy generated by AI can expose a bucket to the internet, and the blast radius is the whole organization. The skill is matching the human-oversight intensity to the artifact's blast radius, exactly as file 01 matched scrutiny to verification cost — same shape, bigger units.

2) The spec-generate-conform loop — picture before mechanics¶

This is the core mental model of the chapter. Keep it as ASCII; it is the canonical image of "spec as source of truth."

        ┌──────────────────────────────────────────────────┐
        │             HUMAN-OWNED SPEC                      │
        │   (YAML / OpenAPI / schema / IaC plan, in repo,   │
        │    reviewed in a PR — THIS is the source of truth)│
        └───────────────┬──────────────────────────────────┘
                        │ generate
                        ▼
        ┌──────────────────────────────────────────────────┐
        │           AI-GENERATED CANDIDATE                  │
        │   (service code / migration / terraform)          │
        │   — disposable, regenerable, NOT the truth        │
        └───────────────┬──────────────────────────────────┘
                        │ check
                        ▼
        ┌──────────────────────────────────────────────────┐
        │            CONFORMANCE GATE                       │
        │   Does the candidate match the spec?              │
        │   (contract tests, schema diff, terraform plan,   │
        │    policy-as-code) — THIS is what review approves │
        └───────────────┬───────────────┬──────────────────┘
                        │ pass          │ fail
                        ▼               ▼
                    merge          fix spec OR regenerate
                                   (NEVER hand-patch the code
                                    and let it diverge from spec)

The arrow that matters most is the bottom-left one: when the candidate fails conformance, you fix the spec or regenerate — you do not hand-edit the generated code into passing, because that re-introduces unowned decisions and breaks the "code is regenerable from spec" invariant. The moment you start hand-patching generated code, the spec stops being the source of truth and drift resumes.

3) Meridian's migration — the running example, with numbers¶

Meridian needs to add a tier column to a 40-million-row customers table and backfill it. This is the migration class — high blast radius, hard to reverse. Watch the two approaches.

Attempt A — prompt-to-migration, no spec¶

Prompt: "Write a migration to add a tier column to customers,
         default 'standard', and backfill existing rows."

Generated:
  ALTER TABLE customers ADD COLUMN tier VARCHAR DEFAULT 'standard';
  UPDATE customers SET tier = 'standard' WHERE tier IS NULL;

On Postgres with 40M rows, that ALTER ... DEFAULT rewrites the whole table under an ACCESS EXCLUSIVE lock — the table is unavailable for ~6 minutes mid-day. The blanket UPDATE locks 40M rows in one transaction, bloating WAL and risking replication lag. There is no down migration. The generated code is syntactically perfect and operationally catastrophic, and nothing in the prompt-to-code flow surfaced the lock behavior. This is the migration-class failure: a wrong default is not a compile error, it is a six-minute outage.

Attempt B — spec-to-migration with a conformance gate¶

# migration.spec.yaml  (source of truth, reviewed by a DBA)
change: add_column
table: customers
column: { name: tier, type: text, nullable: true }   # nullable first, no rewrite
backfill: { strategy: batched, batch_size: 10000, throttle_ms: 50 }
default: { value: standard, apply: after_backfill }   # set default last
constraints: { not_null: after_backfill }
down: drop_column tier

Conformance gate (runs in CI before merge):
  ✓ ALTER adds nullable column (no table rewrite)  [checked: no DEFAULT in ALTER]
  ✓ backfill is batched in 10k chunks with throttle [checked: loop present]
  ✓ down migration exists                           [checked: drop_column tier]
  ✗ FAIL if generated SQL contains "ADD COLUMN ... DEFAULT" on a >1M-row table

The DBA reviews twelve lines of spec, not 80 lines of SQL. The generated SQL is checked against the spec by a lint rule that knows the lock behavior. The bad default is caught at the gate, not in production. Cost: the spec took Arun ten minutes to write; it saved a six-minute outage and a post-mortem.

Teacher voice. See the division of labor, na. The DBA's expensive judgment goes into twelve lines of spec, where it is reusable and reviewable. The assistant's cheap labor generates the verbose SQL. The gate — a dumb lint rule encoding "never ALTER...DEFAULT on a big table" — catches the one failure mode the model doesn't reason about. Three layers, each doing what it is good at. That is the whole pattern.

4) Why spec-to-code, not codegen templates or model fine-tuning¶

The plausible alternatives to AI spec-to-code are deterministic code generators (Yeoman, Rails generators, OpenAPI codegen, cookiecutter) and fine-tuning a model on your house style. Why pick spec-driven AI generation over these under Meridian's workload?

Deterministic generators are safe and repeatable but rigid: they emit exactly the patterns someone pre-wrote as templates, and adapting them to a new framework, a new convention, or a one-off shape means writing a new template — expensive, and the long tail never gets covered. AI generation's advantage is coverage of the long tail: it can generate a service shape nobody templated, adapt to the local conventions in the spec, and handle the 30% of services that don't fit any template. The cost is determinism — two generations from the same spec may differ — which is exactly why the conformance gate exists: it re-imposes determinism on the output (does it match the spec?) without requiring a template for every shape.

Fine-tuning loses on a different axis. It bakes your conventions into model weights, but conventions change faster than you want to retrain, and a fine-tuned model still produces unowned decisions for anything outside its training. The spec is a cheaper, more transparent, instantly-editable place to encode conventions than model weights. Under a workload of "many service shapes, evolving conventions, high blast radius on a few artifact classes," spec-to-code with a gate dominates: AI's flexibility for the long tail, the spec's reviewability for provenance, the gate's determinism for safety.

5) The property that changes the design: reversibility of the artifact¶

The single variable that should change how much you trust generation is how reversible the artifact is.

Boilerplate:    reversible in seconds (delete, regenerate)        → trust freely
Scaffold:       reversible in hours (refactor)                    → review the spec
Migration:      partially reversible / data loss risk             → review every line + dry-run
Infrastructure: a wrong apply can be live + global before caught  → draft only, mandatory gate

A wrong DTO is a backspace. A wrong terraform apply can open a security group to 0.0.0.0/0 across the production fleet before anyone reads the diff. The same model, the same accuracy, but the consequence of the failure spans seven orders of magnitude. Reversibility — not difficulty, not how impressive the generation looks — is what sets the oversight level. This is the file-01 verification-cost idea applied to artifacts that act on the world: an IaC artifact's "verification" includes the cost of the failure being live in production while you verify.

6) One failure walked through: the over-permissive IAM policy¶

Trace the highest-blast-radius failure, because it is the one that ends up in a security post-mortem.

1. Dev prompts: "Terraform for a Lambda that reads from the orders S3 bucket
                 and writes to the reports bucket."
2. Assistant generates an IAM policy. To "make it work," it grants:
       Action: ["s3:*"], Resource: ["*"]
   — every S3 action on every bucket in the account. It works, so the dev moves on.
3. PR review: the reviewer sees 90 lines of Terraform, skims, sees it "looks like IAM,"
   approves. (The review tax from file 01: bigger generated diff, shallower review.)
4. terraform apply. The Lambda now has account-wide S3 access.
5. Six weeks later the Lambda is compromised via a dependency CVE.
   Blast radius = every bucket in the account, not the two it needed.

The model did not malfunction — s3:* on * is the most likely way to make a vague S3 request "work," so a model optimizing for plausible-working code reaches for it. The failure is structural: there was no spec saying "least privilege: only s3:GetObject on orders, s3:PutObject on reports," and no policy-as-code gate to reject wildcards. The amplifier rule: an org without least-privilege discipline gets that absence amplified into account-wide grants at generation speed.

The fix is the same three layers: a spec that states the exact permissions, generation toward it, and a policy gate (tfsec, checkov, OPA) that fails the build on Action: s3:* or Resource: *. The gate is dumb and deterministic and catches the one thing the model reaches for by default.

7) Cost movement — what spec-driven generation buys and bills¶

What changes	Direction	Concrete (Meridian)	Who absorbs it
Time to first working artifact	cheaper	service scaffold: 2 days → 2 hours	the author
Unowned design decisions	far fewer	~52 → ~0 (all in the spec)	the codebase (less debt)
Up-front spec-writing effort	new cost	+20–40 min per artifact	the author, before generating
Migration/IaC incident risk	lower	caught at gate, not prod	ops, on-call
Conformance-gate maintenance	new cost	lint rules, policy-as-code to maintain	platform team
Regeneration cost	near-zero	code is disposable, regen from spec	nobody

The pressure relieved is time-to-artifact and provenance loss. The pressure created is spec-writing discipline and gate maintenance — a platform-team cost. The org trades "fast but unowned" for "slightly-less-fast but owned and checkable." For boilerplate that trade isn't worth it (just generate). For migrations and IaC it is overwhelmingly worth it, because the avoided cost is an outage or a breach.

Mini-FAQ. "Isn't writing a spec just writing the code in another form — extra work for nothing?" No, because the spec is an order of magnitude smaller than the code and lives at the level of decisions, not syntax. Twelve lines of migration spec replace 80 lines of SQL and, more importantly, make the lock-behavior decision reviewable by a DBA who would never read all 80 lines. The spec compresses the parts a human must judge and delegates the verbose parts to the model.

8) Signals — healthy, first to degrade, misleading, expert's graph¶

Healthy: generated artifacts conform to specs on first gate run; specs are reviewed in PRs; generated code is regenerated (not hand-patched) when the spec changes; migration and IaC changes pass policy gates without manual override.

First metric to degrade: the rate of manual overrides on the conformance gate. When developers start adding # noqa / tfsec:ignore / hand-editing generated code to bypass the gate, the spec is quietly ceasing to be the source of truth. This degrades before any incident, and it is the leading indicator that drift is returning.

The misleading metric: "number of services scaffolded by AI" or "lines of IaC generated." Pure vanity, file-01 family — it counts generation, not whether the generated artifacts conform to reviewed specs or whether anyone owns their decisions.

The graph an expert opens first: gate-override rate and spec-vs-code divergence over time, per team. Rising overrides mean the gate is being routed around; rising divergence (generated code hand-edited away from its spec) means provenance is leaking back out. Both predict the return of the exact drift this chapter prevents.

9) Boundary of applicability — where spec-to-code shines and where it's overhead¶

Strong fit: high-blast-radius, repeatable artifacts — migrations, IaC, API contracts, service scaffolds — especially in regulated or large orgs where provenance and review are required. Here the spec pays for itself the first time it catches a bad default. Also strong where the same artifact shape recurs (every service needs a scaffold), so the spec format amortizes.

Pathological (overhead): truly one-off throwaway scripts, spikes, and prototypes where there is no future maintainer and no blast radius. Forcing a spec-conform loop on a 20-line debugging script is bureaucracy — just generate it. The cost of the spec exceeds the cost of the artifact being wrong.

Scale/workload that breaks naive intuition: the intuition "specs slow us down" inverts with blast radius and team size. On a solo prototype, specs are pure overhead. On a 200-engineer org's payment migration, the spec is the fastest path because it replaces a multi-hour design review and a potential outage with a twelve-line reviewable artifact. The bigger the blast radius and the more maintainers, the more the spec accelerates rather than slows.

10) Wrong assumption: "the generated code is the deliverable"¶

The seductive belief is that the AI's output — the 600 lines, the Terraform, the migration — is the valuable thing the team produced, and the prompt was just the means. It is backwards. The generated code is disposable; the spec is the deliverable. You can regenerate the code from the spec in seconds; you cannot regenerate the spec from the code at all, because the code has lost the record of which decisions were intentional.

Teams that believe the code is precious hand-patch it (breaking conformance), guard it from regeneration, and let the spec rot. Teams that believe the spec is precious keep the spec reviewed and current and treat the code as a build output of it. Replace the wrong belief with: the spec is the source code; the code is the binary.

11) Other failure shapes to recognize¶

Spec rot. The spec exists but nobody updates it; developers hand-edit code; within a quarter the spec describes a service that no longer exists. Gate-override rate catches this early.
Over-specification. A spec so detailed it is the code, line for line — losing the compression benefit and re-introducing the ambiguity it was meant to remove. Specs should encode decisions, not syntax.
Generated-default lock-in. A bad default (fixed backoff, no index) ships once and then gets copied into every new service because the assistant learns from the existing repo. The amplifier rule turns one bad default into a house style.
Migration without a down-step. AI rarely generates rollback paths unless the spec demands them; a forward-only migration that fails halfway leaves the schema in an unrecoverable state.
IaC drift. Generated Terraform applied once, then changed by hand in the console; the next generation overwrites the manual change or the state diverges from reality.
Silent serial coupling. Generated service code wires steps serially (SMS blocks email) because that is the simplest plausible structure, encoding a latency/availability decision nobody made.
Boilerplate masquerading as design. A scaffold's "reasonable defaults" become load-bearing — error formats, pagination shapes, auth wiring — and changing them later breaks clients who depended on the accidental contract.

12) Pattern transfer — where this pressure recurs¶

The source of truth is the same invariant as schema-driven generation in module 17: a typed contract (schema/spec) defines correct, and generated artifacts conform to it rather than redefining it. Same shape, different artifact.
Drift between spec and code is the same failure geometry as the train/serve skew that haunts ML systems and the spec/served-prompt drift in module 13: two artifacts that should agree diverge because only one is the authority and the other is silently edited.
The conformance gate is the same mechanism as the eval gate in file 06 and the tool-contract validation in module 19: a deterministic check that re-imposes correctness on probabilistic output before it ships.
Blast radius by artifact class is the agent blast-radius idea from module 01's leash chapters, applied to generated infrastructure: oversight intensity scales with the worst thing a wrong action can do.

13) Design test — five questions before generating an artifact¶

Is there a human-owned spec this generation conforms to, or is the prompt the only record of intent?
What is this artifact's blast radius — backspace, refactor, data loss, or org-wide breach — and is my oversight scaled to it?
Is there a deterministic gate that fails the build on the model's most-likely bad default (ALTER...DEFAULT, s3:*, no down-migration)?
When the spec changes, will we regenerate, or will someone hand-patch the code and start drift?
Could a reviewer who has never seen the prompt understand and approve the decisions from the spec alone?

Where this appears in production¶

OpenAPI / Swagger codegen — the original spec-to-code: a human-owned API spec generates clients/servers that must conform; AI now drafts the spec and the handlers from it.
GitHub Copilot coding agent — takes an issue (a loose spec) and generates a PR; the discipline is to make the issue a real spec and gate the PR, not trust the prose.
Terraform + tfsec / checkov / OPA — generated IaC checked by policy-as-code that fails builds on wildcard IAM and public buckets; the exact gate from the IAM failure walkthrough.
AWS CloudFormation Guard / Azure Policy — declarative policy gates that reject non-conforming generated infrastructure before apply.
Prisma / Atlas / Liquibase migrations — schema-as-spec tools where the desired schema is the source of truth and migrations are generated to reach it, with lock-safe strategies.
gh-ost / pt-online-schema-change — the lock-safe migration execution layer that the batched-backfill spec compiles down to on large tables.
Stripe — public engineering writing on API-spec-driven development; the spec is the contract clients depend on, generated SDKs conform to it.
dbt — analytics-as-spec: models defined declaratively, SQL generated and tested for conformance, a data-world version of the same loop.
Backstage software templates (Spotify) — scaffolding from golden-path templates with baked-in conventions; AI extends this for the long tail templates don't cover.
Pulumi — IaC in general-purpose languages where AI-generated infra is gated by policy packs before deploy.
Cookiecutter / Yeoman / Rails & Django generators — the deterministic-template alternative; useful where a template exists, complemented by AI for shapes that have none.
Snyk IaC / Wiz — security scanners that act as the conformance gate for generated infrastructure, catching over-permissive defaults at PR time.
Atlassian / Confluence as spec home — orgs that keep the human spec in a reviewed doc and generate code from it, keeping provenance outside the chat.
Meta / Google internal codegen — large-monorepo scaffolding where generated artifacts must pass house lint and policy gates before merge.

Pause and recall¶

Why is a prompt a poor source of truth, in three words each: ephemeral, , ?
State the chapter's core invariant about spec vs code.
Name the four artifact classes in order of blast radius.
On a 40M-row table, why is ALTER ... ADD COLUMN ... DEFAULT dangerous, and what does the spec do instead?
When the conformance gate fails, what are the two allowed responses — and the one forbidden one?
Which metric degrades first when the spec is silently ceasing to be the source of truth?
Why does spec-to-code beat both deterministic templates and fine-tuning under "many shapes, evolving conventions"?
Why does the "specs slow us down" intuition invert as blast radius and team size grow?

Interview Q&A¶

Q1. A dev scaffolds a new service from a one-line prompt and ships it. What is the risk even if the code works? A. The prompt under-specified, so the model made dozens of unowned decisions (retries, timeouts, indexes, secret handling) that entered the codebase with no author and no review. There is no spec to check against, so a bad default ships silently and no one can defend it later. The fix is a human-owned spec the code conforms to. Common wrong answer to avoid: "No risk if it works and passes tests." Working code can encode unreviewed design decisions and bad defaults that tests don't cover; the risk is provenance loss, not compilation.

Q2. Why not just write a very detailed prompt instead of a separate spec file? A. A prompt is ephemeral (lives in chat, not the repo), unreviewable (no PR, no line comments), ambiguous (prose resolves differently each generation), and unversioned. A spec is durable, reviewable, deterministic to diff, and version-controlled. Moving decisions into a spec gives them an author and a home; a prompt loses them after generation. Common wrong answer to avoid: "A detailed prompt is the same thing." It captures the decisions once and then erases them; it cannot be reviewed or diffed as design.

Q3. An AI generates a migration that adds a column with a default on a 40M-row table. It passes tests. Ship it? A. No. On Postgres that rewrites the whole table under an exclusive lock — a multi-minute mid-day outage that tests don't catch. The spec should require a nullable add, batched throttled backfill, default applied last, and a down-step, with a CI gate that fails on ADD COLUMN ... DEFAULT for large tables. Syntactic correctness is not operational safety. Common wrong answer to avoid: "Tests pass, so it's safe." Tests run on small data and don't model production lock behavior; the failure is operational, not logical.

Q4. Your conformance gate's manual-override rate is climbing. What does it mean and what do you do? A. The spec is ceasing to be the source of truth — developers are bypassing the gate with ignores or hand-edits, so drift is returning. It's the leading indicator before any incident. Investigate why the gate is being routed around (false positives? too strict? slow?), fix the gate, and re-establish that code is regenerated from spec rather than hand-patched. Common wrong answer to avoid: "Overrides are fine, devs know their context." Systemic override growth means the gate and spec are being abandoned, which silently restores the exact drift you built them to prevent.

Q5. When is forcing a spec-conform loop the wrong call? A. On one-off throwaway artifacts with no future maintainer and no blast radius — a debugging script, a spike, a prototype. The spec's cost exceeds the cost of the artifact being wrong, so it's pure bureaucracy. Spec discipline scales with blast radius and number of maintainers; reserve it for migrations, IaC, contracts, and shared scaffolds. Common wrong answer to avoid: "Always write a spec." Universal spec discipline burns time on disposable artifacts; the spec is leverage only where provenance and blast radius matter.

Q6. Is a recurring bad default (no index, fixed backoff) across new services a model problem or a process problem? (cumulative — connects to 01) A. Process, via the amplifier rule. The model learns from the existing repo, so one unowned bad default gets copied into every new generation, becoming a house style. The fix is upstream: encode the right default in a reviewed spec/template and gate against the wrong one. A better model still copies whatever pattern dominates the codebase. Common wrong answer to avoid: "The model keeps making the same mistake, use a smarter model." The model is faithfully amplifying your repo's existing default; the fix is the spec and gate, not the model.

Design/debug exercise (10 min)¶

Step 1 — Modeled example. Here is the spec/gate pairing for Meridian's notifications service, showing what the spec owns and what the gate checks:

Spec owns (human-reviewed):  queue durability, retry policy, sms non-blocking,
                             log indexes, secret source + rotation, timeout.
Gate checks (deterministic): retry config matches spec; no in-memory queue;
                             indexes present on declared columns; no secrets in
                             env without rotation; send timeout set.
Forbidden:                   hand-editing generated code to pass the gate.

Step 2 — Your turn. Take a real service or migration from your work. Write a 10–15 line spec capturing the decisions a reviewer would care about (blast-radius decisions first). Then write three deterministic gate checks that would fail the build on the model's most-likely bad default for that artifact. Continue Meridian if you don't have one: write the IaC spec for the orders→reports Lambda and the policy-gate rule that rejects s3:*.

Step 3 — Reproduce from memory. Redraw the spec → candidate → conformance-gate loop, including the bottom-left arrow (on gate failure, fix spec or regenerate, never hand-patch). Then connect it to file 01: which artifact class corresponds to the red zone, and why does the review tax make the IAM failure more likely?

Operational memory¶

This chapter explained why generating whole artifacts — scaffolds, migrations, IaC — from a prompt lets the model make dozens of unowned design decisions that ship without review, and why a bad default in a migration or IaC artifact is an outage or a breach, not a backspace. The important idea is that the spec is the source of truth and the generated code is a disposable, regenerable candidate — not that "generated code is bad."

You learned the spec → generate → conform loop: put the blast-radius decisions in a short human-owned spec, generate toward it, and gate the candidate with deterministic checks that fail on the model's most-likely bad default. That solves the opening failure because the decisions now have an author and a reviewable home, and the gate catches the ALTER...DEFAULT and the s3:* before production does. Scale oversight to the artifact's blast radius and reversibility, exactly as file 01 scaled scrutiny to verification cost.

Carry this diagnostic forward: when a generated artifact surprises you in production, ask where was the decision recorded, and what gate should have caught it. If you see the gate-override rate climbing, inspect whether the spec is still the source of truth before trusting any "AI scaffolded N services" number.

Remember:

The spec is the source code; the generated implementation is the binary — regenerable, disposable.
Scale oversight to blast radius: trust boilerplate, review scaffolds via their spec, line-review migrations, draft-only for IaC behind a policy gate.
On gate failure, fix the spec or regenerate — never hand-patch the code into passing, or drift returns.
The model reaches for the most-plausible-working default (ALTER...DEFAULT, s3:*); a dumb deterministic gate catches exactly that.
Gate-override rate is the leading indicator that the spec is quietly losing source-of-truth status.
Specs accelerate, not slow, as blast radius and maintainer count grow.

Bridge. We made the spec the source of truth and gated generated artifacts against it. But that gate is deterministic — it catches the bad defaults we anticipated. It cannot judge whether an unanticipated change is good code: does it have a subtle logic bug, a missed edge case, a security smell? For that we need a reviewer with judgment, and the obvious move is to point AI at the diff. The next file asks what AI reviewers actually catch, what they miss, and how false-positive fatigue can erode trust in the gate faster than bugs erode the codebase. → 03-ai-code-review-and-quality-gates.md