12. The architect's checklist — your 20-item punch list before ship¶

~14 min read. The synthesis. Eleven chapters walked. Now the list you carry into every architecture review.

Built on the first-principles overview in 00-first-principles.md. The five primitives — Loop, Toolbelt, State, Leash, Lifecycle — every surface we designed, now collapsed into a single punch list you can hand to any team proposing an agent.

The picture before the items¶

Imagine the workshop foreman from chapter zero. The apprentice is hired. The job sheet is written. The foreman now walks the floor with a clipboard. Four columns on that clipboard. Design-time. Build-time. Launch-time. Run-time.

Look at the layout.

┌────────────────────────────────────────────────────────────────┐
│                  THE ARCHITECT'S CLIPBOARD                     │
├──────────────┬──────────────┬──────────────┬──────────────────┤
│ DESIGN-TIME  │ BUILD-TIME   │ LAUNCH-TIME  │ RUN-TIME         │
│ (whiteboard) │ (code)       │ (rollout)    │ (production)     │
├──────────────┼──────────────┼──────────────┼──────────────────┤
│ leash        │ idempotency  │ yardstick    │ alarm signals    │
│ topology     │ approval     │ observability│ kill drill       │
│ blast radius │ checkpoint   │ kill switch  │ versioning       │
│ memory       │ tenancy      │ rollout plan │ regression lock  │
│ budgets      │ retries      │ runbook      │ cost dashboard   │
└──────────────┴──────────────┴──────────────┴──────────────────┘

Twenty items. Five per column. Each item maps back to one chapter you already read. The clipboard is not extra knowledge. It is the same knowledge, organised as a gate.

See. An architect's job is not to know more. It is to remember the right thing at the right time. So that is what we build now.

Column 1 — Design-time (before any code)¶

These are whiteboard decisions. No code yet. If you get these wrong, no amount of clean implementation saves you.

The leash is justified. Single-call, ReAct, or multi-agent — picked deliberately, not by default. Each step up the leash ladder buys you autonomy and costs you predictability. Write one paragraph defending the choice.
Topology mapped to failure mode. ReAct fails by infinite loops. Orchestrator-worker fails by silent worker drift. Pipeline fails by mid-stage stall. Match topology to which failure you can absorb.
Tools categorised by blast radius. Every tool in the toolbelt marked: read-only, reversible-write, irreversible-write, external-side-effect. The blast radius table is on the design doc, not in someone's head.
Memory model picked. Stateless? Session? Long-term? Picked before code, not retrofitted. Long-term memory without a forgetting policy is a privacy bug waiting to happen.
Budgets set with numbers. The budget is not "reasonable". It is max_tokens=8000, max_steps=10, max_wall_clock_s=30, max_cost_usd=0.05. Numbers on the page.

If any of these five are vague, you are not in design phase yet. You are in wishing phase.

Column 2 — Build-time (writing the code)¶

Code-time decisions. The ones that turn the design into a system that won't corrupt data.

Idempotency keys on every write tool. Same request twice must produce the same effect once. Otherwise a retry is a bug.
Approval gates wired where blast radius demands. The understudy sits before every irreversible write above a threshold. Not after. Not "we'll review logs".
State checkpointing tested. Kill the process mid-trajectory in staging. Does it resume cleanly? If not, the resume claim is a lie.
Multi-tenancy isolation verified. Two users, parallel sessions, no context bleed. Tested with adversarial prompts that try to leak.
Retry and backoff policies explicit. Tool fails — how many retries, what backoff, what surfaces to the user? Default behaviour of every framework is wrong for production.

Build-time bugs are silent until production. Each item above is a test you can run in CI, not a vibe check.

Column 3 — Launch-time (the rollout)¶

Now the system exists. Shipping it is its own discipline.

Yardstick eval passed. The yardstick — a frozen golden set with a published pass threshold — green before launch, full stop.
Observability spans emit cleanly. Every step writes structured trace: input, output, tool name, tokens, latency, cost. Verified end-to-end in staging.
Kill switch tested in staging. The kill switch is exercised, not just present. On-call clicks it. Agent stops within 5 seconds. Logged.
Rollout plan written. Shadow → 1% canary → 10% → 50% → 100%. Each stage has a hold time and a rollback trigger.
Runbook published. When the alarm fires at 3 AM, what does on-call do? Written down. With page links. Not tribal knowledge.

If launch-time skips even one of these, you have a demo, not a product.

Column 4 — Run-time (after it's live)¶

Production discipline. Often skipped. Usually the reason agents quietly degrade over weeks.

Alarm panel signals defined. Tool error rate, eval drift, p95 latency, cost-per-call, refusal rate. Each signal has a threshold and a pager rule. The alarm panel is wired to PagerDuty, not Slack-only.
Kill drill executed with on-call. Once per quarter, pretend it's an outage. Time how long to flip the kill switch. Under 5 minutes or it's not real.
Versioning protocol documented. Prompt version, tool schema version, model version — pinned together. Rolling any one alone is a contract break.
Regression eval locked. The frozen eval set runs nightly. Pass rate drop > 2% pages someone. Drift catches the slow rot.
Cost dashboard live. Tokens per request, USD per day, per-tenant breakdown. Without this, your CFO finds out before you do.

Twenty items. Now you walk into a review and you have a sheet.

The smell test — five questions for any new agent proposal¶

You are sitting in the architecture review. Someone proposes an agent. You have 5 questions. Just five.

"What is the worst single action this agent can take, and is that action reversible?" Tests blast-radius thinking. A vague answer means no one mapped the toolbelt.
"How does on-call stop this in one click at 3 AM?" Tests the kill switch. If the answer is "we'd push a config change", that is not a kill switch.
"What eval gates production?" Tests the yardstick. If the answer is "manual review", there is no gate.
"What does the agent do when its budget runs out?" Tests the budget and the give-up rule. Silent failure is the wrong answer.
"How do you tell tomorrow that this agent got worse since today?" Tests observability and regression eval. Most teams cannot answer this.

If any answer is hand-wavy, the agent is not ready. Send it back. Politely.

Worked example — "AI agent that auto-resolves Jira tickets"¶

A team brings this proposal to your review:

"We want an agent that reads incoming Jira tickets, classifies them, and auto-resolves the easy ones — password resets, doc lookups, duplicate-ticket merging. Should save the support team 30% of their time."

You take out the clipboard. Let us walk it.

Design-time, items 1-5. Leash: they propose ReAct with 6 tools. Topology: clear, orchestrator-worker not needed. Blast radius: here is problem one — "merging duplicate tickets" is an irreversible write. They haven't flagged it. Flag #1. Memory: session-only, fine. Budgets: not specified. They say "we'll cap at $0.10/ticket". Numbers exist. Acceptable.

Build-time, items 6-10. Idempotency: ticket merge has no idempotency key. Re-running the agent on the same ticket will merge it again into something different. Flag #2. Approval gates: they want full autonomy. For password reset (reversible by user) that is fine. For ticket merge (destroys ticket history) that needs the understudy. Tenancy: single-tenant Jira instance, fine.

Launch-time, items 11-15. Yardstick: they have 50 labelled tickets. That is a unit test, not an eval. Need 500+ for the yardstick to mean anything. Flag #3. Observability: missing. They plan to "check Jira's audit log". That is not a trace. Spans missing. Kill switch: feature flag in their config service, untested. Acceptable if drilled. Runbook: does not exist. Flag #4.

Run-time, items 16-20. Alarm panel: not defined. Kill drill: never done. Versioning: not discussed. Regression eval: nightly run not planned.

Four hard flags. Two soft gaps. You send them back with a checklist of exactly these items. Not vibes. The clipboard.

That is the review you wanted to have. Now you can.

The "what blocks ship" decision tree¶

When a flag fires, not every flag blocks launch. Use this tree.

flag identified
   │
   ├── irreversible action with no approval gate? ───→ BLOCK SHIP
   │
   ├── no kill switch / untested kill switch? ───→ BLOCK SHIP
   │
   ├── no eval gate at all? ───→ BLOCK SHIP
   │
   ├── no observability spans on writes? ───→ BLOCK SHIP
   │
   ├── thin eval set (< 100 cases)? ───→ SHIP TO 1% ONLY
   │
   ├── budget caps soft? ───→ SHIP WITH HARD CAP IN GATEWAY
   │
   └── runbook missing? ───→ SHIP TO SHADOW ONLY

Hard blockers — anything that lets the agent corrupt data, run forever, or hide its own failure. Soft gaps — anything you can compensate for with a smaller blast radius (1% canary, hard cap, shadow mode).

See. The tree is opinionated. Make it your team's tree. Write it down. Pin it in the review template.

Where this lives in the wild¶

Anthropic deployment review — every model launch passes a usage policy review, red-team review, and a "what could go wrong" exercise that maps directly to blast-radius thinking before any API access goes public.
OpenAI red-team gate — frontier model deployments pass a preparedness framework checklist covering misuse risks, eval thresholds, and rollback criteria before staged rollout. Skipping the gate blocks ship.
Meta GenAI deployment review — internal AI features pass a responsible AI review board that requires impact assessment, eval baselines, and kill-switch demos before any user-facing rollout.
Microsoft Responsible AI Standard — the "Responsible AI Impact Assessment" template is the literal checklist every AI product team fills before launch, covering fitness, fairness, transparency, and human oversight gates.
Databricks AI Gateway — its review process for production agents requires registered evals, cost guardrails per workspace, and a single-flip disable per agent before promotion from staging to production.

Every one of these is a clipboard. Same idea, different organisation.

Pause and recall¶

Which four items in the checklist are hard blockers — meaning launch cannot proceed even one percent without them?
The understudy sits before which class of tools, not after — and why does the order matter?
What is the difference between having a kill switch and having a drilled kill switch?
Why does a thin eval set (50 examples) demote you to 1% canary instead of full launch?

Interview Q&A¶

Q: You are presented with a proposed agent that does customer-refund processing autonomously. Walk me through the three things you would flag in the design review. A: One, the refund write is irreversible — money out is hard to claw back, so the understudy must approve above a threshold (say $50), not just review logs. Two, the kill switch must be a single config flip that any on-call can hit; "redeploy with feature flag off" is not a kill switch. Three, the yardstick must include adversarial cases — customers prompting the agent to refund non-existent orders — not just happy-path tickets. If any of these three are missing, the proposal goes back.

Common wrong answer to avoid: focusing on prompt quality and tool descriptions. Those matter, but they are not what gets the team paged at 3 AM. Irreversible writes without an approval gate are what cause incidents.

Q: A team says "we have observability — we log every LLM call to Datadog". Why is that not enough for an agent in production? A: Logging the LLM call is one fifth of the trace. You also need: the tool name and arguments per call, the tool result, the iteration index, the budget consumed, and the final state of the trajectory. Without a structured span per step you cannot reconstruct why an agent made a bad sequence of decisions — only that it made calls. The alarm panel needs trajectory-level signals (loop count, refusal rate, tool-error rate), not raw LLM call logs.

Common wrong answer to avoid: "Datadog is fine, you can search the logs". You cannot search what you did not emit. If you do not emit the iteration index and tool result as a span field, no query in the world will give you a trajectory view.

Q: An architect proposes a multi-agent system — orchestrator with five specialised workers. They show beautiful diagrams. What is the single question that exposes whether they have done the design work? A: "What happens when one worker silently returns wrong data?" Multi-agent topologies fail this way more than they fail loudly. The orchestrator trusts the worker. The worker hallucinates a plausible result. The error compounds downstream. If the answer is "the orchestrator will catch it" without a concrete validation step, the design is incomplete. The blast radius of one wrong worker has not been mapped.

Common wrong answer to avoid: "We use retries and timeouts". Retries help with crashes, not with confidently wrong outputs. Silent drift in workers is the orchestrator-worker topology's signature failure mode and it needs explicit cross-validation, not retries.

Q: When does a soft gap become a hard blocker — what is the principle? A: A soft gap is one you can compensate for by shrinking the blast radius at launch. Thin eval set? Ship to 1% canary, the small sample limits damage. No runbook yet? Ship to shadow mode where the agent's actions are not applied. A hard blocker is a gap where shrinking the blast radius does not help — no kill switch means even 1% can run amok. No idempotency on writes means even shadow mode can corrupt on retry. The principle: if reducing the rollout percentage does not reduce the harm, it is a hard blocker.

Common wrong answer to avoid: "Hard blockers are anything related to safety, soft gaps are anything related to performance". Wrong axis. A performance issue at 100x cost is a blocker too. The right axis is whether the rollout dial can compensate for the gap.

Apply now (5 min)¶

Pick one agent — yours or one from your team — that is running in production or staging right now. Open a doc. Write the 20-item checklist as a table. For each row, mark green / yellow / red based on the current state. Be honest, not aspirational.

When you finish, count the reds. More than three reds and you are not in production — you are in an incident waiting room.

Sketch from memory: Draw the four-column clipboard (design, build, launch, run) with at least three items per column. Then circle the four items that are hard blockers — agent cannot ship without them.

Operational memory¶

This chapter synthesised the module's twenty architectural decisions into a four-column clipboard — design-time, build-time, launch-time, run-time — that the architect carries into every new agent review. The important idea is that the architect's job is not to know more; it is to remember the right thing at the right time, and a written clipboard moves the conversation from impressionistic to mechanical. Hand-wavy answers on any row are the design's confession that the team has not actually decided.

You learned the twenty items grouped by lifecycle phase, the five-question smell test ("worst single action, kill switch path, eval gate, budget exhaustion behaviour, drift detection") that exposes whether a new agent proposal is ready for review, and the what-blocks-ship decision tree that separates hard blockers (no kill switch, no eval gate, irreversible action without approval, no observability on writes) from soft gaps (thin eval set → 1% canary, soft budget caps → hard gateway caps, missing runbook → shadow only). That solves the Jira-auto-resolve proposal from section 7 because four hard flags surface before the design progresses to build.

Carry this diagnostic forward: every agent review uses the clipboard. Twenty rows, green or yellow or red. More than three reds and you are not in production — you are in an incident waiting room.

Remember:

Twenty items, four phases — design, build, launch, run. Each row maps back to one chapter's discipline.
Hard blockers: irreversible action without approval gate, no/untested kill switch, no eval gate, no observability on writes.
Soft gaps compensate by shrinking blast radius (1% canary, shadow mode, hard gateway caps); hard blockers cannot be compensated.
The five-question smell test runs in 30 seconds and exposes whether the team has done the design work.
Numbers on the page replace vibes; vague answers on any row are the place to push back.

Bridge. The clipboard is powerful. But it catches what we know to look for. Some failures are not on any checklist yet. The field itself has gaps — places where even careful architects ship with confidence and still get burned. Next we name those honestly. → 13-honest-admission.md