12. Architect checklist¶
Twenty items. Design, build, launch, operate. If you can answer all of them with a contract artefact, the tool is defensible. If you cannot, the gaps are the work.
This is the checklist a tech lead uses in design review, again at launch review, and again at the first incident postmortem. It compresses the eleven preceding chapters into questions you can ask in twenty minutes.
The items are sequenced as they appear in a tool's lifecycle: first the design decisions that constrain everything else, then the implementation surfaces, then the launch gates, then the operational habits that keep the contract honest over time.
Design (1–6)¶
The first six items must be answered before any code is written. They constrain the implementation.
1. Identity. Is the tool's name, version, owner, and purpose written down in the contract registry — not in the wrapper file? (Chapter 02.)
2. Class. Is the class one of read, write-idempotent, write-non-idempotent, or irreversible, with a one-paragraph rationale? Is the rationale defensible against a reviewer challenge — particularly the question "what undoes this if it goes wrong?" (Chapter 03.)
3. Side effects. Are all side effects documented, including downstream fan-out (events emitted, notifications sent, ledger entries triggered, third-party calls)? Is reversibility explicit? (Chapter 02.)
4. Scope. Is the required scope specified at the verb-noun level (e.g., payments:refund:write, not payments:*)? Is tenant binding required? Are there target constraints (allowlists, value caps, time windows)? (Chapter 06.)
5. Error enum. Does the contract enumerate all foreseeable failure modes, walked against the error taxonomy — InputError, NotFoundError, PermissionError, PreconditionError, ConflictError, RateLimitError, TransientError, UpstreamError, PolicyError, InternalError — with each error's retriable flag, human_hint, and model_action defined? (Chapter 05.)
6. Validation surfaces. Are preconditions listed (state, policy, invariant)? Are postconditions listed (shape, invariant, state-after)? Is dry-run mode implemented for irreversible or high-value tools? (Chapter 07.)
If any of items 1–6 cannot be answered with the contract artefact in hand, the design phase is not done. Going to implementation without these is how the chapter-opening incidents happen.
Build (7–13)¶
The next seven items are about the contract layer implementation.
7. Schema validation. Are tool arguments validated against a strict schema (additionalProperties: false) on input? Is the return shape validated against the contract on response? (Chapters 02, 07.)
8. Idempotency. For write-non-idempotent and irreversible tools, is the idempotency key required, generated at the workflow step boundary, and dedup-checked in a tier-zero store with atomicity between side effect and record? (Chapter 04.)
9. Error translation. Is the downstream system's native error space mapped to the contract's error enum at the boundary, including a UPSTREAM_UNCLASSIFIED catch-all wired to a monitor? (Chapter 05.)
10. Credential resolution. Is the credential minted per call with a short TTL, bound to the tenant, narrow to the required scope? Are credentials never cached across calls or exposed to the model? (Chapter 06.)
11. Audit emission. Does the contract layer emit one structured audit record per call, with request/response payloads, scope, idempotency state, preconditions, postconditions, costs, redactions applied at write? (Chapter 11.)
12. Tracing. Does the contract layer participate in W3C trace propagation so the call's span links to its parent in the agent's trace? (Chapter 11.)
13. Tests in CI. Do schema well-formedness, version-compatibility, pact, and behavioural tests run on every contract PR? Are pacts also scheduled against staging downstream? (Chapter 10.)
Launch (14–17)¶
Four items that must be true before the tool is enabled in production.
14. Dual-run posture. If this version supersedes a prior major version, is the dual-run window documented, the deprecation comms sent, and v1 traffic monitored? If this is the first version, is the version pinned to a meaningful number (not 0.0.1 for life)? (Chapter 08.)
15. Live monitors wired. Are the on-call dashboards live: error rate per code, UPSTREAM_UNCLASSIFIED rate, postcondition violation rate, latency p95/p99, cost, scope-denial rate, idempotency dedup-hit rate? (Chapters 09, 11.)
16. Runbook. Is there a runbook for this tool, covering the top expected alerts, common drift scenarios, manual reversal paths if any, and the escalation tree? Module 02_ai_infrastructure/06_ai_runbooks_oncall covers the form; the contract layer owns the contents. (Chapter 11.)
17. Sign-off from the underlying-system team. Has the team owning the downstream system reviewed the contract? Do they have a copy of the registry entry? Have they agreed on the notification protocol for breaking changes? (Chapter 09.)
A tool that launches without item 17 is the chapter-9-opening incident waiting to happen.
Operate (18–20)¶
The three habits that keep the contract honest across the years it runs.
18. Drift review cadence. Once a month (or on every significant downstream release), is someone reviewing: UPSTREAM_UNCLASSIFIED rates, postcondition violations, schema diffs against published downstream specs, and audit anomalies? Is the translator extended and tests added when drift is found? (Chapter 09.)
19. Quarterly god-key audit. Once a quarter, is the credential inventory walked: which credentials does the platform hold, what permissions does each carry, is any credential broader than its consumers need? Are findings tracked and retired in order of blast radius? (Chapter 06.)
20. Postmortem feedback. When an incident happens involving this tool, does the postmortem update the contract — extending the error enum, tightening the schema, adding a precondition, sharpening the audit — and not just patch the immediate bug? (All chapters.)
Using the checklist in a design review¶
A twenty-minute review walks the twenty items in order. For each:
- Green — answer with the artefact (contract entry, test name, monitor link)
- Yellow — answer with a plan and a date
- Red — no answer; this is a gap that blocks the design
Reviewer's job is to be specific about reds and yellows. "The error enum looks thin" is not specific; "the enum has no PolicyError or PreconditionError codes, and the downstream is known to return both" is.
A tool that has 18+ green at design review and 20 green at launch review is a tool that will not contribute to the next quarter's incident list. A tool with several reds at design review is a tool that has not yet been designed.
Using the checklist at a postmortem¶
After an incident, walk the checklist with the question "which item, if it had been green, would have prevented or shortened this incident?"
Common postmortem-to-checklist mappings:
- "We didn't know the downstream changed" → item 18 (drift review), item 13 (scheduled pacts)
- "The agent retried and produced duplicates" → item 8 (idempotency)
- "We surfaced a stack trace to a customer" → item 5 (error enum), item 9 (translator)
- "A successful call exfiltrated cross-tenant data" → item 4 (scope, tenant binding), item 10 (credential resolution)
- "We could not reconstruct what happened" → item 11 (audit), item 12 (tracing)
- "The compliance ask took three days" → item 11 (audit completeness and queryability)
- "The downstream removed a field we depended on without telling us" → item 17 (sign-off), item 9 (translator + pacts)
The output of a postmortem should be a delta on the checklist for that tool. If the tool was 18/20 green and now it has a clear remediation item, the postmortem produced a contract change.
When the checklist is overkill¶
Two cases where this checklist is more than the tool requires:
- Internal-only experiments. A tool exposed to a single engineer's agent for a one-off task, never seen by users, never touching shared state, can be built without items 8, 14, 15, 16, 17. Items 1, 4, 5, 11 still apply because even experiments accumulate.
- Pure-read tools against well-typed sources. A read tool against a stable internal API can skip items 8, 14 (no dual-run needed if it's truly first), and treat 6 as light (read postconditions are cheap). The audit, scope, and tests still apply.
The exceptions are explicit. A tool that quietly drops half the checklist without justification is the next incident; a tool that explicitly waives items with rationale ("internal-only, single-tenant, item 14 N/A") is being managed.
What this checklist does not cover¶
The checklist is the contract layer's discipline. It does not replace:
- Model-side concerns — prompt design, tool description quality, model evaluation of when to call the tool. Those are module 01 (agentic system design), module 13 (prompt lifecycle), module
04_ai_product_evals. - Agent-level safety — blast-radius governance, approval gates at the agent level, kill switches. Module 01 chapter 07.
- Workflow durability — checkpointing, replay, compensation across multi-step workflows. Module 02.
- Underlying system correctness — the downstream's own bugs are out of scope; the contract layer treats them as drift to be detected, not solved.
A green checklist gives you a defensible contract layer. The rest of the agent stack still has to do its job.
Interview Q&A¶
Q1. You inherit a tool that has been in production for a year. Where do you start the audit? Walk the twenty items in order. Identity (item 1) and class (item 2) first — often the most stale. Side effects (item 3) and scope (item 4) next — most over-broad. Error enum (item 5) and validation (item 6) — usually under-built. Then items 7–13 in the contract layer. Items 14–17 are launch-time and may not apply; items 18–20 are ongoing and tell you whether the team has been operating the contract or just running it. The result is a triaged list of reds and yellows that becomes the work. Wrong-answer notes: starting with "read the code" misses that the artefacts (contract, monitors, tests) should be the audit subjects, not the implementation.
Q2. Which three items are the most commonly missing in early-stage agent platforms? Item 4 (scope — almost always god-keys), item 8 (idempotency — usually wrong layer), item 11 (audit — usually conflated with application logs). These three are the highest-value early fixes. Wrong-answer notes: "all of them" is true and unhelpful.
Q3. The team argues that the checklist is too heavy for a small startup. What is your response? The checklist is the discipline of running an API platform against a non-deterministic client. The cost of not doing it is paid eventually — usually in incidents that are several times the cost of the discipline. That said, the checklist has explicit "overkill" cases and supports waivers with rationale. A startup can launch with reds on items 14, 15, 16, 17 (operational items) for an experimental tool, but should not have reds on items 1–6 (design items) under any circumstance. The compromise is "design items always; operational items where they matter." Wrong-answer notes: caving to "we'll do it later" produces the inherited mess of item 1's audit.
Q4. What item on this checklist do you think is most under-appreciated? Item 18 (drift review cadence). Drift is silent; the cost of an undetected drift compounds across every call it affects; and unlike the implementation items, drift review requires time every month, which is the resource teams most often skip. A platform that has all 1–17 green but skips 18 will accumulate drift incidents until 18 becomes mandatory by force. Wrong-answer notes: any specific item is defensible; what distinguishes the answer is the reasoning about silent compounding cost.
Bridge. The checklist is the engineer's defence. The last chapter is the honest opposite — what tool integration contracts still cannot solve, where the discipline is young, and the limits a thoughtful lead should be transparent about with their team and stakeholders. → 13-honest-admission.md