Skip to content

12. Architect checklist

~7 min read. The previous eleven chapters built and operated the apparatus. This chapter condenses the module into a twenty-item checklist a lead engineer runs through to validate an AI feature's on-call apparatus before launch, before quarterly review, or after a major change.

Continues from 11-oncall-health-and-burnout.md. This chapter is the synthesis: a checklist scoped to the apparatus's six surfaces.

The checklist has three modes: design review (before mocks are signed off on the apparatus), pre-launch review (before the feature ships), and quarterly audit (on shipped features). Each item is yes/no with a fallback action.


Alert plane (items 1–4)

1. Are all five paging conditions wired (quality, prompt-version, provider-drift, cost-spike, safety violation)? Each with severity, payload schema, runbook link, owning team. If no: wire the missing condition before launch; do not ship with a blind family.

2. Are quality alerts sliced by intent, tenant tier, and relevant cohort? Aggregate alerts miss slice regressions. If no: slice the eval signal at the source; aggregate-only is a structural blind spot.

3. Are alerts anchored to deploys? Prompt and model deploys produce tightened thresholds during a deploy window with the deploy ID in the payload. If no: wire the deploy event stream into the alert engine.

4. Is alert precision and recall measured against postmortems? The metric to watch is per-alert performance over time. If no: set up the quarterly review; without measurement, alerts drift.


Rotation plane (items 5–7)

5. Does the AI surface have a named primary, backup, lead, and SME contacts? The roster is published, versioned, accessible from the alert payload. If no: the page may land on an undefined hop.

6. Is the training gate enforced before rotation entry? Two drill participations, a runbook walk-through, a tooling check. If no: untrained engineers enter the rotation; the apparatus's effective capacity is lower than its body count.

7. Are handoffs both live and written? Live for comprehension; written for persistence. If no: context is lost at handoff; the next incident pays the cost.


Runbook plane (items 8–10)

8. Does each paging condition have a corresponding runbook card? Five sections: identification, first ten minutes, diagnosis, mitigation, escalation. If no: the on-call has the page but no procedure.

9. Are the steps executable, not descriptive? Action, expected output, what if different. If no: under stress, descriptive steps degenerate into improvisation.

10. Is the freshness gate enforced by tooling? Last-validated date; staleness warning in the on-call's tooling. If no: runbooks rot silently.


Escalation plane (items 11–12)

11. Is the escalation graph published with named hops and SLOs? Channel, SLO, authority, handover-context schema for each hop. If no: "escalate" is a guess, not a contract.

12. Is the provider escalation template pre-written? With affected model, magnitude, traffic share, business impact placeholders. If no: under stress, the on-call writes vague tickets that get vague responses.


Postmortem plane (items 13–14)

13. Does the template include eval delta and apparatus update as mandatory fields? Other required fields: cause (AI taxonomy), blast radius, follow-ups. If no: the apparatus does not learn from incidents.

14. Is the follow-up closure rate tracked and reported? 80% within 30 days is the typical target. If no: postmortems become a wishlist.


Drill plane (items 15–17)

15. Does the drill calendar run on cadence? Tabletops monthly per surface, dry-runs quarterly, live drills semi-annually. If no: the apparatus rusts between incidents.

16. Does the scenario library cover the failure families and the recent postmortems? One scenario per family, per runbook, per postmortem, plus quarterly adversarial. If no: drills are speculative; coverage gaps.

17. Is the readiness score computed and trended? 6 dimensions: alert recall, alert precision, rotation reach, runbook fitness, escalation latency, apparatus updates. If no: drill health is unmeasured; degradation is silent.


Health plane (items 18–20)

18. Are the load metrics tracked per rotation and per engineer? Pages per shift, off-hours rate, sustained shift, ack latency, recovery time, satisfaction. If no: burnout surfaces as attrition, not as a metric.

19. Is the alert false-positive rate below 20% per alert? False positives drive alert fatigue and apparatus erosion. If no: schedule a tuning sprint; do not let the FP rate persist.

20. Are rotation imbalances visible and addressable? Per-engineer page count, off-hours rate, satisfaction published quarterly. If no: the apparatus's fairness is unmeasured; senior engineers carry the load until they leave.


How to score

The checklist is not pass/fail. A reasonable target:

  • All twenty items are answered consciously (yes; no with mitigation; not applicable with reasoning).
  • Items 1, 5, 8, 11, 13, 18 are non-negotiable for any AI feature with non-trivial user impact.
  • Items can be deferred with documented owners and dates; deferrals are reviewed quarterly.

A surface that scores well on the first launch will still drift. The quarterly audit catches the drift.


When to extend the checklist

The checklist is a starting point. Domain-specific extensions:

  • Healthcare: regulator notification step in escalation, clinical-evidence trace in postmortem.
  • Finance: audit-grade postmortem retention, regulatory escalation hop.
  • Legal: privileged-context handling in incident channels.
  • Children's products: safety-violation escalation to platform trust-and-safety.

The base twenty are universal; the extensions are domain-specific. Both belong on the same review.


What the checklist does not cover

  • The technical design of the AI feature itself (model, prompt, retrieval) — that is the AI engineering modules.
  • The business decision to ship the feature.
  • The product surface UX — that is the human-AI product experience module.

The checklist validates the on-call apparatus; it is one gate, not the only one.


Operational signals

Healthy. The checklist is run at three checkpoints (design, pre-launch, quarterly). Scores improve over time. Deferrals are tracked and remediated.

First degrading metric. Deferrals not closing. Items that the team meant to address are being indefinitely deferred; the apparatus has gaps the team has stopped tracking.

Misleading metric. Number of items "complete." A surface with 18 items complete but two critical items deferred (alert wiring, rotation backup) has more risk than the count suggests. The non-negotiable items carry disproportionate weight.

Expert graph. The matrix of features × items, with cell colour (green/yellow/red) over time. Trending the matrix surfaces both team improvements and team regressions.


Boundary of applicability

Strong fit. Lead engineers running pre-launch, post-incident, or quarterly apparatus reviews. The checklist makes review thorough and consistent.

Pathology. Treating the checklist as a launch-blocker only. The quarterly audit is where the checklist catches drift; skipping it means the apparatus degrades silently after launch.

Scale limit. Very large platforms have many features and many checklists; the meta-problem is consistency of scoring. The pattern is a platform-team standard with feature-team self-scoring plus periodic central audit.


Failure-prone assumption

The seductive wrong belief: the checklist is exhaustive. It is not. It is a starting point. Every team will discover items specific to their domain — regulatory steps, cross-team handoffs, particular failure modes — that warrant additional checklist items. The correct belief: the checklist is a baseline that the team extends.


Where this appears in production

  • A fintech runs the checklist at design, pre-launch, and quarterly; apparatus drift is caught early.
  • A telecom AI has the checklist integrated with their production readiness review (PRR).
  • A consumer chatbot uses the checklist for launches but skips quarterly audits; apparatus drifts.
  • A healthtech AI has healthcare-extended checklist items; regulatory steps are validated.
  • A coding assistant treats items 1, 5, 8, 11, 13, 18 as non-negotiable; features without them do not ship.
  • A retail AI has the checklist as a self-scored artefact; central platform audits quarterly.
  • A logistics AI has the checklist score on the leadership dashboard; aggregate health is visible.
  • A government AI has the checklist as a regulatory artefact; the apparatus discipline produces the artefact as a byproduct.
  • A B2B SaaS has the checklist published in the engineering wiki; everyone can see the bar.
  • A travel platform had checklist items deferred indefinitely; the next incident exposed the gap.
  • A payments AI runs the checklist as part of every quarterly business review.
  • A legal AI has legal-extended checklist items; counsel sign-off on the apparatus for regulated features.
  • A staffing AI runs the checklist at quarterly engineering reviews; trends are visible.
  • A search-rerank service has the checklist as the gate for promoting features from alpha to GA.
  • A document AI treats the checklist as a living document; team-discovered items are added.
  • A media AI has the checklist's quarterly audit as a leadership commitment.
  • An ad-tech AI treats item 4 (alert precision/recall) as a quarterly platform team activity.
  • A real-estate AI has the checklist published with examples of typical green/yellow/red answers.
  • A medical AI uses the checklist to demonstrate regulator compliance; it is both internal discipline and external evidence.
  • A small SaaS uses a minimal version of the checklist; non-negotiable items still apply.

Recall / checkpoint

  1. Name the six surfaces the checklist covers.
  2. List the six non-negotiable items.
  3. What is the three-mode usage (design, pre-launch, quarterly)?
  4. What domain-specific extensions are common?
  5. What signals a degrading checklist process?
  6. Why is item 4 (alert precision/recall) load-bearing?
  7. Why is the checklist "a baseline that the team extends"?

Interview Q&A

Q1. The team passes their feature's model evals but the apparatus checklist surfaces three gaps. What do you push back on? Same shape as the human-AI UX checklist: model evals measure model quality; the apparatus checklist measures operational quality. A model with perfect evals shipped without alert wiring, escalation paths, and postmortem templates will produce incidents the apparatus cannot handle. The fixes are sprint-scale; the cost of skipping them is months of incidents and apparatus debt. Common wrong answer to avoid: "the model passes, we ship" — apparatus debt is not visible in eval scores.

Q2. Which items are non-negotiable, and why? Items 1 (alerts wired), 5 (rotation named), 8 (runbooks exist), 11 (escalation graph), 13 (postmortem template), 18 (load metrics tracked). These six are the minimum viable apparatus — alerts to surface incidents, rotation to receive pages, runbooks to act, escalation to reach specialists, postmortems to learn, load tracking to sustain. Skipping any of them means the apparatus has a structural hole. Common wrong answer to avoid: "all items are equal" — non-negotiable items carry disproportionate weight.

Q3. How does the checklist interact with quarterly audit cadence? Each audit re-runs the checklist on each feature, comparing scores to the previous quarter. Regressions investigate root cause (deferred items, drift, organisational change). Improvements celebrate and standardise. The audit produces a backlog of apparatus engineering work — items to close, drifts to remediate, extensions to add. Without the cadence, the checklist is launch-only and the apparatus degrades after launch. Common wrong answer to avoid: "audit is annual" — annual cadence is too long; quarterly catches drift while it is still reversible.

Q4. The team has 18 of 20 items green but item 8 (runbooks) is deferred. How serious is this? Serious. Item 8 is non-negotiable; the apparatus has alerts but no procedure. The team is one page away from a real incident handled by improvisation. The deferral should be the team's next sprint, not their next quarter. The 18-of-20 score masks the structural gap. Common wrong answer to avoid: "18 of 20 is good" — non-negotiable items carry the weight; 18 of 20 with a non-negotiable deferred is closer to broken than to 90%.

Q5. The platform team wants to mandate a uniform checklist across all feature teams. What is the right balance? Mandate the base twenty as the platform standard; encourage extensions per domain. Feature teams self-score; platform team audits quarterly with sample-based deep dives. The uniformity is in the standard; the customisation is in the domain extensions. Avoid both extremes — total uniformity (misses domain context) and total customisation (no platform-wide standard). Common wrong answer to avoid: "let each team define their own" — without a platform standard, cross-team comparison is impossible and apparatus quality varies.

Q6. How do you use the checklist after a major incident? Treat the postmortem as input to the checklist. Which item, had it scored higher, would have prevented or reduced the incident? Update the apparatus accordingly. The incident is data; the checklist is the framework that translates data into apparatus updates. Common wrong answer to avoid: "checklists are for launch" — they're the standing framework for apparatus quality.


Design / debug exercise (10 minutes)

Modelled example. Pick a current AI feature. Walk through the checklist and score each item green/yellow/red. Identify the non-negotiable items and assess them strictly.

Your turn. For your team's feature, fill in the twenty items honestly. Estimate the apparatus engineering work to close the red and yellow items. Schedule it.

Reproduce from memory. Name the six surfaces and the non-negotiable items per surface. The signal of internalisation is that the apparatus structure lands quickly without rereading.


Operational memory

This chapter explained the architect checklist: twenty items across six surfaces, non-negotiable items identified, three-mode usage (design, pre-launch, quarterly). The important idea is that the checklist is the standing framework for apparatus quality; it catches what individual reviews miss and trends apparatus health over time.

You learned to score each item, treat non-negotiable items strictly, extend with domain-specific items, and use the checklist as the bridge between incidents and apparatus updates. That solves the opening failure because apparatus quality now has a measurable, comparable, trendable artefact.

Carry this diagnostic forward: when a team says "our apparatus is good," ask for the checklist score. The score is the truth; the assertion is the appearance.

Remember:

  • Twenty items across six surfaces.
  • Six non-negotiable items: alerts wired, rotation named, runbooks exist, escalation graph, postmortem template, load tracking.
  • Three-mode usage: design, pre-launch, quarterly.
  • Domain-specific extensions are the team's contribution.
  • Quarterly audit catches drift while it is reversible.

Bridge. The checklist captures what good apparatus requires. The honest admission captures what apparatus cannot fix — the limits, the tradeoffs, the cases where the apparatus is not enough. The next chapter is that admission. → 13-honest-admission.md