12. Architect checklist¶
Measurement closes the operating loop. This chapter condenses the module into twenty items the lead engineer or designer can run through on any AI surface — design review, launch review, or quarterly audit — to catch the obvious failures before they ship.
A platform engineer at a Delhi healthtech company runs the checklist on a pre-launch AI feature. The model is good; the prompts are tuned; the evals pass. The checklist surfaces six gaps: no streaming UX, no uncertainty signal on borderline outputs, citations rendered as raw URLs, no correction affordance, handoff path buried two clicks deep, no accessibility test on assistive tech. None of these would have failed the model evals. All of them would have damaged adoption. The team fixes the gaps over the next sprint; launch goes cleanly.
This chapter is the checklist itself.
How to use the checklist¶
The checklist has three modes:
- Design review — run before mocks are signed off. Catches structural gaps when fixing is cheap.
- Pre-launch review — run before the feature ships. Catches implementation gaps.
- Quarterly audit — run on shipped features. Catches drift as the product evolves.
Each item is a yes/no question with a fallback action if the answer is no. The aim is not to pass every item — products differ — but to make the choices conscious.
Trust and friction (items 1–3)¶
1. Has the team articulated the trust asymmetry for this AI surface? Trust is built slowly and lost fast. The team should have a sentence or two on what trust costs the user, what loss looks like, and where the surface is most fragile. If no: run the trust-and-friction diagnosis from chapter 01 before further design.
2. Is there a clear primary user goal for the surface? "Users can ask anything" is not a goal; it is an absence of goals. The surface should have a primary task and one or two secondary tasks. If no: re-scope the surface around the primary goal.
3. Are friction sources tracked separately from engagement? Engagement metrics often mask friction. The team should know where users hesitate, retry, or abandon — separately from how much they click. If no: add abandonment-after-failure and time-to-correct-answer to the dashboard.
Latency UX (items 4–5)¶
4. Is the response streamed, with appropriate indicators? For text responses, streaming should be the default. Indicators should signal "thinking," "generating," and "done" without ambiguity. Long operations should report progress, not just an indeterminate spinner. If no: design streaming and indicator states before launch; non-streamed default is a regression on every screen-reader and slow-network user.
5. Is the cancel path visible during generation? The user must be able to cancel a slow or wrong-direction generation. The cancel affordance is on the surface, not hidden. If no: add cancel; consider re-queue or alternative on cancel.
Uncertainty surfacing (items 6–7)¶
6. Does the AI surface uncertainty when it changes the user's action? If low confidence means "verify before acting," the user sees the signal on the surface. If confidence does not change the action, the signal can stay in the support layer. If no: identify the action-changing thresholds; surface the signal at those thresholds.
7. Does the AI say "I don't know" when it should? A refusal to guess is more trustworthy than a confident wrong answer. The product surface respects refusals rather than treating them as failures. If no: tune the model and prompt to enable graceful refusal; update the UX to render refusals respectfully.
Explainability (items 8–9)¶
8. Are sources cited where the user expects them? For factual claims, citations are inline or one click away. The user can verify any non-trivial assertion. If no: add citation markers; ensure citations are real (not hallucinated) and clickable.
9. Is the reasoning available for users who want to inspect it? The reasoning trace is one click away, structured for skim-reading. Not on the surface; not hidden in depth. If no: expose a "Show reasoning" affordance in the support layer.
Error and recovery (items 10–11)¶
10. Does each error category have a specific UX? Timeout, refusal, empty response, wrong answer, and system error each have a distinct message and recovery path. No "Something went wrong" defaults. If no: design the per-category UX before launch.
11. Is the correction affordance on the surface, not buried? The user can signal "wrong" within two clicks from any AI response. The signal is captured into the eval pipeline. If no: add the affordance; verify the capture flows to telemetry, not just the application database.
Progressive disclosure (item 12)¶
12. Does the surface stay readable while depth remains accessible? The headline answer is at full readability. Reasoning and citations are one click away. Audit-grade depth is in a drawer. Anything that changes the user's decision to proceed is on the surface, not hidden. If no: re-design the layers; collapse what is over-exposed, expose what is over-hidden.
Handoff (items 13–14)¶
13. Is the path to a human visible from every AI surface? One click away, clearly labelled. No hidden escalation buttons. If no: surface the path; treat hidden escalation as a launch blocker for any user-facing AI.
14. Does the handoff transfer state to the human agent? The agent receives transcript, intent, context, tools called, reason, and suggested next step in a 15-second-readable format. If no: build the state transfer; stateless handoff is worse than no AI.
Correction and repair (items 15–16)¶
15. Is pushback cheap and the signal captured? The user can push back in one or two clicks with optional detail. The captured signal includes the full payload — original, corrected, reason, context, version. If no: lower the friction; verify the capture is structured, not just logged.
16. Does the AI handle pushback honestly in-conversation? The AI acknowledges, asks a clarifying question if ambiguous, and adjusts. The AI does not capitulate blindly when it has a verifiable source. If no: tune the prompts to handle correction explicitly; add an escalation path for disputed corrections.
Onboarding and mental models (items 17–18)¶
17. Does onboarding teach what the AI does, what it does not, and how to correct it? The first session should land an accurate mental model. Three things the AI does well; one or two things it does not; a failure-correction rehearsal; the human path made visible. If no: rebuild onboarding around the four questions: intent, shape, boundary, recovery.
18. Are exclusions honest and visible? The AI's boundaries are stated up front, not discovered by failure. A "what can I ask?" hint stays accessible during the session, not just at onboarding. If no: surface the exclusions; design honesty into the surface.
Accessibility and inclusivity (item 19)¶
19. Has the surface been tested with assistive tech, slow networks, and non-default languages? Real users on real conditions, not just automated scanners. Screen-reader chunking, keyboard-only flow, large-text and high-contrast rendering, slow-network behaviour, localised AI responses, fallback paths for users who cannot use chat. If no: schedule the testing before launch; treat regressions as blocking.
Measurement (item 20)¶
20. Is the dashboard balanced across outcome, trust, repair, calibration, adoption, and safety? Engagement-only dashboards mask friction. The team has a north-star metric that cannot be gamed without improving the user's outcome, plus diagnostic metrics from each family and hard safety floors that cannot regress. If no: rebuild the dashboard; add the missing families.
How to score¶
The checklist is not a pass/fail. A reasonable target is:
- All twenty items are answered consciously (yes, no with mitigation, or not applicable with reasoning).
- Items 4, 10, 11, 13, 14, 19, 20 are non-negotiable for any user-facing AI in regulated or high-stakes domains.
- The remaining items can be deferred with documented rationale, but the rationale is reviewed quarterly.
A surface that scores well on the first launch will still drift. The quarterly audit catches the drift.
When to extend the checklist¶
The checklist is a starting point. Specific domains will add:
- Healthcare: clinical-evidence rendering, dosing safety, regulatory disclosure.
- Finance: model-version disclosure, audit logging, regulator-required citations.
- Legal: counsel-disclaimer, jurisdiction handling, privileged-context protection.
- Children's products: age-gating, parental controls, content moderation.
The base twenty are universal. The extensions are domain-specific. Both belong on the same review.
What the checklist does not cover¶
- Visual design — colour, typography, layout aesthetics. Standard UX discipline applies.
- Backend AI architecture — model choice, prompt engineering, retrieval, evals. Covered in other modules.
- Pricing and packaging.
- Marketing and positioning.
The checklist is the UX-quality gate. It is not the only gate.
Interview Q&A¶
Q1. The team passes model evals but the checklist surfaces six gaps. What do you push back on? The framing that the model passing means the feature is ready. Model evals measure model quality; the checklist measures product quality. A model with perfect evals shipped with no streaming, no uncertainty signal, no correction affordance, and a buried escalation path will fail in production. The fixes are sprint-scale; the cost of skipping them is months of lost adoption. Wrong-answer note: "the model is the product" misses that the surface is the product the user touches.
Q2. Which items are non-negotiable for a regulated AI surface, and why? Streaming and indicators (4), per-category error UX (10), correction affordance with signal capture (11), visible human path (13), stateful handoff (14), accessibility (19), balanced dashboard (20). These are not preferences; they are minimum competence. A regulated surface that lacks any of them will fail audit, accessibility law, or operational incident review. Wrong-answer note: "regulators only care about safety metrics" misses how broad the audit surface is.
Q3. The team wants to defer items 1, 2, and 17. How do you respond? Cautiously. Items 1 and 2 are foundational — if the team has not articulated the trust asymmetry or the primary user goal, the rest of the design is built on assumption. Item 17 is the onboarding gap; deferring it means the second-month adoption curve will be the launch metric for the next quarter. The mitigation is to write down the deferred items with explicit owners and dates; treat them as launch blockers for the v1.1, not as nice-to-haves. Wrong-answer note: "we'll address them post-launch" is the standard path to never addressing them.
Q4. How does the checklist interact with model and prompt iteration cycles? Each iteration touches one or more checklist items. A prompt change that affects refusal behaviour touches items 7 (refusal) and 14 (handoff). A model change that affects latency touches items 4 (streaming) and 5 (cancel). The discipline is to re-run the affected items after each change, not just the model evals. Wrong-answer note: "the checklist is for launch only" misses that AI products drift continuously.
Q5. Walk through running the checklist as a quarterly audit. Pick the AI surfaces in scope. For each, score the twenty items honestly — not as the team wishes but as the surface actually stands. Compare against the previous quarter's score; investigate regressions. Cross-reference with the dashboard from item 20 — metric regressions and checklist regressions should usually correlate; when they do not, one of them is wrong. Produce a list of fixes prioritised by user impact, not by ease of fix. Wrong-answer note: "score everything green so the leadership review goes smoothly" produces silent drift that surfaces as a customer incident.
What to do differently after reading this¶
- Run the checklist at design review, pre-launch, and quarterly.
- Treat items 4, 10, 11, 13, 14, 19, 20 as non-negotiable for high-stakes surfaces.
- Document deferrals with owners and dates; revisit them.
- Pair the checklist with the dashboard; regressions should usually correlate.
- Extend the checklist for domain-specific surfaces — healthcare, finance, legal, children's products.
- Use the checklist to catch the gaps that model evals cannot see.
Bridge. The checklist captures what good UX requires. The honest admission captures what UX cannot fix — the limits of the discipline, the failures that surface design cannot prevent, and the cases where the right answer is to not ship at all. The next chapter is that admission. → 13-honest-admission.md