Skip to content

12. Architect checklist

Twenty items. Classify, bind, scope, redact, retain, audit, detect, erase, segregate, respond. If you can answer all of them with an artefact, the posture is defensible. If you cannot, the gaps are the work.


The checklist a tech lead uses in design review, in security review, in compliance audit, and at the first incident postmortem. Each item maps to a chapter.


Classify and bind (1–6)

1. Classification. Is every field in every store tagged with a tier (public, internal, sensitive, regulated)? Is the tier metadata used by the access mediator? (Chapter 02.)

2. Purpose registry. Does the platform have a registered set of purposes, each with owner, allowed tiers, scope template, audit policy? (Chapter 03.)

3. Per-call purpose declaration. Does every read and write declare a registered purpose? Are calls without a purpose refused? (Chapter 03.)

4. Per-call scope resolution. Does every call resolve to a per-call scope derived from active context (tenant, user, resource), with the data layer enforcing? (Chapter 04.)

5. PII detection and redaction. Are the four moves (detect, redact, hash, minimise) applied at every storage surface (audit, logs, samples, traces) at write time? (Chapter 05.)

6. Eval set hygiene. Does the eval set use only synthetic personal data? Is a CI check enforcing? (Chapter 05.)


Retain and audit (7–11)

7. Retention matrix. Does every data category have a documented retention window, driven by regulation, contract, and need (in that order of binding force)? (Chapter 06.)

8. Automatic deletion. Is past-window data deleted by automated jobs, with a dashboard showing expected vs actual? (Chapter 06.)

9. Backups in scope. Are backups within the retention discipline — either short-rotation, selective-skip-on-restore, or recycled? (Chapter 06.)

10. Per-call access audit. Does every access emit a structured audit record with actor, tenant, jurisdiction, purpose, scope, operation, fields, tier summary, and outcome? (Chapter 07.)

11. Append-only, tamper-evident audit. Is the audit append-only? For regulated data, is it tamper-evident (hash chaining or signed)? (Chapter 07.)


Detect and erase (12–15)

12. Live leak detection. Are per-actor volume, refusal-rate, and scope-violation signals monitored with per-actor baselines? Do alarms produce triage within an SLA? (Chapter 08.)

13. Offline review. Is there a quarterly review of regulated-tier access patterns, top accessors, new purposes, scope-failure aggregates? (Chapter 08.)

14. Right-to-be-forgotten workflow. Is the data-location map maintained? Does the workflow reach live data, audit (with pseudonymisation as appropriate), backups (with documented constraint), embeddings? Is every erasure verified? (Chapter 09.)

15. Subject notification. Are subjects notified per regulatory window, with specific content about what was deleted and what was retained with legal basis? (Chapter 09.)


Segregate and respond (16–20)

16. Cross-tenant isolation. Are agent code paths incapable of cross-tenant queries? Is tenant scope enforced at the storage layer? (Chapter 10.)

17. Regional segregation. Are per-tenant stores region-local per jurisdiction? Is the global control plane free of user data? (Chapter 10.)

18. Consent for cross-region. Is cross-region processing explicit, dated, contract-bound? (Chapter 10.)

19. Incident response playbook. Is the playbook documented with notification templates, regulatory contacts, escalation tree? Has it been rehearsed? (Chapter 11.)

20. Postmortem discipline. Are postmortems blameless, complete (timeline, root cause, contributing factors, blast, response, action items), and tracked to closure of action items? (Chapter 11.)


How to use the checklist

In a design review for a new tenant or a new feature: walk the items. Reds are the work; yellows are tracked; greens are achievements.

In a security audit: walk the items with artefacts in hand. A green item that cannot produce an artefact is a yellow.

In a postmortem: walk the items. Which one, if green, would have prevented or shortened this incident?

A platform with most items green has a defensible governance posture. A platform with several reds is a platform with known gaps; the discipline is to know the gaps and close them on a timeline.


Common postmortem-to-checklist mappings

  • "Agent surfaced wrong customer's data" → items 3 (purpose), 4 (scope)
  • "Email addresses found in audit log" → item 5 (PII redaction)
  • "Failed to delete data per regulator request" → items 14 (RTBF workflow), 9 (backups)
  • "Cross-tenant data appeared in another tenant's response" → items 4 (scope), 16 (cross-tenant isolation)
  • "Provider retained data that should not have been kept" → item 18 (cross-region consent), provider contracts
  • "Investigation took weeks to find what data was accessed" → items 10 (audit), 11 (tamper-evident)
  • "Leak detection missed a slow-and-low exfiltration" → item 13 (offline review)
  • "Notification was sent late" → item 19 (playbook, rehearsal)
  • "Postmortem listed the engineer; root cause not identified" → item 20 (blameless postmortem discipline)

When the checklist is overkill

Two cases.

Internal-only platform with no personal data. Items 5 (PII), 9 (backups for personal data), 14-15 (RTBF), 18 (cross-region consent) may be N/A. Document the N/A explicitly with the reason; some items become applicable as the platform evolves.

Tightly regulated sector (healthcare, finance, defense). All items apply, often with stricter thresholds than the defaults in this module. The checklist is the starting baseline; the regulator's specific rules add more.

In both cases, the discipline is to know what does and does not apply, and to keep the list current as the platform evolves.


Interview Q&A

Q1. You inherit an agent platform. Walk the first three items you address. Item 10 (per-call access audit). Without it, every other claim is unverifiable; the audit is the substrate. Item 1 (classification). Without tier labels, every other discipline is uniform when it should not be. Item 4 (per-call scope). Without it, the chapter-1 incident pattern recurs. Three to four weeks to get all three to yellow-or-better; the rest of the checklist builds on these. Wrong-answer notes: starting with leak detection (item 12) without the audit is detecting without a substrate.

Q2. The team says "we have OAuth scopes and an audit log; we're done." How do you push back? OAuth scopes and audit logs are the foundation, not the discipline. The discipline adds: purpose binding (item 2, 3), per-call scope narrowing (item 4), PII redaction in the audit itself (item 5), tier-driven retention (items 7-8), leak detection on access patterns (item 12), erasure workflow (item 14-15), regional segregation (item 17), incident response (item 19-20). Each is the answer to a specific failure that OAuth + audit alone does not catch. The push-back is concrete: "for failure X (cross-purpose access), OAuth does not catch it; purpose binding does." Five specific examples make the case. Wrong-answer notes: dismissing OAuth + audit altogether is the opposite mistake.

Q3. What item on this checklist do you think is most under-appreciated? Item 16 (cross-tenant isolation at the storage layer). Teams build application-level filters but skip the storage-level enforcement. The first code-path bypass becomes a cross-tenant leak. The defence-in-depth costs little to implement (row-level security or a query builder that cannot omit the filter) and prevents the highest-blast incident class on a multi-tenant platform. Wrong-answer notes: any specific item is defensible; what distinguishes is the reasoning about blast and prevention cost.

Q4. The team is small and cannot land all twenty items in the first quarter. How do you sequence? Quarter one: items 1, 4, 10, 12, 19 — classification, scope, audit, leak detection, incident playbook. These are the operational floor. Quarter two: items 2-3 (purpose), 5 (PII), 7-8 (retention), 16 (cross-tenant). These tighten the posture. Quarter three: items 14-15 (RTBF), 17-18 (cross-region), 6 (eval hygiene), 11 (tamper-evidence). Specialised. Quarter four: items 9 (backups), 13 (offline review), 20 (postmortem culture). Continuous improvement. The sequencing is not perfect; the discipline is to commit to a sequence and execute. Wrong-answer notes: "all at once" is unrealistic; "wait for completion before any are useful" misses that each item provides value as it lands.


Bridge. The checklist is the engineer's defence. The last chapter is the honest opposite — what governance still cannot prevent, where the discipline is young, and the limits a thoughtful lead is transparent about. → 13-honest-admission.md