Skip to content

13. Honest admission — Architecture decisions we still cannot defend at design time

~17 min read. We have walked eleven chapters covering five primitives. For some of them, no one in the field actually knows the right setting at design time. Senior interviews probe exactly these gaps.

Built on the first-principles overview in 00-first-principles.md. The Loop, the Toolbelt, the State, the Leash, the Lifecycle — every primitive in this module has defensible settings in low-stakes domains and wide-open question marks in high-stakes ones. This file names the question marks plainly.


1) First picture: the punch list has holes

See. The architect's checklist from chapter 12 has twenty items across four columns. Most have textbook answers. Six do not.

architect checklist                 status
┌───────────────────────────────┐   ┌─────────────────────┐
│ schema, descriptions, retries │ → │ solved              │
│ stopping rules, kill switch   │ → │ solved              │
│ observability, eval gates     │ → │ mostly solved       │
├───────────────────────────────┤   ├─────────────────────┤
│ right autonomy level          │ → │ open                │
│ long-horizon drift            │ → │ open                │
│ eval method for architecture  │ → │ open                │
│ defensible topology choice    │ → │ open                │
│ blast radius across chains    │ → │ open                │
│ multi-tenant personalisation  │ → │ open                │
└───────────────────────────────┘   └─────────────────────┘

The textbook parts you can ship. The open parts you must operate. This is the first honest admission. Architecture is a partial science. The rest is craft, and the craft is what the debugging-agents-in-production module is built around.


2) The cost-quality frontier — nobody knows the right leash for high stakes

Look. The leash is the first knob in this module. For chat tasks, the right length is roughly known. Three to ten iterations. Cheap model. Confidence-based stop. Default to single call where possible.

For high-stakes domains — money, health, security, legal — nobody knows.

Two production teams will pick wildly different leashes for the same task.

task: "approve a $400 customer refund"

Team A → tight leash, two tools, mandatory human gate above $50, p95 = 4s
Team B → ReAct loop, eight tools, soft human gate above $500, p95 = 45s

Both teams have evidence for their choice. Team A points to incident logs. Team B points to deflection rates. Neither has a formal answer for "what is the maximum acceptable autonomy for this dollar value and customer tier."

The frontier is real. The map is not. This is why senior interview answers about leash length should always be conditional — "for tier-1 customers above $X, I would require an understudy gate; below, I would not." Anybody who gives one universal number is bluffing.


3) Open-ended autonomy still drifts past the horizon

See. Long-horizon agents — fifty steps, a hundred, a thousand — still degrade in ways the field has no clean fix for.

iteration 1   focused on goal
iteration 10  still focused, mostly correct
iteration 30  scope creeps, fixes side issues, generates fluent paragraphs
iteration 80  off-task, confidently producing nothing useful

The problem: Each individual step looks fine. The check passes. The tool returns success. The model writes a tidy rationale. Aggregate, the trajectory drifts. Devin, OpenAI Deep Research, Cognition, Cursor's long agent mode — every long-horizon system in production today has this failure shape, and none has solved it.

Mitigations exist. Aggressive scratchpad summarisation. Goal-restating prompts every K steps. Critic agents that re-anchor. Hard step caps. None is a fix. They reduce drift; they do not eliminate it.

What honest senior engineers say in interviews: "At horizon length above roughly N, we ship with kill switches and tight observability, not with confidence." That is the truth.


4) Eval methodology for architecture choices is immature

Look. The yardstick — the eval that gates launch — is well-defined for individual capabilities. Tool calling accuracy. Routing accuracy. Schema adherence. Hallucination rate on a fixed prompt set.

For architecture choices, the yardstick is much weaker.

How do you eval "ReAct vs Plan-and-Execute for our domain"? Run both on a static set, count successes? Static sets do not capture the failure modes that matter — partial tool outages, stale data, ambiguous user intent, multi-tenant contention. Replay traces? Replay erases the very feedback that makes a loop different from a single call.

Three practical methods exist today.

A/B in production    →  real signal, slow, expensive, ethical concerns on errors
shadow mode          →  no user impact, but can't observe gated tools firing
synthetic harness    →  cheap, fast, almost always over-optimistic

All three are partial. The field has no consensus on which to use when. A team that picks topology by intuition and tells you they "evaluated it" is mostly using one of these three with known gaps.

What we tell interviewees: name the method, name the gap, name what you would watch in production after shipping. That is the senior answer.


5) The "what topology" question rarely has a defensible answer at design time

The problem: Look at this real fork in the road.

Task: "summarise a long contract and flag risky clauses."

option 1: single Claude call with the whole contract in context
option 2: ReAct loop with chunk-retrieval and re-read tools
option 3: orchestrator-worker — planner agent, then 5 specialist clause-checkers
option 4: pipeline — preprocess, classify, summarise, score, format

All four can ship. All four pass a basic eval. The dollar costs span 10×. The latencies span 20×. The failure modes are completely different.

At design time, before shipping, the architect has no formal method to pick among them. The literature gives heuristics. Experience gives intuition. Neither gives a defensible answer.

In practice teams pick the topology that matches their team's strengths — Python engineers reach for LangGraph, ML engineers reach for orchestrator-worker, infra engineers reach for pipelines. The topology decision is sociotechnical, not just technical.

This is uncomfortable to admit in interviews. The honest senior answer is: "For this task class, I would prototype options 1 and 3 in parallel for one week, run them through our eval, and pick. The architecture call is made empirically, not deductively." Anybody who picks topology in one sentence is overstating their certainty.


6) Multi-tenancy vs personalisation is an unresolved tension

See. The blast radius in a multi-tenant agent is the question of how far one user's context can leak into another's response. Chapter 12 said "isolate per request." That is necessary. It is not sufficient.

The tension is real.

strict isolation      →  no cross-user learning, no personalisation, generic answers
shared memory layer   →  personalisation, but cross-tenant leak risks

How much shared structure is safe? Embeddings of similar past tickets? Cached tool results? Distilled patterns from user A used to bias responses for user B? The field has no formal way to bound the leakage. Differential privacy gives one half-answer. Per-tenant fine-tunes give another. Neither is widely adopted in agent stacks.

What teams ship: hard isolation by default, opt-in shared layers with manual review. That is conservative. It is also conceding that the personalisation problem has no defensible bound at design time. We will only know if our shared layer leaked sensitive context once a customer complains.

This is the operations side of the unknown. the debugging-agents-in-production module lives exactly in this space — how to catch the leak after design has done what it can.


7) Bounding blast radius across compounding tool calls is unsolved

Look. Chapter 05 mapped each tool to a blast radius. That is per-tool. Trivial.

Per-trajectory blast radius is not trivial. An agent that calls read_email then summarise then send_email may have just exfiltrated private content. Each tool was approved. The composition was not.

read_internal_doc  +  draft_blog_post  +  publish_to_blog
   safe alone           safe alone         safe alone
   together: leaked confidential plan

There is no formal calculus for "the blast radius of an arbitrary chain of approved tools." Capability-based security gets close in the abstract. In practice, agent stacks ship without compositional guarantees.

Mitigations exist. Sensitivity tags on tool outputs that flow forward. Sandboxed execution. Per-trajectory egress filters. Approval gates on any tool that emits to an external surface. None is universal. All of them are operated, not designed.

Honest senior interview answer: "I can bound per-tool. I cannot bound per-trajectory. I compensate with egress controls and post-hoc auditing, and I assume I will catch the worst chains in observability, not in design."


8) Worked example: compound uncertainty in one design review

Suppose you sit in an architecture review for a refund agent. Four open questions get raised: right leash, right topology, right approval gate threshold, right kill switch trigger. Of the four, only the kill switch has a defensible design-time answer.

known unknowns                  unknown unknowns
right leash                     novel failure modes from interaction
right topology                  policy drift on tool descriptions
right approval threshold        cross-tenant leakage we have not seen yet
                                drift after a model upgrade

A mature design review names the known unknowns explicitly, marks them with the metric that will catch each one in production, and plans the rollback. That is the bridge between this module and the debugging-agents-in-production module — design plus operate, not design alone.


9) What changes when you upgrade a model

See. You shipped the agent. Six weeks later your provider releases a new model. You upgrade for the cost savings.

Every assumption in your architecture was conditional on the old model. The new model routes tools differently, plans longer chains, formats outputs slightly differently, and hallucinates on different topics. Your old eval passes; new failure modes appear in week two.

The field has no clean method for "architecture diff under model upgrade." Versioning (chapter 18) is necessary but not sufficient. Architecture and model are coupled in ways evals cannot fully decouple. Mitigation: shadow the new model for a week before cutover, watch distribution shifts, plan for surprises.


10) What mature teams admit plainly

Look. The strongest senior engineers I have heard talk about agent architecture are the most modest. They say:

  • "We picked the leash empirically. We do not have a formal justification."
  • "Our topology was a coin flip. We commit to observe the failure mode and revisit in a quarter."
  • "Our long-horizon evals miss drift past iteration N. We compensate with hard caps and kill switches."
  • "We cannot bound trajectory-level blast radius. We bound egress instead."
  • "On every model upgrade, we expect to discover one new failure shape and patch in week two."

That is not weakness. That is the truth of where the field is in 2026. Architecture is partial. Operations close the gap.

This is the operations side of the unknowns covered in the debugging-agents-in-production module. Every open question here becomes an observability question there. Drift becomes a metric. Topology choice becomes a rollback plan. Blast radius becomes an egress alert. Read both modules together; design alone is half the job.

One more honest rule helps. If the workflow matters a lot, narrow the scope, increase observability, and keep the understudy escape hatch wide. Then revisit the design every quarter as production teaches you what you could not know at design time.

That brings us to the bridge. One agent's limits become two agents' coordination questions, and standard protocols across tools become essential as multi-agent systems grow.


Where this lives in the wild

  • Anthropic Claude Code and Computer Use — Anthropic explicitly documents that long-horizon Computer Use is alpha; the team treats horizon-length as an open architecture question.
  • Cognition Devin — production traces show drift past roughly 100 iterations; the team mitigates with planner checkpoints, not architectural fixes.
  • OpenAI Deep Research / Operator — public failure analyses from OpenAI's evals team note "evaluation methodology for long-horizon agents is not mature."
  • Cursor and GitHub Copilot agent mode — both teams have written publicly that topology choice (single-loop vs orchestrator) was decided empirically per workflow class, not derived.
  • Salesforce Agentforce and ServiceNow Now Assist — enterprise agent vendors openly publish that multi-tenant isolation is "defense in depth" — no single architectural guarantee.
  • Klarna's LangGraph-based support stack — the engineering blog explicitly states that leash length was tuned via shadow + canary, not at design time.

Pause and recall

  1. Which six rows of the architect's checklist have no defensible design-time answer?
  2. Why does long-horizon autonomy still drift even when each individual step succeeds?
  3. What are the three eval methods for architecture choices, and what is the gap in each?
  4. Why is per-tool blast radius solvable but per-trajectory blast radius not?
  5. What changes silently when you upgrade the underlying model after shipping an agent?
  6. What is the relationship between the open questions in this module and the operational discipline in the debugging-agents-in-production module?

Interview Q&A

Q: A senior architect asks you to pick the autonomy level for a $400 refund agent. What is your answer? A: Conditional. For tier-1 customers above $X, I would require an understudy gate. Below that, I would run a tight ReAct loop with a hard cost cap and a kill switch. The right answer is not a single number; it is a policy keyed by customer tier and dollar value, validated empirically in shadow before going live. Common wrong answer to avoid: "I would set max iterations to 5 and confidence threshold to 0.7." Universal numbers without conditioning on tier and dollar value are bluffing.

Q: How would you evaluate "ReAct vs orchestrator-worker" for our document-analysis agent before shipping? A: Three methods, all partial. Synthetic harness is fast but over-optimistic. Shadow mode preserves real traffic but cannot observe gated tools firing. A/B in production gives real signal but is slow and ethically constrained on errors. I would run the synthetic harness for ranking, shadow for distribution sanity, then a small A/B for the final pick. I would explicitly name the gaps in my evaluation when presenting. Common wrong answer to avoid: "Run them both on our eval set and pick the winner." Static eval sets do not capture the failure modes that actually decide topology — partial tool outages, stale state, multi-tenant contention.

Q: Why can't we bound blast radius across a chain of approved tools? A: Because composition matters. Each tool may be safe alone — read an internal doc, draft a blog post, publish to the blog. The combination exfiltrates confidential content. There is no formal calculus for "approved tool A + approved tool B + approved tool C is safe." Production stacks mitigate with sensitivity tagging, egress filters, and post-hoc auditing, but no architectural guarantee exists today. Common wrong answer to avoid: "Per-tool authority is enough." It is necessary, not sufficient. The chain creates new effects the per-tool guarantee does not cover.

Q: You upgrade the underlying model. What can silently break in your agent architecture? A: Tool routing accuracy drifts because descriptions land differently. The new model plans longer chains, so iteration budgets are wrong. Output formatting changes by tiny amounts, breaking downstream parsers. Hallucination patterns shift, so old eval sets pass while new failure modes emerge in production. The mitigation is shadow with traffic mirroring before cutover, plus a deliberate week of close observation after. Even that does not catch everything. Common wrong answer to avoid: "Re-run the eval set; if green, ship." The eval set was tuned against old failure modes. Green-on-old does not predict green-on-new.

Q: What is the relationship between architectural decisions and operations for agents? A: Roughly six rows of the architect's checklist have no defensible design-time answer — leash length for high stakes, topology choice, approval gate thresholds, trajectory-level blast radius, multi-tenant personalisation tension. For each, the design phase commits to a setting; operations measures the failure mode and adjusts. the debugging-agents-in-production module is the operations half. Design alone is half the job. Common wrong answer to avoid: "Good architecture means you do not need to debug in production." Even Anthropic, OpenAI, and Cognition operate their long-horizon agents partly empirically; the field has no design-time-only methodology for these unknowns.


Apply now (5 min)

  1. Take one agent you know — Cursor, Devin, Claude Code, a vendor agent in your stack, or an internal prototype. List the six rows of the architect's checklist that have no defensible design-time answer. For each, write one sentence on how the team operating that agent is closing the gap after design.
  2. Sketch from memory: draw the architect's checklist with the solved rows on the left and the open rows on the right, and mark the bridge to operations (debugging agents in production).

Operational memory

This chapter named the six rows of the architect's checklist that have no defensible design-time answer in 2026 — right leash for high-stakes domains, long-horizon drift past N iterations, eval methodology for architecture choices, defensible topology selection, blast radius across compounding tool chains, multi-tenant personalisation tension. The important idea is that architecture is partial; operations close the gap, and the senior interview answer for each open question is conditional, empirical, and observability-driven rather than universal.

You learned why long-horizon trajectories drift even when every individual step succeeds, why per-tool blast radius is solvable while per-trajectory is not, why model upgrades silently break architectural assumptions in week two, and what mature teams admit plainly ("we picked the leash empirically", "our topology was a coin flip", "we compensate with kill switches and observability"). That solves the design-review honesty problem because the open questions are named with the metrics that catch each one in production, not papered over with confident heuristics.

Carry this diagnostic forward: when a senior engineer gives a one-sentence universal answer to a leash, topology, or threshold question, push for the conditioning. "For tier X above value Y under workload Z" is the honest shape; numbers without conditions are bluffing.

Remember:

  • Six rows of the checklist have no defensible design-time answer; they are operated, not designed.
  • Long-horizon drift is real even when each individual step passes; mitigations reduce it, none eliminate it.
  • Per-tool blast radius is bounded; per-trajectory blast radius across compounding tools is not — egress filters and post-hoc audit close the gap.
  • Model upgrades break assumptions silently; shadow before cutover, expect one new failure shape per upgrade.
  • Architecture is partial; operations is the other half. Read the debugging-agents-in-production module alongside this one.

Bridge. We have named what the architecture phase cannot answer. The operations phase closes some of that gap — and the structural problems of coordinating across many tools and many agents open the next door. Standard protocols between tools and agents are where the next module begins. → ../16_multi_agent_coordination/00-first-principles.md

For a deeper comparison of agent frameworks (LangGraph, CrewAI, OpenAI Agents SDK), see details_deep_dive/frameworks-own-or-rent.md.