Multi-Agent Orchestration — Interview Questions¶
The "you have N agents working together — how do you not let them turn into a swarm of confused colleagues" round. Different from agents-design.md (single-agent loop design) and agents-debugging-production.md (debugging methodology). This file is about the coordination architecture — orchestrator-worker, hierarchical, swarm, pipeline patterns; supervisor vs peer-to-peer; communication protocols; conflict resolution; when multi-agent beats monolithic; and the specific failure modes you only see with multiple agents in the loop.
The senior tell: most candidates over-reach toward multi-agent because it's fashionable. The mature answer in 2026 is "start monolithic; go multi-agent only when you have a clear coordination boundary, a specialization gain, or a scope-isolation reason — and even then prefer the simplest orchestration pattern that works".
When multi-agent is worth it¶
Q: "What is the difference between single-agent and multi-agent systems?"¶
Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Single-agent: one LLM-driven loop, one set of tools, one context, one termination condition. The agent itself handles all reasoning, planning, and tool use. Simpler to build, simpler to debug, predictable. - Multi-agent: multiple LLM-driven loops collaborating — separate prompts, possibly separate models, separate tool sets, communicating via structured messages or shared state. Used when the task naturally decomposes, or when isolation/specialization matters. - The trade-off is coordination overhead. Multi-agent introduces: handoff protocols, conflict resolution, message-passing failure modes, harder debugging, higher cost per task (multiple LLM calls for what one might have done). - Common multi-agent wins: specialized roles (research vs writer vs reviewer), trust boundaries (untrusted-input agent sandboxed from privileged-action agent), parallelism (multiple sub-tasks in flight), human-fit decomposition (matches an organizational workflow). - Common multi-agent losses: small tasks where the coordination overhead exceeds the specialization win, tasks where free-form prose communication between agents creates noise, tasks where one strong model could do the whole thing. - 2026 default: build monolithic first, decompose only when a specific multi-agent gain is identified. - Numbers to drop: "multi-agent overhead: 2-5× cost and latency vs monolithic for the same task", "real specialization win typically 10-30% quality lift when decomposition matches the task structure"
Common follow-ups: - "When does multi-agent hurt?" - "Give me an example where multi-agent clearly wins."
Traps: - Going multi-agent for prestige. Senior interviewers will probe whether you can articulate the specific win. - Underestimating coordination overhead.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/
Q: "When is a monolithic agent better than a multi-agent system?"¶
Tags: senior · very-common · scenario · source: agents-design references; standard senior pushback probe; 2026 AI loops
Answer outline:
- Default to monolithic when:
- Task fits one context window: single agent can hold the whole task; no decomposition gain.
- Tools naturally cohere: all tools serve one type of operation; no specialization boundary.
- Latency-sensitive: each agent handoff adds an LLM call. Multi-agent latency is 2-5× monolithic.
- Debug-ability matters: one trace beats N traces with cross-agent communication links to follow.
- Cost-sensitive at scale: extra LLM calls add up at high QPS.
- Team velocity: building a single agent is faster; iterating is cheaper.
- Go multi-agent when:
- Specialization gain is measurable: a writer model with one prompt + a reviewer model with another prompt outperforms one model trying to do both.
- Trust isolation: an agent reading untrusted content (email, scraped pages) must be separated from one with privileged tool access. See lethal trifecta in safety-guardrails.md.
- Parallelism: independent sub-tasks can run in parallel for latency win.
- Task-level scope match: tasks naturally map to human team roles (research, draft, review).
- Senior tell: candidate names a specific boundary — "the writer's prompt is different from the reviewer's prompt; that's why they're separate". Without a boundary, it's the same agent with extra steps.
- Numbers to drop: "monolithic latency: 1× baseline. Two-agent handoff: 2-3×. N-agent orchestration: linear in agent count plus coordination overhead"
Common follow-ups: - "What's a specific boundary you'd split on?" - "What does coordination overhead look like in practice?"
Traps: - Claiming multi-agent is more "scalable". It usually isn't — same model, more calls.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/
Orchestration patterns¶
Q: "Walk me through the main multi-agent orchestration patterns."¶
Tags: senior · very-common · conceptual · source: Multi-Agent Orchestration 2026 guides (Gurusup, Build5Nines, MindStudio, Codebridge); standard senior architecture probe
Answer outline: - Five canonical patterns: - Orchestrator-Worker (also called planner-worker, supervisor-worker): one orchestrator agent decides the plan, delegates sub-tasks to specialized worker agents, assembles results. Centralized control, easy to reason about, easy to debug. The 2026 default. - Hierarchical: orchestrator-worker extended to multiple levels. Top-level supervisor delegates to mid-level supervisors, each of which manages worker pools. Useful at scale or when the task has natural tiered structure. - Pipeline / Sequential: agents arranged in a fixed sequence, each transforming the output of the previous. Deterministic, predictable, low coordination overhead. Best for well-defined transformation chains (extract → enrich → validate → format). - Swarm: decentralized, emergent. Agents communicate peer-to-peer, no central coordinator. Inspired by biological swarms. Flexible but harder to debug and bound. - Mesh: full peer-to-peer communication. Every agent can talk to every other. Powerful but coordination overhead grows quadratically; rarely the right choice in production. - Decision: start with orchestrator-worker. Add hierarchy only if the agent count or task complexity demands it. Pipeline for transformations. Swarm/mesh almost never in 2026 production. - The hidden choice is whether the orchestrator is deterministic code or another agent. Deterministic orchestrator = workflow engine (Temporal, Airflow, LangGraph state machine). Agent orchestrator = planner LLM that decides who runs next. Deterministic wins for production reliability; agent for flexibility. - Numbers to drop: "2026 default: orchestrator-worker with a deterministic orchestrator", "swarm/mesh: research-y; rare in production", "hierarchical: 2-3 levels typical, beyond that diminishing returns"
Common follow-ups: - "Walk me through orchestrator-worker." - "When would you pick pipeline over orchestrator-worker?" - "What's wrong with swarm?"
Traps: - Mixing patterns without intention. "Hierarchical with some peer-to-peer" usually means you haven't decided.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/02_durable_agent_workflows/
Q: "Compare and contrast a supervisor-based multi-agent system with a peer-to-peer collaborative system. When would you choose one over the other?"¶
Tags: senior · very-common · scenario · source: AEM Institute 25 Advanced Agentic AI Questions 2026
Answer outline: - Supervisor-based (orchestrator-worker): a designated supervisor agent (or workflow engine) plans, delegates, supervises, aggregates. Workers don't talk to each other; they talk only to the supervisor. Tree-shaped communication. - Peer-to-peer (mesh/swarm): agents communicate directly with each other. Graph-shaped communication. No central authority. - Supervisor wins when: - Bounded coordination: the task is decomposable into clear sub-tasks. Supervisor knows the plan. - Easier debugging: traces are tree-shaped; you can follow what happened by reading the supervisor's decisions. - Clear failure handling: supervisor decides what to do when a worker fails (retry, route to another worker, escalate). - Cost control: supervisor enforces budget at the workflow level. - Peer-to-peer wins when: - Emergent collaboration: the task isn't well-decomposed in advance; agents need to figure it out as they go. - Resilience to single-point failure: no supervisor to crash. (Though this is rare in practice — most production multi-agent has a coordinator.) - Specific research patterns: simulation, multi-agent reinforcement learning, where peer-to-peer is the point. - 2026 production almost always uses supervisor-based. Peer-to-peer adds debugging cost that exceeds its flexibility benefit in most product use cases. - Senior tell: candidate explicitly notes that "peer-to-peer" in product contexts usually means "we haven't thought through the coordination architecture", and pushes toward supervisor unless there's a clear reason. - Numbers to drop: "supervisor-based: 80%+ of production multi-agent in 2026", "peer-to-peer: mostly research / specific frameworks like CAMEL, AutoGen multi-agent groupchat"
Common follow-ups: - "How does the supervisor itself handle failure?" - "Have you used peer-to-peer in production?"
Traps: - Romanticizing peer-to-peer. Usually adds noise without benefit.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/
Q: "Explain hierarchical orchestration with a concrete example."¶
Tags: staff · common · design · source: Multi-Agent Orchestration 2026 guides; standard staff-tier probe
Answer outline: - Hierarchical = orchestrator-worker extended to multiple levels. Top supervisor → mid-level supervisors → worker pools. - Concrete example: enterprise customer-support agent platform. - Top supervisor: receives the user query, classifies the broad domain (account, billing, technical, compliance), routes to the relevant mid-level supervisor. - Mid-level supervisor (e.g., "technical"): classifies sub-domain (network issues, login issues, app issues), spins up the right specialized worker. - Worker (e.g., "network troubleshooter"): runs the actual diagnostic loop with network-specific tools. - Why hierarchy: avoids overloading any single supervisor with all routing logic. Each level has a narrow specialization → tighter prompts, smaller tool sets, easier to reason about. - Trade-off: latency. Each level adds an LLM call for routing. 3-level hierarchy = 3 LLM calls before the actual worker runs. For latency-sensitive apps, flatten to 1-2 levels. - When the supervisor is deterministic (a router classifier + workflow engine, not an LLM), hierarchy is cheap. When supervisors are LLM-driven, every level multiplies cost and latency. - Bound the recursion: max depth = 3-4 typical. Beyond that, you're over-engineering. - Numbers to drop: "3-level hierarchy: 3 routing calls + 1 worker call = 4 LLM calls per task", "max hierarchy depth in production: 2-4", "deterministic-supervisor routing: ~10-50ms vs LLM-supervisor 500-2000ms"
Common follow-ups: - "What's the failure mode when one level of the hierarchy is wrong?" - "Why not just one big supervisor with all the logic?"
Traps: - Hierarchy for hierarchy's sake. If a single supervisor can do it, do that.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/
Q: "Compare planner-worker with pipeline patterns."¶
Tags: senior · common · conceptual · source: Multi-Agent Orchestration patterns 2026; standard senior architecture probe
Answer outline: - Planner-worker (dynamic): planner LLM decides per-task which workers to invoke, in what order, with what inputs. Flexible — same orchestrator handles many task shapes. - Pipeline (static): fixed sequence of agents. Input → Agent A → Agent B → Agent C → Output. Each stage has a fixed role. - Planner-worker wins when: - The task structure varies. The planner adapts each run. - You want one orchestrator handling many task types. - The work isn't predictable in advance. - Pipeline wins when: - The task structure is fixed. Extract → enrich → validate → format. No need for a planner. - Latency / cost matters. No planning call needed; goes straight to execution. - Determinism matters. Pipeline always does the same steps in the same order. - For most data-processing tasks (document extraction, transformation pipelines, ETL with LLM steps), pipeline is the right answer. For open-ended user requests (chat agents, research assistants), planner-worker. - The hybrid: a planner-worker that often degenerates into a fixed pipeline. If you observe the planner picking the same workers in the same order for 80%+ of tasks, that's a signal to harden into a pipeline. - Numbers to drop: "pipeline saves the planner call: 30-50% latency win vs planner-worker on simple tasks", "planner-worker pays off when task variance is real — 5+ distinct task shapes routinely"
Common follow-ups: - "How do you decide when to flatten planner-worker into a pipeline?" - "What if the pipeline's stage 3 is sometimes skipped?"
Traps: - Using planner-worker when the task is structurally fixed. Adds cost and indeterminism.
Related cross-cutting: Architecture choices, Cost & latency
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/02_durable_agent_workflows/
Agent communication¶
Q: "How do agents communicate in a multi-agent system?"¶
Tags: senior · very-common · conceptual · source: Multi-Agent Orchestration guides 2026; standard senior probe
Answer outline: - Three communication models: - Structured messages: agents exchange typed JSON / Pydantic objects with clear schemas. The orchestrator routes messages. This is the production-grade default. - Shared scratchpad / blackboard: all agents read/write to a common state store. Each agent's contribution is structured. Avoids point-to-point messaging at the cost of contention. - Free-form prose: agents write paragraphs to each other. Closest to "conversation"; very flexible but very brittle. The model on the receiving end has to parse free-form intent; errors compound. - Strong recommendation: use structured messages with explicit schemas. Free-form prose communication is the single most common reason multi-agent systems get unreliable. - Schema design: each handoff has a defined message type with required fields. Receiver validates schema; rejects malformed; possibly retry-with-correction. - Shared state pattern (LangGraph-style): the orchestrator owns a typed state object; each agent reads and updates specific fields. State transitions are explicit. Easier to debug than point-to-point messages because all changes go through one place. - Long-running context: agents shouldn't carry forward each other's entire context window. Summarize at handoff; pass only the relevant summary + structured handoff data. - Numbers to drop: "structured messages: 3-5× fewer downstream failures than free-form prose", "shared state pattern: easier to reason about for 3-10 agents; gets messy beyond"
Common follow-ups: - "Walk me through a specific message schema." - "How do you handle agent A producing output that agent B can't parse?" - "Shared state vs message-passing — which?"
Traps: - Free-form prose communication. Standard amateur multi-agent mistake. - Carrying full conversation context forward across agents. Cost and confusion explode.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/
Q: "How do you design a handoff between two agents?"¶
Tags: senior · common · design · source: standard senior multi-agent design probe; 2026 AI engineer loops
Answer outline:
- A handoff is a typed transition: agent A's output becomes agent B's input. Treat it like an API contract.
- Components:
- Handoff schema: explicit JSON/Pydantic type. Required fields, optional fields, validation rules.
- Trigger condition: when does the orchestrator decide handoff happens? Output of agent A passes validation + matches a routing rule.
- Context to pass: not everything. Pass the summary + relevant facts + the next-step instructions. Drop intermediate reasoning that B doesn't need.
- Failure handling: if B rejects the handoff (malformed input, can't continue), what does the orchestrator do? Retry A with feedback, escalate to human, terminate.
- Example: research agent → writer agent handoff. Schema: {topic, gathered_facts[], sources[], outline_suggestion}. Writer receives this, doesn't see research-agent's full search history.
- Anti-pattern: handoff via "agent A's last message in the conversation history". B has to parse free-form text, infer intent, often gets it wrong.
- Frameworks: LangGraph state-machine transitions are an explicit handoff model. AutoGen's nested-group-chat is more free-form. Pick based on how strict your handoffs need to be.
- Numbers to drop: "handoff schema: 5-15 fields typical", "handoff failure rate target: <1% with retry-with-correction"
Common follow-ups: - "What if agent A produces output that's almost-but-not-quite valid?" - "How do you version handoff schemas?"
Traps: - Passing the full conversation history. The next agent doesn't need it. - No schema validation at handoff.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/
Conflict resolution¶
Q: "How do you handle conflicting outputs or goals in a multi-agent system?"¶
Tags: senior · very-common · scenario · source: AEM Institute 25 Advanced Agentic AI Questions 2026
Answer outline: - Two kinds of conflict: - Output conflict: agents produce inconsistent answers. E.g., research agent says product launched 2023; another says 2024. - Goal conflict: agents have competing objectives. E.g., performance-optimizer agent wants to remove safety check; safety agent wants to keep it. - Resolution mechanisms: - Priority hierarchy (declared by orchestrator): "if writer disagrees with reviewer, reviewer wins". Source priority is a configuration, not an agent decision. - Cross-verification: ask a third agent (or call a verification tool) when two agents disagree. - Recency-weighted: if the conflict is about facts, more recent / authoritative source wins. - Confidence-weighted: each agent emits a confidence score with its output; higher-confidence wins, ties go to escalation. - Escalation: orchestrator surfaces the conflict to a human or to a stronger model. - Surface-to-user: when the system can't resolve confidently, return both views with attribution rather than picking silently. - For goal conflicts: usually a sign the agent boundaries are wrong. Performance vs safety shouldn't be different agents arguing; they should be a single agent with constraints, or the orchestrator enforces the safety constraint as policy. - The senior insight: conflicts that should be resolvable by the system shouldn't be agent-level disagreements — they're orchestrator-level policy. Reserve conflicts for genuinely uncertain cases. - Numbers to drop: "log every conflict event for pattern analysis", "frequent conflicts often indicate stale data or wrong source-priority config — fix the system, not the agent"
Common follow-ups: - "What if all agents are wrong?" - "How does this scale with N agents?"
Traps: - Letting agents "vote" without a tie-breaker. The orchestrator must own the decision.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "How do agents avoid stepping on each other when sharing state?"¶
Tags: senior · common · design · source: standard senior concurrency probe in multi-agent contexts; 2026 AI engineer loops
Answer outline: - Two strategies: - Single-writer pattern: each state field has exactly one agent allowed to write it. Others read. Eliminates conflicts at the field level. - Orchestrator-mediated: agents propose updates; orchestrator validates and applies. Centralized arbitration. - Concurrency control (when agents run in parallel): - Optimistic locking: each state read includes a version; write requires matching version, retry on mismatch. - Mutexes / locks: pessimistic, simpler, but limits parallelism. Use sparingly. - Append-only logs: agents append events to a log; the orchestrator (or a reducer) combines them. Natural for accumulation patterns. - For shared scratchpad / blackboard patterns: structure the blackboard with explicit sections per agent. "Agent A's findings" section is writable only by A. - Avoid: free-form shared state where any agent can write any field. Concurrent updates create silent corruption. - Frameworks: LangGraph's state transitions enforce single-writer per node (typically). Custom orchestrators need to enforce this manually. - Numbers to drop: "single-writer pattern: zero update conflicts by construction", "optimistic locking: retry 1-3 times before failing the transaction"
Common follow-ups: - "What if two agents legitimately need to update the same field?" - "How does this work in async / streaming agents?"
Traps: - Free-for-all shared state. Concurrency bugs you can't reproduce.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/
Failure modes & reliability¶
Q: "What are the failure modes unique to multi-agent systems?"¶
Tags: senior · very-common · conceptual · source: standard senior multi-agent reliability probe; 2026 AI engineer loops
Answer outline: - Beyond single-agent failures, multi-agent has: - Cascading failures: agent A's slightly-wrong output becomes agent B's input; B amplifies the error; by the time you reach agent C, it's badly wrong. Especially common in pipelines. - Communication failures: handoff messages malformed or misinterpreted; receiving agent does something wrong. - Orphaned sub-tasks: orchestrator spawns workers, one crashes silently; the orchestrator waits forever or returns incomplete results. - Cyclic handoffs: agent A delegates to B, B delegates back to A, loop. Particularly easy in peer-to-peer; less common with a strict supervisor. - Goal drift: each agent slightly reinterprets the task; cumulative effect is wandering away from the user's intent. - Cost explosions: N agents × M iterations × per-step LLM calls = quickly into the dollars-per-task range. - Debugging difficulty: traces span multiple agents with cross-links; finding the source of a wrong output requires walking the full graph. - Mitigations: - Schema validation at every handoff. - Cross-agent budget enforcement (orchestrator-level). - Max-iteration caps per agent and total task cap. - Cycle detection (track which agents handed off to which; alarm on cycles). - Faithfulness checks at each stage: does this agent's output preserve the user's intent? - Trace-driven debugging with cross-agent span links. - Numbers to drop: "multi-agent failure rate often 1.5-3× single-agent on the same task", "trace complexity scales with agent count × iteration depth", "cost: 2-5× single-agent for the same end result"
Common follow-ups: - "Walk me through a cascading failure you've debugged." - "How do you bound the cost of an N-agent task?"
Traps: - Treating multi-agent failure as just "more single-agent failures". The cascade and coordination dimensions are distinct.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "How do you bound the cost of a multi-agent task?"¶
Tags: senior · common · design · source: standard senior cost-control probe; 2026 AI engineer loops
Answer outline: - Layered budgets: - Task-level budget: total $ across all agents for one user task. Hard cap at the orchestrator. - Per-agent budget: each worker has a max $; trip → orchestrator decides to retry / route to another / terminate. - Per-step budget: each tool call / LLM call has an estimated cost; orchestrator rejects calls that would blow the budget. - Inputs to budget enforcement: - Predicted cost from token count × model rate. - Tool cost (some tools are themselves expensive — API calls, image generation). - Buffer for retries (assume 1-2 retries on average). - Visibility to agents: include "you have $X of $Y remaining" in the agent's system prompt. Soft guidance; hard enforcement still at the orchestrator. - Per-tenant daily cap: aggregate across tasks. One runaway task shouldn't consume a tenant's monthly budget. - Telemetry: dashboard per-tenant, per-task-type cost. Alarm on cost growth. - When the budget hits: graceful termination with structured partial result, not a hard crash. - Numbers to drop: "task-level cap: $1-50 typical depending on use case", "tenant daily cap as backstop", "buffer 20-30% over expected cost"
Common follow-ups: - "What happens when a worker is mid-LLM-call and budget hits?" - "How do you predict cost before running?"
Traps: - Soft budget (prompt-only). Need orchestrator-level enforcement.
Related cross-cutting: Cost & latency, Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "An agent in your multi-agent system silently fails. How do you detect it?"¶
Tags: senior · common · debugging · source: standard senior reliability probe; 2026 AI engineer loops
Answer outline: - Silent failure: agent didn't crash but produced wrong / empty / hallucinated output that downstream agents accept. - Detection layers: - Output validation at handoff: every output passes schema + sanity checks before becoming next agent's input. Schema rejects malformed; sanity rejects empty / suspicious patterns. - Faithfulness check: a verifier (LLM judge or rule-based) confirms the agent's output is consistent with its input. - Confidence threshold: agent emits a confidence score; below threshold → flag for review or retry. - Heartbeat / progress signal: long-running agents emit periodic progress; absence indicates stuck/dead. - End-to-end check: at the final output, a separate critic validates against the user's original intent. Catches the cumulative-drift case where every individual handoff passed but the result is wrong. - Telemetry: per-agent silent-failure rate (when the validator fails). Trends over time. Alarm on regressions. - For specific patterns: - Empty output: simple length / content check. - Hallucinated tool use: validate tool calls against schema. - Off-topic output: embedding similarity to expected topic. - Goal drift: check against the user's original query, not just the immediate input. - Numbers to drop: "schema validation catches structural failures: ~80%", "LLM-judge faithfulness catches semantic drift: ~70-90% with calibration", "end-to-end critic on 1-10% of traffic for offline review"
Common follow-ups: - "How do you tell drift apart from valid creative variance?" - "What if the validator itself hallucinates?"
Traps: - No handoff validation. Errors cascade silently.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/03_agent_observability_debugging/
Specialization & roles¶
Q: "How do you decide what each agent's role should be?"¶
Tags: senior · common · design · source: standard senior multi-agent design probe; 2026 AI engineer loops
Answer outline: - Role boundaries should match either cognitive specialization (different prompts produce better outputs for different sub-tasks) or system isolation (trust, scope, tool-access reasons). - Cognitive specialization: writer, reviewer, researcher, planner, fact-checker. Each has a distinct prompt that does that thing well. The boundary justifies itself if the specialized prompts beat a generalist prompt on eval. - System isolation: - Untrusted-content reader vs privileged-action executor (trust boundary). - Different tool sets per agent (scope boundary). - Different cost tiers per agent (smaller model for simple steps). - Anti-pattern roles to avoid: - Personas that don't change behavior: "I'm Alex, the friendly assistant" vs "I'm Sam, the technical expert" — if their prompts are the same except the name, they're the same agent. - Roles that just slice the prompt: one agent for "intro" and one for "body" of a single response. Just one agent with a better prompt. - Roles created for vibes: "let's have a critic agent" without measuring whether the critic's feedback actually improves output. - Validate each role with an A/B: does having this separate agent measurably improve quality vs collapsing into the parent? If not, collapse. - Numbers to drop: "valid role: prompt is materially different + measurable quality lift", "rule of thumb: ≤5 distinct agent roles for most production systems"
Common follow-ups: - "Show me a real role boundary you've found valuable." - "How do you measure whether a role is justified?"
Traps: - Creating agents because the diagram looks better with more boxes.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you assign tools to agents in a multi-agent system?"¶
Tags: senior · common · design · source: standard senior architecture probe; 2026 AI engineer loops
Answer outline: - Principle: least privilege. Each agent gets only the tools it needs for its role. - Examples: - Research agent: read-only tools (web search, vector store query, doc fetch). - Writer agent: zero tools — receives gathered facts, produces text. - Editor agent: text-manipulation tools, possibly a fact-check tool. - Executor agent: privileged tools (send email, create ticket) only if explicitly authorized. - Why this matters for safety: trust isolation. An agent reading untrusted content (web pages, emails) shouldn't have the email-send tool. Otherwise indirect prompt injection in the content can chain into a privileged action. - Configuration: tool allow-list per agent role. Orchestrator enforces; the agent's tool description only includes the allowed tools. - For tool retrieval (large tool catalogs): per-agent retrieval scope. Researcher only retrieves research tools; writer only retrieves writing tools. - Audit: log every tool call with agent ID + tenant ID + tool ID. Detect anomalies (an agent calling a tool it shouldn't have access to). - Numbers to drop: "least-privilege agent: 3-10 tools typical", "tool catalog with retrieval: 50-500+ tools possible, with per-agent retrieval scope"
Common follow-ups: - "Walk me through tool isolation for the lethal trifecta." - "What if two agents legitimately need the same tool?"
Traps: - Giving all agents all tools "for flexibility". The flexibility costs safety and tool-selection quality.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/03_ai_security_safety/00_safety_guardrail_design/
Scenarios¶
Q: "Design a multi-agent system for content moderation at scale."¶
Tags: senior · common · design · source: standard senior multi-agent design probe; 2026 AI engineer loops
Answer outline:
- Single-agent moderation doesn't decompose well — there's no clear specialization boundary for "look at this content and decide if it violates policy". So why multi-agent here? Because severity tiers + cross-modal content + appeals create real role boundaries.
- Architecture:
- Triage agent / classifier (small fast model): runs on every piece of content. Returns severity tier (critical, high, medium, low, benign) and category. ~50-150ms.
- Cross-modal analyzers: per-modality (text, image, audio after ASR). Each is specialized for its modality. Outputs feed into the triage agent.
- Deep-analysis agent (larger model, slower): invoked only on borderline cases (medium severity). Reads policy text + content + context; emits structured verdict.
- Appeals agent: reviews escalated decisions with the full audit trail; recommends overturn or uphold.
- Coordination: deterministic orchestrator (workflow engine), not an agent. Each step has clear input/output schema; no LLM is deciding "who runs next".
- Trade-off vs single-agent: more cost per piece of content, but better latency (most content takes the fast path; only borderline triggers expensive analysis) and better explainability (per-step traces).
- See safety-guardrails.md and the system-design Content moderation question for the broader picture; here the multi-agent framing focuses on the role boundaries.
- Numbers to drop: "fast triage: 95%+ of content; deep analysis on 1-5%; appeals on <0.5%", "per-stage cost: triage <$0.0001, deep $0.01-0.10"
Common follow-ups: - "Why isn't this just one big classifier?" - "How do you handle the deep-analysis agent disagreeing with triage?"
Traps: - Going multi-agent where a tiered classifier would do.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/03_ai_security_safety/00_safety_guardrail_design/
Q: "Walk me through a customer-support multi-agent system."¶
Tags: senior · common · design · source: standard senior multi-agent design probe; 2026 AI engineer loops
Answer outline:
- Architecture: orchestrator-worker with a deterministic router.
- Router (classifier, not LLM): receives user message, classifies intent (FAQ, account, billing, technical, complex).
- FAQ agent: fast path, RAG over support docs, return answer with citations. Most traffic.
- Account agent: handles account-related queries; has account-info tools.
- Billing agent: handles billing queries; has billing-system tools (with careful authorization).
- Technical agent: handles app/product issues; can run diagnostic tools.
- Escalation agent / handoff: surfaces complex cases to a human; preserves trace and summary.
- Shared layer:
- Conversation memory (per-user; see memory-systems.md when written).
- Output guardrails (PII leak, brand voice, refusal patterns).
- Cost / budget enforcement per task.
- Why this maps well: each agent has clearly different tools and authorization scope. Billing agent shouldn't have FAQ tools; FAQ shouldn't have billing access. Trust boundary plus tool specialization justifies separate agents.
- Communication: structured handoff. Router → agent: {intent, user_query, conversation_summary}. Agent → escalation: {transcript, identified_issue, suggested_resolution}.
- Eval: per-agent quality, per-handoff success rate, end-to-end CSAT.
- Numbers to drop: "router: ~$0.001/call, ~50ms", "per-agent specialized prompts: 5-15% quality lift vs generalist", "escalation rate target: <10%"
Common follow-ups: - "What if the router misclassifies?" - "How do you avoid the agent count blowing up as you add domains?"
Traps: - Free-form prose handoffs between agents.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you scale a multi-agent system to handle thousands of concurrent tasks?"¶
Tags: staff · common · design · source: standard staff-tier scale probe; 2026 AI engineer loops
Answer outline:
- The orchestrator becomes the bottleneck if it's stateful or single-instance.
- Architecture for scale:
- Stateless orchestrator workers: any worker can pick up any task. State lives in a durable store (Redis, Postgres, Temporal's own store).
- Task queue: incoming tasks land in a queue (SQS, Kafka, RabbitMQ); workers pull and process.
- Per-task isolation: one task's state never leaks to another. Tenant/task ID in every state key.
- Agent workers as a pool: agent calls go through a shared LLM client pool; provider keys / rate limits shared across the org via the AI gateway (see ai-system-design.md).
- Tool workers: tools that are slow (DB queries, external APIs) run on dedicated pools; agents await results via the orchestrator.
- Durability: each agent's checkpoint persisted; crashed worker → another worker resumes. Use Temporal or equivalent workflow engine for the orchestrator state.
- Backpressure: when overloaded, queue grows; orchestrator rate-limits new task ingest. Prefer "queue and serve later" over "drop tasks".
- Observability: dashboards per agent role (queue depth, processing time, error rate); per-tenant (cost, task count); aggregate (E2E latency, success rate).
- Numbers to drop: "thousands of concurrent tasks: orchestrator as stateless pool with persistent state in Postgres/Temporal", "agent worker pool autoscales on queue depth"
Common follow-ups: - "How do you handle a tenant overwhelming the queue?" - "What's the failure mode of the workflow engine itself?"
Traps: - Stateful orchestrator. Single point of failure + bottleneck.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/02_ai_infrastructure/04_ml_platform_operations/
Frameworks¶
Q: "Compare LangGraph, AutoGen, CrewAI, and Temporal for multi-agent orchestration."¶
Tags: senior · common · conceptual · source: Multi-Agent Orchestration 2026 guides; standard senior tooling probe
Answer outline: - LangGraph (LangChain): state-machine-based, explicit nodes and edges, typed state transitions. Best for structured workflows where you want clear handoffs and debugging. Native LangSmith integration. The 2026 default for many LangChain shops. - AutoGen (Microsoft Research): conversational multi-agent; agents talk via group-chat. More research-flexible; supports both structured and free-form. Better for emergent multi-agent behavior. - CrewAI: role-and-task abstraction (you define crew of agents with roles, then assign tasks). Higher-level than LangGraph; less code for common patterns. Adoption growing for simple multi-agent flows. - Temporal: not LLM-specific. General workflow engine with durable state, retries, signals, timers. Use Temporal as the orchestrator and call LLMs / agents as activities. Best when reliability and durability matter more than LLM-specific ergonomics. - 2026 default: LangGraph if you want LLM-native + clear state model; Temporal if reliability is critical; AutoGen if you're doing research / exploration; CrewAI for quick wins on simple flows. - The senior insight: framework choice matters less than architecture choice. A clean state-machine in any framework beats a tangled "chat between agents" in the fanciest framework. - Numbers to drop: "LangGraph adoption growing in 2026 for production multi-agent", "Temporal for reliability-critical agent workflows", "AutoGen more research / experimental", "CrewAI for low-code multi-agent"
Common follow-ups: - "When would you build your own orchestrator instead?" - "How do you handle versioning of an agent workflow?"
Traps: - Picking the framework before deciding the architecture.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/02_durable_agent_workflows/