Agent Design — Interview Questions¶
Agentic AI is now its own interview round in 2026. The defining question of the year: "What's the difference between an agent and a simple LLM chain?". This file covers design decisions for single-agent systems — multi-agent is its own file.
Architecture (ReAct, plan-and-execute)¶
Q: "What's the difference between an agent and a simple LLM chain?"¶
Tags: screen · very-common · conceptual · source: adilshamim8 Medium "Every AI Engineer Interview Question 2026" + Anthropic "Building Effective Agents"
Answer outline: - Chain: fixed DAG of LLM calls, deterministic control flow, you decide step order at design time. - Agent: LLM decides next step at runtime, including which tool to call and when to stop. Anthropic's framing: "workflows" vs "agents" — workflows are predefined paths, agents are model-driven loops. - Agent loop has three parts: reason, act (tool call), observe (tool result fed back as message). - Cost shape: chains have bounded token cost, agents can run for arbitrary iterations until a stopping rule fires. - Use agents only when "it's difficult or impossible to predict the required number of steps" (Anthropic). Otherwise a workflow is cheaper and more reliable. - Numbers to drop: "agent loops typically iterate 3-15 times per task; a chain runs in 1-3 LLM calls; agents cost 5-20x more tokens per task at the same accuracy."
Common follow-ups: - "Give me a task where an agent is clearly the wrong choice." - "When would Anthropic say to start with a chain and only graduate to an agent?"
Traps: - Treating "agentic" as a marketing label rather than a control-flow decision. - Reaching for an agent when a 3-node LangGraph workflow would do the job for 1/10th the cost. - Forgetting that every agent loop iteration replays the full context, so cost grows linearly with iteration count.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Walk through a production-ready agent architecture."¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u "Complete Agentic AI System Design Interview Guide 2026"
Answer outline: - Five layers: orchestrator (state machine), LLM client (with caching), tool interface layer, memory (short + long), policy/guardrails engine, observability (traces + metrics). - Orchestrator owns control flow — it enforces max iterations, total token budget, retries, timeouts. The LLM owns reasoning only. - Tool layer wraps each tool with schema validation (Pydantic/JSON Schema), rate limiting, idempotency keys, structured error returns. - Every loop iteration emits a span: prompt, response, tool call, tool result, latency, tokens, cost. OpenTelemetry + LangSmith or Arize for traces. - Stopping conditions are first-class: max_steps, max_tokens, max_wall_clock, final_answer_predicate, low-confidence escalation. - Numbers to drop: "p50 agent task in production at my last role: 4 tool calls, 18k tokens, 12s wall clock; p99: 14 tool calls, 90k tokens, 75s — we capped at 20 steps and $0.50/task."
Common follow-ups: - "Where exactly does the policy engine sit — before or after the LLM call?" - "How do you handle a tool call that takes 90 seconds?" - "Draw the failure-mode diagram for one iteration."
Traps: - Putting retry logic inside the LLM prompt instead of the orchestrator. - No total-token cap, only a per-iteration step cap (an agent can burn $50 in 10 steps with bloated context). - Treating tools as plain function calls with no telemetry, so when something breaks you have no trace.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What is ReAct? When does it fail?"¶
Tags: mid · very-common · conceptual · source: Yao et al. "ReAct: Synergizing Reasoning and Acting" (2022); SurePrompts "ReAct Prompting Guide 2026"; aemonline.net top-25 agentic AI 2026
Answer outline: - ReAct = Reason + Act loop. LLM emits a thought, then an action (tool call), receives an observation, and repeats. The thought is what makes it "ReAct" instead of just function-calling. - Strengths: handles unpredictable environments well — every observation can change the next thought. Good for search, navigation, debugging. - Failure mode 1: long horizons — context grows linearly with iterations, so 30+ step tasks blow context window and the model loses track. - Failure mode 2: greedy local optimization — ReAct doesn't backtrack, so once it commits to a bad sub-path it doubles down. - Failure mode 3: prompt sensitivity — the "Thought:" channel can leak into the user-visible output if the parser is sloppy. - Numbers to drop: "ReAct on HotpotQA hits ~35% EM in the original paper; modern tool-augmented variants reach ~60%+. ReAct dies around 15-20 steps in our internal eval because of context bloat."
Common follow-ups: - "How do you keep ReAct from blowing the context window on long tasks?" - "ReAct vs plan-and-execute on a multi-file refactor — which wins?"
Traps: - Confusing ReAct with plain chain-of-thought (ReAct has tools and observations, CoT does not). - Letting the "Thought:" string become a load-bearing parser anchor — models drift in formatting. - Assuming ReAct is the answer for everything because it's the default in LangChain.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Compare ReAct, Plan-and-Solve, and Tree-of-Thoughts with a real-world trade-off."¶
Tags: senior · common · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q16)
Answer outline: - ReAct: interleaved reason+act, reactive, no upfront plan. Best when environment is unpredictable. - Plan-and-Solve / Plan-and-Execute: produce full plan first, then execute step by step, optionally re-plan on failure. Predictable, gives a natural human-review gate before execution starts. - Tree-of-Thoughts: explore multiple reasoning branches, evaluate each, backtrack. Expensive — 5-20x token cost of ReAct. - Real-world trade-off: code-fixing on a known repo → plan-and-execute (cheaper, reviewable plan). Bug triage in an unknown system → ReAct (you don't know what you'll find). Research with many candidate hypotheses → ToT, capped at 3 branches. - Plan-and-execute "hurts when the task is not decomposable because you do not know the environment yet" (SurePrompts 2026). - Numbers to drop: "On a refactor benchmark, plan-and-execute cut iterations from 9 → 4 vs ReAct, saving 55% tokens; on bug triage in unknown repos, plan-and-execute failed 40% of the time because the initial plan referenced wrong file paths."
Common follow-ups: - "When does Tree-of-Thoughts beat a simple ReAct loop with reflection?" - "How do you decide the branching factor for ToT?"
Traps: - Using plan-and-execute on tasks where you can't enumerate steps without exploration. - Pricing Tree-of-Thoughts as if it were one-shot — every branch is its own ReAct loop. - Forgetting that plan-and-execute still needs a re-plan trigger when a step fails.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What logic belongs in the orchestrator vs the LLM?"¶
Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u system design guide 2026 (Q8)
Answer outline:
- LLM: choose the next action, generate arguments, summarize observations, draft user-facing prose. Anything that needs semantic judgment.
- Orchestrator (code): loop control, step count, token accounting, retry policy, timeout enforcement, tool dispatch, schema validation, idempotency keys, audit logging, approval gates.
- Rule of thumb: if it can be expressed as if/while/for, it doesn't belong in the prompt. Prompts are terrible loops.
- Determinism matters here: the orchestrator is the only place you can guarantee invariants (max 20 steps, max $0.50, never call DELETE without approval).
- The LLM should not know its own budget cap — you don't want it negotiating with itself. The orchestrator enforces silently.
- Numbers to drop: "Moving retry/backoff out of the prompt and into orchestrator code cut hallucinated tool calls by ~30% and saved 22% tokens because the LLM no longer rationalizes 'I'll try once more'."
Common follow-ups: - "Where does the system prompt's role policy enforcement actually run?" - "Can the LLM ever modify the budget?"
Traps: - Pushing security checks into prompts ("don't delete files unless..."). LLMs route around prompts. - Letting the LLM decide when to stop — it will stop too early or too late. - Putting tool-discovery logic into prompts at runtime when a registry lookup in code is simpler.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you design a safe and debuggable agent loop?"¶
Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u system design guide 2026 (Q9)
Answer outline:
- Make every state transition recorded — a single append-only AgentEvent stream with thought, tool_call, tool_result, latency, tokens, model_version.
- Replayable state: the agent state at step N must be reconstructible from events 1..N-1 plus the seed. No hidden state in instance variables.
- Hard caps on the loop: max_steps, max_tokens, max_seconds, max_cost_usd. Any one trip halts.
- Side-effect quarantine: tools that mutate the world are marked mutating=True; orchestrator routes them through approval or dry-run modes during dev.
- Structured tool errors: every tool returns {ok, value | error_code, message, retryable, hint} so the LLM can reason about failures without parsing strings.
- Numbers to drop: "After we added replay-from-event-log, MTTR on agent bugs dropped from ~3 hours to ~25 minutes because we could reproduce production failures locally."
Common follow-ups: - "How do you make a non-deterministic LLM replayable?" - "Where does the event log get garbage collected?"
Traps: - Logging only the final answer — you can't debug a 12-step loop from one row. - Letting tools throw raw exceptions back to the LLM as stack traces (the LLM tries to fix Python, not the task). - Using wall-clock time as the only termination signal in batch contexts.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "When should you use a workflow instead of an agent?"¶
Tags: mid · very-common · conceptual · source: Anthropic "Building Effective Agents" (anthropic.com/research/building-effective-agents)
Answer outline: - Anthropic's canonical advice: "Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short." - Workflows win when: steps are fixed, latency budget is tight, you need predictable cost, you can write evals for each step. - Agents win when: you cannot predict step count, environment is unpredictable, branching factor is high. - The five named Anthropic workflow patterns — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer — cover ~80% of "agentic" use cases without a real agent loop. - Cost shape: workflows are bounded (you can quote a per-task cost); agents have heavy right tail. - Numbers to drop: "We replaced a 'customer query agent' with a 3-step prompt chain (classify → retrieve → answer) and cut p50 latency from 8s → 1.4s and cost from $0.08 → $0.011 per query at equal quality."
Common follow-ups: - "Walk through Anthropic's orchestrator-workers pattern — when does it beat a real agent?" - "How would you decide in an interview to recommend a workflow vs agent?"
Traps: - Reaching for an agent because the problem sounds open-ended when it actually decomposes into 3 steps. - Confusing "uses an LLM with tools" with "must be an agent loop". - Building a generic agent before you've shipped any chain in production.
Related cross-cutting: Cost & latency
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How would you design an agent to handle a task with an extremely long time horizon, like 'research the entire history of a company and write a 50-page report'?"¶
Tags: staff · occasional · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q1)
Answer outline: - Phase-decompose the task: outline → per-section research → per-section drafting → cross-section consistency pass → final assembly. Each phase is its own agent run with its own budget. - Hierarchical state machine: a top-level planner emits phase tickets, sub-agents complete tickets, results land in a durable store (Postgres or object storage), not the parent's context. - Checkpoint after every phase — the system must be killable and resumable. Critical when wall time is hours. - Compaction between phases: don't carry raw research notes into the drafting phase, carry a structured summary (key entities, dates, sources). - Stopping rule per phase: max iterations + "section_complete" predicate validated by an evaluator LLM. - Numbers to drop: "Anthropic's deep-research style agents run 5-30 minutes wall clock and consume 200k-2M tokens; we chunk at 30-min boundaries and persist intermediate state in S3 with content-hash keys."
Common follow-ups: - "Where does the agent store intermediate research without polluting context?" - "How do you ensure section N is consistent with section M written 4 hours earlier?"
Traps: - One monolithic ReAct loop trying to do everything in one context window. - No checkpointing — a 4-hour run that crashes at step 28 is unrecoverable. - Using episodic memory as if it were durable storage; it isn't, and it grows context per step.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do agents decompose high-level goals into executable steps?"¶
Tags: mid · common · conceptual · source: adilshamim8 Medium 2026
Answer outline:
- Three common approaches: LLM-as-planner (one shot, returns JSON plan), recursive decomposition (split until atomic), and emergent (ReAct just figures it out step by step).
- Atomic step = "callable by one tool in one turn with no further sub-goal". This is the planner's contract.
- Quality check on the plan: an evaluator LLM scores each step for is_atomic, is_actionable, references_valid_tools. Reject and re-plan if any fails.
- Output a graph, not a list — explicit dependencies enable parallel execution and partial replanning.
- Track plan-vs-execution drift: if the agent deviates from the plan more than N times, re-plan. Anthropic-style "evaluator-optimizer" loop.
- Numbers to drop: "Plan validation by a cheap model (Haiku-class) costs ~$0.001 per plan and catches ~25% of bad plans before any tool is called."
Common follow-ups: - "Show me the JSON schema you'd use for a plan." - "What's plan drift and how do you detect it?"
Traps: - Asking the LLM for a plan and then ignoring it inside a ReAct loop. - Plans that are too coarse ("research company X") — they leave the executor doing planning anyway. - No re-plan trigger, so when step 2 fails the agent fights step 3 with stale assumptions.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Describe how you would architect an AI agent system, including the agent loop, tool interfaces, memory design, orchestration technologies, and safety considerations."¶
Tags: staff · common · design · source: adilshamim8 Medium "100 real interviews" 2026
Answer outline:
- Loop: typed state object (goal, messages, step, budget_used, pending_tool_calls), single step() function, deterministic event log.
- Tool interface: Pydantic schemas, namespaced names (gh_issues_search, gh_pulls_create), structured errors, idempotency keys, per-tool rate limits.
- Memory: short-term = message buffer with summarization at 70% of context; long-term = vector store keyed by agent_id + topic + ts; episodic = event log replayed for self-reflection.
- Orchestration: LangGraph or Temporal for durable execution, Redis for ephemeral state, Postgres for checkpoints.
- Safety: policy engine (OPA or in-process rules), pre-execution check on every mutating tool, audit log, rate limits per actor.
- Numbers to drop: "Temporal-backed agents survived our last region outage with zero lost tasks; in-memory LangGraph lost 800 in-flight runs. We standardized on Temporal for any agent with side effects."
Common follow-ups: - "Why Temporal over a plain Redis queue?" - "How does your policy engine talk to the orchestrator?"
Traps: - Conflating short-term context with long-term memory. - Skipping the durability layer because "it's just a prototype" — it never is. - Hardcoding tool names into prompts instead of generating from a registry.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Describe a scenario where a monolithic agent is a better choice than a multi-agent system."¶
Tags: mid · occasional · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q3)
Answer outline: - Tight context coupling: when every step needs the same intermediate state, splitting it across agents forces re-serialization on every handoff. - Latency-sensitive tasks: agent-to-agent communication adds 1-3 LLM round-trips of overhead per handoff. - Simple linear flows: if the task is "research → summarize → format", a single ReAct agent with three tools beats three agents. - Small token budgets: multi-agent inflates total tokens 2-4x because each agent re-reads parts of context. - Strong eval signal: if you can write one eval for the whole task, you don't need orchestration boundaries. - Numbers to drop: "We replaced a 3-agent 'planner→researcher→writer' setup with one ReAct agent; latency dropped from 22s → 9s, cost dropped 60%, eval score identical."
Common follow-ups: - "When would you actually need agents to be separate processes?" - "Where's the multi-agent overhead coming from?"
Traps: - Splitting agents along organizational lines instead of task boundaries. - Assuming multi-agent is automatically more capable — it's usually just more expensive. - Ignoring handoff cost in latency budget.
Related cross-cutting: Cost & latency
Related module: learning/01_ai_engineering/01_agentic_system_design/
Tool schemas & descriptions¶
Q: "Why are tool descriptions more important than tool names?"¶
Tags: mid · very-common · conceptual · source: Anthropic "Writing Effective Tools for AI Agents" (anthropic.com/engineering/writing-tools-for-agents)
Answer outline:
- The LLM picks tools by reading the full schema, not just the name. Description carries the disambiguation signal.
- Anthropic: "small refinements to tool descriptions can yield dramatic improvements" — Claude Sonnet 3.5 reached SOTA on SWE-bench Verified after description tuning alone.
- Good descriptions answer: when to call, when NOT to call, what other tools to prefer for nearby cases, input format requirements, example use.
- Names handle coarse routing (gh_issues_search vs linear_issues_search); descriptions handle fine-grained "use this for X, not Y".
- Treat descriptions as you'd treat a new-hire onboarding doc for that tool. Anthropic's phrasing: "how you would describe your tool to a new hire on your team".
- Numbers to drop: "After we rewrote 12 tool descriptions to include 'use this when' + 'do NOT use this when' bullets, wrong-tool rate dropped from 14% → 3% on our eval, with no model change."
Common follow-ups: - "What's in a great tool description besides parameter types?" - "How do you evaluate description quality?"
Traps: - One-line descriptions like "Search GitHub issues." — useless for disambiguation. - Renaming a tool when the real fix is rewriting its description. - Letting two tools share overlapping descriptions; the model coin-flips between them.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you design tool schemas that reduce hallucinated actions?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q20)
Answer outline:
- Pydantic / JSON Schema with strict types — enums for categorical fields, regex for IDs, ranges for numerics. Reject before the tool runs.
- Specific parameter names: user_id, not user. repository_full_name, not repo. Anthropic explicitly recommends this.
- Required vs optional discipline — every required field should be required; everything else should default. Otherwise the model invents reasonable-sounding values.
- Constrained generation: use the provider's structured-output mode (OpenAI tool_choice, Anthropic tool use) so the schema is enforced server-side.
- Failure mode messages should restate the expected shape: "Expected user_id (string, UUID v4), got user_id='alice'". The model self-corrects on the next turn.
- Numbers to drop: "Strict Pydantic + structured outputs cut parameter hallucinations from 8% → 0.4% in our payments agent eval (n=5k tasks)."
Common follow-ups: - "How does this play with chain-of-thought before the tool call?" - "Show me an enum vs free-string trade-off you've shipped."
Traps:
- Loose Dict[str, Any] parameters — the model fills them with garbage.
- Allowing the same field name with different meanings across tools.
- Returning unstructured error strings; the model can't tell whether to retry.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How does an agent choose which tool to use when multiple tools seem relevant?"¶
Tags: mid · very-common · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q8); adilshamim8 Medium 2026
Answer outline:
- The provider's tool-selection sampler reads name + description + schema of every registered tool and picks one in a single forward pass.
- Signal sources: name namespacing (asana_search vs jira_search), description "use when/use not when" bullets, parameter shape (matching what the model already has).
- Disambiguation tactic: include explicit comparisons in description — "Use this for code search; for prose search use docs_search."
- Routing precondition: if you have >20 tools, do a description-based vector search first to shortlist 5-7 before showing them to the main model. Saves tokens and reduces confusion.
- Hard guardrail: enums in the system prompt for which tools are valid in which agent role.
- Numbers to drop: "Anthropic showed namespacing-by-service plus namespacing-by-resource (asana_projects_search vs asana_users_search) materially improved tool selection in tool-use benchmarks; we measured a 6-point F1 gain after adopting it."
Common follow-ups: - "What's tool retrieval and when do you need it?" - "How do you A/B test tool descriptions?"
Traps:
- Loading 50 tools into the context unconditionally.
- Names that don't disambiguate (search, get, update for three services).
- Assuming the model will figure it out without a "do not use this for" clue.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What is a 'tool retrieval' mechanism, and when is it necessary?"¶
Tags: senior · common · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q9)
Answer outline:
- Tool retrieval = semantic search over the registry of tools using the user's query, returning top-K to attach to the prompt.
- Necessary above ~30 tools — beyond that, tool descriptions alone eat the context budget and the model's selection accuracy drops.
- Implementation: embed each tool's name + description + example into a vector DB. At runtime, embed the user goal, retrieve top-7, attach those schemas.
- Refresh on every loop turn or only on plan boundaries — the trade-off is one being more flexible and the other being more predictable.
- Combine with metadata filters: scope ("read-only only", "current user has permission") before semantic ranking.
- Numbers to drop: "Our internal agent has 180 tools; without retrieval, p95 latency was 9s and tool-pick accuracy was 71%. With top-7 retrieval, p95 dropped to 4s and accuracy went to 88%."
Common follow-ups: - "Do you retrieve once or per turn?" - "How do you keep the retrieved tool set stable across turns?"
Traps: - Retrieving fresh every turn and confusing the model with shifting tool sets. - Embedding only the name — descriptions carry most of the signal. - Skipping a permission filter; the model "discovers" tools it can't actually call.
Related cross-cutting: Retrieval
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Design tool schemas for a financial advisor agent."¶
Tags: senior · occasional · design · source: synthesized from atul4u 2026 Q19-22 + Anthropic tool-writing guide
Answer outline:
- Scope first: read-only tools (account_get_positions, market_quote_get, portfolio_risk_score) and mutating tools (portfolio_rebalance_propose, trade_order_submit).
- All mutating tools take an idempotency_key (UUID) and an approval_token (only valid if HITL approved).
- Strict enums for everything: order_type ∈ {market, limit, stop}, time_in_force ∈ {day, gtc, ioc}. No free strings.
- dry_run: bool = True default on every mutating tool — orchestrator flips to false only after approval.
- Structured errors: {ok: false, code: "INSUFFICIENT_FUNDS", retryable: false, hint: "Reduce quantity or use margin"}.
- Audit attributes on every call: user_id, session_id, agent_version, reason. Stored separately from tool output.
- Numbers to drop: "We targeted 100% audit coverage on mutating calls and got there by making audit_context a required Pydantic field — calls without it 400 at the tool boundary."
Common follow-ups: - "How do you stop the agent from inventing a CUSIP?" - "What gets logged for SOX compliance?"
Traps:
- Free-form instruction: str arguments — guarantees prompt injection.
- One mega-tool trade(action: str, ...) instead of typed siblings.
- Mutating tools without idempotency — every retry doubles the position.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you measure the quality of tool descriptions for an agent?"¶
Tags: senior · occasional · design · source: aemonline.net top-50 agentic AI 2026 (Q10); Anthropic "Writing Effective Tools"
Answer outline:
- Build a held-out eval set of (user_goal, correct_tool, correct_args). Measure: tool-pick accuracy, arg-validity rate, end-to-end task success.
- Interleaved thinking traces: Anthropic explicitly recommends inspecting why the model chose tool X over Y to find description gaps.
- A/B descriptions: keep schemas constant, vary descriptions, measure delta on held-out eval. Treat tool docs like prompts — version them.
- Confusion matrix between similar tools: if docs_search and code_search get swapped >5% of the time, descriptions need explicit "use this for X, not Y".
- Automated: a meta-eval LLM scores descriptions for clarity, presence of examples, presence of negative cases.
- Numbers to drop: "Our description-quality scorecard hits 4 axes (when-to-use, when-not, examples, error guide); raising the average from 2.1/4 → 3.6/4 lifted task success from 64% → 79%."
Common follow-ups: - "How often do you re-run tool description evals?" - "How do you keep examples in descriptions from going stale?"
Traps: - "Eyeballing" descriptions instead of measuring against an eval. - A/B testing model and descriptions simultaneously (can't attribute delta). - Treating descriptions as documentation rather than load-bearing input.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What is 'lexical ambiguity' in tool names, and how can you prevent it?"¶
Tags: mid · occasional · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q12)
Answer outline:
- Lexical ambiguity: tool names that share words but mean different things, e.g. search (CRM customers) vs search (Confluence pages).
- Fix with namespacing-by-service + namespacing-by-resource: salesforce_customer_search, confluence_page_search. Anthropic's exact recommendation.
- Avoid English synonyms across tools — pick one verb per action class (get, list, create, update, delete) and stick to it.
- Don't reuse a verb for both query and mutation (update_record should never read-only).
- Validate with a name-collision linter at registration time — fails CI if two tools share a stem.
- Numbers to drop: "After renaming 9 tools to <service>_<resource>_<verb> format, tool selection F1 went from 0.81 → 0.91; the lift held across GPT-4, Claude, and Gemini-Flash."
Common follow-ups: - "How do you handle this when tools come from external MCP servers you don't control?" - "Prefix vs suffix namespacing — which wins?"
Traps:
- Tools called do_thing, process, handle — meaningless to the model.
- Loading two tools with identical names from different servers without a prefix.
- Trusting the description to disambiguate two tools that share a name (it sometimes doesn't).
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you design a robust function-calling interface that can handle malformed tool responses?"¶
Tags: senior · common · coding · source: aemonline.net top-50 agentic AI 2026 (Q6)
Answer outline:
- Validate every tool response against a response schema before feeding it back. Treat tools as untrusted.
- Three categories of malformed: missing fields, wrong types, surprising content (e.g., HTML when JSON expected). Each gets a different recovery path.
- Wrap raw response in a structured envelope: {ok, parsed, raw_excerpt, warnings}. The LLM sees parsed and only raw_excerpt (truncated) if validation fails.
- On parse failure, return a hint instead of dumping the raw payload: "Tool returned non-JSON; this usually means rate-limit or maintenance — try again with smaller limit."
- Idempotency: if validation fails after retry N, mark as terminal failure and stop. Don't loop on the same broken tool.
- Numbers to drop: "Adding response-schema validation cut downstream hallucinations from tool output by ~40% and halved the retry rate."
Common follow-ups: - "What's the right retry count for a tool returning 502?" - "How do you stop a 'fix the tool output' loop?"
Traps: - Treating tool stdout as model input verbatim. - Re-running a tool that returned a stable wrong format multiple times. - Dumping a 10MB HTML page into the LLM's next prompt because parsing failed.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What is tool hallucination and how do you prevent it?"¶
Tags: mid · very-common · conceptual · source: aemonline.net 50-questions 2026 (Q36); aemonline.net 25-advanced Q21
Answer outline:
- Two flavors: hallucinated tool (calls a tool that doesn't exist) and hallucinated args (right tool, made-up parameter values).
- Prevent tool hallucination: provide tools only via the API's native tool-use channel, not in prose. Models hallucinate prose tools more than API-declared ones.
- Prevent arg hallucination: strict schema + structured outputs + specific names. user_id: str (UUID v4) is harder to hallucinate than user: str.
- Catch what slips through: pre-execution validation in the orchestrator, structured error back to the model. Anthropic calls this "poka-yoke your tools".
- Telemetry: count schema-validation failures per tool per day. Sudden spike = description regression or model regression.
- Numbers to drop: "Anthropic's tool-writing guide reports description tuning alone reached SOTA on SWE-bench Verified; we saw arg-hallucination drop from 8% → 0.4% after adding strict schemas + structured outputs."
Common follow-ups: - "How do you tell schema-validation failure from network failure?" - "Does temperature affect tool hallucination?"
Traps:
- Putting tool catalog in the system prompt as prose and hoping the model uses real names.
- Permissive Any types in schemas.
- No telemetry on hallucination rate, so regressions ship unnoticed.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you handle tool failures, retries, and idempotency?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q22); aemonline.net 50-questions (Q42)
Answer outline:
- Classify failures: transient (5xx, rate limit, network) → retry with backoff; permanent (4xx, invalid auth, business rule) → don't retry, surface to LLM as structured error.
- Retry policy lives in the orchestrator, not the prompt: max 3 attempts, exponential backoff with jitter, total time budget per tool.
- Idempotency: every mutating tool takes an idempotency_key (UUID per logical action). Server dedupes — retries cannot create duplicates.
- Circuit breaker: if a tool fails >50% over the last 10 calls in 60s, trip the breaker and route around it for the next 5 minutes.
- Surface failures to the LLM as {ok: false, code, retryable: false, hint} — don't dump stack traces.
- Numbers to drop: "After adding idempotency keys + circuit breakers, we eliminated 100% of duplicate-charge incidents (previously 3-4/month) and dropped p99 task latency 22% because retries no longer queued behind dying tools."
Common follow-ups: - "What's the right backoff curve for an OpenAI rate-limit error?" - "What if the tool is non-idempotent and you can't fix it?"
Traps: - Retrying 4xx errors (it's not going to work the next time either). - Retrying mutating tools without idempotency keys. - Letting the LLM "decide" whether to retry — it doesn't have circuit-breaker state.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you sandbox tool execution safely?"¶
Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q21)
Answer outline:
- Containment levels: read-only tools run in-process; mutating tools run via a separate service with its own auth; code-execution tools run in firecracker/gVisor with no network and a fresh tmpdir.
- Capability-based auth: each tool call carries a scoped token that names allowed resources, not the user's full creds.
- Resource caps per call: CPU seconds, RSS memory, network egress, files written. Kill on overage.
- Network egress allowlist per tool — web_search can hit Google; python_exec cannot reach the internet.
- Output sanitization: scrub secrets/PII from tool output before it lands in LLM context (regex + classifier).
- Numbers to drop: "Moving python_exec to gVisor + 256MB RSS + 10s CPU cut runaway-process incidents to zero; previously we had one per 50k tasks at 8GB RAM each."
Common follow-ups: - "How do you isolate code-exec tools from each other across users?" - "Where does the sandbox sit in the trace?"
Traps: - Letting code-exec tools share a filesystem with other tenants. - No egress allowlist, so the agent exfiltrates secrets via DNS. - Forgetting to scrub secrets from tool output that goes back to the LLM.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Stopping rules & budgets¶
Q: "How do you prevent an agent from over-reasoning or over-planning?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q5)
Answer outline:
- Hard cap on steps (max_iterations = 15-20 is typical for most agents). Trip → return best-effort answer with terminated=max_steps.
- Hard cap on total tokens (e.g., 100k per task) and total wall clock (e.g., 60s for sync tasks).
- Per-tool budget: tools that are cheap can be called 10x, expensive ones (LLM-judge, code-exec) capped at 2-3x.
- Progress check: every 5 iterations, an evaluator LLM asks "has progress been made vs 5 steps ago?" If not, halt or escalate.
- Plan once, execute many — plan-and-execute trims overthinking by separating decisions from execution.
- Numbers to drop: "We cap at 20 steps and \(0.50/task; before caps, 4% of tasks consumed >\)5 and 0.1% consumed >$50. After caps, p99 cost is bounded at $0.50 and quality dropped only 1.2 pts on our eval."
Common follow-ups: - "What signals indicate over-planning specifically vs. over-reasoning?" - "How do you A/B test budget caps?"
Traps: - No global cap, only per-tool caps — agent can rack up 50 LLM-only "thinking" turns. - Hard cap with no informative fallback (returns nothing instead of best-effort). - Letting the LLM see its own budget — it negotiates.
Related cross-cutting: Cost & latency
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you implement termination conditions in long-running agents?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q10)
Answer outline:
- Multiple independent stop conditions, OR'd: step >= max_steps, tokens >= max_tokens, wall_clock >= max_seconds, cost >= max_cost, final_answer_emitted, human_halt_signal.
- "Final answer" predicate: the agent must call a final_answer tool to terminate naturally. No free-form "I'm done" — the LLM can lie about being done.
- Stuck detector: if last 3 tool calls are identical (same name + same args), halt; obvious loop.
- No-progress detector: if 5 consecutive iterations produce no new tool calls (only "thinking"), halt.
- Externalized kill switch: a Redis key the orchestrator polls. Ops can stop any agent without redeploy.
- Numbers to drop: "Adding the 'identical-tool-call x3 → halt' rule caught 0.8% of runs that would otherwise have spun until max_steps, saving an average of 14k tokens each."
Common follow-ups: - "How does the agent know it's done without lying?" - "How do you handle agents that need 4 hours legitimately?"
Traps: - One stop condition only (e.g., step count) — agent burns the budget elsewhere. - LLM self-reported termination with no structural enforcement. - No kill switch — incident response means a deploy.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do agents decide a task is 'done'?"¶
Tags: mid · common · conceptual · source: adilshamim8 Medium 2026
Answer outline:
- Three signals: explicit (model calls final_answer tool), implicit (no more tool calls, output is final prose), or external (orchestrator detects success criteria met).
- Best practice: require the explicit final_answer(answer: str, confidence: float, supporting_evidence: List[str]) tool. Forces structured completion.
- Have an evaluator LLM (or rule-based check) score the answer against the goal — if score < threshold, re-prompt with feedback.
- For deterministic tasks (SQL query, code fix), use task-specific verifiers (run the SQL, run the tests).
- Cap by budget regardless — "done by budget" is a legitimate termination state, labeled as such.
- Numbers to drop: "Switching from 'no more tool calls' to explicit final_answer reduced false completions (agent said done while the task was unfinished) from 11% → 2.3% on our eval."
Common follow-ups: - "What's a verifier and where does it live?" - "How do you handle 'partial done' for long tasks?"
Traps: - Trusting the model's word that it's done. - No verifier on tasks where one is trivial to write (e.g., compile the code). - Treating "max steps reached" the same as "answered" in metrics.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you detect and stop infinite planning loops?"¶
Tags: senior · common · debugging · source: adilshamim8 Medium 2026; aemonline.net top-50 (Q24)
Answer outline:
- Loop signature: track (tool_name, normalized_args_hash) for last K calls. Three identical entries → halt.
- Plan churn detector: if the agent re-plans 3+ times in 5 iterations, escalate. The plan isn't converging.
- Cosine-similarity check on consecutive "thoughts" — if similarity > 0.92 across 3 turns, the model is reasoning in circles.
- Token-velocity check: if last 5 turns produced no tool calls, only "thinking", halt with no_progress reason.
- External watchdog: separate process sees stuck thread → kills it. Don't rely on the agent to notice it's stuck.
- Numbers to drop: "Our loop detector (identical tool call x3) fires on ~0.6% of production runs and saves ~12k tokens each. Plan-churn detector fires on ~0.2% and prevents avg $1.20/run waste."
Common follow-ups: - "Show me the args-hash function — what counts as 'identical'?" - "How do you distinguish a legitimate retry from a loop?"
Traps: - Exact-string matching on tool args (misses near-identical calls). - Letting the LLM self-detect — by definition, it doesn't see the loop. - Hard-killing with no diagnostic dump.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you control cost explosions from tool calls?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026
Answer outline:
- Per-task hard budget in USD, computed live from token usage + tool call costs. Trip → halt with budget_exceeded.
- Per-tool cost annotation in the registry: {name, usd_per_call, tokens_per_call}. Orchestrator pre-checks before allowing the call.
- Tier expensive tools behind cheap-model gates: a Haiku-class model decides whether the Sonnet/GPT-5-class tool is worth invoking.
- Cache at the tool boundary: same input → same output for read-only tools, TTL keyed by data freshness needs.
- Soft and hard limits per tenant: 80% → alert, 100% → halt. Standard pattern from Stripe usage-based billing.
- Numbers to drop: "Per-task \(0.50 cap eliminated all incidents of >\)5/task runs (was 4% before). LLM-judge caching brought eval cost down 70% with 0.4 pt accuracy loss."
Common follow-ups: - "How granular is your cost accounting — per token or per call?" - "What's a 'budget aware' agent vs a 'budget gated' agent?"
Traps: - Cost limits per request only, not per session — sessions stack up. - Caching mutating-tool results. - Letting the model see the remaining budget (it games it).
Related cross-cutting: Cost & latency
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Your AI agent burns too many tokens per task. How do you reduce token consumption?"¶
Tags: mid · very-common · scenario · source: MindStudio 2026 "AI Agent Token Budget Management"; igmguru top-40 agentic AI 2026
Answer outline: - Audit first: log per-turn tokens, find the hot spots. Usually 70-90% of tokens come from re-sending tool results unchanged. - Compact tool outputs: tools should return only the fields the agent uses, summarize long results with a cheap model before re-injection. - Sliding-window or summary buffer for older messages — keep last N raw, summarize the rest. - Cache the system prompt + tools (Anthropic prompt caching, OpenAI cached_input). 90% discount on cached tokens. - Prefer cheap models for routing and final-answer formatting; reserve flagship models for hard reasoning steps. - Numbers to drop: "Prompt caching alone cut our agent's token cost 62% on cached portions; tool-output compaction cut overall context size 45%. Combined: $0.034 → $0.011 per task."
Common follow-ups: - "What's the right cache TTL for the tool definitions block?" - "Sliding window vs summarization — when does each break?"
Traps: - Caching prompts that change per request (defeats cache). - Summarizing tool output and losing the IDs the agent needs. - Optimizing tokens without measuring quality — easy to regress accuracy 5%.
Related cross-cutting: Cost & latency
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you rate-limit and budget an agent's tool usage?"¶
Tags: senior · common · design · source: atalupadhyay "50 most asked agentic AI" 2026 (Q22)
Answer outline:
- Three layers: per-tool RPS (protect the tool), per-agent RPS (protect downstream), per-tenant RPS (fairness).
- Implementation: Redis token bucket keyed by (tenant_id, tool_name). 429 surfaces as structured error to the agent — "retry in 2.1s, suggest different tool".
- Budget enforcement uses both call count and cumulative cost. Some tools (web search) are cheap-per-call but expensive in aggregate.
- Burst capacity: short bursts ok (10x for 5s), then drain. Prevents legitimate concurrent calls from getting throttled.
- Per-tool weight: a code_exec call counts as 100 units, a cache_get as 1. Same bucket, different rates.
- Numbers to drop: "Per-tenant 100 RPS soft cap + 500 RPS hard cap stopped one customer's runaway agent from saturating our 10k RPS database in production."
Common follow-ups: - "What's the difference between rate limiting and budgeting in this context?" - "How does the agent recover from a 429?"
Traps: - Only client-side rate limiting (one buggy client takes everyone down). - Returning 429 as an unstructured string the agent can't parse. - No per-tenant cap, so noisy neighbors starve others.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Approval gates¶
Q: "Define agent autonomy boundaries — what can it do without human approval?"¶
Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q3)
Answer outline:
- Default-deny on side effects. Start with read-only autonomy and explicitly add categories.
- Three tiers: (1) read-only — fully autonomous, (2) reversible writes (draft email, scratch table) — autonomous with audit, (3) irreversible — always human-gated.
- Per-tool mutation_class annotation: read | reversible | irreversible | financial | external_send. Policy engine maps class to required approval level.
- Blast radius cap: even autonomous writes cap rows/files/dollars per task. update_records can change ≤ 50 rows without approval.
- Confidence threshold: if model's self-reported confidence < 0.7 on an action, escalate regardless of class.
- Numbers to drop: "We classify ~30% of our 60 tools as irreversible; those route through HITL with target 2-minute approval SLA and ~95% approve rate."
Common follow-ups: - "How do you measure confidence reliably from an LLM?" - "What's the policy when approval times out?"
Traps: - All-or-nothing autonomy (either full HITL or no gates). - No blast-radius cap on reversible writes — "reversible" 10k rows is still bad. - Confidence based on LLM self-report only.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "An agent is about to perform an action with irreversible consequences, like deleting files. How should the system be designed?"¶
Tags: senior · very-common · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q15)
Answer outline:
- Intercept at the tool boundary, not in the prompt. The file_delete tool's wrapper checks for approval_token; without it, returns "approval_required" instead of executing.
- Surface to a human reviewer: structured message with file paths, sizes, last-modified, agent's reasoning. UI shows approve/edit/reject/respond (LangGraph's four-option pattern).
- Persistent state during pause: LangGraph interrupt() + checkpointer keeps the agent's full state durable, resumes from exact step.
- Soft-delete first: where possible, move to trash with TTL instead of hard delete. Reduces blast radius of approved-but-wrong actions.
- Audit log: who approved, when, with what justification, what was deleted. Tamper-evident store.
- Numbers to drop: "Our SLA: irreversible-action approvals route to on-call in <30s, p95 approve-or-reject in 90s, target 99% completion within 5-minute timeout."
Common follow-ups: - "What happens if no human approves within the timeout?" - "How do you stop reviewers from rubber-stamping?"
Traps: - Approval gate in the system prompt only — prompt injection bypasses it. - Hard delete with no soft-delete option. - No timeout policy — gates stuck indefinitely.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Design an agent for end-to-end customer onboarding. Where does it call humans?"¶
Tags: senior · common · design · source: aemonline.net top-50 2026 (Q47-variant)
Answer outline: - Phases: identity verification → KYC document review → account provisioning → first-product setup → welcome handoff. - Human checkpoints: (1) KYC review when confidence < 0.85 or jurisdiction is high-risk; (2) any provisioning above a dollar threshold; (3) discrepancy between submitted docs and external KYC service. - Auto-paths: extracting fields from ID docs, scheduling welcome calls, sending templated emails — autonomous with audit log. - Failure handoff: 3 retries on KYC service → human takes over the case, not the agent restarting. - Per-step SLA: identity (5s), KYC review (15 min if human-gated), provisioning (30s). Total target: 10 min unattended, 25 min with one review. - Numbers to drop: "Onboarding agent at a fintech I worked with handles ~80% of cases fully autonomously, 15% with one human checkpoint, 5% full-takeover. SLA: 12 min median, 45 min p95."
Common follow-ups: - "How does the human inherit the agent's state when they take over?" - "What's the rollback if onboarding fails at step 4?"
Traps: - One single approval at the end ("review this whole 12-step run"). - No takeover path — agent retries forever on a doc it cannot parse. - Trusting LLM-extracted KYC fields without a deterministic schema check.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you prevent human-in-the-loop from becoming a bottleneck?"¶
Tags: senior · common · design · source: bestaiweb / Mastra "Human-in-the-Loop" 2026; Strata "Practicing HITL" 2026
Answer outline: - Calibrate confidence thresholds: only genuinely uncertain actions (calibrated below 0.85 say) route to review, not every action. - Tiered SLAs: 15s for low-risk, 2 min for PII, 15 min for financial. Match latency to risk; don't make every gate same-SLA. - Async approval: agent parks the action and continues with parallel sub-tasks instead of blocking. - Batch similar approvals: 20 "send email to customer" actions arrive in one queue, reviewed in a batched UI. - Watch automation bias: reviewers who see 1000 similar agent proposals per day rubber-stamp. Track approval-to-reject ratio per reviewer; if it goes >99%, rotate or sample-audit. - Numbers to drop: "Tiered SLAs cut median approval latency from 4m → 35s while halving total reviewer hours. We sample-audit 5% of auto-approved low-risk actions to catch reviewer drift."
Common follow-ups: - "How do you measure reviewer fatigue?" - "What's a calibrated confidence threshold?"
Traps: - Every action gates → reviewers burn out, throughput collapses. - Confidence not calibrated → threshold is meaningless. - No async path — agent sits idle waiting.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Can an agent have 'doubt,' and should it ask for human confirmation?"¶
Tags: mid · occasional · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q46)
Answer outline:
- LLMs have noisy self-reported confidence; treat it as one signal, not the only one.
- Better proxies: log-prob entropy on the tool-call token, disagreement among N samples (self-consistency), evaluator-model agreement.
- Define "doubt triggers": multiple plausible tools score similarly, args have no schema match, user's intent ambiguous, missing required info.
- Action on doubt: a clarify tool that emits a question to the user, OR an escalate tool that pages a human. Not just "guess and hope".
- Don't over-clarify — every clarification is friction. Cap at 1-2 per task or use a clarifier-budget.
- Numbers to drop: "Our 'doubt detector' (entropy + N-sample disagreement) fires on ~7% of turns; turning those into clarifications dropped wrong-action rate from 12% → 4% with only a small NPS hit (-1.3 points)."
Common follow-ups: - "Show me your confidence calibration plot." - "What's the right ratio of clarifications to autonomous answers?"
Traps: - Trusting raw LLM "I'm 90% sure" tokens. - Asking clarifying questions for every uncertainty (users hate it). - No clarifier-budget so the agent stalls.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you implement human approval in LangGraph?"¶
Tags: mid · common · coding · source: LangChain docs "Human-in-the-Loop" (docs.langchain.com); LangChain blog "interrupt"
Answer outline:
- Use interrupt() in a graph node where approval is needed. State is persisted by the checkpointer; the call returns to the caller.
- The host app surfaces the interrupted state to a human and resumes with graph.invoke(Command(resume=...)) carrying the decision.
- Four standard outcomes: approve, edit (modify before run), reject (with feedback), respond (used by "ask user" style tools).
- Requires a checkpointer (Postgres, SQLite, Redis) — interrupt() only works with durable state. In-memory checkpointer is for tests.
- Make interrupts dynamic, not static — conditional on action.value > threshold, not on every action. Avoids ceremonial gating.
- Numbers to drop: "LangGraph's interrupt() with Postgres checkpointer survives node restarts; we measured 100% resume rate across 50k production interrupts over the last quarter."
Common follow-ups: - "What if the checkpointer is unavailable when interrupt fires?" - "How do you let approvers edit the agent's proposed action?"
Traps:
- interrupt() without a checkpointer → state lost.
- Static interrupt() on every tool call (rubber-stamp problem).
- Not handling the respond case for clarifications, only approve/reject.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you implement guardrails for safe and predictable agent behavior?"¶
Tags: senior · common · design · source: aemonline.net top-50 agentic AI 2026 (Q46-variant)
Answer outline: - Pre-execution guardrails: input filters (PII detection, prompt injection classifier, jailbreak detection), policy engine that decides allow/deny/escalate. - In-loop guardrails: tool selection allowlist by role, parameter validation, budget checks, blast-radius caps. - Post-execution guardrails: output filters (PII redaction, toxicity check), response evaluator scoring against goal. - Policy as code: OPA/Rego or a domain-specific rule engine, version-controlled, tested. Don't hide policy in prompt strings. - Layered defense: even if one guardrail fails, another catches. Anthropic's framing is "constitutional AI" baked into reasoning + structural checks at boundaries. - Numbers to drop: "Post-execution toxicity filter catches ~0.04% of agent outputs; pre-execution prompt-injection filter catches ~0.3% of inputs. Both layers together = 0 toxic outputs over 500k requests last month."
Common follow-ups: - "Where does the policy engine sit in your topology?" - "How do you A/B test a new guardrail without blocking real traffic?"
Traps: - Single point of guardrail (only input or only output). - Guardrails in the prompt only. - No mechanism to update guardrails without redeploy.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
MCP & tool protocols¶
Q: "What is MCP and why does it matter?"¶
Tags: screen · very-common · conceptual · source: modelcontextprotocol.io; Piyali Das "10 MCP Interview Questions" 2026; aemonline.net top-50 (Q34)
Answer outline: - MCP = Model Context Protocol — open standard from Anthropic (Nov 2024) for connecting LLMs to tools, resources, and prompts via a client-server interface. - Three concepts: MCP client (your agent), MCP server (exposes tools/resources), transport (stdio for local, HTTP/SSE for remote, streamable HTTP for production). - Why it matters: standardizes tool integration so you stop writing one custom adapter per service. One MCP server can serve any MCP-compliant client. - Not a replacement for RAG: MCP handles tool execution and resource fetching; RAG is a retrieval pattern that can use MCP as transport. - 2026 roadmap focus: transport scalability, agent-to-agent communication, governance maturation. - Numbers to drop: "After standardizing on MCP, we cut net-new tool integration time from ~4 engineering days to ~0.5 day per tool, and 60+ open-source MCP servers covered our top integrations out of the box."
Common follow-ups: - "What's the difference between MCP and a regular REST API?" - "When would you NOT use MCP?"
Traps: - Treating MCP as just a buzzword without naming client/server/transport. - Building a custom protocol in 2026 when MCP would serve. - Confusing MCP servers with agent frameworks like LangGraph.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How does an LLM discover MCP tools?"¶
Tags: mid · common · conceptual · source: Piyali Das "10 MCP Interview Questions" 2026 (Q6); modelcontextprotocol.io spec
Answer outline:
- The MCP server exposes tools/list — returns each tool's name, description, JSON Schema for input and output.
- Client calls tools/list on connection (or refreshes on tools/list_changed notifications).
- Discovered tools are merged into the agent's tool registry — alongside native function-calling tools.
- Selection: same mechanism as any other tool — model reads the descriptions and picks. Same namespacing rules apply (gh_mcp_issues_search, not just search).
- Permission filtering at the client: even if a server exposes 50 tools, the client can surface only the ones the current user is authorized to call.
- Numbers to drop: "Our agent connects to 4 MCP servers exposing 28 tools combined; we filter to ~12 per agent role via the client's allowlist, keeping the context overhead under 4k tokens."
Common follow-ups:
- "What happens if tools/list_changed fires mid-task?"
- "How do you stop a malicious MCP server from injecting tools?"
Traps:
- Loading every MCP server's full tool list into context unfiltered.
- Trusting tool descriptions from external servers without review.
- Not handling tools/list_changed — stale registry leads to silent failures.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What are the security risks of MCP and how do you mitigate them?"¶
Tags: senior · common · design · source: Coalition for Secure AI "MCP Security" 2026; Security Boulevard "How MCP servers handle authentication"; Checkmarx "MCP Security"
Answer outline: - Static tokens dominate (53% of production servers per Coalition for Secure AI 2026 scan of 518 servers; 41% have no auth at all). Long-lived tokens in multi-agent chains have geometrically wider blast radius than in conventional apps. - Use short-lived OAuth tokens + DPoP (RFC 9449) to bind tokens to the requesting client. Prevents replay if intercepted. - Authorization in three layers: (1) authenticate the client, (2) authorize the connection, (3) per-action authorization. Most production servers stop at layer 2; layer 3 is the gap. - End-to-end traceability: every action must trace back to the initiating user and every intermediate server. Propagate identity context across the chain. - Server allowlist: don't auto-connect to any MCP server the user mentions. Curated registry per tenant. - Numbers to drop: "After moving to short-lived tokens + per-action authorization, we caught a malicious server attempting privilege escalation 14 times in one week — none succeeded vs. the previous static-token regime."
Common follow-ups: - "What's prompt injection's role in MCP exploits?" - "How do you isolate one MCP server from another?"
Traps: - Static API keys with no rotation. - Trusting tool output as safe input to other tools (cross-server injection). - No identity propagation — action looks like it was done by the agent service, not the user.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Difference between MCP and APIs?"¶
Tags: screen · common · conceptual · source: Piyali Das "10 MCP Interview Questions" 2026 (Q4)
Answer outline: - APIs (REST/gRPC): general-purpose, designed for human developers, manual integration per endpoint. - MCP: AI-native, designed for LLM consumption, includes tool discovery, structured descriptions, resource lists, prompt templates as first-class primitives. - MCP can wrap an existing REST API — most production MCP servers are thin adapters over an internal API plus an LLM-friendly description layer. - MCP standardizes the integration shape, so any MCP-compliant client (Claude Desktop, Cursor, Cody, your agent) gets the tool for free. - APIs answer "how do I call this endpoint?"; MCP answers "what tools are available, when should the model use them, and what's the expected output shape?". - Numbers to drop: "Our internal CRM API exposes 80 endpoints; the MCP wrapper exposes 12 carefully-curated tools with rich descriptions. Tool-pick accuracy went from 64% (raw OpenAPI generation) to 89% with the curated MCP surface."
Common follow-ups: - "Why not just point the LLM at an OpenAPI spec?" - "Should every internal API have an MCP server?"
Traps: - Auto-generating MCP tools 1:1 from OpenAPI — surface too noisy, descriptions too thin. - Treating MCP as a replacement for APIs (it's a layer on top). - Exposing every endpoint, not just the safe and useful subset.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "When would you choose MCP over a custom tool integration?"¶
Tags: mid · common · design · source: Piyali Das "10 MCP Interview Questions" 2026 (Q10); Red Hat "Building effective AI agents with MCP"
Answer outline: - Choose MCP when: multiple clients will use the same tool (Claude Desktop + your agent + Cursor); tool surface evolves often (descriptions change without client redeploys); third-party integrations available off-the-shelf. - Stick with native function-calling when: latency is critical (MCP transport adds ~5-30ms); the tool is one-off; you need fine-grained control over the schema and selection logic. - For internal tools, even 1 client today, MCP can future-proof — but don't adopt it prematurely if the team isn't ready to operate an MCP server. - Off-the-shelf MCP servers in 2026 cover GitHub, Linear, Slack, Notion, Postgres, Sentry, dozens more. Don't reinvent. - Production checklist: auth (OAuth + short-lived tokens), audit, rate limits, per-action authz, observable transport. - Numbers to drop: "We adopted MCP for 4 external SaaS integrations (GitHub, Linear, Slack, Sentry), kept native function-calling for 18 internal tools. Net effect: 60% less glue code, 0 latency regression."
Common follow-ups: - "What's the operational overhead of running an MCP server?" - "How do you migrate from custom tools to MCP without breaking the agent?"
Traps: - All-in on MCP without a fallback for tools that need <100ms latency. - Running a leaky MCP server with static creds (now a much bigger surface). - Mixing MCP and native tools without a unified registry abstraction.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How does MCP handle long-running tool calls?"¶
Tags: senior · occasional · conceptual · source: MCP 2026 Roadmap (blog.modelcontextprotocol.io/posts/2026-mcp-roadmap); The New Stack "MCP biggest growing pains" 2026
Answer outline:
- Current spec: clients can start asynchronous Tasks — invoke once, retrieve result later. Lets the agent kick off long jobs without blocking.
- Open gaps the 2026 roadmap is closing: retry semantics for failed jobs, result retention windows, observability of in-flight tasks across clients.
- Practical pattern today: MCP tool returns a task_id immediately, agent polls a tasks/get tool, or subscribes to completion notifications.
- Trade-offs: synchronous calls are simpler but block the loop; async is correct but adds plumbing the agent must reason about.
- Timeout policy belongs to the client — server publishes "estimated_duration", client decides whether to wait or convert to async.
- Numbers to drop: "Our deep-research tool runs 30-180s. Wrapped in async MCP, the agent polls every 5s and continues other work meanwhile; we cut median task wall-time 40% by parallelizing the wait."
Common follow-ups: - "What's the failure mode when a job crashes mid-flight?" - "How long does an MCP server keep completed results?"
Traps: - Synchronously blocking the agent on a 60s tool call. - No retention policy on async results, lost on server restart. - Polling without backoff (DDoSes your own MCP server).
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Failure handling¶
Q: "Your agent hits a tool that returns 500. What happens next?"¶
Tags: mid · very-common · scenario · source: aemonline.net 25-advanced 2026 (Q10); aemonline.net top-50 (Q9-variant)
Answer outline:
- Orchestrator classifies 5xx as transient → retry with exponential backoff + jitter, max 3 attempts, total budget 10s.
- On final failure, return a structured error to the LLM: {ok: false, code: "TOOL_500", retryable: false_after_retries, hint: "Try a different tool or fall back"}.
- LLM's response options: try an alternative tool, ask the user, or end with a partial answer + reason. Never let it loop on the same failed call.
- Circuit breaker: if the tool has failed >50% over the last 10 calls, the breaker trips and the orchestrator routes around it for 5 minutes.
- Track in telemetry: tool_id, retry_count, final_outcome, agent_recovery_action. Alert if recovery rate < 90%.
- Numbers to drop: "Our retry policy (3 attempts, exp backoff 1s/2s/4s + 20% jitter) handles ~95% of transient 5xx; circuit breaker on top reduces cascading failures during partial outages by ~70%."
Common follow-ups: - "What if there's no alternative tool?" - "How do you stop the LLM from retrying after the orchestrator gave up?"
Traps: - Letting the LLM see "Tool failed, you might want to retry" as a free-form string (it will retry forever). - No circuit breaker → cascading failures. - Retrying writes without idempotency.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you handle errors when an API call fails (e.g., rate limit, authentication error, invalid parameters)?"¶
Tags: mid · very-common · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q10)
Answer outline:
- Different error classes need different responses: 429 → backoff + retry (parse Retry-After header); 401/403 → refresh credentials, then retry once, then fail; 400 → don't retry, surface to LLM as "invalid_args" with the validator's hint.
- 401 retry uses a token-refresh flow in the orchestrator, not in the LLM prompt.
- 400 carries the exact validation message back to the LLM so it can correct args on the next turn.
- 5xx → retry with backoff as transient.
- Pattern: error envelope {ok: false, http_status, code, message, retryable, hint} standardized across all tools.
- Numbers to drop: "Standardizing the error envelope dropped time-to-recovery from agent retries by 35% — the LLM now corrects on the next turn instead of probing."
Common follow-ups: - "What's the right retry-after when the API doesn't send one?" - "How do you avoid leaking the new token to the LLM?"
Traps: - Retrying 4xx (won't succeed). - Putting auth tokens in tool responses the LLM sees. - Same retry curve for every error class.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How would you design a tool-use function robust to the LLM hallucinating parameter names or values?"¶
Tags: senior · common · design · source: aemonline.net 25-advanced agentic AI 2026 (Q21)
Answer outline:
- Strict Pydantic / JSON Schema with descriptive error messages: when validation fails, return "Expected user_id (UUID v4), got username='alice'. Use user_lookup to convert username→UUID first."
- Type coercion where safe (int vs string for IDs) and rejection elsewhere — don't auto-fix ambiguous types.
- Each parameter description includes 1-2 example values: user_id: "550e8400-e29b-41d4-a716-446655440000". Anchors the model.
- Provider-native structured-output mode (OpenAI tool_choice strict, Anthropic tools) enforces schema at the sampler level.
- Tracker: count validation_failures_per_tool_per_day. Spike = description regression or model regression.
- Numbers to drop: "Strict Pydantic + structured outputs + hint-rich errors cut arg-validation failures from 8% → 0.4% on our payments agent eval; we left 0.4% as 'genuinely ambiguous user intent', not hallucinations."
Common follow-ups: - "What if the model insists on the wrong value across multiple retries?" - "Why not just allow any string and parse later?"
Traps:
- Dict[str, Any] parameters.
- Validation errors dumped as raw Pydantic tracebacks.
- No telemetry on hallucination rate.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "Your agent is stuck in an infinite loop calling the same tool. How do you detect and stop it?"¶
Tags: senior · common · debugging · source: aemonline.net top-50 agentic AI 2026 (Q24)
Answer outline:
- Detection in the orchestrator: hash (tool_name, normalized_args) over the last K calls. If the same hash appears 3 times in 5 calls, halt with loop_detected.
- Normalize args before hashing: lowercase, sort keys, strip whitespace. Otherwise near-identical args evade detection.
- Plan-stage detector: if the agent re-plans 3 times in a window, escalate — the plan isn't converging.
- On halt, dump the last 5 messages, the failing tool call hash, and the agent state. Don't silently terminate.
- After halting, route to either user clarification (if the loop is ambiguity-driven) or human takeover (if it's a tool-availability issue).
- Numbers to drop: "Loop detection fires on ~0.6% of runs; pre-detector, those runs averaged 18 steps to budget cap and ~$1.40 wasted. Post-detector, halt at step 5, dump-and-escalate within 3s."
Common follow-ypes: - "What's the right value of K?" - "How do you keep the loop detector from false-positiving on legitimate retries?"
Traps: - Exact-string args matching (misses near-duplicates). - No diagnostic dump on halt. - Loop detector tied to identical literal args; misses semantic loops.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What are the most dangerous failure modes of agentic AI?"¶
Tags: senior · common · conceptual · source: adilshamim8 Medium 2026; atul4u 2026 (Q38); lockedinai (Q34)
Answer outline: - Tool misuse: agent calls a destructive tool with wrong args (delete the wrong table). Mitigation: schema strictness, blast-radius caps, HITL for mutating tools. - Infinite reasoning loops: agent burns budget without progress. Mitigation: step cap, loop detector, no-progress detector. - Prompt injection via tool outputs: a malicious doc the agent reads instructs it to exfiltrate data. Mitigation: input/output filters, structured channels, action allowlist. - Hallucinated tool calls: agent invents a tool or argument. Mitigation: strict schemas, structured outputs. - Scope creep: agent attempts actions beyond its authority (e.g., reads cross-tenant data). Mitigation: capability-based auth, per-action authz. - Numbers to drop: "Across post-mortems on 47 agent incidents at one fintech, root causes: scope creep 28%, tool misuse 21%, injection 17%, infinite loops 13%, hallucination 11%, other 10%."
Common follow-ups: - "Which of these is hardest to detect in production?" - "How would you prioritize fixing them?"
Traps: - Listing risks without naming concrete mitigations. - Treating all failure modes as equally likely (scope creep dominates in practice). - No incident retrospective process for agents.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you prevent agents from executing irreversible actions accidentally?"¶
Tags: senior · common · design · source: atalupadhyay "50 most asked agentic AI" 2026 (Q30)
Answer outline:
- Tag every tool with a mutation_class: read | reversible | irreversible | external_send | financial. Policy engine maps class to gating policy.
- Irreversible tools require a one-time-use approval_token in their wrapper; absence → 403 at the tool boundary, not the prompt.
- Soft-delete first wherever the underlying system supports it (S3 versioning, soft-delete columns, message-deleted flag). Lets you reverse "irreversible" actions in practice.
- Dry-run mode default: dry_run=True; the agent's normal output is a proposed action. Approval flips to dry_run=False and executes.
- Pre-execution simulation: for things like SQL UPDATE, run an EXPLAIN-style preview to show row count and a sample. Surfaces "this would change 12M rows" before approval.
- Numbers to drop: "Dry-run-default + soft-delete cut 'agent did the wrong thing irreversibly' incidents from 2-3/month to 0 in 6 months across our top 4 production agents."
Common follow-ups: - "What about external sends like emails?" - "How do you preview a tool call without running it?"
Traps: - "Approval check" only in the prompt. - Hard delete by default, retrofitting soft delete later. - No dry-run mode on mutating tools.
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "What is idempotency and why is it required for agents?"¶
Tags: mid · common · conceptual · source: atalupadhyay "50 most asked agentic AI" 2026 (Q42)
Answer outline:
- Idempotent operation: calling it N times with the same input has the same effect as calling it once. Agent retries (which happen often) cannot create duplicates.
- Implementation: every mutating tool accepts an idempotency_key: str (UUID); server-side dedupe within a TTL (24h is common).
- The agent generates one key per logical action and reuses it across retries. Orchestrator stores it with the action plan.
- Pair with two-phase commit for multi-step actions: write intent_id first, then execute, confirm or rollback.
- Without idempotency: agent retries on transient error → user charged twice; agent loops on payment_create → 10 duplicate charges in 30s.
- Numbers to drop: "Idempotency keys eliminated 100% of duplicate-charge incidents (3-4/month previously) at one fintech; the engineering cost was ~2 weeks per service."
Common follow-ups: - "What's the right TTL for an idempotency key?" - "How does the agent know which actions need keys?"
Traps:
- Idempotent for create only, forgetting update/delete.
- Short TTL (5 min) that expires while a long-running task retries.
- Letting the LLM generate idempotency keys (it sometimes reuses them across distinct actions).
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/
Q: "How do you handle a tool call that takes 90 seconds?"¶
Tags: senior · occasional · scenario · source: synthesized from MCP 2026 roadmap + Temporal/LangGraph long-running patterns
Answer outline:
- Default agent loops are synchronous and 90s blocks everything — switch to async pattern: tool returns task_id, agent gets a tool_status_get(task_id) poller.
- Two choices for the agent meanwhile: do nothing (just wait, with timeouts) or run parallel sub-tasks. Choice depends on whether downstream steps depend on the long tool's output.
- Durable execution layer (Temporal, LangGraph + Postgres checkpointer) keeps the agent state alive across the long call.
- Set explicit user expectations: surface "this will take ~90s, you'll get a notification" — don't pretend it's instant.
- Webhook callback pattern for very long tools (minutes): tool calls back the orchestrator, which resumes the agent's checkpoint.
- Numbers to drop: "Async tools with Temporal-backed durable agents survived a 45-min outage of one of our long-running providers with zero lost tasks; previously, ~120 in-flight runs would have died."
Common follow-ups: - "How does the agent know when to give up vs keep polling?" - "What's the latency cost of moving from sync to async?"
Traps: - Synchronously blocking the agent's whole loop on a 90s call. - No durable state → agent dies on deploy, loses the in-flight task. - Polling without exponential backoff (hammers the tool).
Related cross-cutting: Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/