Skip to content

Agent Design — Interview Questions

Agentic AI is now its own interview round in 2026. The defining question of the year: "What's the difference between an agent and a simple LLM chain?". This file covers design decisions for single-agent systems — multi-agent is its own file.

Architecture (ReAct, plan-and-execute)

Q: "What's the difference between an agent and a simple LLM chain?"

Tags: screen · very-common · conceptual · source: adilshamim8 Medium "Every AI Engineer Interview Question 2026" + Anthropic "Building Effective Agents"

Answer outline: - Chain: fixed DAG of LLM calls, deterministic control flow, you decide step order at design time. - Agent: LLM decides next step at runtime, including which tool to call and when to stop. Anthropic's framing: "workflows" vs "agents" — workflows are predefined paths, agents are model-driven loops. - Agent loop has three parts: reason, act (tool call), observe (tool result fed back as message). - Cost shape: chains have bounded token cost, agents can run for arbitrary iterations until a stopping rule fires. - Use agents only when "it's difficult or impossible to predict the required number of steps" (Anthropic). Otherwise a workflow is cheaper and more reliable. - Numbers to drop: "agent loops typically iterate 3-15 times per task; a chain runs in 1-3 LLM calls; agents cost 5-20x more tokens per task at the same accuracy."

Common follow-ups: - "Give me a task where an agent is clearly the wrong choice." - "When would Anthropic say to start with a chain and only graduate to an agent?"

Traps: - Treating "agentic" as a marketing label rather than a control-flow decision. - Reaching for an agent when a 3-node LangGraph workflow would do the job for 1/10th the cost. - Forgetting that every agent loop iteration replays the full context, so cost grows linearly with iteration count.

Related cross-cutting: Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Walk through a production-ready agent architecture."

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u "Complete Agentic AI System Design Interview Guide 2026"

Answer outline: - Five layers: orchestrator (state machine), LLM client (with caching), tool interface layer, memory (short + long), policy/guardrails engine, observability (traces + metrics). - Orchestrator owns control flow — it enforces max iterations, total token budget, retries, timeouts. The LLM owns reasoning only. - Tool layer wraps each tool with schema validation (Pydantic/JSON Schema), rate limiting, idempotency keys, structured error returns. - Every loop iteration emits a span: prompt, response, tool call, tool result, latency, tokens, cost. OpenTelemetry + LangSmith or Arize for traces. - Stopping conditions are first-class: max_steps, max_tokens, max_wall_clock, final_answer_predicate, low-confidence escalation. - Numbers to drop: "p50 agent task in production at my last role: 4 tool calls, 18k tokens, 12s wall clock; p99: 14 tool calls, 90k tokens, 75s — we capped at 20 steps and $0.50/task."

Common follow-ups: - "Where exactly does the policy engine sit — before or after the LLM call?" - "How do you handle a tool call that takes 90 seconds?" - "Draw the failure-mode diagram for one iteration."

Traps: - Putting retry logic inside the LLM prompt instead of the orchestrator. - No total-token cap, only a per-iteration step cap (an agent can burn $50 in 10 steps with bloated context). - Treating tools as plain function calls with no telemetry, so when something breaks you have no trace.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What is ReAct? When does it fail?"

Tags: mid · very-common · conceptual · source: Yao et al. "ReAct: Synergizing Reasoning and Acting" (2022); SurePrompts "ReAct Prompting Guide 2026"; aemonline.net top-25 agentic AI 2026

Answer outline: - ReAct = Reason + Act loop. LLM emits a thought, then an action (tool call), receives an observation, and repeats. The thought is what makes it "ReAct" instead of just function-calling. - Strengths: handles unpredictable environments well — every observation can change the next thought. Good for search, navigation, debugging. - Failure mode 1: long horizons — context grows linearly with iterations, so 30+ step tasks blow context window and the model loses track. - Failure mode 2: greedy local optimization — ReAct doesn't backtrack, so once it commits to a bad sub-path it doubles down. - Failure mode 3: prompt sensitivity — the "Thought:" channel can leak into the user-visible output if the parser is sloppy. - Numbers to drop: "ReAct on HotpotQA hits ~35% EM in the original paper; modern tool-augmented variants reach ~60%+. ReAct dies around 15-20 steps in our internal eval because of context bloat."

Common follow-ups: - "How do you keep ReAct from blowing the context window on long tasks?" - "ReAct vs plan-and-execute on a multi-file refactor — which wins?"

Traps: - Confusing ReAct with plain chain-of-thought (ReAct has tools and observations, CoT does not). - Letting the "Thought:" string become a load-bearing parser anchor — models drift in formatting. - Assuming ReAct is the answer for everything because it's the default in LangChain.

Related cross-cutting: Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Compare ReAct, Plan-and-Solve, and Tree-of-Thoughts with a real-world trade-off."

Tags: senior · common · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q16)

Answer outline: - ReAct: interleaved reason+act, reactive, no upfront plan. Best when environment is unpredictable. - Plan-and-Solve / Plan-and-Execute: produce full plan first, then execute step by step, optionally re-plan on failure. Predictable, gives a natural human-review gate before execution starts. - Tree-of-Thoughts: explore multiple reasoning branches, evaluate each, backtrack. Expensive — 5-20x token cost of ReAct. - Real-world trade-off: code-fixing on a known repo → plan-and-execute (cheaper, reviewable plan). Bug triage in an unknown system → ReAct (you don't know what you'll find). Research with many candidate hypotheses → ToT, capped at 3 branches. - Plan-and-execute "hurts when the task is not decomposable because you do not know the environment yet" (SurePrompts 2026). - Numbers to drop: "On a refactor benchmark, plan-and-execute cut iterations from 9 → 4 vs ReAct, saving 55% tokens; on bug triage in unknown repos, plan-and-execute failed 40% of the time because the initial plan referenced wrong file paths."

Common follow-ups: - "When does Tree-of-Thoughts beat a simple ReAct loop with reflection?" - "How do you decide the branching factor for ToT?"

Traps: - Using plan-and-execute on tasks where you can't enumerate steps without exploration. - Pricing Tree-of-Thoughts as if it were one-shot — every branch is its own ReAct loop. - Forgetting that plan-and-execute still needs a re-plan trigger when a step fails.

Related cross-cutting: Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What logic belongs in the orchestrator vs the LLM?"

Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u system design guide 2026 (Q8)

Answer outline: - LLM: choose the next action, generate arguments, summarize observations, draft user-facing prose. Anything that needs semantic judgment. - Orchestrator (code): loop control, step count, token accounting, retry policy, timeout enforcement, tool dispatch, schema validation, idempotency keys, audit logging, approval gates. - Rule of thumb: if it can be expressed as if/while/for, it doesn't belong in the prompt. Prompts are terrible loops. - Determinism matters here: the orchestrator is the only place you can guarantee invariants (max 20 steps, max $0.50, never call DELETE without approval). - The LLM should not know its own budget cap — you don't want it negotiating with itself. The orchestrator enforces silently. - Numbers to drop: "Moving retry/backoff out of the prompt and into orchestrator code cut hallucinated tool calls by ~30% and saved 22% tokens because the LLM no longer rationalizes 'I'll try once more'."

Common follow-ups: - "Where does the system prompt's role policy enforcement actually run?" - "Can the LLM ever modify the budget?"

Traps: - Pushing security checks into prompts ("don't delete files unless..."). LLMs route around prompts. - Letting the LLM decide when to stop — it will stop too early or too late. - Putting tool-discovery logic into prompts at runtime when a registry lookup in code is simpler.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you design a safe and debuggable agent loop?"

Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u system design guide 2026 (Q9)

Answer outline: - Make every state transition recorded — a single append-only AgentEvent stream with thought, tool_call, tool_result, latency, tokens, model_version. - Replayable state: the agent state at step N must be reconstructible from events 1..N-1 plus the seed. No hidden state in instance variables. - Hard caps on the loop: max_steps, max_tokens, max_seconds, max_cost_usd. Any one trip halts. - Side-effect quarantine: tools that mutate the world are marked mutating=True; orchestrator routes them through approval or dry-run modes during dev. - Structured tool errors: every tool returns {ok, value | error_code, message, retryable, hint} so the LLM can reason about failures without parsing strings. - Numbers to drop: "After we added replay-from-event-log, MTTR on agent bugs dropped from ~3 hours to ~25 minutes because we could reproduce production failures locally."

Common follow-ups: - "How do you make a non-deterministic LLM replayable?" - "Where does the event log get garbage collected?"

Traps: - Logging only the final answer — you can't debug a 12-step loop from one row. - Letting tools throw raw exceptions back to the LLM as stack traces (the LLM tries to fix Python, not the task). - Using wall-clock time as the only termination signal in batch contexts.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "When should you use a workflow instead of an agent?"

Tags: mid · very-common · conceptual · source: Anthropic "Building Effective Agents" (anthropic.com/research/building-effective-agents)

Answer outline: - Anthropic's canonical advice: "Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short." - Workflows win when: steps are fixed, latency budget is tight, you need predictable cost, you can write evals for each step. - Agents win when: you cannot predict step count, environment is unpredictable, branching factor is high. - The five named Anthropic workflow patterns — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer — cover ~80% of "agentic" use cases without a real agent loop. - Cost shape: workflows are bounded (you can quote a per-task cost); agents have heavy right tail. - Numbers to drop: "We replaced a 'customer query agent' with a 3-step prompt chain (classify → retrieve → answer) and cut p50 latency from 8s → 1.4s and cost from $0.08 → $0.011 per query at equal quality."

Common follow-ups: - "Walk through Anthropic's orchestrator-workers pattern — when does it beat a real agent?" - "How would you decide in an interview to recommend a workflow vs agent?"

Traps: - Reaching for an agent because the problem sounds open-ended when it actually decomposes into 3 steps. - Confusing "uses an LLM with tools" with "must be an agent loop". - Building a generic agent before you've shipped any chain in production.

Related cross-cutting: Cost & latency Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How would you design an agent to handle a task with an extremely long time horizon, like 'research the entire history of a company and write a 50-page report'?"

Tags: staff · occasional · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q1)

Answer outline: - Phase-decompose the task: outline → per-section research → per-section drafting → cross-section consistency pass → final assembly. Each phase is its own agent run with its own budget. - Hierarchical state machine: a top-level planner emits phase tickets, sub-agents complete tickets, results land in a durable store (Postgres or object storage), not the parent's context. - Checkpoint after every phase — the system must be killable and resumable. Critical when wall time is hours. - Compaction between phases: don't carry raw research notes into the drafting phase, carry a structured summary (key entities, dates, sources). - Stopping rule per phase: max iterations + "section_complete" predicate validated by an evaluator LLM. - Numbers to drop: "Anthropic's deep-research style agents run 5-30 minutes wall clock and consume 200k-2M tokens; we chunk at 30-min boundaries and persist intermediate state in S3 with content-hash keys."

Common follow-ups: - "Where does the agent store intermediate research without polluting context?" - "How do you ensure section N is consistent with section M written 4 hours earlier?"

Traps: - One monolithic ReAct loop trying to do everything in one context window. - No checkpointing — a 4-hour run that crashes at step 28 is unrecoverable. - Using episodic memory as if it were durable storage; it isn't, and it grows context per step.

Related cross-cutting: Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do agents decompose high-level goals into executable steps?"

Tags: mid · common · conceptual · source: adilshamim8 Medium 2026

Answer outline: - Three common approaches: LLM-as-planner (one shot, returns JSON plan), recursive decomposition (split until atomic), and emergent (ReAct just figures it out step by step). - Atomic step = "callable by one tool in one turn with no further sub-goal". This is the planner's contract. - Quality check on the plan: an evaluator LLM scores each step for is_atomic, is_actionable, references_valid_tools. Reject and re-plan if any fails. - Output a graph, not a list — explicit dependencies enable parallel execution and partial replanning. - Track plan-vs-execution drift: if the agent deviates from the plan more than N times, re-plan. Anthropic-style "evaluator-optimizer" loop. - Numbers to drop: "Plan validation by a cheap model (Haiku-class) costs ~$0.001 per plan and catches ~25% of bad plans before any tool is called."

Common follow-ups: - "Show me the JSON schema you'd use for a plan." - "What's plan drift and how do you detect it?"

Traps: - Asking the LLM for a plan and then ignoring it inside a ReAct loop. - Plans that are too coarse ("research company X") — they leave the executor doing planning anyway. - No re-plan trigger, so when step 2 fails the agent fights step 3 with stale assumptions.

Related cross-cutting: Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Describe how you would architect an AI agent system, including the agent loop, tool interfaces, memory design, orchestration technologies, and safety considerations."

Tags: staff · common · design · source: adilshamim8 Medium "100 real interviews" 2026

Answer outline: - Loop: typed state object (goal, messages, step, budget_used, pending_tool_calls), single step() function, deterministic event log. - Tool interface: Pydantic schemas, namespaced names (gh_issues_search, gh_pulls_create), structured errors, idempotency keys, per-tool rate limits. - Memory: short-term = message buffer with summarization at 70% of context; long-term = vector store keyed by agent_id + topic + ts; episodic = event log replayed for self-reflection. - Orchestration: LangGraph or Temporal for durable execution, Redis for ephemeral state, Postgres for checkpoints. - Safety: policy engine (OPA or in-process rules), pre-execution check on every mutating tool, audit log, rate limits per actor. - Numbers to drop: "Temporal-backed agents survived our last region outage with zero lost tasks; in-memory LangGraph lost 800 in-flight runs. We standardized on Temporal for any agent with side effects."

Common follow-ups: - "Why Temporal over a plain Redis queue?" - "How does your policy engine talk to the orchestrator?"

Traps: - Conflating short-term context with long-term memory. - Skipping the durability layer because "it's just a prototype" — it never is. - Hardcoding tool names into prompts instead of generating from a registry.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Describe a scenario where a monolithic agent is a better choice than a multi-agent system."

Tags: mid · occasional · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q3)

Answer outline: - Tight context coupling: when every step needs the same intermediate state, splitting it across agents forces re-serialization on every handoff. - Latency-sensitive tasks: agent-to-agent communication adds 1-3 LLM round-trips of overhead per handoff. - Simple linear flows: if the task is "research → summarize → format", a single ReAct agent with three tools beats three agents. - Small token budgets: multi-agent inflates total tokens 2-4x because each agent re-reads parts of context. - Strong eval signal: if you can write one eval for the whole task, you don't need orchestration boundaries. - Numbers to drop: "We replaced a 3-agent 'planner→researcher→writer' setup with one ReAct agent; latency dropped from 22s → 9s, cost dropped 60%, eval score identical."

Common follow-ups: - "When would you actually need agents to be separate processes?" - "Where's the multi-agent overhead coming from?"

Traps: - Splitting agents along organizational lines instead of task boundaries. - Assuming multi-agent is automatically more capable — it's usually just more expensive. - Ignoring handoff cost in latency budget.

Related cross-cutting: Cost & latency Related module: learning/01_ai_engineering/01_agentic_system_design/


Tool schemas & descriptions

Q: "Why are tool descriptions more important than tool names?"

Tags: mid · very-common · conceptual · source: Anthropic "Writing Effective Tools for AI Agents" (anthropic.com/engineering/writing-tools-for-agents)

Answer outline: - The LLM picks tools by reading the full schema, not just the name. Description carries the disambiguation signal. - Anthropic: "small refinements to tool descriptions can yield dramatic improvements" — Claude Sonnet 3.5 reached SOTA on SWE-bench Verified after description tuning alone. - Good descriptions answer: when to call, when NOT to call, what other tools to prefer for nearby cases, input format requirements, example use. - Names handle coarse routing (gh_issues_search vs linear_issues_search); descriptions handle fine-grained "use this for X, not Y". - Treat descriptions as you'd treat a new-hire onboarding doc for that tool. Anthropic's phrasing: "how you would describe your tool to a new hire on your team". - Numbers to drop: "After we rewrote 12 tool descriptions to include 'use this when' + 'do NOT use this when' bullets, wrong-tool rate dropped from 14% → 3% on our eval, with no model change."

Common follow-ups: - "What's in a great tool description besides parameter types?" - "How do you evaluate description quality?"

Traps: - One-line descriptions like "Search GitHub issues." — useless for disambiguation. - Renaming a tool when the real fix is rewriting its description. - Letting two tools share overlapping descriptions; the model coin-flips between them.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you design tool schemas that reduce hallucinated actions?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q20)

Answer outline: - Pydantic / JSON Schema with strict types — enums for categorical fields, regex for IDs, ranges for numerics. Reject before the tool runs. - Specific parameter names: user_id, not user. repository_full_name, not repo. Anthropic explicitly recommends this. - Required vs optional discipline — every required field should be required; everything else should default. Otherwise the model invents reasonable-sounding values. - Constrained generation: use the provider's structured-output mode (OpenAI tool_choice, Anthropic tool use) so the schema is enforced server-side. - Failure mode messages should restate the expected shape: "Expected user_id (string, UUID v4), got user_id='alice'". The model self-corrects on the next turn. - Numbers to drop: "Strict Pydantic + structured outputs cut parameter hallucinations from 8% → 0.4% in our payments agent eval (n=5k tasks)."

Common follow-ups: - "How does this play with chain-of-thought before the tool call?" - "Show me an enum vs free-string trade-off you've shipped."

Traps: - Loose Dict[str, Any] parameters — the model fills them with garbage. - Allowing the same field name with different meanings across tools. - Returning unstructured error strings; the model can't tell whether to retry.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How does an agent choose which tool to use when multiple tools seem relevant?"

Tags: mid · very-common · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q8); adilshamim8 Medium 2026

Answer outline: - The provider's tool-selection sampler reads name + description + schema of every registered tool and picks one in a single forward pass. - Signal sources: name namespacing (asana_search vs jira_search), description "use when/use not when" bullets, parameter shape (matching what the model already has). - Disambiguation tactic: include explicit comparisons in description — "Use this for code search; for prose search use docs_search." - Routing precondition: if you have >20 tools, do a description-based vector search first to shortlist 5-7 before showing them to the main model. Saves tokens and reduces confusion. - Hard guardrail: enums in the system prompt for which tools are valid in which agent role. - Numbers to drop: "Anthropic showed namespacing-by-service plus namespacing-by-resource (asana_projects_search vs asana_users_search) materially improved tool selection in tool-use benchmarks; we measured a 6-point F1 gain after adopting it."

Common follow-ups: - "What's tool retrieval and when do you need it?" - "How do you A/B test tool descriptions?"

Traps: - Loading 50 tools into the context unconditionally. - Names that don't disambiguate (search, get, update for three services). - Assuming the model will figure it out without a "do not use this for" clue.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What is a 'tool retrieval' mechanism, and when is it necessary?"

Tags: senior · common · conceptual · source: aemonline.net 25-advanced agentic AI 2026 (Q9)

Answer outline: - Tool retrieval = semantic search over the registry of tools using the user's query, returning top-K to attach to the prompt. - Necessary above ~30 tools — beyond that, tool descriptions alone eat the context budget and the model's selection accuracy drops. - Implementation: embed each tool's name + description + example into a vector DB. At runtime, embed the user goal, retrieve top-7, attach those schemas. - Refresh on every loop turn or only on plan boundaries — the trade-off is one being more flexible and the other being more predictable. - Combine with metadata filters: scope ("read-only only", "current user has permission") before semantic ranking. - Numbers to drop: "Our internal agent has 180 tools; without retrieval, p95 latency was 9s and tool-pick accuracy was 71%. With top-7 retrieval, p95 dropped to 4s and accuracy went to 88%."

Common follow-ups: - "Do you retrieve once or per turn?" - "How do you keep the retrieved tool set stable across turns?"

Traps: - Retrieving fresh every turn and confusing the model with shifting tool sets. - Embedding only the name — descriptions carry most of the signal. - Skipping a permission filter; the model "discovers" tools it can't actually call.

Related cross-cutting: Retrieval Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Design tool schemas for a financial advisor agent."

Tags: senior · occasional · design · source: synthesized from atul4u 2026 Q19-22 + Anthropic tool-writing guide

Answer outline: - Scope first: read-only tools (account_get_positions, market_quote_get, portfolio_risk_score) and mutating tools (portfolio_rebalance_propose, trade_order_submit). - All mutating tools take an idempotency_key (UUID) and an approval_token (only valid if HITL approved). - Strict enums for everything: order_type ∈ {market, limit, stop}, time_in_force ∈ {day, gtc, ioc}. No free strings. - dry_run: bool = True default on every mutating tool — orchestrator flips to false only after approval. - Structured errors: {ok: false, code: "INSUFFICIENT_FUNDS", retryable: false, hint: "Reduce quantity or use margin"}. - Audit attributes on every call: user_id, session_id, agent_version, reason. Stored separately from tool output. - Numbers to drop: "We targeted 100% audit coverage on mutating calls and got there by making audit_context a required Pydantic field — calls without it 400 at the tool boundary."

Common follow-ups: - "How do you stop the agent from inventing a CUSIP?" - "What gets logged for SOX compliance?"

Traps: - Free-form instruction: str arguments — guarantees prompt injection. - One mega-tool trade(action: str, ...) instead of typed siblings. - Mutating tools without idempotency — every retry doubles the position.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you measure the quality of tool descriptions for an agent?"

Tags: senior · occasional · design · source: aemonline.net top-50 agentic AI 2026 (Q10); Anthropic "Writing Effective Tools"

Answer outline: - Build a held-out eval set of (user_goal, correct_tool, correct_args). Measure: tool-pick accuracy, arg-validity rate, end-to-end task success. - Interleaved thinking traces: Anthropic explicitly recommends inspecting why the model chose tool X over Y to find description gaps. - A/B descriptions: keep schemas constant, vary descriptions, measure delta on held-out eval. Treat tool docs like prompts — version them. - Confusion matrix between similar tools: if docs_search and code_search get swapped >5% of the time, descriptions need explicit "use this for X, not Y". - Automated: a meta-eval LLM scores descriptions for clarity, presence of examples, presence of negative cases. - Numbers to drop: "Our description-quality scorecard hits 4 axes (when-to-use, when-not, examples, error guide); raising the average from 2.1/4 → 3.6/4 lifted task success from 64% → 79%."

Common follow-ups: - "How often do you re-run tool description evals?" - "How do you keep examples in descriptions from going stale?"

Traps: - "Eyeballing" descriptions instead of measuring against an eval. - A/B testing model and descriptions simultaneously (can't attribute delta). - Treating descriptions as documentation rather than load-bearing input.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What is 'lexical ambiguity' in tool names, and how can you prevent it?"

Tags: mid · occasional · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q12)

Answer outline: - Lexical ambiguity: tool names that share words but mean different things, e.g. search (CRM customers) vs search (Confluence pages). - Fix with namespacing-by-service + namespacing-by-resource: salesforce_customer_search, confluence_page_search. Anthropic's exact recommendation. - Avoid English synonyms across tools — pick one verb per action class (get, list, create, update, delete) and stick to it. - Don't reuse a verb for both query and mutation (update_record should never read-only). - Validate with a name-collision linter at registration time — fails CI if two tools share a stem. - Numbers to drop: "After renaming 9 tools to <service>_<resource>_<verb> format, tool selection F1 went from 0.81 → 0.91; the lift held across GPT-4, Claude, and Gemini-Flash."

Common follow-ups: - "How do you handle this when tools come from external MCP servers you don't control?" - "Prefix vs suffix namespacing — which wins?"

Traps: - Tools called do_thing, process, handle — meaningless to the model. - Loading two tools with identical names from different servers without a prefix. - Trusting the description to disambiguate two tools that share a name (it sometimes doesn't).

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you design a robust function-calling interface that can handle malformed tool responses?"

Tags: senior · common · coding · source: aemonline.net top-50 agentic AI 2026 (Q6)

Answer outline: - Validate every tool response against a response schema before feeding it back. Treat tools as untrusted. - Three categories of malformed: missing fields, wrong types, surprising content (e.g., HTML when JSON expected). Each gets a different recovery path. - Wrap raw response in a structured envelope: {ok, parsed, raw_excerpt, warnings}. The LLM sees parsed and only raw_excerpt (truncated) if validation fails. - On parse failure, return a hint instead of dumping the raw payload: "Tool returned non-JSON; this usually means rate-limit or maintenance — try again with smaller limit." - Idempotency: if validation fails after retry N, mark as terminal failure and stop. Don't loop on the same broken tool. - Numbers to drop: "Adding response-schema validation cut downstream hallucinations from tool output by ~40% and halved the retry rate."

Common follow-ups: - "What's the right retry count for a tool returning 502?" - "How do you stop a 'fix the tool output' loop?"

Traps: - Treating tool stdout as model input verbatim. - Re-running a tool that returned a stable wrong format multiple times. - Dumping a 10MB HTML page into the LLM's next prompt because parsing failed.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What is tool hallucination and how do you prevent it?"

Tags: mid · very-common · conceptual · source: aemonline.net 50-questions 2026 (Q36); aemonline.net 25-advanced Q21

Answer outline: - Two flavors: hallucinated tool (calls a tool that doesn't exist) and hallucinated args (right tool, made-up parameter values). - Prevent tool hallucination: provide tools only via the API's native tool-use channel, not in prose. Models hallucinate prose tools more than API-declared ones. - Prevent arg hallucination: strict schema + structured outputs + specific names. user_id: str (UUID v4) is harder to hallucinate than user: str. - Catch what slips through: pre-execution validation in the orchestrator, structured error back to the model. Anthropic calls this "poka-yoke your tools". - Telemetry: count schema-validation failures per tool per day. Sudden spike = description regression or model regression. - Numbers to drop: "Anthropic's tool-writing guide reports description tuning alone reached SOTA on SWE-bench Verified; we saw arg-hallucination drop from 8% → 0.4% after adding strict schemas + structured outputs."

Common follow-ups: - "How do you tell schema-validation failure from network failure?" - "Does temperature affect tool hallucination?"

Traps: - Putting tool catalog in the system prompt as prose and hoping the model uses real names. - Permissive Any types in schemas. - No telemetry on hallucination rate, so regressions ship unnoticed.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you handle tool failures, retries, and idempotency?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q22); aemonline.net 50-questions (Q42)

Answer outline: - Classify failures: transient (5xx, rate limit, network) → retry with backoff; permanent (4xx, invalid auth, business rule) → don't retry, surface to LLM as structured error. - Retry policy lives in the orchestrator, not the prompt: max 3 attempts, exponential backoff with jitter, total time budget per tool. - Idempotency: every mutating tool takes an idempotency_key (UUID per logical action). Server dedupes — retries cannot create duplicates. - Circuit breaker: if a tool fails >50% over the last 10 calls in 60s, trip the breaker and route around it for the next 5 minutes. - Surface failures to the LLM as {ok: false, code, retryable: false, hint} — don't dump stack traces. - Numbers to drop: "After adding idempotency keys + circuit breakers, we eliminated 100% of duplicate-charge incidents (previously 3-4/month) and dropped p99 task latency 22% because retries no longer queued behind dying tools."

Common follow-ups: - "What's the right backoff curve for an OpenAI rate-limit error?" - "What if the tool is non-idempotent and you can't fix it?"

Traps: - Retrying 4xx errors (it's not going to work the next time either). - Retrying mutating tools without idempotency keys. - Letting the LLM "decide" whether to retry — it doesn't have circuit-breaker state.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you sandbox tool execution safely?"

Tags: senior · common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q21)

Answer outline: - Containment levels: read-only tools run in-process; mutating tools run via a separate service with its own auth; code-execution tools run in firecracker/gVisor with no network and a fresh tmpdir. - Capability-based auth: each tool call carries a scoped token that names allowed resources, not the user's full creds. - Resource caps per call: CPU seconds, RSS memory, network egress, files written. Kill on overage. - Network egress allowlist per tool — web_search can hit Google; python_exec cannot reach the internet. - Output sanitization: scrub secrets/PII from tool output before it lands in LLM context (regex + classifier). - Numbers to drop: "Moving python_exec to gVisor + 256MB RSS + 10s CPU cut runaway-process incidents to zero; previously we had one per 50k tasks at 8GB RAM each."

Common follow-ups: - "How do you isolate code-exec tools from each other across users?" - "Where does the sandbox sit in the trace?"

Traps: - Letting code-exec tools share a filesystem with other tenants. - No egress allowlist, so the agent exfiltrates secrets via DNS. - Forgetting to scrub secrets from tool output that goes back to the LLM.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Stopping rules & budgets

Q: "How do you prevent an agent from over-reasoning or over-planning?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q5)

Answer outline: - Hard cap on steps (max_iterations = 15-20 is typical for most agents). Trip → return best-effort answer with terminated=max_steps. - Hard cap on total tokens (e.g., 100k per task) and total wall clock (e.g., 60s for sync tasks). - Per-tool budget: tools that are cheap can be called 10x, expensive ones (LLM-judge, code-exec) capped at 2-3x. - Progress check: every 5 iterations, an evaluator LLM asks "has progress been made vs 5 steps ago?" If not, halt or escalate. - Plan once, execute many — plan-and-execute trims overthinking by separating decisions from execution. - Numbers to drop: "We cap at 20 steps and \(0.50/task; before caps, 4% of tasks consumed >\)5 and 0.1% consumed >$50. After caps, p99 cost is bounded at $0.50 and quality dropped only 1.2 pts on our eval."

Common follow-ups: - "What signals indicate over-planning specifically vs. over-reasoning?" - "How do you A/B test budget caps?"

Traps: - No global cap, only per-tool caps — agent can rack up 50 LLM-only "thinking" turns. - Hard cap with no informative fallback (returns nothing instead of best-effort). - Letting the LLM see its own budget — it negotiates.

Related cross-cutting: Cost & latency Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you implement termination conditions in long-running agents?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q10)

Answer outline: - Multiple independent stop conditions, OR'd: step >= max_steps, tokens >= max_tokens, wall_clock >= max_seconds, cost >= max_cost, final_answer_emitted, human_halt_signal. - "Final answer" predicate: the agent must call a final_answer tool to terminate naturally. No free-form "I'm done" — the LLM can lie about being done. - Stuck detector: if last 3 tool calls are identical (same name + same args), halt; obvious loop. - No-progress detector: if 5 consecutive iterations produce no new tool calls (only "thinking"), halt. - Externalized kill switch: a Redis key the orchestrator polls. Ops can stop any agent without redeploy. - Numbers to drop: "Adding the 'identical-tool-call x3 → halt' rule caught 0.8% of runs that would otherwise have spun until max_steps, saving an average of 14k tokens each."

Common follow-ups: - "How does the agent know it's done without lying?" - "How do you handle agents that need 4 hours legitimately?"

Traps: - One stop condition only (e.g., step count) — agent burns the budget elsewhere. - LLM self-reported termination with no structural enforcement. - No kill switch — incident response means a deploy.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do agents decide a task is 'done'?"

Tags: mid · common · conceptual · source: adilshamim8 Medium 2026

Answer outline: - Three signals: explicit (model calls final_answer tool), implicit (no more tool calls, output is final prose), or external (orchestrator detects success criteria met). - Best practice: require the explicit final_answer(answer: str, confidence: float, supporting_evidence: List[str]) tool. Forces structured completion. - Have an evaluator LLM (or rule-based check) score the answer against the goal — if score < threshold, re-prompt with feedback. - For deterministic tasks (SQL query, code fix), use task-specific verifiers (run the SQL, run the tests). - Cap by budget regardless — "done by budget" is a legitimate termination state, labeled as such. - Numbers to drop: "Switching from 'no more tool calls' to explicit final_answer reduced false completions (agent said done while the task was unfinished) from 11% → 2.3% on our eval."

Common follow-ups: - "What's a verifier and where does it live?" - "How do you handle 'partial done' for long tasks?"

Traps: - Trusting the model's word that it's done. - No verifier on tasks where one is trivial to write (e.g., compile the code). - Treating "max steps reached" the same as "answered" in metrics.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you detect and stop infinite planning loops?"

Tags: senior · common · debugging · source: adilshamim8 Medium 2026; aemonline.net top-50 (Q24)

Answer outline: - Loop signature: track (tool_name, normalized_args_hash) for last K calls. Three identical entries → halt. - Plan churn detector: if the agent re-plans 3+ times in 5 iterations, escalate. The plan isn't converging. - Cosine-similarity check on consecutive "thoughts" — if similarity > 0.92 across 3 turns, the model is reasoning in circles. - Token-velocity check: if last 5 turns produced no tool calls, only "thinking", halt with no_progress reason. - External watchdog: separate process sees stuck thread → kills it. Don't rely on the agent to notice it's stuck. - Numbers to drop: "Our loop detector (identical tool call x3) fires on ~0.6% of production runs and saves ~12k tokens each. Plan-churn detector fires on ~0.2% and prevents avg $1.20/run waste."

Common follow-ups: - "Show me the args-hash function — what counts as 'identical'?" - "How do you distinguish a legitimate retry from a loop?"

Traps: - Exact-string matching on tool args (misses near-identical calls). - Letting the LLM self-detect — by definition, it doesn't see the loop. - Hard-killing with no diagnostic dump.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you control cost explosions from tool calls?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026

Answer outline: - Per-task hard budget in USD, computed live from token usage + tool call costs. Trip → halt with budget_exceeded. - Per-tool cost annotation in the registry: {name, usd_per_call, tokens_per_call}. Orchestrator pre-checks before allowing the call. - Tier expensive tools behind cheap-model gates: a Haiku-class model decides whether the Sonnet/GPT-5-class tool is worth invoking. - Cache at the tool boundary: same input → same output for read-only tools, TTL keyed by data freshness needs. - Soft and hard limits per tenant: 80% → alert, 100% → halt. Standard pattern from Stripe usage-based billing. - Numbers to drop: "Per-task \(0.50 cap eliminated all incidents of >\)5/task runs (was 4% before). LLM-judge caching brought eval cost down 70% with 0.4 pt accuracy loss."

Common follow-ups: - "How granular is your cost accounting — per token or per call?" - "What's a 'budget aware' agent vs a 'budget gated' agent?"

Traps: - Cost limits per request only, not per session — sessions stack up. - Caching mutating-tool results. - Letting the model see the remaining budget (it games it).

Related cross-cutting: Cost & latency Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Your AI agent burns too many tokens per task. How do you reduce token consumption?"

Tags: mid · very-common · scenario · source: MindStudio 2026 "AI Agent Token Budget Management"; igmguru top-40 agentic AI 2026

Answer outline: - Audit first: log per-turn tokens, find the hot spots. Usually 70-90% of tokens come from re-sending tool results unchanged. - Compact tool outputs: tools should return only the fields the agent uses, summarize long results with a cheap model before re-injection. - Sliding-window or summary buffer for older messages — keep last N raw, summarize the rest. - Cache the system prompt + tools (Anthropic prompt caching, OpenAI cached_input). 90% discount on cached tokens. - Prefer cheap models for routing and final-answer formatting; reserve flagship models for hard reasoning steps. - Numbers to drop: "Prompt caching alone cut our agent's token cost 62% on cached portions; tool-output compaction cut overall context size 45%. Combined: $0.034 → $0.011 per task."

Common follow-ups: - "What's the right cache TTL for the tool definitions block?" - "Sliding window vs summarization — when does each break?"

Traps: - Caching prompts that change per request (defeats cache). - Summarizing tool output and losing the IDs the agent needs. - Optimizing tokens without measuring quality — easy to regress accuracy 5%.

Related cross-cutting: Cost & latency Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you rate-limit and budget an agent's tool usage?"

Tags: senior · common · design · source: atalupadhyay "50 most asked agentic AI" 2026 (Q22)

Answer outline: - Three layers: per-tool RPS (protect the tool), per-agent RPS (protect downstream), per-tenant RPS (fairness). - Implementation: Redis token bucket keyed by (tenant_id, tool_name). 429 surfaces as structured error to the agent — "retry in 2.1s, suggest different tool". - Budget enforcement uses both call count and cumulative cost. Some tools (web search) are cheap-per-call but expensive in aggregate. - Burst capacity: short bursts ok (10x for 5s), then drain. Prevents legitimate concurrent calls from getting throttled. - Per-tool weight: a code_exec call counts as 100 units, a cache_get as 1. Same bucket, different rates. - Numbers to drop: "Per-tenant 100 RPS soft cap + 500 RPS hard cap stopped one customer's runaway agent from saturating our 10k RPS database in production."

Common follow-ups: - "What's the difference between rate limiting and budgeting in this context?" - "How does the agent recover from a 429?"

Traps: - Only client-side rate limiting (one buggy client takes everyone down). - Returning 429 as an unstructured string the agent can't parse. - No per-tenant cap, so noisy neighbors starve others.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Approval gates

Q: "Define agent autonomy boundaries — what can it do without human approval?"

Tags: senior · very-common · design · source: adilshamim8 Medium 2026; atul4u 2026 (Q3)

Answer outline: - Default-deny on side effects. Start with read-only autonomy and explicitly add categories. - Three tiers: (1) read-only — fully autonomous, (2) reversible writes (draft email, scratch table) — autonomous with audit, (3) irreversible — always human-gated. - Per-tool mutation_class annotation: read | reversible | irreversible | financial | external_send. Policy engine maps class to required approval level. - Blast radius cap: even autonomous writes cap rows/files/dollars per task. update_records can change ≤ 50 rows without approval. - Confidence threshold: if model's self-reported confidence < 0.7 on an action, escalate regardless of class. - Numbers to drop: "We classify ~30% of our 60 tools as irreversible; those route through HITL with target 2-minute approval SLA and ~95% approve rate."

Common follow-ups: - "How do you measure confidence reliably from an LLM?" - "What's the policy when approval times out?"

Traps: - All-or-nothing autonomy (either full HITL or no gates). - No blast-radius cap on reversible writes — "reversible" 10k rows is still bad. - Confidence based on LLM self-report only.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "An agent is about to perform an action with irreversible consequences, like deleting files. How should the system be designed?"

Tags: senior · very-common · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q15)

Answer outline: - Intercept at the tool boundary, not in the prompt. The file_delete tool's wrapper checks for approval_token; without it, returns "approval_required" instead of executing. - Surface to a human reviewer: structured message with file paths, sizes, last-modified, agent's reasoning. UI shows approve/edit/reject/respond (LangGraph's four-option pattern). - Persistent state during pause: LangGraph interrupt() + checkpointer keeps the agent's full state durable, resumes from exact step. - Soft-delete first: where possible, move to trash with TTL instead of hard delete. Reduces blast radius of approved-but-wrong actions. - Audit log: who approved, when, with what justification, what was deleted. Tamper-evident store. - Numbers to drop: "Our SLA: irreversible-action approvals route to on-call in <30s, p95 approve-or-reject in 90s, target 99% completion within 5-minute timeout."

Common follow-ups: - "What happens if no human approves within the timeout?" - "How do you stop reviewers from rubber-stamping?"

Traps: - Approval gate in the system prompt only — prompt injection bypasses it. - Hard delete with no soft-delete option. - No timeout policy — gates stuck indefinitely.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Design an agent for end-to-end customer onboarding. Where does it call humans?"

Tags: senior · common · design · source: aemonline.net top-50 2026 (Q47-variant)

Answer outline: - Phases: identity verification → KYC document review → account provisioning → first-product setup → welcome handoff. - Human checkpoints: (1) KYC review when confidence < 0.85 or jurisdiction is high-risk; (2) any provisioning above a dollar threshold; (3) discrepancy between submitted docs and external KYC service. - Auto-paths: extracting fields from ID docs, scheduling welcome calls, sending templated emails — autonomous with audit log. - Failure handoff: 3 retries on KYC service → human takes over the case, not the agent restarting. - Per-step SLA: identity (5s), KYC review (15 min if human-gated), provisioning (30s). Total target: 10 min unattended, 25 min with one review. - Numbers to drop: "Onboarding agent at a fintech I worked with handles ~80% of cases fully autonomously, 15% with one human checkpoint, 5% full-takeover. SLA: 12 min median, 45 min p95."

Common follow-ups: - "How does the human inherit the agent's state when they take over?" - "What's the rollback if onboarding fails at step 4?"

Traps: - One single approval at the end ("review this whole 12-step run"). - No takeover path — agent retries forever on a doc it cannot parse. - Trusting LLM-extracted KYC fields without a deterministic schema check.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you prevent human-in-the-loop from becoming a bottleneck?"

Tags: senior · common · design · source: bestaiweb / Mastra "Human-in-the-Loop" 2026; Strata "Practicing HITL" 2026

Answer outline: - Calibrate confidence thresholds: only genuinely uncertain actions (calibrated below 0.85 say) route to review, not every action. - Tiered SLAs: 15s for low-risk, 2 min for PII, 15 min for financial. Match latency to risk; don't make every gate same-SLA. - Async approval: agent parks the action and continues with parallel sub-tasks instead of blocking. - Batch similar approvals: 20 "send email to customer" actions arrive in one queue, reviewed in a batched UI. - Watch automation bias: reviewers who see 1000 similar agent proposals per day rubber-stamp. Track approval-to-reject ratio per reviewer; if it goes >99%, rotate or sample-audit. - Numbers to drop: "Tiered SLAs cut median approval latency from 4m → 35s while halving total reviewer hours. We sample-audit 5% of auto-approved low-risk actions to catch reviewer drift."

Common follow-ups: - "How do you measure reviewer fatigue?" - "What's a calibrated confidence threshold?"

Traps: - Every action gates → reviewers burn out, throughput collapses. - Confidence not calibrated → threshold is meaningless. - No async path — agent sits idle waiting.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Can an agent have 'doubt,' and should it ask for human confirmation?"

Tags: mid · occasional · conceptual · source: aemonline.net top-50 agentic AI 2026 (Q46)

Answer outline: - LLMs have noisy self-reported confidence; treat it as one signal, not the only one. - Better proxies: log-prob entropy on the tool-call token, disagreement among N samples (self-consistency), evaluator-model agreement. - Define "doubt triggers": multiple plausible tools score similarly, args have no schema match, user's intent ambiguous, missing required info. - Action on doubt: a clarify tool that emits a question to the user, OR an escalate tool that pages a human. Not just "guess and hope". - Don't over-clarify — every clarification is friction. Cap at 1-2 per task or use a clarifier-budget. - Numbers to drop: "Our 'doubt detector' (entropy + N-sample disagreement) fires on ~7% of turns; turning those into clarifications dropped wrong-action rate from 12% → 4% with only a small NPS hit (-1.3 points)."

Common follow-ups: - "Show me your confidence calibration plot." - "What's the right ratio of clarifications to autonomous answers?"

Traps: - Trusting raw LLM "I'm 90% sure" tokens. - Asking clarifying questions for every uncertainty (users hate it). - No clarifier-budget so the agent stalls.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you implement human approval in LangGraph?"

Tags: mid · common · coding · source: LangChain docs "Human-in-the-Loop" (docs.langchain.com); LangChain blog "interrupt"

Answer outline: - Use interrupt() in a graph node where approval is needed. State is persisted by the checkpointer; the call returns to the caller. - The host app surfaces the interrupted state to a human and resumes with graph.invoke(Command(resume=...)) carrying the decision. - Four standard outcomes: approve, edit (modify before run), reject (with feedback), respond (used by "ask user" style tools). - Requires a checkpointer (Postgres, SQLite, Redis) — interrupt() only works with durable state. In-memory checkpointer is for tests. - Make interrupts dynamic, not static — conditional on action.value > threshold, not on every action. Avoids ceremonial gating. - Numbers to drop: "LangGraph's interrupt() with Postgres checkpointer survives node restarts; we measured 100% resume rate across 50k production interrupts over the last quarter."

Common follow-ups: - "What if the checkpointer is unavailable when interrupt fires?" - "How do you let approvers edit the agent's proposed action?"

Traps: - interrupt() without a checkpointer → state lost. - Static interrupt() on every tool call (rubber-stamp problem). - Not handling the respond case for clarifications, only approve/reject.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you implement guardrails for safe and predictable agent behavior?"

Tags: senior · common · design · source: aemonline.net top-50 agentic AI 2026 (Q46-variant)

Answer outline: - Pre-execution guardrails: input filters (PII detection, prompt injection classifier, jailbreak detection), policy engine that decides allow/deny/escalate. - In-loop guardrails: tool selection allowlist by role, parameter validation, budget checks, blast-radius caps. - Post-execution guardrails: output filters (PII redaction, toxicity check), response evaluator scoring against goal. - Policy as code: OPA/Rego or a domain-specific rule engine, version-controlled, tested. Don't hide policy in prompt strings. - Layered defense: even if one guardrail fails, another catches. Anthropic's framing is "constitutional AI" baked into reasoning + structural checks at boundaries. - Numbers to drop: "Post-execution toxicity filter catches ~0.04% of agent outputs; pre-execution prompt-injection filter catches ~0.3% of inputs. Both layers together = 0 toxic outputs over 500k requests last month."

Common follow-ups: - "Where does the policy engine sit in your topology?" - "How do you A/B test a new guardrail without blocking real traffic?"

Traps: - Single point of guardrail (only input or only output). - Guardrails in the prompt only. - No mechanism to update guardrails without redeploy.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


MCP & tool protocols

Q: "What is MCP and why does it matter?"

Tags: screen · very-common · conceptual · source: modelcontextprotocol.io; Piyali Das "10 MCP Interview Questions" 2026; aemonline.net top-50 (Q34)

Answer outline: - MCP = Model Context Protocol — open standard from Anthropic (Nov 2024) for connecting LLMs to tools, resources, and prompts via a client-server interface. - Three concepts: MCP client (your agent), MCP server (exposes tools/resources), transport (stdio for local, HTTP/SSE for remote, streamable HTTP for production). - Why it matters: standardizes tool integration so you stop writing one custom adapter per service. One MCP server can serve any MCP-compliant client. - Not a replacement for RAG: MCP handles tool execution and resource fetching; RAG is a retrieval pattern that can use MCP as transport. - 2026 roadmap focus: transport scalability, agent-to-agent communication, governance maturation. - Numbers to drop: "After standardizing on MCP, we cut net-new tool integration time from ~4 engineering days to ~0.5 day per tool, and 60+ open-source MCP servers covered our top integrations out of the box."

Common follow-ups: - "What's the difference between MCP and a regular REST API?" - "When would you NOT use MCP?"

Traps: - Treating MCP as just a buzzword without naming client/server/transport. - Building a custom protocol in 2026 when MCP would serve. - Confusing MCP servers with agent frameworks like LangGraph.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How does an LLM discover MCP tools?"

Tags: mid · common · conceptual · source: Piyali Das "10 MCP Interview Questions" 2026 (Q6); modelcontextprotocol.io spec

Answer outline: - The MCP server exposes tools/list — returns each tool's name, description, JSON Schema for input and output. - Client calls tools/list on connection (or refreshes on tools/list_changed notifications). - Discovered tools are merged into the agent's tool registry — alongside native function-calling tools. - Selection: same mechanism as any other tool — model reads the descriptions and picks. Same namespacing rules apply (gh_mcp_issues_search, not just search). - Permission filtering at the client: even if a server exposes 50 tools, the client can surface only the ones the current user is authorized to call. - Numbers to drop: "Our agent connects to 4 MCP servers exposing 28 tools combined; we filter to ~12 per agent role via the client's allowlist, keeping the context overhead under 4k tokens."

Common follow-ups: - "What happens if tools/list_changed fires mid-task?" - "How do you stop a malicious MCP server from injecting tools?"

Traps: - Loading every MCP server's full tool list into context unfiltered. - Trusting tool descriptions from external servers without review. - Not handling tools/list_changed — stale registry leads to silent failures.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What are the security risks of MCP and how do you mitigate them?"

Tags: senior · common · design · source: Coalition for Secure AI "MCP Security" 2026; Security Boulevard "How MCP servers handle authentication"; Checkmarx "MCP Security"

Answer outline: - Static tokens dominate (53% of production servers per Coalition for Secure AI 2026 scan of 518 servers; 41% have no auth at all). Long-lived tokens in multi-agent chains have geometrically wider blast radius than in conventional apps. - Use short-lived OAuth tokens + DPoP (RFC 9449) to bind tokens to the requesting client. Prevents replay if intercepted. - Authorization in three layers: (1) authenticate the client, (2) authorize the connection, (3) per-action authorization. Most production servers stop at layer 2; layer 3 is the gap. - End-to-end traceability: every action must trace back to the initiating user and every intermediate server. Propagate identity context across the chain. - Server allowlist: don't auto-connect to any MCP server the user mentions. Curated registry per tenant. - Numbers to drop: "After moving to short-lived tokens + per-action authorization, we caught a malicious server attempting privilege escalation 14 times in one week — none succeeded vs. the previous static-token regime."

Common follow-ups: - "What's prompt injection's role in MCP exploits?" - "How do you isolate one MCP server from another?"

Traps: - Static API keys with no rotation. - Trusting tool output as safe input to other tools (cross-server injection). - No identity propagation — action looks like it was done by the agent service, not the user.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Difference between MCP and APIs?"

Tags: screen · common · conceptual · source: Piyali Das "10 MCP Interview Questions" 2026 (Q4)

Answer outline: - APIs (REST/gRPC): general-purpose, designed for human developers, manual integration per endpoint. - MCP: AI-native, designed for LLM consumption, includes tool discovery, structured descriptions, resource lists, prompt templates as first-class primitives. - MCP can wrap an existing REST API — most production MCP servers are thin adapters over an internal API plus an LLM-friendly description layer. - MCP standardizes the integration shape, so any MCP-compliant client (Claude Desktop, Cursor, Cody, your agent) gets the tool for free. - APIs answer "how do I call this endpoint?"; MCP answers "what tools are available, when should the model use them, and what's the expected output shape?". - Numbers to drop: "Our internal CRM API exposes 80 endpoints; the MCP wrapper exposes 12 carefully-curated tools with rich descriptions. Tool-pick accuracy went from 64% (raw OpenAPI generation) to 89% with the curated MCP surface."

Common follow-ups: - "Why not just point the LLM at an OpenAPI spec?" - "Should every internal API have an MCP server?"

Traps: - Auto-generating MCP tools 1:1 from OpenAPI — surface too noisy, descriptions too thin. - Treating MCP as a replacement for APIs (it's a layer on top). - Exposing every endpoint, not just the safe and useful subset.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "When would you choose MCP over a custom tool integration?"

Tags: mid · common · design · source: Piyali Das "10 MCP Interview Questions" 2026 (Q10); Red Hat "Building effective AI agents with MCP"

Answer outline: - Choose MCP when: multiple clients will use the same tool (Claude Desktop + your agent + Cursor); tool surface evolves often (descriptions change without client redeploys); third-party integrations available off-the-shelf. - Stick with native function-calling when: latency is critical (MCP transport adds ~5-30ms); the tool is one-off; you need fine-grained control over the schema and selection logic. - For internal tools, even 1 client today, MCP can future-proof — but don't adopt it prematurely if the team isn't ready to operate an MCP server. - Off-the-shelf MCP servers in 2026 cover GitHub, Linear, Slack, Notion, Postgres, Sentry, dozens more. Don't reinvent. - Production checklist: auth (OAuth + short-lived tokens), audit, rate limits, per-action authz, observable transport. - Numbers to drop: "We adopted MCP for 4 external SaaS integrations (GitHub, Linear, Slack, Sentry), kept native function-calling for 18 internal tools. Net effect: 60% less glue code, 0 latency regression."

Common follow-ups: - "What's the operational overhead of running an MCP server?" - "How do you migrate from custom tools to MCP without breaking the agent?"

Traps: - All-in on MCP without a fallback for tools that need <100ms latency. - Running a leaky MCP server with static creds (now a much bigger surface). - Mixing MCP and native tools without a unified registry abstraction.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How does MCP handle long-running tool calls?"

Tags: senior · occasional · conceptual · source: MCP 2026 Roadmap (blog.modelcontextprotocol.io/posts/2026-mcp-roadmap); The New Stack "MCP biggest growing pains" 2026

Answer outline: - Current spec: clients can start asynchronous Tasks — invoke once, retrieve result later. Lets the agent kick off long jobs without blocking. - Open gaps the 2026 roadmap is closing: retry semantics for failed jobs, result retention windows, observability of in-flight tasks across clients. - Practical pattern today: MCP tool returns a task_id immediately, agent polls a tasks/get tool, or subscribes to completion notifications. - Trade-offs: synchronous calls are simpler but block the loop; async is correct but adds plumbing the agent must reason about. - Timeout policy belongs to the client — server publishes "estimated_duration", client decides whether to wait or convert to async. - Numbers to drop: "Our deep-research tool runs 30-180s. Wrapped in async MCP, the agent polls every 5s and continues other work meanwhile; we cut median task wall-time 40% by parallelizing the wait."

Common follow-ups: - "What's the failure mode when a job crashes mid-flight?" - "How long does an MCP server keep completed results?"

Traps: - Synchronously blocking the agent on a 60s tool call. - No retention policy on async results, lost on server restart. - Polling without backoff (DDoSes your own MCP server).

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Failure handling

Q: "Your agent hits a tool that returns 500. What happens next?"

Tags: mid · very-common · scenario · source: aemonline.net 25-advanced 2026 (Q10); aemonline.net top-50 (Q9-variant)

Answer outline: - Orchestrator classifies 5xx as transient → retry with exponential backoff + jitter, max 3 attempts, total budget 10s. - On final failure, return a structured error to the LLM: {ok: false, code: "TOOL_500", retryable: false_after_retries, hint: "Try a different tool or fall back"}. - LLM's response options: try an alternative tool, ask the user, or end with a partial answer + reason. Never let it loop on the same failed call. - Circuit breaker: if the tool has failed >50% over the last 10 calls, the breaker trips and the orchestrator routes around it for 5 minutes. - Track in telemetry: tool_id, retry_count, final_outcome, agent_recovery_action. Alert if recovery rate < 90%. - Numbers to drop: "Our retry policy (3 attempts, exp backoff 1s/2s/4s + 20% jitter) handles ~95% of transient 5xx; circuit breaker on top reduces cascading failures during partial outages by ~70%."

Common follow-ups: - "What if there's no alternative tool?" - "How do you stop the LLM from retrying after the orchestrator gave up?"

Traps: - Letting the LLM see "Tool failed, you might want to retry" as a free-form string (it will retry forever). - No circuit breaker → cascading failures. - Retrying writes without idempotency.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you handle errors when an API call fails (e.g., rate limit, authentication error, invalid parameters)?"

Tags: mid · very-common · scenario · source: aemonline.net 25-advanced agentic AI 2026 (Q10)

Answer outline: - Different error classes need different responses: 429 → backoff + retry (parse Retry-After header); 401/403 → refresh credentials, then retry once, then fail; 400 → don't retry, surface to LLM as "invalid_args" with the validator's hint. - 401 retry uses a token-refresh flow in the orchestrator, not in the LLM prompt. - 400 carries the exact validation message back to the LLM so it can correct args on the next turn. - 5xx → retry with backoff as transient. - Pattern: error envelope {ok: false, http_status, code, message, retryable, hint} standardized across all tools. - Numbers to drop: "Standardizing the error envelope dropped time-to-recovery from agent retries by 35% — the LLM now corrects on the next turn instead of probing."

Common follow-ups: - "What's the right retry-after when the API doesn't send one?" - "How do you avoid leaking the new token to the LLM?"

Traps: - Retrying 4xx (won't succeed). - Putting auth tokens in tool responses the LLM sees. - Same retry curve for every error class.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How would you design a tool-use function robust to the LLM hallucinating parameter names or values?"

Tags: senior · common · design · source: aemonline.net 25-advanced agentic AI 2026 (Q21)

Answer outline: - Strict Pydantic / JSON Schema with descriptive error messages: when validation fails, return "Expected user_id (UUID v4), got username='alice'. Use user_lookup to convert username→UUID first." - Type coercion where safe (int vs string for IDs) and rejection elsewhere — don't auto-fix ambiguous types. - Each parameter description includes 1-2 example values: user_id: "550e8400-e29b-41d4-a716-446655440000". Anchors the model. - Provider-native structured-output mode (OpenAI tool_choice strict, Anthropic tools) enforces schema at the sampler level. - Tracker: count validation_failures_per_tool_per_day. Spike = description regression or model regression. - Numbers to drop: "Strict Pydantic + structured outputs + hint-rich errors cut arg-validation failures from 8% → 0.4% on our payments agent eval; we left 0.4% as 'genuinely ambiguous user intent', not hallucinations."

Common follow-ups: - "What if the model insists on the wrong value across multiple retries?" - "Why not just allow any string and parse later?"

Traps: - Dict[str, Any] parameters. - Validation errors dumped as raw Pydantic tracebacks. - No telemetry on hallucination rate.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "Your agent is stuck in an infinite loop calling the same tool. How do you detect and stop it?"

Tags: senior · common · debugging · source: aemonline.net top-50 agentic AI 2026 (Q24)

Answer outline: - Detection in the orchestrator: hash (tool_name, normalized_args) over the last K calls. If the same hash appears 3 times in 5 calls, halt with loop_detected. - Normalize args before hashing: lowercase, sort keys, strip whitespace. Otherwise near-identical args evade detection. - Plan-stage detector: if the agent re-plans 3 times in a window, escalate — the plan isn't converging. - On halt, dump the last 5 messages, the failing tool call hash, and the agent state. Don't silently terminate. - After halting, route to either user clarification (if the loop is ambiguity-driven) or human takeover (if it's a tool-availability issue). - Numbers to drop: "Loop detection fires on ~0.6% of runs; pre-detector, those runs averaged 18 steps to budget cap and ~$1.40 wasted. Post-detector, halt at step 5, dump-and-escalate within 3s."

Common follow-ypes: - "What's the right value of K?" - "How do you keep the loop detector from false-positiving on legitimate retries?"

Traps: - Exact-string args matching (misses near-duplicates). - No diagnostic dump on halt. - Loop detector tied to identical literal args; misses semantic loops.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What are the most dangerous failure modes of agentic AI?"

Tags: senior · common · conceptual · source: adilshamim8 Medium 2026; atul4u 2026 (Q38); lockedinai (Q34)

Answer outline: - Tool misuse: agent calls a destructive tool with wrong args (delete the wrong table). Mitigation: schema strictness, blast-radius caps, HITL for mutating tools. - Infinite reasoning loops: agent burns budget without progress. Mitigation: step cap, loop detector, no-progress detector. - Prompt injection via tool outputs: a malicious doc the agent reads instructs it to exfiltrate data. Mitigation: input/output filters, structured channels, action allowlist. - Hallucinated tool calls: agent invents a tool or argument. Mitigation: strict schemas, structured outputs. - Scope creep: agent attempts actions beyond its authority (e.g., reads cross-tenant data). Mitigation: capability-based auth, per-action authz. - Numbers to drop: "Across post-mortems on 47 agent incidents at one fintech, root causes: scope creep 28%, tool misuse 21%, injection 17%, infinite loops 13%, hallucination 11%, other 10%."

Common follow-ups: - "Which of these is hardest to detect in production?" - "How would you prioritize fixing them?"

Traps: - Listing risks without naming concrete mitigations. - Treating all failure modes as equally likely (scope creep dominates in practice). - No incident retrospective process for agents.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you prevent agents from executing irreversible actions accidentally?"

Tags: senior · common · design · source: atalupadhyay "50 most asked agentic AI" 2026 (Q30)

Answer outline: - Tag every tool with a mutation_class: read | reversible | irreversible | external_send | financial. Policy engine maps class to gating policy. - Irreversible tools require a one-time-use approval_token in their wrapper; absence → 403 at the tool boundary, not the prompt. - Soft-delete first wherever the underlying system supports it (S3 versioning, soft-delete columns, message-deleted flag). Lets you reverse "irreversible" actions in practice. - Dry-run mode default: dry_run=True; the agent's normal output is a proposed action. Approval flips to dry_run=False and executes. - Pre-execution simulation: for things like SQL UPDATE, run an EXPLAIN-style preview to show row count and a sample. Surfaces "this would change 12M rows" before approval. - Numbers to drop: "Dry-run-default + soft-delete cut 'agent did the wrong thing irreversibly' incidents from 2-3/month to 0 in 6 months across our top 4 production agents."

Common follow-ups: - "What about external sends like emails?" - "How do you preview a tool call without running it?"

Traps: - "Approval check" only in the prompt. - Hard delete by default, retrofitting soft delete later. - No dry-run mode on mutating tools.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "What is idempotency and why is it required for agents?"

Tags: mid · common · conceptual · source: atalupadhyay "50 most asked agentic AI" 2026 (Q42)

Answer outline: - Idempotent operation: calling it N times with the same input has the same effect as calling it once. Agent retries (which happen often) cannot create duplicates. - Implementation: every mutating tool accepts an idempotency_key: str (UUID); server-side dedupe within a TTL (24h is common). - The agent generates one key per logical action and reuses it across retries. Orchestrator stores it with the action plan. - Pair with two-phase commit for multi-step actions: write intent_id first, then execute, confirm or rollback. - Without idempotency: agent retries on transient error → user charged twice; agent loops on payment_create → 10 duplicate charges in 30s. - Numbers to drop: "Idempotency keys eliminated 100% of duplicate-charge incidents (3-4/month previously) at one fintech; the engineering cost was ~2 weeks per service."

Common follow-ups: - "What's the right TTL for an idempotency key?" - "How does the agent know which actions need keys?"

Traps: - Idempotent for create only, forgetting update/delete. - Short TTL (5 min) that expires while a long-running task retries. - Letting the LLM generate idempotency keys (it sometimes reuses them across distinct actions).

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you handle a tool call that takes 90 seconds?"

Tags: senior · occasional · scenario · source: synthesized from MCP 2026 roadmap + Temporal/LangGraph long-running patterns

Answer outline: - Default agent loops are synchronous and 90s blocks everything — switch to async pattern: tool returns task_id, agent gets a tool_status_get(task_id) poller. - Two choices for the agent meanwhile: do nothing (just wait, with timeouts) or run parallel sub-tasks. Choice depends on whether downstream steps depend on the long tool's output. - Durable execution layer (Temporal, LangGraph + Postgres checkpointer) keeps the agent state alive across the long call. - Set explicit user expectations: surface "this will take ~90s, you'll get a notification" — don't pretend it's instant. - Webhook callback pattern for very long tools (minutes): tool calls back the orchestrator, which resumes the agent's checkpoint. - Numbers to drop: "Async tools with Temporal-backed durable agents survived a 45-min outage of one of our long-running providers with zero lost tasks; previously, ~120 in-flight runs would have died."

Common follow-ups: - "How does the agent know when to give up vs keep polling?" - "What's the latency cost of moving from sync to async?"

Traps: - Synchronously blocking the agent's whole loop on a 90s call. - No durable state → agent dies on deploy, loses the in-flight task. - Polling without exponential backoff (hammers the tool).

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/