Agent Debugging in Production — Interview Questions¶

The "you wake up at 3am, your agent fleet is doing something weird" round. Different from agents-design.md (which covers loop design, stopping rules, tool schemas) and from ../conceptual/observability-tracing.md (which covers the trace platform). This file is about the debugging methodology when agents misbehave in production — replay, partial-failure recovery, reflection-as-debug, conflict resolution between tools, reliability metrics, and the playbook for the most common 2026 agent incidents.

The senior tell is naming the signal, the replay path, and the regression test — "I'd pull the trace, replay it deterministically with seed=X, find the divergence step, add it to the eval set" beats "I'd look at logs".

Trace-driven debugging¶

Q: "What is an 'agent trace' and why is it more important for debugging than an LLM's text output?"¶

Tags: mid · very-common · conceptual · source: AEM Institute 25 Advanced Agentic AI Interview Questions 2026; standard senior debugging probe

Answer outline: - An agent trace is a structured step-by-step record of one agent run: chain-of-thought, internal state before and after each action, exact tool calls with arguments and raw outputs, retrieval results, reasoning for the next step choice, termination cause. - Why it beats the text output: agent failures are usually in the process, not in the final answer. The text output says "I couldn't help with that"; the trace shows the agent picked a wrong tool, got a malformed response, looped 8 times, and gave up. - Structure: nested spans (root = user request; children = each loop iteration; grandchildren = planning, tool calls, observations, guardrail checks). Each span has attributes (tokens, latency, cost, model version, prompt version) and events (cache hit, retry, fallback). - The senior debugging tell: "I'd open the trace waterfall, look for the divergence point, inspect the actual content of the failing step" — not "I'd look at the response and try to guess what happened". - Storage requirement: traces consume real storage. Sample wisely — 100% of error traces, 1-10% of normal, 100% of paid-tier customers. See observability-tracing.md for the sampling-and-cost angle. - Numbers to drop: "typical agent run: 10-50 spans", "trace storage: 1-10 KB per span", "MTTR with traces: 30-90 min vs hours-to-days without"

Common follow-ups: - "What goes in a span attribute?" - "How long do you keep traces?" - "Walk me through debugging a specific trace."

Traps: - Calling logs "traces". Flat logs miss the tree structure. - Capturing trace metadata but not content. You need the actual prompt/output/tool args to debug.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/

Q: "Walk me through debugging a failing agent run."¶

Tags: senior · very-common · scenario · source: standard senior agent-debug probe; reported in 2026 AI engineer loops

Answer outline: - Step 1 — find the trace. Use trace ID from the user complaint, or filter by tenant + timestamp + error tag. - Step 2 — open the span waterfall. Scan for the failure pattern: - Long span → which step blew the time budget (often a runaway tool call or a long-context retrieval). - Repeating pattern → identical spans across iterations → agent stuck. - Missing expected span → tool not called, retrieval skipped, guardrail bypassed. - Error span → check the captured exception, input args, output. - Step 3 — inspect span attributes at the failing step. The actual content (prompt, model output, tool args, retrieval results) is where the bug is. - Step 4 — diff against a known-good trace of the same intent. What's different? Often it's a tool result that looks fine but is subtly malformed, or a retrieved chunk that triggered a hallucination. - Step 5 — hypothesize and verify in a sandbox. Replay the exact prompt + tool state in a dev environment with seed=X; confirm the failure deterministically. - Step 6 — fix and add a regression test. The failing trace's inputs become a permanent eval-set example. Future regressions caught in CI. - The senior signal: candidate moves from observation → hypothesis → replay → fix → regression test. Skipping any step is amateur. - Numbers to drop: "MTTR with this workflow: 30-90 min typical agent bug", "regression suite grows by 5-20 examples per week from production triage"

Common follow-ups: - "What if the trace is missing critical attributes?" - "How do you replay a trace that depends on external state (DB, API)?" - "When can you not reproduce a failure deterministically?"

Traps: - Reading only the final error span. The interesting failure is usually 3-5 steps upstream. - Skipping the regression step. Bugs come back.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/

Q: "How do you replay an agent run deterministically?"¶

Tags: senior · common · conceptual · source: standard senior reliability probe; reported in 2026 AI engineer loops

Answer outline: - Three sources of non-determinism in an agent run: (1) the LLM sampling, (2) external tool state, (3) wall-clock time. - For (1): pass seed to the LLM API + temperature=0 (or fixed temperature with seed). Most 2026 providers (Anthropic, OpenAI, Google) accept seed; output is mostly-deterministic but provider implementation isn't bit-perfect. - For (2): record-and-replay. During the original run, log every tool's inputs and outputs in the trace. During replay, intercept tool calls and return the recorded outputs instead of calling the real tool. The agent loop runs against a recorded snapshot of the world. - For (3): freeze time / inject the original timestamps via dependency injection. Don't let the agent see "now" as the current wall-clock — feed it the trace's recorded time. - Implementation: many frameworks have a "trace replay" mode. LangSmith, LangGraph, custom orchestrators with structured tool-call interception. Build it if you don't have it; it's the single most valuable debug capability. - Catches: - Provider model upgrades silently break replays. Pin the model version exactly. - Tool outputs may contain timestamps embedded in the data; treat those as opaque bytes during replay. - LLM seed support is best-effort across providers; expect some non-determinism even at T=0. - The senior tell: candidate names "tool record-and-replay" as a separate dimension from "LLM seed". Without record-replay you can't reproduce most agent failures. - Numbers to drop: "seed + T=0 deterministic on ~95% of cases for major providers", "tool record-and-replay closes the rest", "model version pinning is mandatory for replay"

Common follow-ups: - "What if the tool has side effects on retry?" - "How do you handle clock-dependent logic?" - "Why isn't seed alone enough?"

Traps: - Trusting seed alone. Tool state matters more than LLM determinism in agent debugging.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/

Common failure modes & playbooks¶

Q: "Your AI agent gets conflicting answers from different tools. How does it reconcile them?"¶

Tags: senior · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - This happens often: web search says X, internal DB says Y, calculator says Z. The agent must resolve without hallucinating a third option. - Strategies: - Source priority hierarchy: declare in the system prompt or orchestrator config that internal DB > authoritative APIs > web search > LLM parametric knowledge. Conflicts resolve top-down. - Recency-weighted: timestamps on each source; newer wins for time-sensitive data. - Cross-verification: when two tools disagree, call a third tool (e.g., a different verifier) and majority-vote. - Surface the conflict to the user: if it can't be resolved confidently, the agent says "I got conflicting information — DB says X, web says Y, please clarify." - Cite-and-attribute: the answer includes source + value pairs, letting the user see the conflict themselves. - Anti-pattern: silent picking. The agent picks one source without saying so, the user trusts the answer, the wrong source was authoritative — bad outcome. - In production: log every conflict event. Pattern-cluster them; many conflicts indicate a stale data source or a wrong source-priority config. - Eval: build a test set with intentional conflicts. Score the agent on (a) detection rate, (b) correct resolution rate, (c) user-visibility rate. - Numbers to drop: "conflict rate in tool-using agents: 5-15% of multi-tool tasks", "test set: 50-200 intentional-conflict examples for eval"

Common follow-ups: - "What if all sources are wrong?" - "Walk me through a specific conflict you'd test."

Traps: - "The model decides" — no, the system should have a declared resolution policy. - Silent picking. Bad UX.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/01_agentic_system_design/

Q: "Your AI agent hallucinates tool capabilities and passes wrong inputs. How do you fix it?"¶

Tags: senior · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - The model invents parameters the tool doesn't have, passes wrong types, or uses tools that don't exist. Classic agent failure mode. - Mitigations stacked at multiple layers: - Strict tool schemas: JSON-schema-typed tool definitions, enforced at the orchestrator. Reject malformed calls before they reach the tool. - Function-calling / structured-output APIs: use the provider's structured tool-call API rather than parsing free-form text. Anthropic tool use, OpenAI function calling, Gemini function calling. These dramatically reduce schema hallucination. - Clear tool descriptions: tool descriptions are not docstrings; they're prompts. Be explicit about parameter types, examples, valid ranges. - Enums + examples in the schema: instead of free-form strings, use enums where possible. "currency: enum[USD, EUR, GBP]" beats "currency: string". - Retry-with-correction: when a tool call fails schema validation, return the validation error to the model with "your call was rejected because X; please correct" and let it retry. 2-3 retries max. - Tool-call telemetry: log every rejected call. Pattern-cluster to find common hallucinations; tighten the schema or description. - For "tool that doesn't exist": tool retrieval (only show the model the tools relevant to the current step) reduces hallucination of phantom tools. - Eval: a "tool-call accuracy" metric — fraction of calls where the schema is valid on the first try. Track per-tool; tools below 95% need description/schema work. - Numbers to drop: "first-try schema-valid rate target: 95%+", "retry-with-correction recovers 70-90% of failures", "tool retrieval cuts phantom-tool hallucination ~10×"

Common follow-ups: - "How does function-calling help vs free-form?" - "What if the same tool keeps failing for the same reason?"

Traps: - Treating tool hallucination as the model's fault. The orchestrator design and tool schema matter more. - No retry-with-correction. The model can usually fix its own mistakes if told why.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/03_agent_observability_debugging/

Q: "Your AI agent has many tools, but keeps picking the wrong one. How do you improve tool selection?"¶

Tags: senior · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - The model's tool-selection accuracy depends on (a) the descriptions, (b) how many tools are visible, (c) overlap / ambiguity between tools. - Levers: - Tool retrieval: don't show 100 tools at once. Use a small retrieval step (intent classifier + tool embedding similarity) to surface the 5-10 relevant tools per turn. Cuts hallucination and improves selection accuracy. - Disambiguate names: rename overlapping tools (get_customer_info and lookup_customer_record confuse the model — pick one or distinguish via clear scope). - Better descriptions: lead with when to use, not what it does. "Use this when the user asks about order status; do NOT use for shipping inquiries" beats "queries the order DB". - Examples in descriptions: 1-3 short examples of valid and invalid use cases. - Constrain hierarchically: in multi-step tasks, the orchestrator narrows the tool set per step. The planning step sees one set; the execution step a different one. - Eval-driven iteration: build a test set of (user query, correct tool) pairs. Measure baseline; tweak descriptions; re-measure. - For 50+ tools, tool retrieval is mandatory. Showing all 50 in every prompt blows the context window, costs more, and confuses the model. - Senior tell: candidate names measurable tool-selection accuracy and an iteration loop. - Numbers to drop: "tool selection accuracy: 95%+ achievable with 5-10 tools, drops sharply past 20-30", "tool retrieval cuts wrong-selection rate 2-5× at scale"

Common follow-ups: - "Walk me through a tool retrieval implementation." - "How do you write good tool descriptions?"

Traps: - Showing all tools all the time. Death by context noise. - Treating tool selection as a model problem when it's mostly a system problem.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/03_agent_observability_debugging/

Q: "How do you handle agent failures and implement error recovery?"¶

Tags: mid · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Catalog the failure classes first (each needs different recovery): - Transient tool errors (timeout, 5xx): retry with exponential backoff, circuit-break if persistent. - Schema validation errors: retry-with-correction (return the error to the LLM, let it fix). - Tool-not-available: route to fallback (different tool with similar capability) or graceful refusal. - Model errors (provider 5xx, rate limit): retry with backoff; multi-provider fallback if persistent. - Loop / no-progress: max-steps guard kicks in, structured "I tried but didn't complete" output to user. - Guardrail block: log, return safe refusal, mark for review. - Cost budget exhausted: terminate, return partial result if any, alert. - Hallucination / wrong answer: doesn't auto-recover — caught later by user feedback or eval; fix at the prompt / training level. - Recovery pattern: each tool / call has explicit retry policy, max budget, fallback path. Failures bubble up to the orchestrator which decides retry vs terminate. - Idempotency: tools that mutate state must be idempotent under retry (or use idempotency keys). See agents-design.md for the idempotency discussion. - Partial results: when the agent can't complete fully, return what it did accomplish with a note about what's missing. Better UX than blank failure. - Observability: every recovery event tagged in the trace. Dashboards by failure class. - Numbers to drop: "transient retry: 3 attempts, exp backoff 1s-30s", "max-steps default: 10-20 product agents, 50+ research", "graceful-degradation rate target: <2% hard failures"

Common follow-ups: - "What's your fallback for a provider outage?" - "How do you handle partial-task completion?"

Traps: - Catch-all retry without categorization. Some errors aren't retryable. - No partial-result UX. Hard fail without any output is bad.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/04_resilient_agent_systems/

Reflection, self-correction¶

Q: "What is agent reflection, and how does it improve agent performance?"¶

Tags: senior · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior agent-quality probe 2026

Answer outline: - Reflection: the agent reviews its own output (or trajectory so far) and decides whether to revise, retry, or proceed. A separate LLM call (or a designated reflection step) acts as critic. - Patterns: - Output reflection: agent generates answer; reflection step asks "is this complete, correct, well-cited?"; if not, agent revises. - Trajectory reflection (mid-task): after K steps, the agent reviews its path so far, decides whether to continue or replan. - Self-consistency reflection: sample N answers, the agent picks the most consistent one (related to majority-vote self-consistency). - Critic / actor split: a dedicated critic model evaluates the actor's outputs. Can use a stronger or specifically-tuned critic. - Wins: catches obvious errors before they ship, improves answer quality on hard tasks, especially coding and reasoning. - Costs: extra LLM calls (typically 1-3× the base cost), extra latency. Reflection should be skipped on easy tasks; reserved for high-stakes outputs. - Failure mode: over-reflection. The agent reflects, finds nits, revises, reflects again, loops. Cap reflection iterations (1-2 is typical); track changed-anything-meaningful rate. - Eval: A/B reflection vs no-reflection on a held-out task set; measure quality lift and cost / latency overhead. Sometimes reflection doesn't help (or hurts) — verify empirically. - Numbers to drop: "reflection cost: 1-3× base call; lift: 5-20% on hard tasks, near 0 on easy", "max reflection iterations: 1-2 with diminishing returns"

Common follow-ups: - "When does reflection hurt?" - "Critic vs actor — same model or different?" - "How do you avoid reflection loops?"

Traps: - Always-on reflection. Wastes money on easy tasks. - No cap on reflection iterations. Will loop.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/15_reasoning_routing_verification/

Q: "How do you use self-critique to debug an agent's wrong answer in production?"¶

Tags: senior · occasional · scenario · source: standard senior agent-debug probe; 2026 AI engineer loops

Answer outline: - For a specific failing trace, you don't need reflection in the production agent — you need a debug-time critique to understand the failure. - Process: - Pull the failing trace's full content (prompt, retrieval, tool outputs, final answer). - Run a separate critic LLM call: "given the prompt and the answer, identify problems. Categories: hallucination, missing context, wrong tool, incorrect reasoning, format mismatch." - Cluster across many failing traces: which category dominates? - Fix by category: - Hallucination → improve grounding, add citation-required output, tighten the prompt. - Missing context → improve retrieval (recall@K, reranking). - Wrong tool → fix tool descriptions / retrieval. - Incorrect reasoning → consider stronger model, add chain-of-thought, add self-consistency. - Format mismatch → structured-output schema enforcement. - For ongoing monitoring, run this critique on a sampled 1-10% of production traces; surface the trend. - Note: the critique LLM is itself fallible. Validate on a labeled set; calibrate against human review. - Numbers to drop: "sample 1-10% of production for offline critique", "calibrate critic against humans on 100-500 examples", "common failure category dominates ~60-80% of failures in narrow products"

Common follow-ups: - "What if the critic itself is wrong?" - "How is this different from reflection in the agent?"

Traps: - Treating critique as ground truth.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Reliability metrics & SLOs¶

Q: "What metrics would you track for agent reliability in production?"¶

Tags: senior · common · design · source: standard senior production-metrics probe; 2026 AI engineer loops

Answer outline: - Three tiers: - System health: TTFT, end-to-end latency, error rate, queue depth. Standard. - Agent behavior: tool-call success rate, max-steps termination rate ("stuck rate"), retry rate, fallback-model invocation rate, average steps per task, average tools per task. - Quality: task-completion rate (did the agent finish what the user asked?), correctness on sampled traces (LLM-judge), refusal rate, escalation rate (handoff to human or another agent). - Per-version + per-tool + per-tenant slicing. Aggregates hide regressions. - SLO examples: - p95 end-to-end latency < 30s for product agents (often higher for research). - Stuck rate < 5% of runs. - Tool-call schema-valid rate ≥ 95%. - Task-completion rate ≥ 80% on the eval set. - Hallucinated tool rate < 1%. - Alarms on rate trends, not point events. "Stuck rate > 8% for 2 hours" is actionable; "this one run got stuck" is not. - Closed loop: low-quality traces feed into the eval set. Regression test catches future occurrences. - Numbers to drop: "stuck rate SLO: <5%", "tool schema-valid rate: 95%+", "task-completion rate: 80%+ on eval set"

Common follow-ups: - "What's the difference between stuck rate and error rate?" - "How do you measure task completion?" - "Which metric is most important?"

Traps: - Latency-only dashboards. The interesting failures are behavioral.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your AI agent takes too long to complete a task. How do you speed it up?"¶

Tags: mid · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Profile first. Open a representative trace; identify which span dominates. Common culprits: - Too many sequential tool calls: parallelize independent calls. Most tool calls in an agent's plan don't depend on each other — fan out. - Long context per step: prompt caching, retrieval-side compression, smaller-model planner. - Reflection loop overhead: cap reflection iterations or skip on easy tasks. - Slow individual tools: per-tool timeout, cache common tool results, optimize the tool itself. - Long-context prefill: see cost-latency-optimization.md for prefill tactics. - Streaming: even if total time is long, start streaming partial output to the user. - Architectural: - Split the agent into a small fast planner + larger executor only-when-needed. Most steps use the small model. - Pre-fetch likely-next-tool-call's data in parallel with the LLM's planning step. - Cache tool results across runs for repeat queries. - Eval: A/B latency vs quality. Sometimes speeding up costs quality — measure both, decide explicitly. - Numbers to drop: "parallelizing 3-5 independent tool calls: 2-4× speedup", "fast-planner + slow-executor: 30-60% latency cut on multi-step tasks", "prompt caching: 20-40% TTFT cut for stable prefix"

Common follow-ups: - "How do you decide which tools can run in parallel?" - "What if speeding up requires a worse model?"

Traps: - Reaching for "faster model" without profiling. Often the bottleneck is sequential tool calls.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "Your AI agent keeps exceeding its budget per task. How do you enforce budget limits?"¶

Tags: senior · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Budget enforcement at three layers: - Per-task hard cap: track running token + tool cost in the orchestrator; abort the agent if it crosses the cap. Return a structured "I exceeded my budget" output. - Per-tool budget: each tool call has a max cost; expensive tools count against the agent's budget more. - Per-tenant / per-user daily cap: aggregate across tasks; tenant-level rate limit. - Make the budget visible to the agent: include "you have used $X of your $Y budget" in the system prompt so the model can self-prioritize. Doesn't replace hard enforcement but reduces violations. - For repeat-offender failure modes: - Profile which tools/calls cost the most. Often one tool dominates; cache it or replace with a cheaper version. - Add prompt-side guidance: "prefer cheap tool X over expensive tool Y when both work". - Limit retries / reflection iterations. - Switch to a smaller planner model. - Hard enforcement is non-negotiable. The orchestrator decides; the agent doesn't get to vote. - Numbers to drop: "per-task budget: $0.10-$10 typical depending on complexity", "tenant daily cap as backstop", "abort message: structured 'budget exceeded' with partial results"

Common follow-ups: - "Walk me through enforcing the cap at the orchestrator level." - "What happens to in-flight tool calls when budget hits?" - "Should the agent know its budget?"

Traps: - Soft budget limits ("nudges in the prompt"). Need hard enforcement. - No graceful termination. Hard-aborting mid-tool can leave inconsistent state.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/01_agentic_system_design/

Long-horizon and stateful debugging¶

Q: "How would you debug an agent on a task with extremely long time horizon (research a company, write a 50-page report)?"¶

Tags: staff · common · scenario · source: AEM Institute 25 Advanced Agentic AI Questions 2026; reported in 2026 senior loops

Answer outline: - Long-horizon tasks fail differently from short-horizon. The failure modes: - Forgetting: the agent loses track of its plan as context fills with intermediate outputs. - Drift: subtle wrong turn mid-trajectory compounds into a wrong final output. - Stuck in a sub-problem: agent loops indefinitely on one section of the task. - Cost overrun: 50-page reports easily blow $50-500 of LLM cost without controls. - Architecture for debug-ability: - Plan-and-execute split: a planner produces a structured outline first; sub-agents execute each section. Each sub-agent is a short-horizon task — easier to debug. - Persistent state: outline + progress checkpoint stored in a durable store. The agent can resume after crashes; debug can inspect any checkpoint. - Stage-level traces: trace each sub-task as its own root span; cross-link with the parent plan. Easier than one giant trace. - Replay at the stage level: re-run a single sub-task with adjusted inputs rather than the whole 50-page task. - For active debugging: - Compare plan vs reality: did the agent follow its own plan? Often the failure is silent deviation from the outline. - Audit each sub-output: spot-check a few sub-tasks against the user intent. - Cost-trace: which sub-task burned 80% of the budget? Often one runaway sub-task dominates. - Hard guards: per-sub-task budget, per-sub-task max-steps, total wall-clock cap, total cost cap. Trip → graceful termination with structured partial output. - Numbers to drop: "long-horizon plan-and-execute: 5-20 sub-tasks typical for a 50-page report", "checkpoint frequency: every sub-task", "per-sub-task budget: $0.50-$5 typical, total cap $20-200"

Common follow-ups: - "How does this differ from short-horizon debugging?" - "What if the plan itself is wrong?" - "How do you handle restart from a checkpoint?"

Traps: - Treating long-horizon as one big agent. It's not — it's a tree of short-horizon tasks.

Related cross-cutting: Architecture choices, Cost & latency Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/01_agentic_system_design/

Q: "Your agent worked yesterday and is broken today. How do you diagnose?"¶

Tags: senior · common · debugging · source: standard senior incident-debug probe; reported in 2026 AI engineer loops

Answer outline: - Classic regression. The set of suspects: - Model swap: provider silently rolled out a model update; behavior changed. Check provider release notes; if you don't pin model versions, that's the bug. - Prompt change: someone edited the system prompt. Check git history on prompt files. - Tool change: a downstream tool's API changed; the agent's calls now fail or return different data. Check tool API logs. - Retrieval / RAG drift: corpus updated, embedding model swapped, vector store rebuilt. Compare retrieval results before/after. - Guardrail change: a new guardrail kicks in and blocks output that used to pass. - Provider rate limit / outage: not yesterday-vs-today, but worth ruling out. - Process: - Pull a "yesterday working" trace and a "today broken" trace. Diff them step by step. - Run the working prompt against today's model; if it now fails, the model changed. If it works, something else (tool, retrieval, prompt) changed. - Bisect git history if the change isn't obvious. - Mitigation: pin everything — model version, prompt version, tool schemas, retrieval corpus version. Without pinning, "yesterday" is a moving target. - Add the broken case to the regression suite. Run on every commit. - Numbers to drop: "model version pinning is non-negotiable", "regression test on every commit catches 80%+ of these", "MTTR: minutes-to-hours when artifacts are versioned, hours-to-days otherwise"

Common follow-ups: - "What if the provider doesn't let you pin a specific version?" - "How do you handle silent model upgrades?"

Traps: - Trying to fix forward without identifying what changed. - No version pinning in the first place.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "Your agent returns inconsistent results in production but worked correctly in staging. How do you debug it?"¶

Tags: senior · common · debugging · source: 2026 agent-deployment loop (FDE-style)

Answer outline: - This is an environment-parity problem, not a logic bug — name that first. "Works in staging, flaky in prod" means something differs between the environments. Enumerate the differences before touching the agent: - Data: staging used clean/seeded data; prod has long-tail real inputs, a larger corpus, messier documents. - Scale/concurrency: prod runs parallel requests → race conditions on shared state, prompt-cache contention, rate-limit throttling staging never hit. - Config drift: different model version, temperature, prompt version, tool endpoints, or feature flags between envs. - Non-determinism: temperature > 0 with no seed → inherent variance a small staging sample never surfaced. - External deps: prod tools/APIs are the real (sometimes degraded) ones; staging used mocks. - Method: capture full traces in prod — same input, different output, diff them. Replay a prod trace in staging with the exact prod inputs to reproduce. Pin model + temperature and re-run to separate true non-determinism from environment differences. - Common root cause: "inconsistent" often = temperature variance the demo never showed, OR shared mutable state / a cache-key collision under concurrency. - Fix: pin and version everything across envs; temperature 0 (or a seed) on deterministic paths; make staging mirror prod's data distribution and concurrency; add the failing prod case to the regression suite. - Numbers to drop: "a staging sample of 20 hides 5% variance; prod at 1M req/day surfaces it 50k times", "pin model + prompt + temperature across envs", "replay prod traces in staging to reproduce"

Common follow-ups: - "How would you reproduce a non-deterministic prod failure?" (capture + replay the exact trace; pin temperature/seed) - "What's different about prod concurrency that staging missed?"

Traps: - Assuming the agent logic is wrong when the environments differ. - No trace capture in prod → nothing to diff. - Forgetting temperature / non-determinism as the simplest explanation.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/05_ai_incident_operations/

Reproducibility & regression suites¶

Q: "How do you build a regression suite for agents?"¶

Tags: senior · common · design · source: standard senior eval / regression probe; 2026 AI engineer loops

Answer outline: - Agents are non-deterministic; "passing tests" looks different from classical software. - Suite composition: - Trace replays: failing production traces preserved as eval inputs. The agent should produce a comparable answer on replay. - Synthetic scenarios: hand-crafted edge cases per intent — long context, conflicting tools, ambiguous query, malformed tool output, max-steps stress. - Tool-mocked tests: deterministic tool stubs so the agent's behavior on a known input is reproducible. - Quality metrics: each test has a pass criterion — LLM judge score above threshold, specific tool called, no max-steps termination, correct output format. - Run cadence: - Fast subset on every PR (5-10 min): smoke tests, basic intents. - Full suite nightly or on release (hours): all scenarios, all eval metrics. - Stochastic suite weekly: sample N traces from production, replay against current build, compare. - Promote production failures into the suite. The suite grows ~5-20 examples / week in a healthy team. - Eval-on-eval: periodically audit the eval set itself. Are the pass criteria still right? Has the world changed? - Numbers to drop: "fast suite: 50-200 tests, <10 min", "full suite: 500-2000 tests, hours", "promote 5-20 production failures/week"

Common follow-ups: - "How do you handle non-determinism in the tests?" - "What's the difference between a unit test and an eval test for an agent?"

Traps: - Treating agent tests like deterministic unit tests. They're statistical. - Static suite. The world drifts; the suite must too.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/, learning/01_ai_engineering/03_agent_observability_debugging/

Q: "How do you handle non-determinism when writing agent tests?"¶

Tags: senior · common · design · source: standard senior testing probe; 2026 AI engineer loops

Answer outline: - Three approaches, often combined: - Sample-and-grade: run the test N times, score each, fail if quality score < threshold or pass rate < target. Tolerates stochasticity; computes statistical confidence. - Property-based: assert structural properties (output is valid JSON, response cites a source, no hallucinated tool used) rather than exact text. Properties are robust to stochastic variation. - Deterministic mode: pin seed + temperature=0 + mocked tools. Tests run reproducibly, but you only check one sample. Mostly for smoke testing the orchestration logic, not the agent's quality. - Avoid exact-text-match assertions. They're brittle and produce false negatives on equivalent rewordings. - For multi-sample, use 3-10 samples per test as a default. Higher confidence for high-stakes tests. - LLM-judge for grading: structured rubric, judge calibrated against human raters on a held-out set. - Numbers to drop: "3-10 samples per stochastic test", "pass criterion: score ≥ X and pass rate ≥ Y", "LLM judge calibration: ≥85% agreement with human on held-out"

Common follow-ups: - "How do you decide N samples?" - "When does property-based win over sample-and-grade?"

Traps: - Exact-match assertions on LLM output. They flake constantly.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/, learning/01_ai_engineering/03_agent_observability_debugging/

Partial-failure & graceful degradation¶

Q: "Your agent's downstream tool is down. How does the agent gracefully degrade?"¶

Tags: senior · common · scenario · source: standard senior reliability probe; 2026 AI engineer loops

Answer outline: - Categorize first: hard down (5xx, timeout) vs degraded (slow, partial responses). - Mitigations: - Circuit breaker: track failure rate; trip → return cached / fallback for a cooldown window. Don't keep hammering a down service. - Fallback tool: if tool A is down, try tool B with similar capability. Configured per-tool in the agent's tool graph. - Cached value: if the tool's data doesn't change often (product info, FAQ), return the last cached value with a "this may be stale" note. - Skip-with-degraded-answer: agent proceeds without the tool's output, notes the gap in its answer ("I couldn't check live inventory; here's what I know"). - Hard-fail-with-context: if the tool's data is essential, return "I can't help right now because X is down" — better than a fabricated answer. - The orchestrator handles this, not the LLM. The model shouldn't be deciding "tool A is down, try tool B" — that's an infra decision. - Telemetry: tool-side health checks feed circuit-breaker state. The agent calls the tool through a wrapper that knows the state. - For the user: never expose stack traces. "We're having trouble checking inventory right now" is the right message. - Numbers to drop: "circuit-breaker: trip after 5 consecutive failures, half-open after 30s", "fallback hit rate target: <5% of total traffic", "cache TTL for tool fallback: per-tool, often 5-60 min"

Common follow-ups: - "How do you decide which tools have fallbacks?" - "What if both primary and fallback are down?" - "How do you communicate the degradation to the user?"

Traps: - Letting the LLM decide retry / fallback. The orchestrator's job. - No circuit breaker. Retry storms compound the outage.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/04_resilient_agent_systems/

Q: "How do you handle an agent that did half the work and crashed?"¶

Tags: senior · common · scenario · source: standard senior reliability probe; 2026 AI engineer loops

Answer outline: - Two questions: (1) can the agent resume from where it crashed? (2) what about side effects from the half-completed work? - For (1): durable state. The agent's checkpoint (plan, current step, intermediate outputs) is written to persistent storage at each major step boundary. On restart, the orchestrator loads the checkpoint and resumes. - Frameworks: Temporal, LangGraph (with checkpoint backends), Pregel-style state machines, custom + Redis/Postgres. Pick a checkpoint frequency that balances overhead vs recovery granularity. - For (2): idempotency. Every side-effect tool call should be safe to retry. Use idempotency keys (UUIDs generated once, reused on retry); for non-idempotent operations, check before mutating (read-then-write with optimistic concurrency). - Compensation: for non-idempotent side effects that succeeded before the crash (sent an email, charged a card), the resumption logic needs to know and not re-do them. The orchestrator tracks "step N succeeded, step N+1 was in progress". - Without durable state: every crash means starting over. Acceptable for short-horizon tasks (<30s); painful for long-horizon (where 80% progress lost is unacceptable). - Numbers to drop: "checkpoint per step boundary", "idempotency keys for every mutating tool call", "Temporal-style workflow engines: checkpoint built-in"

Common follow-ups: - "What if the crash happened mid-tool-call?" - "How do you handle restart of a non-idempotent tool?" - "What's the storage cost of frequent checkpointing?"

Traps: - No durable state. Crashes lose all progress. - Tools assumed idempotent without verification. Double-charges, duplicate emails, etc.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/04_resilient_agent_systems/

Incident response specifics¶

Q: "Your agent platform's stuck-rate jumped from 3% to 15% in the last hour. Walk me through the response."¶

Tags: senior · very-common · scenario · source: standard senior incident-response probe; 2026 AI engineer loops

Answer outline: - Step 1 — confirm. Pull the dashboard; is it real? Filter by tenant, model, version. Often a spike is localized. - Step 2 — localize. Is it specific to (a) a tenant, (b) a model version, (c) a prompt version, (d) a tool, (e) a region? Filtered dashboards make this fast. - Step 3 — pull representative stuck traces. What's the common pattern? Same tool failing? Same retrieval coming back empty? Loop on the same plan? - Step 4 — stop the bleeding. - If a deploy correlates → roll back. Should be <2 min. - If a tool is failing → trip circuit breaker, fallback to alternatives. - If a tenant is anomalous → rate-limit that tenant. - If model behavior changed → check provider status; consider routing to a backup provider. - Step 5 — fix forward. Once stable, investigate root cause. Add regression test. - Communication: internal status channel, customer-facing status page if user-visible. - Postmortem: blameless, focuses on (a) detection time, (b) what would have caught it sooner, (c) structural fix vs band-aid. - Numbers to drop: "MTTR target for stuck-rate spike: <30 min", "detection lag target: <5 min from incident start", "rollback as default for any deploy-correlated incident"

Common follow-ups: - "What if rollback isn't possible?" - "How do you avoid rollback ping-pong?" - "Walk me through a postmortem from one of your incidents."

Traps: - Trying to fix forward when rollback is faster. - No communication. Stakeholders find out from customer complaints.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/05_ai_incident_operations/

Q: "What does a blameless postmortem for an agent incident look like?"¶

Tags: senior · occasional · conceptual · source: standard senior postmortem probe; 2026 AI engineer loops

Answer outline: - Structure (Google SRE-style adapted for agent specifics): - Summary: 1-paragraph what happened, impact, duration. - Timeline: minute-by-minute log — detection, escalation, action, resolution. - Root cause analysis: what specifically failed, and what allowed it to fail (the system reason — missing test, missing alert, missing pin). - Detection: how long from problem to alert? What signal caught it? What signal should have caught it sooner? - Action items: assigned, dated, prioritized. The action items are the postmortem's deliverable. - Agent-specific sections to include: - Trace examples: 2-3 representative failing traces for the record. - Eval-set additions: which production failures got promoted to the eval suite. - Pin / version impact: which artifacts (model, prompt, tool schema) were involved. - Blameless framing: focus on the system, not the individual. "The deploy pipeline allowed a prompt change to ship without an eval gate" beats "X pushed a bad prompt". - Distribution: postmortem published org-wide, learnings shared. - Numbers to drop: "postmortem due: within 1 week of incident", "action items completion: tracked in the engineering tracker", "postmortem reading: required for the team"

Common follow-ups: - "How do you decide when a postmortem is required?" - "What's the right level of detail?"

Traps: - Blaming individuals. Kills the culture, doesn't prevent recurrence. - No action items. Postmortem becomes ritual without learning.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/05_ai_incident_operations/