Incident Response — Interview Questions¶

The "something is on fire — what do you do, and what did you learn" round. Distinct from mlops-deployment.md (rollout patterns: canary, shadow, blue-green) and agents-debugging-production.md (trace-driven debugging of a single agent). This file is the operational interview: SEV triage, on-call runbook discipline, partial-failure playbooks, postmortems, blameless culture, and the specific failure modes that AI systems add on top of normal infra.

The senior tell is treating an AI incident like any other production incident — same triage discipline, same postmortem rigour — and knowing the AI-specific failure modes that normal SRE playbooks miss (hallucination spike, eval regression, prompt-injection campaign, model provider outage, KV-cache pressure).

Triage & severity¶

Q: "Walk me through how you triage an incident in an LLM-powered product."¶

Tags: senior · very-common · scenario · source: standard senior on-call probe; 2026 AI engineer loops

Answer outline: - Step 1 — confirm and scope. Is it real? Per-tenant or global? Per-route or whole product? Per-model variant or all? Pull dashboards before doing anything else. - Step 2 — set severity. SEV-1: customer-facing outage or data exposure. SEV-2: major degradation (latency, quality) for a meaningful share of traffic. SEV-3: small slice affected, workaround exists. SEV-4: cosmetic. - Step 3 — page the right people. SEV-1/2 → page on-call, declare an incident, open a channel, assign Incident Commander + Comms + Investigator roles. - Step 4 — stabilize first, root-cause second. Mitigations: rollback the most recent deploy, shed lowest-priority traffic, fail open to a fallback model, switch provider, raise concurrency limits, disable a feature flag. - Step 5 — communicate. Status page update within 10-15 min of declaration. Internal updates every 30 min. External updates per the SLA. - Step 6 — resolve. Verify mitigation holds. Downgrade severity. Schedule postmortem within 5 business days. - AI-specific signals to check early: hallucination rate, eval pass rate, refusal rate, provider error rate, retrieval recall, tool-call success rate, p95/p99 TTFT and end-to-end latency. - Numbers to drop: "SEV-1 page-to-acknowledge target: 5 min", "status page update: 10-15 min", "stabilize before root-cause"

Common follow-ups: - "How do you decide between rollback and forward-fix?" - "What's a SEV-2 example you've seen?"

Traps: - Diagnosing before mitigating. Customers are bleeding. - Treating an AI incident like a normal web incident — missing model/eval-specific signals.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/, learning/01_ai_engineering/03_agent_observability_debugging/

Q: "How do you decide severity for an AI quality regression?"¶

Tags: senior · common · scenario · source: 2026 AI on-call loops

Answer outline: - Quality regressions are tricky because there's no hard error. The product still "works", just worse. Two axes drive severity: - Magnitude: how big is the drop on the eval suite? 1% drift vs 15% collapse changes the response. - Surface: how visible is it to users? Hallucinations on financial answers ≠ a slight tone shift in summaries. - Default rubric: - >10% drop on a tier-1 eval metric (faithfulness, factuality, refusal correctness) → SEV-1. - 3-10% drop on tier-1 OR >10% on tier-2 (tone, format, length) → SEV-2. - <3% drop, no user complaints → SEV-3, track and root-cause without paging. - Cross-check with user-facing signals: support tickets, thumbs-down rate, escalation rate, abandonment rate. A small eval drop with a big user-signal change → upgrade severity. - Always log the eval delta in the incident channel — gives everyone the same baseline. - Numbers to drop: ">10% drop on tier-1 eval = SEV-1", "thumbs-down spike >2× baseline = upgrade severity", "user-complaint correlation with eval delta is the truth"

Common follow-ups: - "What's a tier-1 vs tier-2 eval?" - "What if eval looks fine but users are complaining?"

Traps: - Ignoring quality because the system is "up". Users see quality, not uptime.

Related cross-cutting: Evaluation & quality, Production patterns Related module: learning/01_ai_engineering/22_evals_production/

Playbooks¶

Q: "Your model provider (OpenAI / Anthropic / Google) has a partial outage. What do you do?"¶

Tags: senior · very-common · scenario · source: standard senior reliability probe; 2026 AI engineer loops

Answer outline: - Step 1 — confirm via the provider's status page and your own error rate. Provider outages are often regional or model-specific. - Step 2 — fail over. Three common patterns: - Cross-provider fallback: route to a second provider with comparable capability (Claude → GPT-4o-class fallback, or vice versa). Requires pre-built abstraction layer; quality may differ. Eval suite tells you the gap. - Same-provider, different region/model: try a different region endpoint; downgrade to a smaller model of the same family. - Self-hosted fallback: route to your own vLLM cluster for a degraded but functional experience. - Step 3 — shed load if fallback capacity is limited. Disable lowest-priority features (suggestions, autocompletes); preserve core flows (chat, search). - Step 4 — communicate. Status page: "degraded mode, using fallback model". Don't pretend nothing's wrong. - Step 5 — restore. When provider recovers, drain fallback gradually; watch eval and error metrics. A flip back is also a deploy event. - Pre-incident work matters more than incident response: provider abstraction, cross-provider eval pass, fallback capacity provisioned and warm, kill-switch tested quarterly. - Numbers to drop: "fallback engaged in <2 min via feature flag", "cross-provider quality delta: typically 5-15% on faithfulness; track per-task", "fallback eval pass rate: validated quarterly via game day"

Common follow-ups: - "How do you keep the fallback path warm?" - "What if both providers are down?"

Traps: - No abstraction layer → tightly coupled to one provider → silent outage during partial degradation. - Untested fallback. First execution under fire fails.

Related cross-cutting: Production patterns, Architecture choices Related module: learning/01_ai_engineering/04_resilient_agent_systems/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "Hallucination rate spiked overnight. Walk me through diagnosis."¶

Tags: senior · very-common · debugging · source: standard senior debugging probe; 2026 AI engineer loops

Answer outline: - Step 1 — confirm the signal. Which eval? Online judge or offline run? Is it sampling noise or a real shift? - Step 2 — change inventory in the last 24h: - Model version change (provider silently bumped a checkpoint). - Prompt change (template edit, system message tweak). - Retrieval change (index re-built, embedding model swapped, chunking tweaked). - Data change (new ingestion source, schema migration). - Infra change (router rebalanced, region failover, cache flushed). - Step 3 — bisect. Roll back the most likely change first; observe. If hallucination rate recovers, you have the culprit. If not, roll forward another. - Step 4 — content analysis. Sample 20-50 hallucinating traces. Pattern? All same topic? All long-context? All retrieval-empty? The pattern usually points at the cause. - Step 5 — quick fixes while root-causing: lower temperature, raise top-p threshold, tighten system prompt, increase retrieval recall, disable a recently added tool. - AI-specific reminder: provider model updates are silent in many SDKs. Pin model versions in production (gpt-4o-2024-08-06, not gpt-4o). If you're on a floating tag, this is your first suspect. - Numbers to drop: "trace sample: 20-50 hallucinations is enough to pattern-match", "rollback first, root-cause second", "pinned model versions are non-negotiable in production"

Common follow-ups: - "What if no recent change explains it?" - "How do you measure hallucination rate online?"

Traps: - Tweaking the prompt before checking if the provider changed the model.

Related cross-cutting: Evaluation & quality Related module: learning/01_ai_engineering/22_evals_production/, learning/01_ai_engineering/03_agent_observability_debugging/

Q: "Cost spiked 3× overnight. What happened?"¶

Tags: senior · common · debugging · source: 2026 AI on-call cost-incident loops

Answer outline: - The usual suspects, in order of frequency: - Runaway agent loops: an agent that doesn't terminate properly, retries forever, burns tokens. Check max-step / max-token caps; check loop-detection logs. Often triggered by a brittle tool returning malformed responses. - Prompt bloat: someone added a 5k-token system prompt or pulled extra retrieval context. Diff the prompt template. - Retrieval over-fetch: top-K bumped from 5 to 50, or chunk size doubled. Each request now drags 10× the context. - Cache miss: prompt-cache hit rate collapsed because the prompt template changed slightly (whitespace, ordering). Check cache-hit metric. - Traffic spike: legit traffic 3× usual. Confirm via QPS dashboard. - Abuse / scraping: a single tenant pulling 100× their normal volume. Check per-tenant cost. - Mitigations while investigating: per-tenant rate limit, hard token cap per request, kill switch on the suspected agent path. - Numbers to drop: "cost-per-request alert: >2× baseline 1h = page", "per-tenant cap: catches abuse before bill hits", "cache hit rate target: 60-80% on chat / 80%+ on RAG"

Common follow-ups: - "How do you set up cost alerting?" - "Story of a real cost incident you saw."

Traps: - No per-tenant cost visibility. Can't see who's burning what. - No hard caps. One bug becomes a five-figure invoice.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/02_ai_infrastructure/05_agent_performance_economics/

Q: "Users report a prompt-injection attack succeeded — what do you do?"¶

Tags: senior · common · scenario · source: 2026 AI security incident loops; aligned with safety-guardrails.md

Answer outline: - Step 1 — contain. Disable the affected path or feature flag the agent/tool that exfiltrated data. Don't wait to confirm scope before pulling the plug if exposure is possible. - Step 2 — assess blast radius. What data did the agent have access to? What tools could it call? Did it call them? Pull traces. - Step 3 — preserve evidence. Snapshot the offending inputs, traces, tool calls, model outputs. Legal + security may need them. - Step 4 — notify. Security team, legal (if PII / regulated data), affected customers (if disclosure required), exec stakeholders. - Step 5 — fix. Layered: - Input layer: detect injection patterns (delimiter abuse, "ignore prior", role-shift). Add to allow/deny lists. - Tool layer: tighten tool authorization. Tools should validate authority, not trust the model. - Output layer: scan for exfiltration patterns; require confirmation for sensitive operations. - Architectural: separate planner (talks to user) from executor (untrusted user content never enters the executor's context). - Step 6 — postmortem with security review. Often the root cause is architectural, not prompt-engineering. - Related: safety-guardrails.md covers the defensive design in detail. - Numbers to drop: "contain before confirm if exposure is possible", "tool-layer auth is the load-bearing defense; prompt-layer is best-effort", "post-injection postmortem: 80% architectural fix, 20% prompt fix"

Common follow-ups: - "What's a tool-layer defense look like?" - "Why isn't prompt-layer enough?"

Traps: - Treating it as a prompt problem. It's an authorization problem. - Hoarding the incident in eng. Security and legal need in early.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/24_safety_guardrails/

Q: "Your retrieval index returns empty for 30% of queries. Diagnose."¶

Tags: senior · common · debugging · source: 2026 RAG incident loops

Answer outline: - Step 1 — is the index serving at all? Hit the vector DB health endpoint. Spot-check a known-good query. - Step 2 — has the index changed? Recent re-build, partial re-index, schema migration, embedding model swap. Diff metadata against last known-good. - Step 3 — embedding mismatch is the classic. If the query embedding model and the index embedding model don't match (different version, different normalization), recall collapses. Confirm both sides use the exact same model + version + preprocessing. - Step 4 — query distribution shift. Is traffic suddenly hitting topics the index doesn't cover? Inspect failing-query content. - Step 5 — filter or auth bug. A metadata filter or tenant scoping change can exclude most candidates silently. - Step 6 — score threshold change. If you cut off below a similarity score, a small embedding shift can push everything below threshold. - Mitigation: rollback last index change; widen filters; lower threshold temporarily; route to a degraded answer ("I don't have info on that"). - Numbers to drop: "embedding-model mismatch: recall drops 50-80%", "threshold sensitivity: small shifts can swing recall 20%+", "diff the index metadata first"

Common follow-ups: - "How do you detect this in production?" - "Walk me through your reindexing pipeline."

Traps: - Blaming the LLM. Empty retrieval upstream.

Related cross-cutting: Architecture choices, Evaluation & quality Related module: learning/01_ai_engineering/14_retrieval_ranking/, learning/01_ai_engineering/15_vector_databases/

Q: "Latency p99 doubled. Mitigate first, root-cause second."¶

Tags: senior · common · debugging · source: standard senior latency-incident probe

Answer outline: - Mitigate: - Scale up replicas (autoscaling lag is real on GPU instances). - Shed lowest-priority traffic (rate-limit free tier, defer batch jobs). - Disable expensive optional features (high-K retrieval, multi-step reasoning, judge calls). - Roll back recent deploys correlated in time. - Switch to a faster fallback model for non-critical paths. - Root-cause: - Per-stage tracing. Where does the time go? Retrieval, ranking, LLM call, tool calls, post-processing? - One stage's p99 dominates the rest. Focus there. - Long-context outliers (one request with 100k tokens stalling the pipeline) → chunked prefill, route long-context separately. - Provider-side issue (their TTFT crept up) → check provider dashboards. - Cache regression → check cache hit rate. - Resource saturation (KV cache, GPU memory, network) → check infra metrics. - See inference-serving.md Q on TTFT spikes for the serving-engine-specific drilldown. - Numbers to drop: "p99 is owned by one stage usually", "long-context outliers can blow p99 with low frequency", "mitigation before root-cause"

Common follow-ups: - "How do you find the dominant stage?" - "What's a story where it was the provider's fault?"

Traps: - Optimizing the wrong stage. Always profile first.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/02_ai_infrastructure/05_agent_performance_economics/, learning/02_ai_infrastructure/02_inference_serving_systems/

Partial failure¶

Q: "How do you design for partial failure in an AI system?"¶

Tags: senior · very-common · design · source: standard senior reliability probe

Answer outline: - AI systems have more dependencies than typical apps: model provider, vector DB, embedding service, tools (search, code exec, browser), judge, observability. Any of these can fail without taking the whole product down — if you design for it. - Principles: - Graceful degradation: each subsystem has a "minimum viable" mode. Retrieval down → answer from model knowledge with explicit "I don't have the latest docs" disclaimer. Judge down → log without judging. Tool down → tell the user that capability is unavailable instead of looping. - Bulkheads: isolate dependency failures. A tool timeout shouldn't burn the whole agent's token budget; a slow tenant shouldn't starve others. - Timeouts and retries with backoff: every external call has a timeout. Retries idempotent and bounded. Circuit breakers when failure rate spikes. - Fallbacks: provider B for provider A, smaller model for larger, cached answer for fresh. - Idempotency: client-side request IDs so retries don't double-charge or double-write. - Test the failure modes deliberately. Chaos drills: kill the embedding service, watch the system degrade as designed. - Numbers to drop: "tool timeout: 5-30s typical, never unbounded", "circuit breaker: open after 5 failures in 30s, half-open after 60s", "chaos drill cadence: monthly"

Common follow-ups: - "Walk me through a real graceful-degradation scenario." - "How do you test these paths?"

Traps: - All-or-nothing design. One dependency outage = product outage.

Related cross-cutting: Architecture choices, Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/

Q: "An agent is stuck in a tool-call loop in production. How do you stop it without killing legitimate long-running agents?"¶

Tags: senior · common · scenario · source: 2026 agent on-call loops; related to agents-debugging-production.md

Answer outline: - Detection: per-agent metrics on tool-call count, step count, token spend, wall time. Threshold alarms (e.g., >50 steps, >100k tokens, >5 min wall) flag candidates. - Distinguish loop from long-running: - Loop: same tool with similar args called repeatedly; minimal state progress. - Long-running: diverse tool calls, monotonic progress on a tracked goal. - Auto-mitigation: - Per-agent hard caps (max steps, max tokens, max wall time) terminate runaway. Generous enough not to hit legit long-running; specific to agent type. - Loop detector: if the last N actions repeat with similar args, abort with an explicit error. - Manual mitigation in incident: kill switch on the agent type that's looping. Provide users a "your task didn't complete" message. - Postmortem: was it a tool returning malformed responses confusing the model? A prompt change that broke termination? A new edge case? Patch all three: the tool, the prompt, the cap. - See agents-debugging-production.md for the broader debugging methodology. - Numbers to drop: "loop detector: N=3-5 repeats triggers abort", "agent caps: max-steps 20-50, max-tokens 50-200k", "fix all three: tool, prompt, cap"

Common follow-ups: - "How do you tune the cap not to hit legit cases?" - "What does a loop look like in a trace?"

Traps: - One global cap. Different agent types need different caps.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/

Postmortems¶

Q: "Walk me through how you run a postmortem."¶

Tags: senior · very-common · scenario · source: standard senior reliability culture probe

Answer outline: - Scheduling: within 5 business days of resolution. Wait too long and memory fades; too soon and people are still recovering. - Facilitator: someone not involved in the incident. Keeps it blameless and structured. - Structure: - Timeline: minute-by-minute, from first signal to resolution. Sourced from chat logs, alarms, deploy events, traces. - Impact: users affected, requests failed, revenue/SLO breach, data exposure. - Detection: how did we find out? Was alerting adequate? What was the delay? - Mitigation: what did we do? What worked, what didn't? Where did we fumble? - Root cause(s): contributing factors, not just "one cause". Use 5-whys but accept that complex systems have multiple. - Action items: specific, assigned, with deadlines. Distinguish "stop the bleeding" (this incident class) from "deeper" (broader hardening). - Blameless tone: focus on the system, not individuals. People made the best decisions they could with the info they had. - AI-specific additions: - Did our evals catch this? If not, what eval would have? Add it. - Did the trace contain enough info to debug? If not, what telemetry is missing? - Was the failure mode in our chaos / game-day playbook? If not, add it. - Publish broadly. Track action-item completion; revisit at next postmortem. - Numbers to drop: "postmortem within 5 business days", "action items: specific + assigned + dated", "eval / trace augmentation is the AI-specific deliverable"

Common follow-ups: - "How do you keep them blameless when someone clearly made a mistake?" - "What's a postmortem outcome you're proud of?"

Traps: - Naming names. Destroys safety, doesn't improve the system. - Postmortem theatre — action items that don't get done.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/

Q: "Tell me about an incident you led. What did you learn?"¶

Tags: senior · very-common · scenario · source: behavioral round, universal

Answer outline: - Pick a real incident with stakes — outage, cost spike, quality regression, security event. Ideally one where mitigation worked but the root cause was non-obvious. - Structure (STAR-ish): - Context: product, scale, what was at stake. - Signal: how it surfaced, what you saw first. - Actions: triage decisions, mitigation choice, communication. - Resolution: time to mitigate, time to fully resolve, impact. - Learning: what you changed about the system, the runbook, the team's habits. - AI-specific texture: mention eval regression detection, trace forensics, provider behavior, agent loop, RAG retrieval failure — whichever applies. Demonstrates AI-flavored reliability fluency. - The senior tell: own the mistake without flinching. "I should have rolled back faster — I spent 20 minutes investigating before pulling the trigger. Now my default is rollback-first if a deploy is the suspect." Vulnerability + lesson is what interviewers want. - Numbers to drop: actual numbers — minutes, requests, dollars. Round if you can't share exact figures, but ground the story.

Common follow-ups: - "What would you do differently?" - "How did you communicate to leadership?"

Traps: - Hero narrative. Interviewers detect this and discount the story. - Vague — no numbers, no specifics. Sounds rehearsed.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/

Q: "How do you make sure postmortem action items actually get done?"¶

Tags: staff · occasional · scenario · source: staff-level reliability culture probe

Answer outline: - Track in a real system — Jira / Linear / equivalent — not in the postmortem doc itself. - Each item: owner (one person, not "the team"), deadline, priority, link back to the postmortem. - Review cadence: weekly during eng leadership stand-up; named callout for stale items. - Don't let the next incident steal capacity from prior incidents' actions. Allocate explicit time in sprint planning. - Sunset action items that are no longer relevant — write a note explaining why; don't let them rot. - Metric: % of action items completed within their deadline. Trend it; report it. - Numbers to drop: "review cadence: weekly", "completion target: 80%+ on-time", "sunset stale items with reasoning"

Common follow-ups: - "What if leadership keeps deprioritizing them?" - "What's a sign your action-item process is broken?"

Traps: - Action items in the postmortem doc, nowhere else. Invisible.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/

Pre-incident discipline¶

Q: "What do you do before an incident to be ready for one?"¶

Tags: senior · common · design · source: 2026 SRE / AI reliability loops

Answer outline: - Runbooks: per known failure class. Provider outage, hallucination spike, cost spike, retrieval degradation, agent loop, prompt injection. Each runbook: detection signal, immediate mitigation, escalation, root-cause checklist. - On-call rotation: clear primary/secondary; paging tested; handoff ritual. - Alerting: alarms tied to user-impacting signals (TTFT p95, error rate, eval pass rate, cost-per-request), not just CPU/memory. Each alarm has a runbook link. - Game days / chaos drills: deliberately fail the embedding service, the tool, the provider, in staging or controlled prod. Quarterly minimum. - Eval suite: tier-1 evals run on every deploy; tier-2 nightly. Regression > X% blocks deploy. - Telemetry: traces, structured logs, per-request cost, per-tenant breakdowns. If you can't see it, you can't debug it. - Pre-warmed fallbacks: cross-provider abstraction, fallback capacity, kill switches — tested. - Communication templates: status page snippets, customer comms drafts, internal update format. Don't draft these in the middle of a fire. - Numbers to drop: "runbook coverage: top 10 failure classes minimum", "game day cadence: quarterly", "every alarm links to a runbook"

Common follow-ups: - "What's a chaos drill you've run?" - "How do you keep runbooks fresh?"

Traps: - Runbooks that exist but are stale. Worse than no runbook.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/04_resilient_agent_systems/, learning/01_ai_engineering/26_observability_telemetry/

Q: "What does an AI-specific runbook look like that a generic SRE runbook wouldn't have?"¶

Tags: staff · common · conceptual · source: 2026 AI reliability loops

Answer outline: - AI-specific signals to check at the top of every runbook: - Eval pass rate (tier-1) over the last hour. - Hallucination rate (online judge). - Refusal rate (an unexpected refusal spike is a guardrail mis-fire). - Per-tenant cost-per-request. - Retrieval recall (online proxy). - Tool-call success rate. - Provider error rate, broken out by model + region. - AI-specific mitigations: - Pin/unpin model version. - Toggle prompt-cache (sometimes the cache is the bug). - Switch to fallback provider / smaller model. - Lower retrieval top-K or disable a re-ranker. - Disable a recently added tool. - Drop temperature, raise top-p threshold. - AI-specific data to preserve for postmortem: traces (the agent's full step sequence), inputs (offending prompts), retrieved chunks, model outputs, eval-judge verdicts. - Numbers to drop: "AI runbook ≈ generic runbook + 7-10 AI-specific signals + 5-8 AI-specific mitigations", "preserve traces by default — they're the forensic evidence"

Common follow-ups: - "What's a story where a non-AI runbook missed the AI cause?" - "How do you keep these synced with the eval suite?"

Traps: - Copying generic SRE runbooks verbatim. Miss the AI failure modes.

Related cross-cutting: Production patterns, Evaluation & quality Related module: learning/01_ai_engineering/04_resilient_agent_systems/, learning/01_ai_engineering/22_evals_production/