Skip to content

Observability & Tracing — Interview Questions

The senior tell is treating every LLM call as a span in a trace, not a request in a log. Traditional logs flatten the structure that matters most — which tool fired before which retrieval, which retry triggered which guardrail block, why the agent looped. In 2026 the dominant frame is OpenTelemetry's GenAI semantic conventions (a published standard) with a backend of your choice — Langfuse, LangSmith, Arize, Honeycomb. The candidate who names "OTel as the portability layer, backend as the choice" is the one who's wired this in production.


Foundations

Q: "What is LLM observability?"

Tags: screen · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior LLM loop opener

Answer outline: - Observability is the ability to answer "why is this request behaving this way" from outside the system, using emitted signals. For LLM apps the dimensions are richer than typical web services: model behavior is stochastic, costs scale per token, and failures are distribution-level (quality regressions, drift) rather than binary up/down. - Three pillars carry over from classical observability — traces, metrics, logs — but with LLM-specific content. Traces capture the span tree of an agent run (planning → retrieval → tool calls → LLM completion → guardrails). Metrics include tokens/cost/TTFT/TPOT/refusal rate/cache hit. Logs include the full prompt, response, retrieved chunks, tool args, model version, prompt version, user/tenant ID — with PII redaction. - Tooling stack in 2026: OpenTelemetry as the instrumentation layer (GenAI Semantic Conventions are published), Langfuse / LangSmith / Arize / Datadog / Honeycomb as backends. Most teams use OTel + a backend so they can swap backends without re-instrumenting. - The differentiator from classical observability: you store content (prompts, completions, retrieved chunks). This is the only way to debug "why did the model say that?" — but it creates a PII / retention problem you must address explicitly. - Numbers to drop: "OTel GenAI Semantic Conventions are the 2026 standard", "trace retention: 7-30 days hot, 90-180 days cold", "1-10% sampled traces get LLM-judge eval"

Common follow-ups: - "What's different from web-app observability?" - "How do you store prompts without leaking PII?" - "What's the role of OpenTelemetry?"

Traps: - Treating LLM observability as logs + metrics. Without traces you can't debug agent loops. - Storing raw prompts/completions without PII redaction or retention policy.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/04_ai_product_evals/00_ai_evals_release_gates/


Q: "What's the difference between logging, metrics, and tracing for LLM apps?"

Tags: mid · common · conceptual · source: standard senior observability probe; reported in 2026 LLM platform loops

Answer outline: - Logs: free-form text events. Cheap, easy to emit, hard to query at scale. Good for narrative ("user X hit refusal at 14:32") but useless for trends. - Metrics: numeric time-series. Aggregatable, alarmable, low retention cost. Latency p99, refusal rate, $/call, cache hit rate. Good for dashboards and SLOs. - Traces: structured span trees. Each span represents one operation (LLM call, retrieval, tool invocation, guardrail check), with parent/child relationships, timing, attributes (model, tokens, cost), and events (cache hit, retry, fallback). The only way to debug an agent: you need to see "the retrieval returned 5 chunks, the LLM picked chunk 3, the tool call failed, the agent retried, eventually called fallback". - For LLM apps, traces matter most. Logs supplement; metrics aggregate; traces explain. - Implementation: OpenTelemetry spans, with GenAI Semantic Conventions defining standard attribute names (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, etc.) so any OTel backend can render them. - Numbers to drop: "trace = span tree; one trace per user-facing request", "span attributes per LLM call: ~15-25 standard fields under OTel GenAI", "metrics from spans: aggregate trace data into rolled-up time-series"

Common follow-ups: - "Why are traces more important than logs for LLMs?" - "Can you derive metrics from traces?"

Traps: - Equating "logs" with "observability". For agent debugging you need the span tree. - Storing prompts as log lines without structure. You can't query "all calls with refusal=true and tenant=X" out of free text.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Q: "Why does OpenTelemetry matter for LLM apps?"

Tags: senior · common · conceptual · source: Langfuse OTel integration docs 2026; standard senior platform-design probe

Answer outline: - OpenTelemetry is the open standard for emitting traces/metrics/logs. The GenAI Semantic Conventions (published 2024-2026) define standard attribute names for LLM operations — gen_ai.system (which provider), gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.temperature, etc. - Why it matters: you instrument once with OTel, then send traces to any compatible backend — Langfuse, Logfire, Arize, Honeycomb, Datadog, Grafana Tempo, Jaeger. Switch backend without touching application code. - The alternative is per-vendor SDK (LangSmith ties you to its ingestion path, Arize to its agent, etc.). Swapping vendors means re-instrumenting. OTel breaks that lock-in. - 2026 trend: frameworks adopt OTel GenAI conventions natively. Pydantic AI, smolagents, Strands Agents, OpenLLMetry, OpenInference. LangChain emits OTel; LangSmith ingests OTel. - Practical advice: choose your backend for features (UI, eval workflows, integrations), but instrument on OTel. You get vendor optionality essentially for free. - Numbers to drop: "OTel GenAI Semantic Conventions: standardized attribute names since 2024", "swap backend: zero application-code changes if you instrumented on OTel"

Common follow-ups: - "What's wrong with vendor-SDK instrumentation?" - "Which frameworks natively emit OTel?"

Traps: - Calling OTel "just a wire format". It's the semantic contract — attribute names matter as much as the protocol. - Re-instrumenting your app twice (vendor SDK + OTel) instead of picking OTel up front.

Related cross-cutting: Architecture choices, Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Implementation

Q: "How do you implement logging and tracing for LLM applications?"

Tags: mid · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Layer it: instrumentation SDK → exporter → backend. - Instrumentation: - Wrap LLM SDK calls so every invocation is a span: input prompt, model, parameters, output, tokens, cost, latency. - Wrap retrievers (vector store, BM25): query, top-k results, scores, latency. - Wrap tools: name, args, result, latency, success/failure. - Wrap guardrails: input/output classifier verdicts, action taken. - Use OpenTelemetry SDK with GenAI Semantic Conventions. Most modern frameworks (LangChain, LlamaIndex, OpenInference) auto-instrument with one line. - Exporter: OTLP over gRPC/HTTP to the backend. Use the OpenTelemetry Collector as a buffering/sampling layer so the app emits cheaply. - Backend choice: Langfuse (open-source, MIT-licensed core, biggest open community), LangSmith (best for LangChain shops), Arize (best for eval workflows), Datadog/Honeycomb (best for unified app+LLM observability). - Sampling: 100% of failed traces; 100% of high-priority traffic (paid tier); 1-10% of bulk traffic. Always 100% during incidents. - PII: redact at the exporter layer, not at the application layer. Single point of policy, no application code can accidentally leak. - Numbers to drop: "trace sample rate: 100% errors, 1-10% normal traffic", "OTel Collector adds ~10ms per emit at high volume", "log retention with PII: 7-30 days hot"

Common follow-ups: - "How do you trace an agent loop?" - "What about cost — where does that go?" - "How do you handle PII in traces?"

Traps: - Logging full prompts in plain text without redaction. - 100% sampling on everything — at high volume the cost is real.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Q: "How would you trace an agent loop?"

Tags: senior · common · design · source: Langfuse Agent Observability docs; standard senior agent-debugging probe 2026

Answer outline: - Model the agent as a tree of spans. Root span = the user request. Children = each iteration of the planning loop. Each iteration has sub-spans for planning, tool selection, tool call, observation, guardrail check. - Span attributes: - Planning span: model, prompt version, output (the chosen next action), tokens, cost. - Tool span: tool name, args, raw result, latency, success. - Retrieval span: query, top-k IDs, scores, source store. - Guardrail span: classifier outputs, action (pass/block/sanitize). - Span events: cache hit/miss, retry attempt, fallback model invoked, refusal. - Linkage: every span carries a trace ID + parent span ID. A backend renders the tree as a waterfall view; you can see at a glance which step took long, which retried, where the loop terminated. - Capture why the loop ended: termination reason as an attribute (max_steps reached, success criterion met, hard timeout, guardrail block). Critical for debugging "why did the agent stop here?" - For long-running agents: emit interim spans rather than batching at the end — otherwise crashes lose all visibility. - Numbers to drop: "10-50 spans per agent run typical", "max-depth alarm at 20+ iterations", "trace storage: ~1-10 KB per span"

Common follow-ups: - "How do you find runaway loops in trace data?" - "What if a single trace spans hours?" - "How do you join spans across services?"

Traps: - Logging-only debugging for agents. You can't reconstruct the loop from flat logs. - Forgetting termination-reason. Without it you can't tell success from stuck.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/01_agentic_system_design/


Q: "How do you monitor and profile LLM inference in production (TTFT, inter-token latency, GPU utilization)?"

Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Three signal classes: - User-facing latency: TTFT (time to first token), TPOT/ITL (inter-token latency), E2E. Track p50/p95/p99 by model, tenant, prompt size bucket. - System-side throughput: tokens/sec/replica, queue depth, in-flight requests, requests/sec. - GPU: utilization, memory, KV cache occupancy, batch size, eviction rate. - For self-hosted (vLLM, TGI, TensorRT-LLM), enable Prometheus metrics endpoint: vLLM exposes vllm:e2e_request_latency_seconds, vllm:time_to_first_token_seconds, vllm:gpu_cache_usage_perc, etc. Scrape into Prometheus, dashboard in Grafana. - For API-provider tier: track from the client side (TTFT measured by your SDK), not just E2E. Provider-side metrics are opaque. - Sampling: 100% of latency metrics aggregated, no sampling. Spans can be sampled, but rolled-up metrics need full coverage to catch p99 tails. - Profiling: when a specific endpoint is slow, use NVIDIA Nsight Systems or PyTorch profiler in a non-prod replica to find the kernel-level bottleneck. Production usually doesn't run a profiler. - Numbers to drop: "TTFT p95 target: <500ms for chat", "TPOT p95 target: <50ms for streaming", "GPU memory utilization target: 80-95% (above is OOM risk)", "vLLM emits ~25 metrics natively"

Common follow-ups: - "How do you alarm on p99 without false positives?" - "What's the difference between user-perceived latency and server latency?"

Traps: - Tracking only p50. Tail latency is where products break at scale. - Reading GPU util as "GPU busy". LLM decode can be memory-bandwidth-bound with low SM utilization.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/01_ai_engineering/03_agent_observability_debugging/


Backend platforms

Q: "How do LangSmith, Langfuse, and Arize compare for LLM observability?"

Tags: senior · common · conceptual · source: DigitalApplied / Kanerika 2026 platform comparisons; standard senior tooling probe

Answer outline: - All three solve overlapping problems differently. Pick based on stack and workflow, not features-on-paper. - LangSmith: built by LangChain. Best for LangChain/LangGraph shops — auto-instruments the framework, integrated agent IDE, easy A/B prompt experiments. Downside: instrumentation deeply coupled to LangChain abstractions; if you decide to leave LangChain, you re-instrument. Also closed-source. - Langfuse: open-source (MIT-licensed core), native OTel support, framework-agnostic. Massive 2026 adoption (millions of SDK installs/month). Best for teams who want vendor optionality, self-hostable for data residency, or open-source preference. - Arize: enterprise-focused, strong eval and drift-detection workflows, especially good for RAG quality monitoring. Phoenix is the open-source companion. Best for teams whose primary observability need is quality / drift, not just traces. - Pragmatic combo: Langfuse for tracing (open, OTel-native), Arize Phoenix for RAG eval, W&B for experiment management. Or LangSmith for everything if you're committed to LangChain. - The 2026 default I'd recommend: OTel instrumentation + Langfuse (open) as the primary backend. Switch if specific needs justify it. - Numbers to drop: "Langfuse: 2000+ paying customers, 26M+ SDK installs/month, used by 19 of Fortune 50", "LangSmith: deepest LangChain integration", "Arize Phoenix: open-source eval-focused"

Common follow-ups: - "When would you self-host vs SaaS?" - "What's the downside of LangSmith's coupling?"

Traps: - Picking the tool that's deepest in your current framework without considering the lock-in cost. - Treating these as identical. They optimize for different workflows.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Q: "Self-host or SaaS for LLM observability?"

Tags: senior · common · scenario · source: standard senior tooling probe; reported in 2026 platform loops

Answer outline: - Default to SaaS. The ops cost of running an observability stack (storage, retention, scaling) eats the savings unless you're at high volume. - Self-host when: - Data residency: traces contain prompts/completions; some regulations require these stay in-region or on-prem. - PII sensitivity: even with redaction, some legal/security postures forbid sending prompts to a third party. - Cost at scale: SaaS observability priced per-event or per-span; at 100M+ spans/month, self-hosted Langfuse + ClickHouse or Tempo + S3 can be substantially cheaper. - Custom integrations: deep integrations with internal systems that SaaS won't build. - Self-host options: Langfuse (Docker, K8s, ClickHouse-backed), Arize Phoenix (lightweight, mostly for dev/eval), Jaeger/Tempo+Grafana (OTel-native, classic), OpenLLMetry. - Hybrid: SaaS for normal traffic, with a self-hosted data-residency tier for regulated tenants. Or SaaS frontend with self-hosted storage (some vendors support this). - The senior insight: this is an org decision, not a tech decision. Map to your security/compliance posture first, optimize cost second. - Numbers to drop: "SaaS observability: $0.0001-0.001 per span typical", "self-hosted at 100M+ spans/month often 2-5× cheaper but with ops overhead"

Common follow-ups: - "What's the ops cost of self-host?" - "What about hybrid?"

Traps: - Defaulting to self-host because "data must stay in-house" without checking if vendor zero-retention modes satisfy the policy.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Trace-driven debugging

Q: "Your AI pipeline has zero visibility into which step is failing. How do you add observability?"

Tags: mid · very-common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Triage first: are you debugging a current incident or building observability for future incidents? Different urgency. - For an active incident: add the minimum to diagnose. Print full prompts/completions/tool args at the suspect stage. Identify the failure. Then back-fill proper instrumentation. - For future incidents: instrument once, properly. - Step 1 — pick an OTel-native framework adapter (OpenInference for LangChain/LlamaIndex/etc.) so every LLM call, retrieval, tool call gets auto-spans. - Step 2 — pick a backend (Langfuse / LangSmith / Arize). Start with the SaaS free tier; switch later if needed. - Step 3 — emit. Verify spans land. Validate the trace tree renders. - Step 4 — set sampling. 100% errors, 1-10% normal traffic. - Step 5 — add metrics. Roll up spans into TTFT/cost/refusal-rate dashboards. - Step 6 — add alerting. SLO-based, rate-based, not point-based. - Common observability gaps that cause this question: - Tool calls not instrumented — only the LLM is traced. - Retrieval results not captured — you can't tell if the bug is bad chunks or bad answer. - No prompt-version attribute — you can't tell which prompt produced the failure. - Numbers to drop: "first useful trace: 1-2 days from zero to ingest", "useful dashboards in week 1", "MTTR drops 30-70% once traces land for agent debugging"

Common follow-ups: - "What's the bare-minimum trace for a useful debugging signal?" - "How do you justify the cost of observability?"

Traps: - Adding only error logs. You need full prompts and tool args to debug LLM failures. - Skipping the prompt-version attribute. Critical for "which version regressed?"

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Q: "Walk me through debugging a failing agent run using traces."

Tags: senior · very-common · scenario · source: standard senior agent-debug probe; reported in 2026 AI engineer loops

Answer outline: - Step 1 — find the trace. Use the user-supplied request ID or filter by tenant + timestamp + error tag. - Step 2 — open the span waterfall. Look for the failure pattern: - Long span → which step blew the latency budget (often a runaway tool call or a long-context retrieval). - Loop pattern → identical spans repeating → agent stuck in a thought loop. - Missing expected span → tool not called, retrieval skipped, guardrail bypassed. - Error span → check the captured exception, input args, output. - Step 3 — inspect span attributes at the failing step. Read the full prompt sent, the model output, the tool args, the retrieval results. The actual content is where the bug is. - Step 4 — diff against a known-good trace of the same user intent. What's different? Often it's a tool result that looks fine but is subtly malformed, or a retrieved chunk that triggered a hallucination. - Step 5 — hypothesize and verify in a sandbox. Replay the exact prompt/tools state in a dev environment, confirm the failure, then iterate. - Step 6 — write a regression case: capture the failing trace's inputs as a permanent eval-set example. Future regressions caught in CI. - Numbers to drop: "MTTR for trace-equipped agent bugs: 30-90 min. Without traces: hours-to-days, often unresolved", "regression suite grows by 5-20 examples per week from production triage"

Common follow-ups: - "What if the trace is missing critical attributes?" - "How do you handle long-running agent traces?"

Traps: - Reading only the final error. The interesting failure is usually 3-5 steps upstream. - Not capturing the failing case as a regression test. The bug returns.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/04_ai_product_evals/00_ai_evals_release_gates/


Q: "Your agent keeps looping. How do you find why from traces?"

Tags: senior · common · debugging · source: standard senior debugging probe; reported in 2026 agent-platform loops

Answer outline: - Loops show up as identical or near-identical sub-trees repeating in the trace. Sort the trace's spans by depth; look for clusters of the same operation. - Common loop causes: - Tool returning the same result each iteration: the agent observes the same observation, plans the same action, repeats. Fix: check tool idempotency / state; gate on the agent recognizing repeated observations. - Stopping condition never matches: the agent's "are we done?" check is too strict. Fix: relax the criterion; add a max-steps guard. - Planning prompt has no early-exit instruction: the model keeps "thinking" because nothing told it to stop. Fix: prompt-level explicit stopping rule. - Long-context degradation: as conversation history grows, the model loses early-turn context and re-plans from scratch. Fix: summarize history, trim, or stick a key-context block in. - Reward-hacking under self-reflection: the agent self-critiques its plan, "improves" it identically, repeats. Fix: cap reflection iterations. - Hard guardrails always: max steps per agent run (10-50), max wall-clock per run, max tokens per run. Trip → terminate with structured "I tried but didn't complete" output. - In traces: a max-steps termination shows up as a guardrail-block span at the end. Track frequency of max-steps terminations — that's your "stuck rate" KPI. - Numbers to drop: "max-steps default: 10-20 for product agents, 50+ for research", "stuck-rate alarm: >5% of runs hitting max-steps", "loop detection: 3 identical consecutive tool calls → kill"

Common follow-ups: - "Why is max-steps not enough?" - "How do you detect 'identical' tool calls?"

Traps: - Only adding max-steps without instrumenting the reason. You stop the loop but don't fix it.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/01_ai_engineering/01_agentic_system_design/


Q: "You see one trace where the LLM refused unexpectedly. How do you investigate?"

Tags: senior · common · debugging · source: standard senior trace-debug probe; reported in 2026 AI engineer loops

Answer outline: - Step 1 — read the actual prompt + completion in the trace. Was the refusal model-side ("I can't help with that") or guardrail-side (blocked by an output filter)? - Step 2 — if guardrail-side: which classifier blocked? Score, threshold, action. Was the input genuinely on the policy edge or a false positive? - Step 3 — if model-side: what triggered the refusal? Common causes — input contains a phrase the model interprets as harmful (false positive from the model's safety training), retrieved context included a "do not respond" instruction (indirect prompt injection), conversation history contains a poisoned earlier turn. - Step 4 — reproduce in a sandbox. Same model version, same prompt, see if refusal repeats. Sometimes it's stochastic and resampling fixes it. - Step 5 — fix by class: - One-off false positive → retry-with-different-prompt fallback in the application. - Repeatable false positive → add to refusal-tuning data or adjust guardrail threshold. - Indirect injection signal → add quarantine of retrieved content. - Step 6 — track refusal-rate as a continuous metric, alarm on spikes. Over-refusal is a UX killer. - Numbers to drop: "refusal-rate SLO: <5% on benign traffic", "false-positive audit on 1% sampled refusals weekly", "model-side refusal at T=0 is more deterministic than T>0"

Common follow-ups: - "How do you distinguish a real refusal from a hallucinated 'I can't'?" - "What's the difference between over-refusal and under-refusal?"

Traps: - Treating one refusal as anecdote. Aggregate first; one trace is a starting point, not a fix.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/03_ai_security_safety/00_safety_guardrail_design/


PII & sensitive data in traces

Q: "How do you handle PII in observability traces?"

Tags: senior · common · design · source: standard senior privacy-aware observability probe; reported in 2026 regulated-industry loops

Answer outline: - The fundamental tension: traces contain the most PII of any logs (full prompts, retrieved docs, tool args), and they're the most valuable for debugging. You cannot just "not log them". - Strategy: redact at the SDK / instrumentation layer, before the data leaves the application process. - PII detector (regex + NER) on prompt / completion / tool args. - Replace with placeholders ([EMAIL_1], [NAME_2]); store the placeholder→value map in an encrypted, scoped session store (NOT in the trace). - Rehydrate only at user-facing render, never in the trace store. - Configurable per-tenant: paid-tier or regulated tenants may opt for "no prompts stored at all"; others accept redacted prompts. - Retention: redacted traces 30-90 days hot; full traces (if any) 7 days max with strong access controls. - Access control: trace data is access-logged. Engineers see redacted by default; raw access requires elevated permission and audit log. - Provider-side: API providers also log requests; check provider retention settings (Anthropic / OpenAI offer zero-retention enterprise tiers). - Numbers to drop: "redaction at SDK layer adds ~50-100ms per emit", "audit log on trace access for compliance", "encrypted placeholder map scoped to session, TTL <24h"

Common follow-ups: - "What if redaction breaks the trace's usefulness?" - "How do you handle PII in retrieved documents that appear in traces?" - "What about the provider's logs?"

Traps: - Redacting at application layer. Application code can accidentally bypass it. - Forgetting that retrieved content (RAG chunks) also enters the trace and may contain PII.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/03_ai_security_safety/00_safety_guardrail_design/


Connecting traces to evals

Q: "How do you connect production traces back to your eval suite?"

Tags: senior · very-common · design · source: standard senior eval/observability bridge probe; reported across 2026 AI engineer loops

Answer outline: - This is the closed loop the senior interviewer is probing for. Without it, evals become stale and disconnected from real failure modes. - Loop: - Step 1 — sample traces (typically 1-10% of production) for offline eval. - Step 2 — grade with LLM-judge (faithfulness, correctness, helpfulness) plus human review on a smaller slice. - Step 3 — cluster low-scoring traces to find common failure patterns. - Step 4 — promote representative failures into the golden eval set. Each becomes a permanent regression test. - Step 5 — retrain / fix against the expanded eval set. - Tooling: Langfuse / LangSmith / Arize all support exporting filtered trace sets to eval datasets. Most have built-in LLM-judge eval workflows. - Cadence: weekly triage of bottom-decile traces is the typical senior-team rhythm. Monthly golden-set growth review. - Track as a KPI: "% of traces below quality threshold" — should trend down as the eval set grows and you fix issues. - Senior tell: candidate names the promotion pipeline (production failure → labeled example → CI regression case) as the differentiator from "we have evals". - Numbers to drop: "sample 1-10% of production for LLM-judge eval", "promote 5-20 new failure cases per week", "calibrate LLM judge to ≥85% agreement with human reviewers before trusting"

Common follow-ups: - "Where does the LLM-judge eval run — synchronously or async?" - "How do you keep the golden set from overfitting to your judge?" - "What's your trace → eval cadence?"

Traps: - Treating evals and traces as separate systems. The whole point is the loop. - LLM-judge without human calibration. Your "quality scores" are then meaningless.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/, learning/01_ai_engineering/03_agent_observability_debugging/


Q: "How do you find rare failure modes hidden in millions of traces?"

Tags: staff · common · scenario · source: standard staff-level trace-analysis probe; reported in 2026 AI platform loops

Answer outline: - Rare failures by definition don't show on aggregate dashboards. Need a different surface: - LLM-judge on a sampled slice: filter the sample to the bottom-decile by quality score. The judge's "I'm not sure about this answer" cases concentrate the failures. - Outlier detection on numeric attributes: top-1% latency, top-1% token-count, top-1% cost, top-1% conversation depth. The tails are where weird things live. - Embedding-based clustering: embed every trace's user prompt; cluster; review the small clusters. Rare failure modes often cluster tightly. - User-feedback signal: thumbs-down, escalations, retries. Each is a candidate failure; sample and triage. - Specific-pattern scan: scan for known bad patterns (refusal on benign input, tool error, max-steps termination, output schema mismatch). - Don't try to manually review millions of traces. The interviewer wants to hear about triage tooling — automated filters that surface the candidate set down to hundreds or thousands of traces for human review. - Senior tells: candidate names a cluster-and-sample approach rather than "sample randomly", and names a feedback signal (user-side) as the highest-priority surface. - Numbers to drop: "embedding cluster review: 10-50 clusters from 1M traces", "human-review budget: 100-500 traces/week per engineer", "feedback-signal coverage: ~1-5% of traffic"

Common follow-ups: - "How do you decide which cluster to review first?" - "What's an example failure mode you found this way?"

Traps: - Random sampling for rare-mode discovery. You'll miss the rare ones by definition.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/04_ai_product_evals/00_ai_evals_release_gates/


Cost & sampling

Q: "How do you sample traces without losing visibility into bugs?"

Tags: senior · common · design · source: standard senior observability-cost probe; reported in 2026 platform loops

Answer outline: - The 2026 default: 100% sample for errors / refusals / high-priority traffic, lower sample for everything else. - Implementation: head-sampling (decide at ingest based on attributes) vs tail-sampling (decide at the OTel Collector after seeing the full trace). - Tail-sampling is better for LLM apps because the "interesting" trace attribute (errors, long latency, high cost, low quality score) is only known after the trace completes. The OTel Collector buffers spans briefly, decides per-trace, then exports. - Tail-sampling policies: - Always-keep: errors, traces with refusal, traces over latency threshold, traces tagged for paid tenants. - Probabilistic: 1-10% of remaining traces. - Rate-limited: cap on per-tenant trace volume so one chatty tenant doesn't flood storage. - Storage tiers: hot (queryable, 7-30 days) for active debugging; cold (S3 / GCS, 90-180 days) for compliance and back-analysis. - For metrics: never sample. Aggregate at 100% so p99 is honest. Spans can be sampled; metrics rolled-up from spans cannot. - Numbers to drop: "tail-sample window: 30-60s buffer in OTel Collector", "errors: 100%, normal: 1-10%, high-cost-tenant: 100%", "hot retention: 14-30 days; cold: 90-180 days"

Common follow-ups: - "What's tail-sampling vs head-sampling?" - "How do you avoid losing context when a sampled trace's parent is dropped?"

Traps: - Head-sampling only. You'll miss most of the interesting failures. - Dropping span data without keeping the rolled-up metrics at 100%.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Q: "Your observability bill is bigger than your LLM bill. What do you do?"

Tags: senior · occasional · scenario · source: standard senior cost-incident probe; reported in 2026 platform loops

Answer outline: - This is more common than people admit. SaaS observability priced per-event or per-GB can outpace LLM costs at scale. - Diagnose first. Pull the bill by source: which spans dominate? Often it's auto-instrumentation emitting spans you don't actually use (HTTP middleware spans, database query spans, framework-internal spans). - Levers: - Prune span volume: turn off auto-instrumented spans you don't use. Many frameworks emit 5-10× more spans than you need. - Tail-sample harder: drop normal-traffic spans to 1% if your error/quality coverage stays intact. Keep all errors and slow traces. - Strip unused attributes: each span attribute costs bytes. Drop ones you never query. - Move to OTel Collector + cheaper backend: from a per-event SaaS to a self-hosted Langfuse + ClickHouse stack or a Tempo/Grafana stack. Significant savings at high volume. - Cold-tier old data: hot retention 14 days, cold-tier S3 for 90-180 days. Move 90% of storage cost off the hot tier. - Always: keep 100% errors, 100% high-quality / paid-tier, 100% metrics. Sample only the bulk normal traffic. - Numbers to drop: "SaaS observability at high volume: $0.0005-0.005/span; self-host typically 3-10× cheaper at 100M+ spans/month", "expect to cut span volume 50-80% without losing debugging signal"

Common follow-ups: - "What's the ops cost of self-hosting?" - "How do you tell which spans are useless?"

Traps: - Reducing sample rate without checking that you keep 100% errors. Costs go down; debug-ability disappears.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/


Scenario / design

Q: "Design observability for a multi-tenant agent platform."

Tags: staff · common · design · source: standard staff-level platform-design probe; reported in 2026 AI infra loops

Answer outline: - Multi-tenancy adds dimensions: per-tenant isolation, per-tenant cost attribution, per-tenant quality SLO, regulated-tenant data residency. - Span model: every span tagged with tenant.id, request.id, agent.run_id, model.version, prompt.version. Backend must support filtering and aggregation by these. - Storage tiers: - Standard tenants: shared backend, redacted prompts, sample 1-10%, 14-30 day retention. - High-priority / paid tenants: 100% sampling, longer retention, dedicated dashboards. - Regulated tenants (HIPAA, GDPR data-residency): isolated storage in-region; possibly self-hosted observability per tenant. - Cost attribution: per-tenant rollup of LLM cost, observability cost, GPU cost. Show in internal dashboard; usable for billing or unit-economics analysis. - Quality SLOs per-tenant: some tenants accept more refusals; some need faster TTFT; eval thresholds configurable per-tenant. - Access control: engineer access to traces requires tenant scoping. No global "see all traces" role for support; explicit tenant-grant. - Cross-cutting metrics: aggregate across tenants for platform health (overall p99 latency, refusal rate, cost trends) but tenant-disaggregated for incidents. - Numbers to drop: "trace volume per tenant ranges 100-1M/day", "per-tenant dashboard for top 50 customers; aggregate for the rest", "trace access audit logged"

Common follow-ups: - "How do you handle a tenant requesting their full trace history?" - "What if a tenant's compliance audit needs evidence?" - "Cross-tenant aggregate vs per-tenant — what gets which?"

Traps: - Single global trace store with no tenant tag. Once it's a billion spans, retroactive tenant tagging is painful. - No access-log on trace access. Compliance auditors will ask.

Related cross-cutting: Production patterns, Architecture choices Related module: learning/01_ai_engineering/03_agent_observability_debugging/, learning/02_ai_infrastructure/04_ml_platform_operations/


Q: "What's the difference between LLM observability and traditional APM (application performance monitoring)?"

Tags: mid · common · conceptual · source: standard senior platform-design probe; reported in 2026 LLM platform loops

Answer outline: - Overlap is large; the difference is in what signals matter. - APM (Datadog, New Relic, Dynatrace) focuses on: latency, error rate, throughput, dependency tracing, infrastructure metrics, exception capture. Designed for deterministic services. - LLM observability adds: - Content capture: prompt, completion, retrieved chunks, tool args. Without these you can't debug "why did the model say that?" - Cost-per-call: tokens-in, tokens-out, $/call as first-class metrics. - Quality signals: LLM-judge faithfulness, refusal rate, hallucination flags — not just up/down. - Agent-loop structure: planning spans, tool spans, retrieval spans, deeply nested. - Model + prompt versioning: every trace tagged with which version produced it. - Hybrid: many teams use Datadog APM for infra and Langfuse/LangSmith for LLM-specific traces. Some APM vendors are adding LLM features (Datadog LLM Observability, New Relic AI Monitoring). - The pragmatic answer: don't replace your APM with LLM observability — they're complementary. Run both, link by trace ID, send LLM-specific spans to the LLM-observability tool. - Numbers to drop: "LLM observability span attributes: 20-30 standard fields. APM spans: typically 5-15.", "many 2026 teams run dual-stack: APM + LLM-observability linked by trace ID"

Common follow-ups: - "Why can't classical APM just handle this?" - "What's the integration pattern between them?"

Traps: - Treating LLM observability as APM replacement. They serve different debugging needs.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/03_agent_observability_debugging/