Evaluation in Production — Interview Questions¶

The differentiator at senior loops. Most candidates can talk RAG; few can talk eval rigorously. The defining question of 2026 eval rounds: "Is there an actual eval framework here, or is it vibes-based?"

Offline evals & golden sets¶

Q: "How do you evaluate a chatbot?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Reject the question as posed — "evaluate a chatbot" has at least three layers: task-level (did it answer correctly), conversation-level (did the multi-turn flow stay coherent), and product-level (did the user achieve their goal). - Layer 1 — task: golden set of 200-500 representative turns with reference answers or rubrics; score with LLM-as-judge plus deterministic checks (format, refusal triggers, PII leaks). - Layer 2 — conversation: rollout simulations with a user-simulator LLM, score on goal completion, turn count, recovery from misunderstanding. - Layer 3 — product: online metrics — CSAT, deflection rate, escalation rate, retry rate, thumbs-down rate. - Never trust a single number. Senior tell: candidate names both a binary "did it work" metric and an open-ended quality metric. - Numbers to drop: "golden set of 200-500 turns", "1-10% online sampling for LLM-judge scoring", "target >85% judge-human agreement before trusting it"

Common follow-ups: - "What about multi-turn — how do you score a 12-turn conversation?" - "Your CSAT is flat but task accuracy went up — what's happening?"

Traps: - Listing BLEU/ROUGE for a chatbot — these are translation/summarization metrics, basically useless for conversational quality. - Conflating "I have evals" with "I have a number that goes up". Senior interviewers probe for whether you actually look at traces. - Only offline evals, no online signal — production reality always differs from your golden set.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What's a golden dataset? How do you keep it fresh?"¶

Tags: mid · very-common · conceptual · source: Maxim AI golden dataset guide; Arize AI golden dataset post, 2026

Answer outline: - A golden dataset is a static, version-controlled set of 200-500 inputs with expected behaviors (reference answers, must-include facts, must-not-include strings, or pass/fail rubrics) — the deterministic gate every release must clear. - Sources: hand-curated by domain experts (~30 to start, grow until no new failure modes emerge), bug reports from production, adversarial red-team prompts, edge cases discovered in shadow traffic. - Freshness loop: triage production failures weekly, promote new failure classes into the golden set, version it like code (golden-v3.json with changelog). - "Rot" is the killer — Maxim's HR-chatbot example: 99% offline pass rate, then an equity-plan announcement creates a whole new query class missing from the golden set. - Pair the deterministic golden set with random production sampling — golden set catches known regressions, sampling surfaces unknown failures. - Numbers to drop: "200-500 examples", "weekly triage adding 5-20 new cases", "version every change — golden-v3.json semver"

Common follow-ups: - "How big should it be?" - "What stops it from getting stale?" - "Random sampling vs golden set — which is better?"

Traps: - Treating the golden set as immutable — it should grow with every novel failure. - Letting domain experts hand-curate forever; need a pipeline to promote real user traffic in. - Mixing training data with eval data — golden set MUST be held out, including from any few-shot prompt examples.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you build a golden dataset for evaluation?"¶

Tags: mid · common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Step 1 — taxonomy first. Enumerate user intents, edge cases, known failure modes. Without taxonomy you cannot tell whether your set has coverage. - Step 2 — seed with ~30 examples per intent, hand-labeled by a domain expert with detailed critiques (not just pass/fail). - Step 3 — synthetic augmentation. Use an LLM to paraphrase, translate, add typos, lengthen, shorten — but every synthetic example needs human spot-check before it counts as golden. - Step 4 — adversarial mining. Take 100 random production queries, find the worst 10 by current-system metrics, promote them. - Step 5 — stratify by intent + difficulty so you can report metrics per slice, not just a single average that hides regressions. - Numbers to drop: "30 examples per intent to start", "stratified — 60% happy path / 30% edge / 10% adversarial", "spot-check 100% of synthetic before promotion"

Common follow-ups: - "How do you use it for regression testing?" - "How do you avoid overfitting to your golden set?"

Traps: - All happy-path examples — your CI will be green while production burns. - One global average score — you need per-slice metrics to catch regressions on minority intents. - Forgetting that the golden set leaks into prompt engineering iterations, which silently overfits prompts to the eval.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Difference between offline and online evals."¶

Tags: screen · very-common · conceptual · source: standard senior AI loop opener; Devinterview / MyEngineeringPath 2026

Answer outline: - Offline = run the new system on a fixed dataset before shipping. Fast, reproducible, deterministic gate in CI. Catches regressions on known inputs. - Online = measure real production traffic after shipping. Captures distribution shift, real user behavior, real failure modes. Slow to iterate, statistical noise, requires sampling. - They answer different questions. Offline: "did I break what I had?" Online: "do users actually do better?" - Healthy systems run both: offline as a release gate, online as a continuous monitor with per-feature dashboards. - The classic trap — only-offline teams ship a model that ace's the golden set then degrades on real traffic because their golden set was 6 months stale. - Numbers to drop: "offline runs in 5-15 min on 200-500 examples in CI", "online sampling 1-10% of production traffic for LLM-judge scoring", "two-week ramp from canary to 100%"

Common follow-ups: - "Which one do you trust more?" - "Walk me through your offline-to-online handoff."

Traps: - Saying "offline is for accuracy, online is for latency" — both measure quality dimensions, the difference is distribution, not metric type. - Pretending offline is sufficient — every senior interviewer has been burned by an offline-green release that bombed in prod.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How would you evaluate and monitor a model in production, not just offline?"¶

Tags: senior · common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Three-tier monitoring: (1) cheap heuristics on 100% of traffic — latency, token cost, refusal rate, format-validity rate; (2) LLM-judge on a sampled slice — 1-10% sampling; (3) human review on a smaller slice — bottom-decile by confidence or flagged by judge. - Define guardrail metrics that must not regress (refusal rate, PII leak rate, latency P95) versus optimization metrics that should improve (CSAT, task-completion). - Alarms on rates, not values — "faithfulness <0.85 for 30 minutes" is actionable, "one bad response" is not. - Feed every flagged trace back into the golden-set candidate pool — closes the loop. - Numbers to drop: "100% cheap heuristics", "1-10% LLM-judge sample", "0.1-1% human review on flagged"

Common follow-ups: - "How do you avoid alert fatigue?" - "What's your kill-switch criterion?"

Traps: - LLM-judging 100% of traffic — at 10M queries/day and $0.001/judge call, that's $10K/day in just eval. - Single-metric dashboards — you need a small set of guardrails plus a small set of optimization metrics, not 50 charts.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Walk me through an eval framework you built — what were the failure modes?"¶

Tags: senior · very-common · scenario · source: standard senior loop question; Hamel Husain "Your AI Product Needs Evals"

Answer outline: - Structure: (1) error taxonomy from looking at ~100 real traces, (2) golden set built around the taxonomy, (3) LLM-judge aligned to a domain expert, (4) CI gate on PR, (5) online sampling with the same judge. - Real failure modes worth naming: judge drifted when we upgraded GPT-4 versions (judge-validation broke silently); golden set overfit because prompt engineering was tuned against it; one engineer kept editing golden labels to make CI green; per-slice metrics were averaged and hid a 20-point regression on the rarest intent. - The Hamel point: "Remove ALL friction from looking at data" — most failure modes are caught by humans reading 50 traces, not by a metric. - Honeycomb-style: ~3 iterations to hit >90% judge-human agreement; if you can't get there, your rubric is too vague. - Numbers to drop: "100 real traces to build taxonomy", "3 iterations to >90% agreement", "judge re-validated every 30 days or on model upgrade"

Common follow-ups: - "What did you cut that didn't work?" - "How did you onboard a new engineer to this framework?"

Traps: - Listing metrics without naming a failure mode — senior interviewers want the story, not the architecture diagram. - Saying "we used a 1-5 Likert scale" — Hamel's warning: that's basically vibes-based.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your eval metric went up but users complain more. What's wrong?"¶

Tags: senior · common · debugging · source: standard senior loop; debugging scenario

Answer outline: - Goodhart's Law applied to LLMs — your metric became the target, the underlying property drifted. - Diagnose by checking: (1) golden set composition vs production distribution — has user traffic shifted to a class you don't cover? (2) judge drift — did the judge model upgrade silently change scoring? (3) metric definition — is "answer relevance" capturing helpfulness or just topic match? (4) per-slice — did you regress on a critical minority slice while the average improved? - Specific symptom mapping: complaints about verbosity → verbosity bias in judge inflating scores; complaints about confident-wrong → faithfulness metric only checked claim presence, not correctness; complaints about tone → no tone metric in eval at all. - Fix: sample 50 complaint traces, run them through your eval, find where the metric disagreed with the user. That gap is your taxonomy gap. - Numbers to drop: "sample 50 complaint traces", "per-slice not just global average", "judge re-validation if model version changed"

Common follow-ups: - "Could the eval be right and the users wrong?" (rarely — usually no.) - "How would you change the eval based on this?"

Traps: - Blaming the users. - Adding more metrics instead of fixing the broken one — you don't dilute a bad signal by adding more, you fix it.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Is there an actual eval framework here, or is it vibes-based?"¶

Tags: staff · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026; defining 2026 question

Answer outline: - Vibes-based = engineer ships a prompt change because "looks better in 5 examples I tried in the playground." No regression suite, no gate, no record of decision. - Framework-based = (1) error taxonomy, (2) versioned golden set, (3) judge with validated human agreement, (4) CI gate that blocks merge, (5) online sampling, (6) feedback loop from production failures to golden set. - Tell whether you have a framework: can you answer "what's our faithfulness number this week vs last week, broken down by intent, with a confidence interval"? If no, vibes. - The interview signal — candidate who admits past projects were vibes-based and explains how they migrated. Senior engineers have always inherited vibes-based systems; they know the path out. - Numbers to drop: "5 component checklist", ">90% judge-human agreement", "per-slice not just global average"

Common follow-ups: - "What's the minimum framework you'd accept for a feature with 1000 users?" - "How do you stop teams from regressing into vibes-based?"

Traps: - Pretending you've never shipped on vibes — every honest engineer has, and interviewers know it. - Listing tools (Braintrust, LangSmith, Phoenix) without describing process — tools don't make a framework, process does.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What are your testing strategies for non-deterministic outputs?"¶

Tags: mid · common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Three test classes: (1) deterministic checks on structure — JSON parse, schema valid, required fields present, no forbidden strings; (2) semantic checks via embedding similarity or LLM-judge — answer relevant to question, faithful to context; (3) property-based — for any input matching predicate X, output must satisfy Y. - Set temperature=0 for deterministic regression tests where possible — accepts a small quality cost for reproducibility. - For genuinely stochastic outputs, run N=5-10 samples and assert distribution properties (e.g., 90% of samples must be faithful), not single-sample equality. - Snapshot tests with semantic diff — store last-known-good output, fail if new output's embedding similarity to old < 0.85, and a human review the diff. - Numbers to drop: "temperature=0 for regression", "N=5-10 samples for distribution properties", "snapshot similarity threshold 0.85"

Common follow-ups: - "What about flaky tests — how do you stop those?" - "Embedding similarity has its own failure modes — what?"

Traps: - String-equality testing — fails 100% of the time on the second run. - Single-sample testing for stochastic systems — gives you noise as a signal.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Design an LLM evaluation pipeline that runs on every code change in CI."¶

Tags: senior · common · design · source: lockedin.ai LLM engineer interview Q51, 2026

Answer outline: - Components: (1) golden set of 100-500 cases stratified by intent, (2) deterministic checks (schema, banned strings, latency budget) run first — fail fast, (3) LLM-judge scoring on outputs that pass step 2, (4) per-slice score aggregation, (5) compare to baseline (main branch), (6) gate on regression > threshold. - Budget the CI run to 5-15 min: parallelize across N workers, sample large golden sets if needed, use cheaper judge in CI and expensive judge in nightly. - Two-tier gating: hard gate (refusal rate, schema validity, latency P95) blocks merge; soft gate (quality metrics) requires reviewer ack if regressed. - Cache eval results by (prompt-hash, model-version, golden-input-hash) — same change shouldn't rerun the same evals. - Numbers to drop: "100-500 cases", "5-15 min CI budget", "hard gate + soft gate", "cache by (prompt, model, input) hash"

Common follow-ups: - "What's your gate — % regression on what metric?" - "How do you stop a sneaky regression on a rare intent slice?"

Traps: - Running expensive eval on every commit — bankrupts you within a month. - One global pass/fail — gives you no signal on which dimension regressed.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you avoid overfitting to your eval set?"¶

Tags: senior · common · conceptual · source: standard senior loop probe; ML hygiene 2026

Answer outline: - Three-way split — same hygiene as classical ML. Dev set for iteration (you look at it), validation set for CI gate (judges run on it, but you don't tune to it), holdout set used only for major releases (touched at most monthly). - Track number of times each example has influenced a decision — high influence count = overfit risk. - Production sampling is the antidote — your live traffic is the largest, freshest, untouched holdout that exists. - Refresh the iteration-set every quarter — promote from production, retire stale examples. - Numbers to drop: "3-way split: dev/CI/holdout", "holdout touched at most monthly", "production sampling = continuous fresh holdout"

Common follow-ups: - "How do you know you've overfit?" - "What's your refresh cadence?"

Traps: - Same engineer hand-tuning prompts against the same set used for the gate — guaranteed overfit. - Using the golden set as few-shot examples in production prompts — total leakage.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

LLM-as-judge¶

Q: "Pairwise vs pointwise LLM-judge — when each?"¶

Tags: mid · very-common · conceptual · source: Eugene Yan "LLM-evaluators" post; standard senior loop

Answer outline: - Pointwise (direct scoring): one output, one score. Use when the criterion is objective and absolute — faithfulness ("is this claim in the context?"), policy violation ("does this contain PII?"), schema validity. - Pairwise: two outputs A vs B, pick winner. Use when the criterion is subjective and relative — "is this answer better?" — humans are also more reliable comparing than rating absolutely. - Pairwise gives more stable results and smaller variance — Eugene's data: pairwise correlated better with humans than pointwise on subjective tasks. - Pairwise cost is roughly 2x (need two generations), and you need to control for position bias by running both orderings. - Reference-based is a third mode — compare against gold answer. Use when you have a reference; degrades to fuzzy matching otherwise. - Numbers to drop: "pairwise 2x cost", "must run both A-then-B and B-then-A to control position bias", "MT-Bench pairwise gpt-4 agreement 85% vs human-human 81%"

Common follow-ups: - "How do you turn pairwise into a leaderboard?" - "When does pairwise break?"

Traps: - Using pointwise on inherently relative criteria like "helpfulness" — gives noisy 1-5 scores. - Using pairwise on auditable criteria like faithfulness — you can't ship "A is more faithful than B" to a compliance team. - Not controlling for position bias.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you validate your LLM-as-judge?"¶

Tags: senior · very-common · scenario · source: Hamel Husain LLM-judge post; Eugene Yan LLM-evaluators

Answer outline: - Treat the judge like a model you're shipping — needs its own labeled eval set. - Process (Hamel): pick principal domain expert, hand-label ~30 examples pass/fail with critiques, iterate the judge prompt until you hit >90% agreement with the expert. - Quantify with Cohen's κ (chance-adjusted agreement) not raw % — raw % inflates when class imbalance is severe. Target κ > 0.6 for usable, >0.8 for reliable. - Re-validate when: (a) underlying generator model changes, (b) judge model upgrades, (c) input distribution shifts, (d) every 30 days regardless. - Track per-class agreement — a judge can have 90% global agreement and 40% on the minority class you care about most. - Numbers to drop: "Honeycomb: >90% agreement in 3 iterations", "Cohen's κ > 0.6 usable / > 0.8 reliable", "MT-Bench: gpt-4 hit 85% agreement, human-human was 81%"

Common follow-ups: - "What does Cohen's κ correct for?" - "Your judge agrees with one annotator but not another — what now?"

Traps: - Using raw agreement % without κ — misleads on imbalanced classes. - Validating once at launch and never again — judge drift is real, especially across model upgrades. - Validating against the same engineers who wrote the judge prompt — circular.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What biases affect LLM-as-judge and how do you mitigate them?"¶

Tags: senior · very-common · conceptual · source: Eugene Yan; Justice or Prejudice paper, NeurIPS 2024; TianPan 2026

Answer outline: - Position bias — judge favors first or second option. Concrete: GPT-3.5 was biased 50% of the time, Claude-v1 70%. Mitigate: run both orderings, average; or use the Bradley-Terry model on pairwise outcomes. - Verbosity / length bias — longer responses win even when worse. Both Claude-v1 and GPT-3.5 preferred the longer response >90% of the time in one study. Mitigate: length-normalize (AlpacaEval 2.0 length-controlled win rate), or judge with explicit instruction to ignore length. - Self-enhancement bias — model prefers its own outputs. GPT-4 favored itself by 10pp win rate, Claude-v1 by 25pp. Mitigate: don't use the same model as generator and judge; or use a panel of judges from different families. - Format / Markdown bias — fancier formatting wins. Mitigate: strip formatting before judging, or instruct judge to ignore format. - Numbers to drop: "GPT-3.5 position-biased 50% of the time", "judges preferred longer response >90% of the time", "GPT-4 self-preference 10pp / Claude-v1 25pp"

Common follow-ups: - "What's the cheapest fix for position bias?" - "How would you detect self-preference quantitatively?"

Traps: - Claiming "we use GPT-4 so we don't have these problems" — GPT-4 has all of them, sometimes worse. - Position-bias check that just runs one ordering and shrugs.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Why binary pass/fail and not a 1-5 Likert scale for LLM-judge?"¶

Tags: senior · common · conceptual · source: Hamel Husain LLM-judge post, 2024-2026

Answer outline: - Hamel's hard line: "if your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale, you're doing it wrong." - Why: 1-5 scales hide disagreement. A "3" can mean "good with minor issues" to one labeler and "mediocre" to another. Binary forces you to define the failure cutoff. - Inter-rater reliability collapses on Likert — you'll see κ around 0.3-0.4 on 5-point scales for the same task that hits κ > 0.7 binary. - Production-actionable: binary lets you alarm on "% failures rose from 2% to 5% this week." Likert lets you alarm on "average dropped from 4.1 to 4.0" — which means nothing. - For dimensions that legitimately need gradation (e.g., partial correctness), split into multiple binary checks: "all facts correct?" AND "no extra facts?" AND "addresses question?" - Numbers to drop: "5-point κ typically 0.3-0.4 vs binary 0.7+", "actionable threshold: % fails moved from 2% to 5%"

Common follow-ups: - "What about a star rating from users? That's a Likert." - "How would you decompose 'response quality' into binaries?"

Traps: - Defending Likert because "it gives more granularity" — granularity ≠ signal. - 5-point scale that collapses to binary in practice (everyone scores 4 or 5).

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your LLM judge correlates well on benchmark X but disagrees with users in production. What do you do?"¶

Tags: senior · occasional · debugging · source: standard senior debugging scenario

Answer outline: - Diagnose the distribution gap first. Sample 50 production traces, hand-label, compare against judge. The numerical gap tells you which class of inputs the judge fails on. - Eugene Yan's finding — fine-tuned evaluators "performed worse than random guessing" on out-of-domain fairness eval. Judges are domain-fragile, much more than people assume. - Specific fixes by symptom: (a) judge calibrated to short answers but prod has long answers → add length-stratified validation; (b) judge trained on English, prod is multilingual → re-validate per locale; (c) judge predates a domain change → refresh validation set. - Long-term: maintain a rolling "judge regression set" — 100 production traces with human labels, refreshed monthly, judge must hit >80% on it or the judge is rolled back. - Numbers to drop: "rolling judge regression set of 100 examples refreshed monthly", "judge must hit >80% global agreement on it", "Eugene Yan: fine-tuned evaluators worse than random on OOD"

Common follow-ups: - "Would you switch judge model?" - "When do you stop trusting LLM-judge and revert to humans?"

Traps: - Trusting benchmark numbers over your own production-grounded validation. - Re-tuning the judge against complaint-traces specifically — that overfits to the complaint distribution.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "When would you NOT use LLM-as-judge?"¶

Tags: senior · common · conceptual · source: standard senior probe; cost-aware eval theme

Answer outline: - Don't use it when a cheap deterministic check works: schema validity, regex for PII, banned-strings list. Heuristics catch obvious failures at 1000x the speed and 1000x cheaper. - Don't use it for high-stakes auditing without human in the loop — legal, medical, financial outputs. Judge is a first-pass triage, not a regulator. - Don't use it for self-comparison (judge from same family as generator) without panel — self-preference bias contaminates the signal. - Don't use it on very subtle creative quality (poetry, brand voice) — humans still beat models on the long tail of taste. - Use it where it shines: open-ended QA quality, summarization faithfulness, tone consistency, instruction-following — places where deterministic rubrics don't scale but humans agree on the criteria. - Numbers to drop: "regex/heuristic ~1000x cheaper than LLM-judge", "human review for top 0.1-1% high-stakes traces"

Common follow-ups: - "What's your hierarchy — heuristic, judge, human?" - "Where do you draw the line on judge-only auditing?"

Traps: - Defaulting to LLM-judge for everything when cheaper checks work. - Believing judge-only auditing is sufficient for compliance-grade systems.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Judge says outputs are great, users disagree. What's your debug path?"¶

Tags: senior · occasional · debugging · source: standard senior debugging probe

Answer outline: - Step 1 — pull 50 user-flagged traces. Run them through your judge. If judge agrees with users, problem was sampling (judge didn't see them). If judge disagrees with users, judge is the problem. - Step 2 — taxonomize the disagreements. Are they all verbose-but-wrong? Confident-tone? Slightly-off-topic but technically faithful? That clusters the bias. - Step 3 — check the rubric. Does your judge prompt actually capture what users care about? Often the prompt says "is the answer correct" but users care about "does this resolve my problem." - Step 4 — re-validate the judge on a fresh ~30 examples. If κ drops below 0.6, judge needs rework. - Step 5 — quick patch: add a deterministic check for the specific failure (verbosity threshold, off-topic embedding distance) while you rebuild the rubric. - Numbers to drop: "50 user-flagged traces to triage", "fresh 30-example judge re-validation", "Cohen's κ drop below 0.6 = rework"

Common follow-ups: - "What if the judge IS aligned but users have unrealistic expectations?" - "How would you ship a fix without breaking other dimensions?"

Traps: - Believing the judge over the users — almost always wrong. - Adding more dimensions to the judge — usually you have one broken dimension, not too few dimensions.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What's Cohen's kappa and why do you use it for judge validation?"¶

Tags: senior · occasional · conceptual · source: Eugene Yan LLM-evaluators; standard senior probe

Answer outline: - Cohen's κ measures agreement between two raters, corrected for chance agreement. Range -1 to 1; 0 = chance, 1 = perfect. - Raw % agreement lies under class imbalance. If 95% of answers are "good" and judge always says "good", raw agreement = 95% but κ = 0 (judge has no signal). - Rule of thumb: κ < 0.2 poor, 0.2-0.4 fair, 0.4-0.6 moderate, 0.6-0.8 substantial, > 0.8 almost perfect. For LLM-judge, aim for > 0.6 minimum, > 0.8 to fully trust. - Eugene Yan's numbers: TriviaQA gpt-4 vs human κ = 0.84 (human-human = 0.97). Summarization Spearman ρ = 0.27-0.46 — concerning low. - For more than two raters use Fleiss' κ; for ordinal scales use Kendall's τ or Spearman's ρ. - Numbers to drop: "κ > 0.6 usable, > 0.8 reliable", "TriviaQA gpt-4 κ = 0.84 vs human-human 0.97", "Summarization Spearman ρ only 0.27-0.46"

Common follow-ups: - "What if you have 3+ annotators?" - "Why is κ so much lower than raw % agreement?"

Traps: - Reporting raw % agreement and calling the judge "validated." - Using Cohen's κ on ordinal data instead of weighted κ.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

RAGAS & RAG-specific metrics¶

Q: "What is RAGAS? Which metrics matter for your RAG system?"¶

Tags: screen · very-common · conceptual · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - RAGAS = open-source framework for evaluating RAG systems with LLM-as-judge metrics, no human reference needed for many of them. - Core metrics: faithfulness (claims in answer supported by context), answer relevance (answer addresses the question), context precision (retrieved chunks are relevant, ranked correctly), context recall (retrieved chunks cover the needed info). - Which matter depends on failure mode. Hallucination problem → faithfulness. Wrong-but-confident retrieval → context precision. Missing-info retrieval → context recall. Off-topic answer → answer relevance. - Senior tell: don't list all four mechanically. Pick 2-3 that map to your product's worst failures and justify cutting the others. - RAGAS uses GPT-4-class judges internally — costs add up, sample don't full-set in prod. - Numbers to drop: "4 core metrics scored 0-1", "faithfulness = supported_claims / total_claims", "RAGAS judge ≈ $0.001-0.01/example depending on length and judge"

Common follow-ups: - "What's the cost of running RAGAS on 1M queries?" - "How does context recall work without a reference?"

Traps: - Treating RAGAS as plug-and-play without validating its judges on your domain — same judge-validation discipline applies. - Reporting one aggregate "RAGAS score" — there isn't one; report per-metric.

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How is faithfulness computed in RAGAS?"¶

Tags: mid · common · conceptual · source: RAGAS docs faithfulness page, 2026

Answer outline: - Three steps: (1) decompose the response into atomic claims via LLM, (2) for each claim, LLM checks whether it's supported by the retrieved context, (3) score = supported_claims / total_claims. - Score range 0-1. Example: "Einstein was born in Germany on March 20, 1879" decomposed to {claim_1: born in Germany, claim_2: born March 20, 1879}. Context says March 14. Claim 1 supported, claim 2 contradicted. Score = 1/2 = 0.5. - Failure modes: implicit claims (judge misses them), complex inferences (judge rejects valid logical leaps), claims that span multiple chunks. - Alternative implementation — HHEM-2.1-Open classifier (free T5 model from Vectara). Much faster than LLM-judge, less flexible. - Numbers to drop: "two LLM calls per example (decompose + verify)", "score = supported_claims / total_claims, range 0-1", "HHEM-2.1-Open classifier alternative for cost"

Common follow-ups: - "What's a typical production faithfulness score, and when should you ship?" - "What breaks this metric?"

Traps: - Confusing faithfulness ("grounded in context") with factual correctness ("true in the real world"). RAG can be faithfully wrong. - Counting compound claims as one — inflates scores.

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Context precision vs context recall — explain both with an example."¶

Tags: mid · very-common · conceptual · source: RAGAS docs; standard RAG eval probe 2026

Answer outline: - Context precision = "of what I retrieved, how much was actually relevant and ranked correctly?" — measures retrieval quality and ranking. - Context recall = "of what I needed, how much did I retrieve?" — measures retrieval coverage. - Example: query "What's the capital of France?" Retrieved 5 chunks: [Paris-fact, Eiffel-Tower, Paris-population, Football-team, Random]. Need only chunk 1 to answer. Precision is high (Paris-fact in top position). Recall is 1.0 (the needed info is there). - Now flip: retrieved 5 chunks [Random, Football, Paris-population, Eiffel-Tower, Paris-fact]. Recall still 1.0, but precision drops sharply because the relevant chunk is at the bottom. - Now drop the Paris-fact chunk entirely: precision could still be ok depending on other chunks, but recall = 0 — you literally can't answer. - Numbers to drop: "precision penalizes irrelevant chunks ranked high", "recall = 0 means model can never answer correctly", "RAGAS context_precision uses precision@k weighted by relevance"

Common follow-ups: - "Which is more important?" - "How would you raise recall without tanking precision?"

Traps: - Saying "precision = % relevant" without mentioning ranking — RAGAS weights by position. - Confusing context recall (retrieval) with answer correctness (generation).

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you measure context recall without a gold reference?"¶

Tags: senior · occasional · conceptual · source: RAGAS docs; standard RAG eval probe

Answer outline: - Vanilla context recall requires a reference answer — RAGAS decomposes the reference into claims, then checks if each reference-claim is supported by the retrieved context. No reference, no recall. - Workaround 1 — proxy via answer faithfulness: if the generator's answer is faithful AND correct (judged by human or stronger judge), then by inference recall was sufficient. Doesn't give a continuous score. - Workaround 2 — synthetic references. Use a stronger model with full corpus access to generate reference answers, then run standard context recall. Risk: leaks of the corpus into eval. - Workaround 3 — needle-in-haystack injection. Plant known facts in the corpus, query for them, measure retrieval hit rate. - For production, pair faithfulness + answer relevance + low refusal rate as a proxy for recall when references unavailable. - Numbers to drop: "needle-in-haystack injection: 50-100 planted facts", "stronger-model synthetic references are an audit risk"

Common follow-ups: - "What's wrong with synthetic references?" - "How often would you refresh the planted facts?"

Traps: - Claiming "RAGAS context recall is reference-free" — it isn't, RAGAS's context_recall metric needs ground_truth. - Reusing the corpus to generate references (you'll measure retrieval against itself).

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your RAG faithfulness is 0.95, users still complain about wrong answers. Diagnose."¶

Tags: senior · common · debugging · source: standard senior RAG debugging scenario, 2026

Answer outline: - Faithfulness = grounded in context, NOT factually correct. If your retrieved context is wrong, the answer can be 100% faithful and 100% wrong. - Diagnose: (1) check context precision — are top chunks actually relevant? (2) check the corpus — outdated docs? duplicate or contradictory chunks? (3) sample 50 complaint traces, judge them on both faithfulness AND truth — gap is your problem class. - Common cause: doc base contains stale info (old policy doc) ranked high; new info exists but is lower-ranked or in a different format the retriever misses. - Common cause #2: ambiguous queries get resolved against the wrong chunk; user asks "the discount" referring to context A, system retrieves chunk about discount B. - Fix path: corpus hygiene (dedupe, freshness), retrieval reranker (cross-encoder), and add a "context-correctness" metric distinct from faithfulness. - Numbers to drop: "0.95 faithfulness can coexist with 30% factual errors if corpus is stale", "sample 50 complaint traces to taxonomize"

Common follow-ups: - "What's the difference between faithfulness and factual correctness?" - "Would a reranker help here?"

Traps: - Treating high faithfulness as proof of correctness. - Adding more retrieval without fixing the underlying corpus.

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you evaluate the quality of a RAG system beyond accuracy?"¶

Tags: mid · very-common · conceptual · source: lockedin.ai LLM engineer interview Q18, 2026

Answer outline: - Accuracy is wrong because RAG has two failure modes — retrieval and generation — and they need to be diagnosed separately. - Retrieval-side: context precision (ranking quality), context recall (coverage), hit rate (gold doc in top-k), MRR (rank of gold doc). - Generation-side: faithfulness (grounded in retrieved context), answer relevance (addresses the question). - System-side: end-to-end latency, P95/P99 because retrieval adds variance; refusal rate; user satisfaction signal. - Per-slice — break out metrics by intent, by source corpus, by user segment; aggregate hides where the system is breaking. - Numbers to drop: "hit@k for retrieval gold doc in top-k", "MRR for ranking quality", "report metrics per intent slice, not just global"

Common follow-ups: - "Which retrieval metric do you ship with?" - "What's the typical correlation between retrieval and end-to-end quality?"

Traps: - Single-number "RAG score" — no such thing. - Optimizing retrieval and generation jointly without measuring them separately.

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Build an agentic RAG system evaluated using RAGAS metrics."¶

Tags: senior · occasional · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Architecture: (1) router that picks tool (retrieve-docs, retrieve-tables, web-search, no-retrieve); (2) retriever with hybrid BM25+dense + reranker; (3) generator with citations; (4) optional self-critique loop. - Per-step eval: route accuracy (correct tool picked), retrieval metrics on each tool's results, faithfulness on the final generation, end-to-end answer relevance. - RAGAS metrics: faithfulness, answer_relevancy, context_precision, context_recall computed per-trace; aggregate per-route to see which tool/path is failing. - Agentic-specific failures to instrument: route loops, tool-call failures, context overflow when multiple retrievals concatenate, citation hallucinations. - Online: trace each step in Langfuse/Phoenix, sample 5% for full RAGAS scoring, alert on per-route faithfulness drops. - Numbers to drop: "RAGAS scored on 5% online sample", "per-route metric breakdown not global", "hybrid BM25+dense + cross-encoder rerank"

Common follow-ups: - "How do you eval the router itself?" - "What if two retrieval tools return overlapping content?"

Traps: - One end-to-end metric for an agentic system — you can't debug it. - Forgetting tool-call success rate — agents fail on tool errors as much as on quality.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Beyond RAGAS — what eval frameworks are worth knowing in 2026?"¶

Tags: mid · occasional · conceptual · source: standard senior probe; 2026 tooling landscape

Answer outline: - RAGAS — RAG-specific, LLM-judge based, easy to start. Free, open-source. - DeepEval — broader; supports custom metrics, hallucination, bias, toxicity. Pytest-style integration. - TruLens — observability-first, feedback functions over traces. - Braintrust / LangSmith / Phoenix (Arize) — managed eval platforms with golden-set management, judge orchestration, online sampling. - Promptfoo — CLI/CI-first, golden-set diffs across prompts. - HELM / MMLU / MT-Bench / Chatbot Arena — research benchmarks, useful for model selection, useless for your application. - Picking: research benchmarks for model choice; app-specific tools (RAGAS, DeepEval, Braintrust) for your golden set + judges; observability layer (LangSmith, Phoenix, Langfuse) for online traces. - Numbers to drop: "Chatbot Arena human-LLM agreement 83-87%", "MMLU 1% sample retains rank-order validity per affordable-eval research"

Common follow-ups: - "Which would you pick for a startup with 0 eval today?" - "When do you need a managed platform vs OSS?"

Traps: - Adopting all of them — you'll spend more time wiring tools than evaluating. - Confusing research benchmarks (MMLU) with product eval (your golden set).

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Explain evaluation metrics: perplexity, ROUGE, BLEU. What are the pitfalls of n-gram-based metrics?"¶

Tags: screen · common · conceptual · source: Adil Shamim — 100+ AI engineer interviews; DataCamp LLM eval

Answer outline: - Perplexity: model's exponentiated cross-entropy on a held-out set; lower = better at predicting the next token. Useful for language modeling benchmarks; near-useless for downstream task quality. - BLEU: n-gram precision overlap with references; designed for machine translation. Range 0-1. - ROUGE: n-gram recall (and F-variant) overlap with references; designed for summarization. - Pitfalls — both ignore semantics. "The cat sat on the mat" vs "A feline rested on a rug" — near-zero BLEU/ROUGE but same meaning. They reward surface form, not understanding. - They require references, which are bottlenecks. They penalize valid paraphrases. They're insensitive to factual errors. - Modern eval: replace with embedding-based (BERTScore), LLM-judge (RAGAS, custom), or task-specific exact-match. - Numbers to drop: "BLEU correlates ~0.4-0.6 with humans on translation", "ROUGE for summarization correlates Spearman ρ 0.27-0.46 with humans per Eugene Yan", "BERTScore correlates better but still imperfect"

Common follow-ups: - "When is BLEU still appropriate?" - "What replaces them?"

Traps: - Reporting BLEU on conversational quality — meaningless. - Conflating perplexity with accuracy.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you measure hallucination rate in production?"¶

Tags: senior · very-common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Define hallucination precisely first. For RAG: unfaithful claim (not supported by retrieved context). For open-domain: factually false claim. Different definitions, different measurement. - RAG hallucination — sample 1-10% of production traces, run RAGAS faithfulness or HHEM classifier, track % of responses with at least one unsupported claim. Or count claim-level rate. - Open-domain hallucination — needs a fact-checker (claim extraction + retrieval against trusted source + verification). Much more expensive; usually run as nightly batch not per-request. - User-signal proxies: thumbs-down rate, "this is wrong" feedback button, follow-up correction queries. Cheap to log, noisy as signal. - Stratify by query class — hallucination rates differ wildly across intents; aggregate average hides the worst slice. - Numbers to drop: "1-10% sample for faithfulness scoring", "production hallucination rates often 5-15% on faithfulness depending on domain", "HHEM classifier ~10ms vs LLM-judge ~1-3s"

Common follow-ups: - "What's an acceptable hallucination rate?" (depends — 0% for legal, 5-10% tolerable for casual chat) - "How do you reduce it?" (better retrieval, prompt for citations, fine-tune for refusal)

Traps: - Reporting a single global hallucination rate without slicing. - Confusing "faithful" with "factually correct." - Letting users define hallucination ("I disagree" ≠ "hallucination").

Related cross-cutting: Retrieval Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Online evals (A/B, shadow, canary)¶

Q: "How would you test a new model before full deployment? Describe A/B testing, canary, interleaved, and shadow testing strategies."¶

Tags: senior · very-common · design · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Shadow: duplicate every request to both old and candidate models, candidate output is logged not shown. Zero user risk. Use first for any meaningful change. Compare via LLM-judge or human review on offline. - Canary: route a small slice (1-5%) of real traffic to the candidate, monitor metrics, expand gradually (5% → 25% → 50% → 100%) with auto-rollback gates. - A/B: split traffic randomly between control and treatment for a defined period, measure both quality and business metrics, decide based on statistical significance. - Interleaved: in a single response, interleave outputs from both models and ask the user to pick — high-signal, used in search ranking, rare in LLM apps but powerful for ranking changes. - Order in practice: offline gate → shadow → canary → A/B at scale → full rollout. Each tier catches a different failure class. - Numbers to drop: "shadow = 0% user-visible", "canary 1-5% → 25 → 50 → 100", "A/B typical 2-week window for statistical power", "auto-rollback at >X% regression on guardrail"

Common follow-ups: - "What metric triggers auto-rollback?" - "When would you skip shadow and go straight to canary?"

Traps: - Confusing shadow with canary — shadow means no user sees the candidate. - A/B at 50/50 too long — you're harming half your users if treatment is worse.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Walk me through a shadow deployment for an LLM change."¶

Tags: senior · common · scenario · source: TianPan 2026 shadow/canary post; standard senior probe

Answer outline: - Step 1 — Mirror traffic: every production request is sent to both prod model and candidate. User sees prod response only. Candidate response goes to a logging store. - Step 2 — Sample for eval: 100% logged, ~10% LLM-judged in real-time (or batch nightly to control cost). - Step 3 — Comparison metrics: per-request diff (token count, cost, latency); per-request judge score (which is better); aggregate win rate. - Step 4 — Gating: candidate must win >55% in pairwise judge AND not regress on any guardrail (refusal rate, schema validity, P95 latency). - Step 5 — Run for at least 24-72 hours to cover daily traffic patterns; longer for weekly seasonality. - Step 6 — Decision: promote to canary, rework, or kill. - Numbers to drop: "100% logged + 10% judged", "win rate threshold typically >55%", "shadow duration 24-72 hours minimum"

Common follow-ups: - "What if candidate is faster but worse quality?" - "How do you keep shadow cost manageable?"

Traps: - Shadow only for 1 hour — misses daily / weekly patterns. - Judging 100% of shadow traffic at LLM-judge rates — burns budget.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Design a canary rollout for a new prompt that materially changes outputs."¶

Tags: senior · common · design · source: AppScale AI-native CI/CD 2026; standard senior loop

Answer outline: - Pre-canary gate — offline eval must pass (>X% on golden set, no regression on guardrails). Shadow eval must pass (>55% win rate vs prod). - Canary stages — 1% (1 hour) → 5% (2 hours) → 25% (4 hours) → 50% (24 hours) → 100%. Each stage has auto-rollback criteria. - Auto-rollback triggers — guardrail breach (refusal rate doubled, P95 latency +20%, error rate +0.5pp); quality regression (judge win-rate <50% over rolling 100 samples); user-signal regression (thumbs-down rate +50%). - Per-cohort safety — exclude high-risk user segments from canary (enterprise SLAs, regulated regions). - Kill switch — feature flag for instant 0% routing; mandatory not optional. - Numbers to drop: "1% → 5% → 25% → 50% → 100% staged", "auto-rollback at refusal-rate-doubled", "kill switch mandatory feature flag"

Common follow-ups: - "What happens if a regression appears only at 50%?" - "How do you communicate canary failures to product?"

Traps: - Linear ramp without auto-rollback — defeats the purpose of staging. - Canary that ignores user-segment risk — losing one enterprise customer ≠ losing one consumer user.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How would you implement A/B testing for different prompt variations?"¶

Tags: mid · very-common · scenario · source: Adil Shamim — 100+ AI engineer interviews, 2026

Answer outline: - Decide the metric first. Primary metric must be a business outcome (CSAT, task completion, conversion), not a proxy (latency, token count). Define guardrail metrics that must not regress. - Random assignment at the user level (not request) — same user always sees the same variant during the test, otherwise UX is incoherent. - Compute required sample size up-front given effect size, baseline rate, and α=0.05, β=0.2. Rule-of-thumb: ~10K-100K user-sessions to detect a 1-2% relative improvement on conversion-like metrics. - Run for a full business cycle (1-2 weeks minimum) to absorb weekly seasonality. - Statistical test: t-test or Mann-Whitney for continuous; chi-square or proportion test for rates. Pre-register the test, don't peek. - Numbers to drop: "10K-100K sessions per arm for 1-2% effect", "α=0.05, β=0.2, two-tailed", "minimum 1-2 weeks per arm"

Common follow-ups: - "What if your metric moves but a guardrail also moves?" - "How do you handle multi-variant testing?"

Traps: - Peeking and stopping early — inflates false-positive rate. - Assignment at request-level — user sees inconsistent prompts. - Running until you see significance — guaranteed false discovery.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "When would you use shadow over canary, and vice versa?"¶

Tags: senior · common · conceptual · source: TianPan 2026; standard senior loop

Answer outline: - Shadow when: change is risky (model swap, major prompt rewrite, new tool); you cannot tolerate any user impact; you have offline+pairwise judge that gives sufficient signal. - Canary when: change has passed shadow; you need real user behavior metrics (CSAT, follow-up rate) that shadow can't capture; latency or cost can only be measured on real load. - Shadow's limitation — users never see candidate, so you don't measure user reaction. Good for quality, blind on UX impact. - Canary's risk — real users see candidate, so a bad release degrades real experiences; needs auto-rollback. - Production reality: do both, in sequence. Shadow first to filter obvious regressions, canary second to confirm user-side metrics. - Numbers to drop: "shadow = 0% user-visible", "canary = 1-5% real users initially", "shadow 24-72hr → canary stages"

Common follow-ups: - "Can you skip shadow if change is small?" - "How does this differ for back-end-only changes like embedding model swap?"

Traps: - Using shadow alone for changes that need real-user signal (e.g., tone changes). - Going straight to canary for high-risk changes without offline+shadow gates.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "A/B test shows treatment significant at p=0.04. Ship?"¶

Tags: senior · occasional · debugging · source: standard senior stats probe

Answer outline: - Not automatically. Check: (1) was the test pre-registered with this metric, this sample size, this duration? (2) did you peek? (3) are guardrails clean? (4) is the effect size practically significant or just statistically significant? - p=0.04 is barely under 0.05 — high prior on false discovery if you ran multiple tests (multiple comparisons inflate false-positive rate). - Effect size matters more than p — a 0.1% CSAT lift with p=0.04 is often noise; a 5% lift with p=0.04 is real and ship-worthy. - Confidence interval analysis — if the 95% CI for the lift includes near-zero values, even significant p doesn't mean meaningful effect. - Replication mindset — for borderline results, re-run on independent traffic or wait another cycle. - Numbers to drop: "p=0.04 with 5 simultaneous tests → Bonferroni-corrected p=0.20", "effect size practical threshold typically 1-5% on business metrics"

Common follow-ups: - "How would you do multiple-comparison correction?" - "What if treatment is significantly better on quality but worse on latency?"

Traps: - Treating p<0.05 as automatic ship signal. - Ignoring multiple comparisons when running many tests. - Reporting only "treatment won" without effect size or CI.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your candidate model wins on automated metrics in shadow but the A/B canary tanks. Why?"¶

Tags: senior · occasional · debugging · source: standard senior debugging scenario

Answer outline: - Likely cause #1 — distribution gap. Shadow runs on real traffic but judge scores in isolation; user behavior in canary depends on multi-turn context, prior expectations, UI integration that shadow doesn't capture. - Likely cause #2 — judge bias. Verbose new model wins shadow on a verbosity-biased judge; users hate the verbosity in production. - Likely cause #3 — guardrail latency. Candidate is slower; users abandon. Shadow ignores latency-driven abandonment because users never wait. - Likely cause #4 — second-order effects. Candidate gives better immediate answers but lower follow-up rate because it pre-empts user clarifying questions — net product impact negative. - Diagnose: pull canary losses, compare against shadow score on same prompts. The metric-disagreement traces are the diagnosis. - Numbers to drop: "verbose responses can win shadow with judge verbosity bias >90%", "P95 latency +1s often crashes conversion by 5-10%", "second-order metrics — task completion not just turn quality"

Common follow-ups: - "Would you trust shadow ever again?" - "How do you debias the shadow judge?"

Traps: - Blaming the A/B test. - Re-running shadow longer instead of fixing the underlying metric gap.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you size an A/B test for an LLM feature?"¶

Tags: senior · common · conceptual · source: standard senior stats probe

Answer outline: - Inputs: baseline metric rate (e.g., 60% task-completion), minimum detectable effect (e.g., 2pp absolute), α=0.05, β=0.2 (power=0.8). - For proportion metric: n per arm ≈ 16 × p(1-p) / MDE² (rule of thumb). 60% baseline + 2pp MDE → ~9.6K per arm. - For continuous metric: depends on variance; use σ²/MDE² × constant. - Multiply by 2-3x for variance from user-level cluster effects (multiple sessions per user reduce effective n). - Daily traffic budget: if you can serve 5K users/day to test arm, 9.6K needs 2 days minimum but realistically 1-2 weeks for seasonality. - Numbers to drop: "n ≈ 16·p(1-p)/MDE² for proportions", "60% baseline + 2pp MDE = 9.6K per arm", "2-3x for cluster effects", "1-2 weeks minimum for weekly seasonality"

Common follow-ups: - "What if your sample doesn't reach the size?" - "When do you do sequential testing instead?"

Traps: - Picking sample size by gut. - Forgetting cluster effects when same user has multiple sessions.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Drift detection¶

Q: "A production chatbot's accuracy dropped from 95% to 80% over six weeks. How do you diagnose the root cause before retraining?"¶

Tags: senior · common · debugging · source: 2026 eval/monitoring loop

Answer outline: - "Retrain" is the trap answer — retraining a degradation you haven't diagnosed often does nothing (the model may not be the cause) and burns time and money. Diagnose first. - A six-week gradual slope (not an overnight step) points to drift, not a single bad deploy. Walk the suspects: - Input drift: are users asking new things? Cluster recent queries vs baseline — new intents, new languages, longer queries the system was never good at. - Data/corpus drift (if RAG): knowledge base went stale or grew, retrieval quality fell. Track retrieval hit rate over time, not just final accuracy. - Silent model update: the provider rolled a new version over those weeks. Re-run a frozen eval set against pinned vs current. - Eval drift: is the "accuracy" number itself trustworthy? If it's an LLM judge, the judge may have drifted — re-validate against a frozen human-labeled set. - Accumulated change: prompt edits, new guardrails, routing tweaks that piled up. - Method: segment the drop. Which intents / topics / user-segments fell? A uniform drop ≠ a concentrated one. Pull failing traces from week 1 vs week 6 and diff them. - Only after attribution pick the fix — refresh corpus, pin the model, fix the judge, add new intents to few-shot/training. Retraining is one option, not the default. - The senior tell: a frozen eval set isolates model-vs-world — if the pinned model still scores 95% on it, the model didn't change; it's input or corpus. - Numbers to drop: "gradual 6-week slope = drift; step change = a deploy", "segment by intent before acting", "frozen eval set isolates model vs data", "validate the judge against frozen human labels"

Common follow-ups: - "How do you tell input drift from model drift?" (the frozen eval set test above) - "What would have caught this in week 1?" (continuous online eval + segmented dashboards + drift alerts)

Traps: - Retraining before diagnosing. - Watching aggregate accuracy only — a 15-pt drop concentrated in one intent hides in the average for weeks. - Trusting an unvalidated judge metric as ground truth.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/02_telemetry_feedback_loops/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you detect eval drift?"¶

Tags: senior · common · conceptual · source: VentureBeat monitoring LLM behavior 2026

Answer outline: - Three drift types to monitor: (1) input distribution drift — user prompts changing, (2) output distribution drift — your model responding differently to same inputs, (3) eval drift — your judge or metrics themselves shifting. - Input drift: track prompt-embedding distribution, alert on KL divergence > threshold; track new-intent rate (clusters of OOD queries); track length and language distribution. - Output drift: track length distribution, refusal rate, format-validity rate, tone-classifier distribution; alert on >2σ deviation from baseline. - Judge drift: maintain a frozen "judge regression set" of 100-200 examples with locked human labels; re-run weekly; if judge agreement drops below 0.6 κ, judge is drifting. - Numbers to drop: "embedding KL divergence alert threshold", "judge regression set 100-200 examples", "κ drop below 0.6 = drift action", "weekly judge re-validation cadence"

Common follow-ups: - "Which type of drift is most common in production?" - "What's your runbook when drift fires?"

Traps: - Only monitoring output, missing input drift. - Not monitoring the judge itself — when judges drift, everything else looks fine until users complain.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Provider silently updated their LLM. How do you detect and respond?"¶

Tags: senior · very-common · scenario · source: standard 2026 production scenario; GPT-3.5/4 silent updates

Answer outline: - Detection: (1) run golden set daily as a heartbeat, alarm on score delta; (2) output-distribution monitors — length, format, refusal patterns; (3) explicit version pinning where the provider supports it (e.g., dated snapshots) — but those expire and you need detection anyway. - Response runbook: (1) freeze rollouts, (2) run shadow against the previous pinned version if available, (3) re-validate the judge (judges drift on the same upgrade), (4) sample 100 production traces and human-review, (5) decide: stay on new version with adjusted prompt, pin to older snapshot, switch provider. - Long-term: maintain provider abstraction; multi-provider redundancy for critical paths; daily heartbeat in CI. - Cost reality: silent updates are the norm. Engineering hour ratio: 80% builds, 20% wrestling with provider drift on a stable feature. - Numbers to drop: "daily golden-set heartbeat", "100 production traces for human triage", "pin to dated snapshot where supported (validity ~6 months)"

Common follow-ups: - "What's the case for switching to a self-hosted open model?" - "How do you abstract the provider in code?"

Traps: - Assuming pinned snapshots are forever — they expire. - No daily heartbeat — drift is invisible until users complain.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you monitor for prompt-injection in production?"¶

Tags: senior · occasional · scenario · source: VentureBeat refusal patterns 2026

Answer outline: - Note — overlap with safety-guardrails module; here focus on the eval angle. - Treat as a drift signal: track refusal patterns, system-prompt-leak signatures in outputs, anomalous tool-call sequences. - Output detectors: regex for system-prompt fragments leaking in responses; classifier for "ignore previous instructions" style outputs; tool-call audit for unauthorized actions. - Trace-level anomaly: requests with unusual length, unusual encoding (base64, unicode tricks), excessive tool-call retries. - Add red-team set to your golden suite — 50-100 known injection patterns that must always be refused; failure = block release. - Numbers to drop: "50-100 red-team patterns in golden set", "system-prompt-leak regex coverage", "refusal-rate alarm > 2x baseline"

Common follow-ups: - "What's your response when injection is detected live?" - "How do you distinguish legitimate edge prompts from injections?"

Traps: - Treating injection as one-time defense rather than ongoing eval discipline. - No red-team set in golden — you'll regress on injection defense without noticing.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Refusal rate jumped from 2% to 8% overnight. Diagnose."¶

Tags: senior · occasional · debugging · source: standard 2026 production debugging

Answer outline: - Most likely cause — model upgrade. Provider silently shipped a new version with tighter safety, refusing edge cases that used to pass. - Step 1 — confirm provider didn't update. Check API headers, status pages, dated snapshots. - Step 2 — pull 50 newly-refused traces, classify: legitimate refusals (good), over-cautious refusals (model issue), corner-case prompts users started sending more (input drift). - Step 3 — check input drift — did some new user cohort or campaign push a new query class? - Step 4 — re-run your judge / red-team set: did unsafe prompts also start passing? If yes, the model is just shifted, not necessarily safer. - Fix path: prompt tuning to reduce false refusal; pin to older snapshot if available; route to different model for the false-positive intent class. - Numbers to drop: "2% → 8% = 4x jump, action threshold", "50 refusals manually triaged", "provider snapshot pin valid ~6mo"

Common follow-ups: - "What if refusals are correct but users complain?" - "How do you stop this from happening silently next time?"

Traps: - Assuming refusal increase = good (safer). Often it's over-cautious and tanks UX. - No daily heartbeat to detect it earlier.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How would you set alerting thresholds on eval metrics?"¶

Tags: senior · occasional · design · source: standard senior probe

Answer outline: - Two-tier alerting: paging (something is broken right now) vs ticketing (something is drifting, investigate this week). - Page on: guardrail breach (refusal rate doubled, error rate +0.5pp), latency P95 above SLA, kill-switch criteria. - Ticket on: quality metric drift >X% over 7-day rolling window vs prior 28-day baseline; per-slice metric anomaly even if global is fine. - Use rolling windows to avoid spike alerts on noise: a 30-min spike in refusal rate is usually noise; 12 hours sustained is a real issue. - Stratify alerts per intent / per cohort. Global alarms miss minority-class regressions until they're severe. - Numbers to drop: "page on 2x guardrail breach in 12-hour window", "ticket on 7-day rolling drift >5% vs 28-day baseline", "per-intent alarms, not just global"

Common follow-ups: - "How do you avoid alert fatigue?" - "What's your on-call runbook structure?"

Traps: - Static thresholds that don't account for seasonality. - Paging on noise (single-request errors). - No ticketing layer — only paging means slow drift is invisible.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Cost-aware eval¶

Q: "Running RAGAS on every production query is too expensive. What's your strategy?"¶

Tags: senior · common · scenario · source: HuggingFace 2026 "eval costs bottleneck"; standard senior probe

Answer outline: - Tiered evaluation cascade: (1) cheap heuristics on 100% — schema, regex, banned-strings, length-bounds; (2) embedding-based checks on 100% — query/answer similarity to detect off-topic; (3) cheap classifier (HHEM, BERT) on 10-50% — fast hallucination signal; (4) LLM-judge on 1-5% — RAGAS faithfulness, answer relevance; (5) human review on 0.1% — flagged traces. - Sampling strategies: uniform random for unbiased estimate; stratified by intent to ensure minority-class coverage; over-sample low-confidence / flagged traces. - Choose judge size to the task — distilled small judges (Llama-3-8B fine-tuned) hit 80%+ of GPT-4-judge quality at ~5% the cost. - Schedule expensive evals: nightly full RAGAS on 5K-sample, hourly heuristics on 100%. - Numbers to drop: "100% heuristics, 1-10% LLM-judge, 0.1% human", "distilled judge ~5% cost of GPT-4-judge with ~80% quality", "HuggingFace 2026: 'eval is becoming the compute bottleneck'"

Common follow-ups: - "What's the cost per 1M requests?" - "When would you skip the LLM-judge tier entirely?"

Traps: - LLM-judge on 100% — bankrupts the line. - Sampling without stratification — minority intents invisible.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you decide between GPT-4-judge and a smaller fine-tuned judge?"¶

Tags: senior · occasional · conceptual · source: cost-aware eval research 2026

Answer outline: - Validation-driven, not assumption-driven. Hand-label 100-200 examples, score both judges, compare Cohen's κ against human labels. - GPT-4-judge typical κ on production task: 0.7-0.85. Small fine-tuned judge: 0.55-0.75 if well-trained. - Cost differential: GPT-4 judge $0.005-0.02/example; small fine-tuned judge $0.0001/example self-hosted (50-200x cheaper). - Use small judge when: high volume, narrow task (e.g., faithfulness only), validated to >0.6 κ on your data. - Use GPT-4 judge when: low volume / sampled traffic, multi-dimensional rubric, evolving task definitions where retuning a small judge is expensive. - Hybrid pattern: small judge on 100% of sampled traces, GPT-4-judge on the 10% small judge marks as ambiguous. - Numbers to drop: "GPT-4 judge κ 0.7-0.85, small fine-tuned 0.55-0.75", "cost differential 50-200x", "hybrid: small on 100%, GPT-4 on 10% ambiguous"

Common follow-ups: - "How would you fine-tune a small judge?" - "What if your task definition changes?"

Traps: - Picking based on intuition without validation. - Fine-tuning a small judge once and never refreshing — drifts as task evolves.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you sample production traffic for eval to balance cost and statistical power?"¶

Tags: senior · occasional · conceptual · source: cost-aware eval; standard senior stats probe

Answer outline: - For headline rate metrics (e.g., faithfulness failure rate), sample size for ±1pp precision at 95% CI on a 5% rate: ~1800 examples; for ±0.5pp: ~7300. - Per-slice power matters more than global — sample stratified so each intent slice has ≥200-500 traces evaluated per cycle. - Choose cadence to match volume: 1M requests/day × 1% sample = 10K traces evaluated daily; plenty for global, may need stratification boost for rare intents. - Over-sample tails: requests flagged by cheap heuristics or low-confidence retrieval get over-sampled in the judge tier. - Research finding (HuggingFace): 1% sampling preserves rank-order validity on MMLU-like benchmarks — gives you cover for aggressive cost cuts on stable global metrics. - Numbers to drop: "1800 examples for ±1pp precision on 5% rate", "200-500 per slice for slice-level signal", "1% sample preserves rank-order per MMLU research"

Common follow-ups: - "How would you decide sample size for a new metric?" - "What if your traffic is bursty / non-stationary?"

Traps: - Uniform sampling that under-samples minority intents. - One global sample-size for all metrics regardless of base rate.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your eval budget is $5K/month. How do you allocate?"¶

Tags: senior · occasional · scenario · source: cost-aware eval scenario 2026

Answer outline: - Tier 1 — CI/PR evals on golden set: ~50 runs/week × 500 examples × $0.005 = $50/week ≈ $200/month. Cheap, high leverage. - Tier 2 — online sampled LLM-judge: 1% of 10M monthly requests = 100K judge calls × $0.01 = $1000/month. - Tier 3 — nightly full RAGAS on rotating slice: 30K examples/night × 30 days × $0.005 = ~$4500/month — overshoots. Cut to weekly or smaller sample. - Tier 4 — human eval — couple of contractors, ~$1500/month for 200-500 traces. - Net split: ~10% CI, ~40% online judge, ~30% batch RAGAS, ~20% human. Tune based on which catches the most real issues. - Numbers to drop: "$0.005-0.01 per LLM-judge call", "10% CI / 40% online / 30% batch / 20% human", "human review 200-500 traces/mo at $1500"

Common follow-ups: - "Where would you cut first if you had to halve the budget?" - "What if a critical regression slips because you sampled too low?"

Traps: - All budget on one tier. - No human-in-the-loop allocation at all.

Related cross-cutting: Cost & latency Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you evaluate an agentic system end-to-end vs per-step?"¶

Tags: senior · common · design · source: standard 2026 agentic eval probe; Evaluation-Driven Development arXiv

Answer outline: - End-to-end: task-completion rate, time-to-completion, total cost, user satisfaction. Single number is uninformative; need stratification by task type. - Per-step: tool-call accuracy (right tool?), tool-call success rate (no error?), step quality (sub-goal achieved?), trajectory length (efficient?). - Trace-level: planning quality (did the plan make sense?), recovery (did it handle a failed step?), termination (did it stop when done?). - Failure attribution: when end-to-end fails, which step caused it? Need per-step metrics to attribute, otherwise you're flying blind. - Use both — end-to-end is the outcome, per-step is the diagnosis. - Numbers to drop: "task-completion rate as headline", "per-step tool-call accuracy / success rate", "trajectory length budget per task type"

Common follow-ups: - "How do you collect ground truth for per-step quality?" - "What if the agent finds a creative correct path you didn't anticipate?"

Traps: - End-to-end only — gives outcome but no debug. - Per-step only — local optima, no whole-task signal.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you handle eval for a high-cardinality output space (e.g., code generation)?"¶

Tags: senior · occasional · design · source: standard senior probe; code-gen eval 2026

Answer outline: - Functional correctness over textual match. For code: unit tests, type-check, runtime execution, sandboxed exec — the code either works or doesn't. - HumanEval / MBPP / SWE-bench style — gold tests check behavioral equivalence, not string equality. - Layer in: static analysis (lint, security scan), maintainability (cyclomatic complexity), style adherence — these are deterministic. - LLM-judge for properties that aren't testable: code clarity, comment quality, idiomatic patterns. Validate the judge against engineer labels. - Per-problem stratification: easy / medium / hard like HumanEval+ — aggregate pass@k as headline, breakdown for debugging. - Numbers to drop: "pass@1, pass@10 as headline metrics", "SWE-bench: end-to-end repo-level task completion", "judge for non-testable properties needs engineer-aligned κ > 0.6"

Common follow-ups: - "What about creative code that solves the problem differently?" - "How would you eval for security vulnerabilities in generated code?"

Traps: - String matching against reference solution — penalizes valid alternative solutions. - No execution at all — you're guessing at correctness.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What's a 'system eval' regression test that catches issues an LLM benchmark would miss?"¶

Tags: senior · occasional · conceptual · source: standard staff systems probe

Answer outline: - Benchmarks test the model; system evals test your wiring. Examples of regressions only a system eval catches: - Retrieval misroute — model is fine, but a new prompt template confuses the retriever's query rewriter. - Citation hallucination — model generates citations to docs that don't exist in your corpus. - Tool-call schema break — model's tool-call format changed slightly, your parser rejects 5% of valid responses. - Latency-quality trade — model returns same quality at higher latency, no benchmark notices. - Multi-turn state corruption — single-turn eval passes, but the system loses context across turns. - Numbers to drop: "5 system-only regression classes", "citation hallucination check: 'does cited doc exist?'", "multi-turn state retention test"

Common follow-ups: - "How would you build a test for tool-call schema break?" - "Which of these are easiest to automate?"

Traps: - Believing model-level benchmarks tell you about system health. - No multi-turn state tests at all.

Related cross-cutting: Architecture choices Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Your golden-set eval pass rate is 99% but a user-reported bug is reproducible. Why isn't it in the golden set?"¶

Tags: senior · common · debugging · source: standard 2026 production scenario

Answer outline: - Coverage gap — the golden set was built for known intents; the bug is in an unmodeled intent or edge case. - Triage: (1) reproduce the bug, (2) add it to golden set as a failing case, (3) instrument cause — was it a retrieval miss, generation issue, or system bug? - Process fix: feedback loop from production failures to golden set must be weekly or faster. Hamel: "you can never stop looking at data." - Stratify golden set by intent so coverage gaps surface as low example counts per slice — easier to spot. - Numbers to drop: "weekly triage cadence: 5-20 new examples added", "every bug should yield ≥1 golden-set entry"

Common follow-ups: - "How would you stop this from recurring?" - "What's the cost of expanding the golden set forever?"

Traps: - Calling the golden set "complete." - Not closing the loop from production bugs into eval.

Related cross-cutting: Production patterns Related module: learning/04_ai_product_evals/00_ai_evals_release_gates/