Skip to content

Cross-Cutting Trade-Off Questions

The most-asked senior-loop genre: "when X vs Y?". This file collects them in one place so topic files can deep-link in. Anchors use level-2 sub-topic headings.

Architecture choices

Q: "When would you fine-tune vs use RAG vs prompt engineering?"

Tags: mid · very-common · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026; lockedinai — AI Engineer Interview Questions Q3/Q17

Answer outline: - Decision rule: prompt-engineer first, add RAG when failures are missing/stale facts, fine-tune when failures are behavior (format, tone, classification accuracy). - RAG wins when knowledge changes faster than your retraining cadence (pricing, policies, product specs that update weekly/monthly) and when you need citations. - Fine-tune wins when you need a consistent JSON schema, brand voice, or a domain classifier that prompts can't reliably enforce, and you have 1k+ labeled examples. - Prompt-engineering is the right floor: zero infra, fastest iteration, but plateaus once the task needs >5k tokens of system prompt or strict schema compliance. - Production default in 2026 is hybrid: a LoRA adapter for format/voice + RAG for facts. The "vs" framing is mostly an interview artifact. - Numbers to drop: "RAG iteration loop: hours. SFT loop: days. RLHF/DPO loop: weeks. Pick the shortest loop that hits the SLA."

Common follow-ups: - "Your RAG answers are factually correct but tonally wrong — what do you do?" - "When does prompt engineering stop scaling?"

Traps: - Saying "fine-tuning teaches the model facts" — it teaches behavior; facts in weights go stale and can't be cited. - Recommending RAG for a tiny, static FAQ where a 2k-token system prompt would do. - Forgetting that fine-tuning a hosted model locks you to that provider's serving stack.

Q: "Single agent vs multi-agent — when is multi-agent actually worth it?"

Tags: senior · very-common · design · source: aemonline — 25 Advanced Agentic AI Interview Questions 2026; Analytics Vidhya — 30 Agentic AI Interview Questions

Answer outline: - Default to single agent with a tool belt. Multi-agent is overhead until you have a real reason: parallelism, fault isolation, or specialized prompts that don't fit one context window. - Choose multi-agent when sub-tasks are independent (parallel research over 20 URLs), require different system prompts (coder + reviewer + tester), or need different models (Haiku for triage, Opus for synthesis). - Pick supervisor-based when one agent owns the plan and delegates; pick peer-to-peer when agents negotiate. Supervisor is safer for production — peer-to-peer is harder to debug and prone to chatter loops. - Costs of multi-agent: message-passing tokens (often 2–5x a monolithic prompt), longer end-to-end latency, harder eval, harder rollback. - Decision rule: if you can't draw a graph where every edge has a clear handoff condition, you don't need multi-agent — you need a better prompt. - Numbers to drop: "Anthropic's published multi-agent research showed ~15x token usage vs single-agent for the same task — the win has to clear that bar."

Common follow-ups: - "How would you test a multi-agent system end-to-end?" - "How do you stop two agents from talking past each other?"

Traps: - Spinning up CrewAI for a task that's really a 3-step prompt chain. - Letting agents share unbounded scratchpad memory — context explodes. - No supervisor and no termination condition: the loop runs until the budget guard kills it.

Q: "Self-host vs API — when does each win?"

Tags: senior · very-common · design · source: MyEngineeringPath — LLM Interview Questions 2026; TokenMix — Self-Host LLM vs API 2026

Answer outline: - API wins below ~$20k/month of inference spend, when you need frontier capability (Opus/GPT-5-class), and when you don't have GPU ops talent. - Self-host wins when (a) data residency forbids egress, (b) volume is 100M+ tokens/day, (c) you need custom kernels or fine-tuned weights you control, or (d) you need predictable latency without rate limits. - Hidden costs of self-hosting: GPU underutilization (batch effects), engineer-months for vLLM/TensorRT-LLM tuning, on-call rotation, model upgrade cadence. - Hidden benefits of API: instant model upgrades, prompt caching credit, batch API discounts (often 50%), built-in safety classifiers. - Decision rule: self-host when control or compliance is non-negotiable; API otherwise. Don't self-host to "save money" unless you've modeled GPU \(/M-tokens at your real utilization. - Numbers to drop: "Break-even is ~\)20k/mo API spend; H100 list ~$2/hr on-demand, but you only get the math to work at >40% utilization."

Common follow-ups: - "Your CFO says cut API spend by 50% — walk me through the options." - "What's your model-upgrade strategy on self-host?"

Traps: - Ignoring utilization in the cost model — a $30k/mo H100 cluster at 15% utilization is more expensive than the API it replaces. - Forgetting that fine-tuned open models still need a serving stack (vLLM, TGI in maintenance, SGLang, TensorRT-LLM).

Q: "What are the trade-offs between open-source and proprietary LLMs?"

Tags: mid · very-common · conceptual · source: lockedinai — AI Engineer Interview Questions Q37

Answer outline: - Proprietary (Claude, GPT, Gemini): top-of-leaderboard quality, fastest model upgrades, batch/prompt-caching discounts, but vendor lock-in and no weight ownership. - Open-weight (Llama 3.x, Qwen 2.5, DeepSeek, Mistral): full control, on-prem deploy, fine-tune the actual weights, but you own GPU ops and you lag frontier by 6–12 months on hard tasks. - Hybrid is normal: frontier API for the orchestrator/judge, open-weight 7B–20B for the high-volume worker. - Compliance angle: HIPAA/GDPR/SOC-2 stories are easier with proprietary providers that have BAAs; open-weight gives you isolation but you have to build the rest. - Numbers to drop: "Well-distilled 7B–20B open models handle 80–90% of single-turn chat queries previously sent to 70B+ frontier models (2025 distillation literature)."

Common follow-ups: - "Your latency budget is 300ms p95 — does that change the answer?" - "How do you decide which queries get routed to which tier?"

Traps: - Calling open-weight models "open source" when only weights are released, not training data or code. - Assuming open-weight = free; serving cost dominates, not weight cost.

Q: "Prompt chaining vs single-shot vs agent loop — pick one for a complex task."

Tags: senior · common · design · source: Analytics Vidhya — Agentic AI Interview Questions; aicompetence.org "Prompt Chaining Vs Agentic AI"

Answer outline: - Single-shot when the task fits in one well-engineered prompt under ~8k tokens of context and has a single deterministic output. Cheapest, fastest, most debuggable. - Prompt chain (fixed DAG) when the task decomposes into 2–6 stable steps and you want determinism: extract → validate → format → output. Each node is independently evaluable. - Agent loop when the next step genuinely depends on prior observations (search results, tool errors, user clarifications). Pay the cost only when branching is real. - Decision rule: if you can draw the flowchart before runtime, use a chain. If the flowchart changes per input, use an agent. - Hybrid: deterministic chain wrapping a bounded agent loop is the production sweet spot — most teams call this "workflows with agentic nodes" (LangGraph's model). - Numbers to drop: "Single-shot: 1 LLM call. Chain: N calls, latency adds linearly. Agent loop: 3–15 calls typical, p95 latency 2–4x a fixed chain."

Common follow-ups: - "How do you bound the loop?" - "How do you eval a chain end-to-end vs each step?"

Traps: - Reaching for agent loops when a chain would do — adds cost, latency, and unpredictability. - No iteration cap or token budget → runaway loops.

Q: "Stateless vs stateful agents — trade-offs and use cases?"

Tags: mid · common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Stateless: every request carries its own context. Horizontally scales like any HTTP service, easy to load-balance, no session affinity needed. Use for one-shot tools. - Stateful: agent maintains memory across turns (Redis, Postgres, vector store). Required for chat, long-running plans, user personalization. - Real choice is where state lives: in the LLM context window (cheap, ephemeral), in a session store (sticky session), or in a database keyed by user (persistent). - Default: stateless service + externalized memory. Don't make the process stateful — make the data stateful. - Numbers to drop: "A 32k-token rolling window costs ~\(0.05–\)0.10/turn at Sonnet pricing; cheap. A pgvector-backed semantic memory at 1M vectors: ~$50/mo on a small Postgres."

Common follow-ups: - "How do you trim conversation history?" - "User starts a new session — what carries over?"

Traps: - Stuffing the entire conversation into context every turn → linear cost growth. - Storing PII in long-term memory without TTL or right-to-delete hooks.

Q: "What's the difference between an agent and a simple LLM chain?"

Tags: screen · very-common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Chain: control flow is in your code. You decide step order. LLM is called like a function. - Agent: control flow is in the LLM. The model decides which tool to call next based on observations. - Chains are deterministic; agents are stochastic. Chains debug like a pipeline; agents debug like a distributed system. - Pick chain when the workflow is known; pick agent when it isn't. - Numbers to drop: "Chains: predictable token cost per request (±10%). Agents: tail can be 5–10x the median when the loop expands."

Common follow-ups: - "When did your team graduate from a chain to an agent?" - "How do you know an agent has gone wrong?"

Traps: - Calling any sequence with multiple LLM calls "an agent." - Confusing tool use with agency — tool use can be in a chain too.

Retrieval

Q: "Sparse (BM25) vs dense vs hybrid retrieval — pick one."

Tags: mid · very-common · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026; KalyanKS — RAG Interview Questions Hub Q50/Q52; DataCamp — RAG Interview Questions

Answer outline: - Sparse (BM25): exact-keyword match, no embedding model needed, near-zero infra, perfect for codes, SKUs, error messages, legal citations. - Dense (bi-encoder + ANN): semantic match, handles paraphrase, but misses rare tokens and proper nouns the embedding hasn't seen. - Hybrid: run both, fuse with Reciprocal Rank Fusion (RRF, k=60) or weighted sum. Standard production default in 2026. - Decision rule: hybrid unless you've measured that one method alone meets recall@10. For 95% of enterprise corpora, hybrid wins by 5–15 points NDCG@10. - Implementation: Elasticsearch/OpenSearch + a vector DB, or a single store that does both (Qdrant, Weaviate, OpenSearch). - Numbers to drop: "BM25 alone: ~50–60% recall@10 on technical docs. Dense alone: ~60–70%. Hybrid: 75–85%. Reranking on top adds another 5–10 points."

Common follow-ups: - "Your query is a UUID — what happens with dense-only?" - "How do you tune the fusion weight?"

Traps: - Skipping BM25 because "embeddings are smarter" — they whiff on proper nouns and codes. - Using cosine similarity threshold as a cutoff instead of top-k + rerank.

Q: "Cross-encoder rerank vs bigger LLM at generation — which spend wins?"

Tags: senior · common · design · source: adilshamim8 — Every AI Engineer Interview Question 2026; ZeroEntropy — Reranker Guide 2026

Answer outline: - Reranker (Cohere Rerank, BGE-reranker, Jina) at \(0.50–\)2 per 1k queries adds ~50ms; bigger generator (Sonnet → Opus, GPT-4o → o1) adds \(5–\)30 per 1k queries and 2–5x latency. - Decision rule: spend the reranker dollar first. Bad context defeats a smart generator; good context lets a small generator shine. - Reranker reduces token waste at generation: top-50 from ANN → top-5 reranked → fewer tokens, lower cost, less distraction. - Stack: ANN (top-50) → cross-encoder rerank (top-5) → generator. Adding a stronger generator on top of bad retrieval is the most expensive way to hide a retrieval bug. - Numbers to drop: "Documented case: feeding 75 candidates of 500 tokens to GPT-4o at 10 qps = ~\(162k/day. Rerank → top-20 = ~\)44k/day, 72% cost reduction, ~95% accuracy retained. Reranking cuts hallucinations ~35% vs raw embedding similarity (Databricks benchmark)."

Common follow-ups: - "Where would you put the reranker — same service or sidecar?" - "When would a reranker hurt?"

Traps: - Pumping more context at a frontier model instead of fixing recall@10. - Skipping rerank "because dense retrieval is good enough" without measuring NDCG.

Q: "Compare reasoning vs non-reasoning LLMs for RAG systems."

Tags: senior · common · conceptual · source: KalyanKS — RAG Interview Questions Hub Q22

Answer outline: - Reasoning models (o3, DeepSeek-R1, Claude Opus extended thinking) help when retrieval returns conflicting or partially-relevant docs — they can reconcile. - Non-reasoning models (Sonnet, Haiku, GPT-4o-mini) are 3–10x cheaper and 2–5x faster; correct for clean, high-relevance retrievals. - Decision rule: spend the reasoning budget on synthesis only when retrieval is messy or the question requires multi-hop. Otherwise, fix retrieval and use a non-reasoning model. - Hybrid: cheap model first, escalate to reasoning model on low-confidence judgments (router pattern). - Numbers to drop: "Reasoning models often add 5–30 seconds latency and 5–20x cost per call vs the non-reasoning peer."

Common follow-ups: - "How do you build a cost-aware router?" - "When does extended thinking not help?"

Traps: - Defaulting to a reasoning model for trivial extractions. - Ignoring that reasoning tokens count against your context budget.

Q: "Is RAG still relevant in the era of long-context LLMs?"

Tags: senior · very-common · conceptual · source: KalyanKS — RAG Interview Questions Hub Q2

Answer outline: - Yes. Long context ≠ unlimited context; quality degrades in the middle ("lost-in-the-middle" effect persists at 200k+ tokens), and cost scales with input tokens regardless of cache. - RAG remains the answer for: corpus larger than the context window, knowledge that changes faster than your prompt, citation requirements, multi-tenant isolation. - Long-context shines when the question genuinely needs all of one document (a 100-page contract, a single codebase) and you can afford the input tokens. - Hybrid: retrieve the right documents into a long context window — "RAG into long context" beats both pure RAG (more grounding signal) and pure long-context (cheaper, more focused). - Numbers to drop: "Putnam-style long-context bench shows ~10–25% accuracy drop on facts buried in middle 50% of a 128k-token prompt. Prompt-caching helps cost, not attention."

Common follow-ups: - "Anthropic launched 1M-token context — does that change your answer?" - "Where would you skip RAG entirely?"

Traps: - "Long context killed RAG" — it didn't; cost and grounding still matter. - Ignoring prompt-cache pricing that makes large repeated prefixes cheap but doesn't fix attention.

Q: "What are the trade-offs between chunking documents into larger versus smaller chunks?"

Tags: mid · very-common · scenario · source: DataCamp — RAG Interview Questions

Answer outline: - Small chunks (128–256 tokens): high precision, low recall per chunk, more chunks per query, more embedding cost, can miss multi-paragraph reasoning. - Large chunks (1024–2048 tokens): high recall, lower precision, risk of "lost-in-the-middle" inside the chunk, more generator tokens per result. - Decision rule: chunk size ≈ the granularity of a typical answer-supporting passage. For QA over docs, 400–512 tokens with 50–100 overlap is the workhorse default. - Use parent-child / hierarchical chunking when you want small chunks for retrieval but large chunks for generation context. - Numbers to drop: "Chroma 2024 study: RecursiveCharacterTextSplitter at 400–512 tokens delivered 85–90% recall without semantic-chunking overhead. Semantic chunking adds 5–9% recall but costs \(0.05–\)0.20 per 1M tokens to compute."

Common follow-ups: - "How do you pick the overlap?" - "When would you skip chunking entirely?"

Traps: - "Bigger chunks = more context = better" — not when noise drowns the signal. - Forgetting to re-embed when chunk size changes; embeddings are tied to chunk boundaries.

Q: "Fixed vs semantic vs recursive chunking — which do you ship by default?"

Tags: mid · common · scenario · source: Firecrawl — Best Chunking Strategies 2026; Level Up Coding — Chunking Insights from 80+ Interviews

Answer outline: - Recursive (LangChain RecursiveCharacterTextSplitter) is the production default: respects paragraph/sentence boundaries, fast, no extra model calls. - Fixed-size: simplest, useful for code or tabular data where structure is uniform; bad for prose. - Semantic chunking: uses embedding similarity to cut at topic shifts. Best recall, but costs an extra embedding pass and is hard to debug. - Decision rule: ship recursive at 400–512 tokens / 50-token overlap on day one; upgrade to semantic only if eval shows recall ceiling. - Hierarchical (parent/child) beats all three when answers need wider context than retrieval needs. - Numbers to drop: "Recursive 512/50: ~85% recall, \(0 extra compute. Semantic: ~92% recall, +\)0.10–$0.30 per 1M tokens ingested."

Common follow-ups: - "How do you chunk a PDF with tables?" - "How would you evaluate that the chunker is the bottleneck?"

Traps: - Semantic chunking before measuring whether recursive is actually the bottleneck. - Re-ingesting daily without checkpointing the chunker version — silent eval drift.

Q: "What is CAG, and how does it differ from traditional RAG? When would you prefer CAG over RAG in production?"

Tags: senior · occasional · conceptual · source: DataCamp — RAG Interview Questions

Answer outline: - CAG (Cache-Augmented Generation): pre-compute KV-cache for a fixed corpus and reuse it across queries — corpus lives in attention, not in retrieval. - Works when corpus is small and static (a product manual, a single legal contract) and fits in context with prompt caching. - RAG remains correct when corpus is large, dynamic, multi-tenant, or needs per-query filtering (ACLs). - Decision rule: CAG when corpus < context window and changes slower than your cache TTL; RAG otherwise. - Numbers to drop: "Anthropic prompt-caching: 90% discount on cached input tokens, 5-minute default TTL. Makes CAG cheap for stable corpora up to ~150k tokens."

Common follow-ups: - "Could you combine CAG and RAG?" - "How do you invalidate the cache on document updates?"

Traps: - Treating CAG as a general replacement for RAG; it isn't. - Ignoring per-user / per-tenant ACL — CAG mixes everyone's data into one cache.

Q: "Pinecone vs pgvector vs Qdrant — pick one for production."

Tags: senior · common · design · source: KnowSync — Vector DB 2026; 4xxi — Vector Database Comparison 2026

Answer outline: - pgvector: pick if you already run Postgres, dataset <10M vectors, want vectors next to operational data. ACID, joins, mature ops. - Qdrant: pick if you need filtered search at scale, sub-5ms p50, Rust performance, or want to self-host predictable cost. - Pinecone: pick if zero ops is the requirement and you're fine with less recall tuning. Serverless tier good for prototyping. - Decision rule: start with pgvector unless your scale or filter complexity disqualifies it. Move to Qdrant for self-host scale; move to Pinecone for managed and "no DBA wanted." - Numbers to drop: "pgvector with HNSW matches dedicated vector DBs up to ~1M vectors; Qdrant p50 < 5ms at high recall; Pinecone Serverless slower than dedicated but ops-free."

Common follow-ups: - "Your dataset just hit 50M vectors — what changes?" - "How do you migrate from Pinecone to self-hosted?"

Traps: - Spinning up a dedicated vector DB for 100k vectors when Postgres would do. - Ignoring filter cardinality — high-cardinality filters destroy ANN performance.

Q: "Compare scalar and binary quantization for embeddings in RAG retrieval."

Tags: senior · occasional · conceptual · source: KalyanKS — RAG Interview Questions Hub Q60

Answer outline: - Scalar (int8) quantization: 4x memory reduction, ~1% recall loss, drop-in for most embedders. Standard production default. - Binary quantization: 32x memory reduction, ~5–10% recall loss, requires a rescoring step (top-k binary → re-rank with float). - Decision rule: scalar first; binary only when memory or RAM cost dominates and you can afford the rescore. - Combine with matryoshka embeddings for further dim-reduction without re-embedding. - Numbers to drop: "1B vectors at 1536 dim float32 = 6 TB; int8 = 1.5 TB; binary + rescore = 192 GB. Recall delta usually <1% for scalar, 3–8% for binary with rescore."

Common follow-ups: - "How does rescoring work in binary?" - "When would you not quantize?"

Traps: - Binary quantization with no rescore stage — recall craters. - Quantizing too early in iteration; you can't tune what you can't measure pre-quantization.

Q: "Compare HyPE and HyDE techniques in RAG."

Tags: senior · occasional · conceptual · source: KalyanKS — RAG Interview Questions Hub Q29

Answer outline: - HyDE (Hypothetical Document Embeddings): at query time, LLM generates a fake answer, embed that, retrieve nearest real docs. Bridges query–passage vocabulary gap. - HyPE (Hypothetical Prompt Embeddings): at index time, LLM generates hypothetical questions for each chunk, embed those, retrieve by query-to-question similarity. Bridges the same gap at the other end. - HyDE adds latency per query (extra LLM call); HyPE adds one-time index cost but query-time is free. - Decision rule: HyPE if corpus is stable and you can pre-compute; HyDE if corpus changes faster than re-index cadence. - Numbers to drop: "HyDE adds ~500–1500ms per query; HyPE adds 0ms at query, but indexing cost scales with corpus × questions per chunk (often \(0.50–\)2 per 1M tokens ingested)."

Common follow-ups: - "Can you combine the two?" - "When does HyDE backfire?"

Traps: - Running HyDE in a latency-sensitive path without budgeting the extra call. - HyPE with too many synthetic questions per chunk — index bloat, no recall gain.

Q: "What is a vector database, and how does it differ from a traditional database?"

Tags: screen · very-common · conceptual · source: lockedinai — AI Engineer Interview Questions Q23

Answer outline: - Vector DB: optimized for approximate nearest-neighbor search over high-dim float vectors using HNSW, IVF, or DiskANN indexes. - Traditional DB: optimized for exact lookups, joins, transactions, range queries on scalar/structured data. - Vector DBs trade exactness for speed: ANN is approximate, sub-linear time, recall@k usually 95–99%. - Modern Postgres (pgvector) blurs the line — same DB does both. - Decision rule: dedicated vector DB when scale, filtered ANN performance, or feature surface (sparse + dense, multi-vector, payload filters) outpaces what your OLTP DB offers. - Numbers to drop: "HNSW at default params: ~95% recall@10 with 100x speedup vs brute force. IVF-PQ scales to billions of vectors but recall drops to 80–90% without tuning."

Common follow-ups: - "Why not just use Postgres for vectors?" - "What's the index for billions of vectors?"

Traps: - Calling cosine similarity over a SELECT loop "a vector database." - Forgetting that ANN ≠ KNN; recall is a tunable knob, not a guarantee.

Cost & latency

Q: "Latency optimization: smaller model vs speculative decoding vs caching — pick one."

Tags: senior · common · design · source: MyEngineeringPath — LLM Inference Optimization 2026; adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Caching first: prompt cache (Anthropic/OpenAI) and semantic cache cost nothing per saved call. Always the cheapest win. - Speculative decoding next: 2–3x speedup with mathematically identical output (zero quality loss). Free win if your serving stack supports it (vLLM, TensorRT-LLM). - Smaller model last: quality trade-off, requires eval, requires routing logic for fallback. - Decision rule: cache → spec-decode → distillation. Smaller model only if cache+spec-decode don't hit the SLA. - Numbers to drop: "Prompt-cache hit: ~85% input-token discount. Speculative decoding: 2–3x throughput, same output. Distilled 7B replacing 70B: 5–10x cheaper, but ~2–10% quality drop depending on task."

Common follow-ups: - "Your p95 latency is 4s, target 1s — walk through the levers." - "Where does spec-decode break down?"

Traps: - Quantizing or distilling before exhausting caching — premature optimization. - Treating spec-decode as quality-tradable; it isn't (lossless).

Q: "Distillation vs quantization — which one for cost cuts?"

Tags: senior · common · scenario · source: MyEngineeringPath — LLM Inference Optimization 2026; Analytics Vidhya Agentic AI Q23

Answer outline: - Quantization: shrinks the same model. INT8 = ~1% quality drop, 2x memory cut. INT4 (AWQ/GPTQ) = 4x memory cut, ~2–5% quality drop on most tasks. - Distillation: smaller model trained to mimic a larger one. Bigger structural cut, bigger quality cost, requires training pipeline. - Decision rule: quantize first — almost free, works on any served model. Distill when quantization isn't enough and you have eval data. - Recommended pipeline order from NVIDIA Model Optimizer: prune → distill → quantize (P-K-D-Q). - Numbers to drop: "INT8: <1% quality drop, half the memory. INT4: 4x memory cut, run a 70B on a single H100. Distillation: 7B can recover 80–90% of 70B quality on chat/reasoning tasks (2025 distillation literature)."

Common follow-ups: - "Why not just do both?" - "When does INT4 hurt badly?"

Traps: - Quantizing a model that's already been LoRA-fine-tuned without re-checking eval. - Distilling from a fuzzy teacher — student inherits the noise.

Q: "Exact-match cache vs semantic cache vs prefix cache — pick one for a chatbot."

Tags: senior · common · design · source: Hub.Stabilarity — Semantic Prompt Caching; Redis — Prompt vs Semantic Caching

Answer outline: - Exact-match: hash the full prompt, return cached response on hit. Safest, lowest false-positive risk, but hit rate is tiny in real chat (12% typical). - Semantic: embed the user query, return cached response if similarity > threshold. 50–68% cost savings but risks wrong-answer-for-similar-query if threshold too loose. - Prefix cache (provider KV-cache): reuses computed KV for the shared system-prompt prefix. No false-positive risk, lower discount (~85% on cached input tokens), but only helps prefix. - Decision rule: use all three layers — provider prefix-cache always on, exact-match for hot queries, semantic with conservative threshold (0.95–0.97) + verification for high-volume reads. - Numbers to drop: "Exact match: 12% cost savings typical. Semantic at 0.95 threshold: 38–61%. Tiered async semantic: 68%. Anthropic prompt cache: 90% discount on cached input."

Common follow-ups: - "What threshold do you start at and how do you tune it?" - "How do you handle PII in a semantic cache?"

Traps: - Semantic cache with no verification → answering the wrong question confidently. - Shared cache across users without checking ACL — privacy leak.

Q: "Batch vs streaming inference — when does each win?"

Tags: mid · common · design · source: Anyscale — LLM Batch Inference Basics; Databricks — LLM Inference Performance Engineering

Answer outline: - Streaming (token-by-token SSE/WebSocket): wins for chat UX — TTFT under 500ms feels instant even if total generation takes 8s. - Batch (collect requests, process together): wins for throughput and cost on async workloads — labeling, classification, embeddings, nightly summarization. - Same GPU; you can't have minimum latency AND maximum throughput at once. - Provider batch APIs (OpenAI, Anthropic): ~50% discount for 24-hour SLA — pure win for offline workloads. - Decision rule: streaming for human-facing, batch for system-to-system. - Numbers to drop: "Continuous batching (vLLM/SGLang): 5–10x throughput vs naive serial inference at same latency target. Batch API discount: 50% off list. TTFT target for good UX: <500ms."

Common follow-ups: - "What about background tasks that the user is waiting on?" - "Continuous batching vs static batching?"

Traps: - Streaming when the consumer is another service — wastes the streaming machinery, complicates retries. - Batching with batch size = max — kills tail latency for everyone.

Q: "vLLM vs TGI vs TensorRT-LLM vs SGLang — pick a serving stack."

Tags: staff · common · design · source: Yotta Labs — Best LLM Inference Engines 2026; Spheron — H100 Benchmarks 2026

Answer outline: - vLLM: fastest start, broadest model support, the safe default. PagedAttention + continuous batching are now table stakes. - SGLang: ~29% throughput win over vLLM when requests share prefixes (RAG, agents, chat with stable system prompt) via RadixAttention. - TensorRT-LLM: best peak throughput and latency, but ~28-minute model compile and tight coupling to NVIDIA. Worth it only at sustained high QPS. - TGI: in maintenance mode as of 2026 — HuggingFace's own README points users to vLLM/SGLang/llama.cpp. - Decision rule: prototype on vLLM; move to SGLang if you have heavy prefix reuse; graduate to TensorRT-LLM only when you've capped vLLM's throughput. - Numbers to drop: "TensorRT-LLM: 8–13% faster than vLLM at moderate-to-high concurrency. SGLang: ~29% throughput win on prefix-heavy workloads. TGI: officially maintenance-only in 2026."

Common follow-ups: - "Why is RadixAttention specifically good for agents?" - "How long does a TensorRT-LLM model upgrade take?"

Traps: - Picking TensorRT-LLM because "it's fastest" without modeling the rebuild-on-every-upgrade cost. - Still defaulting to TGI in greenfield work in 2026.

Q: "Should you optimize for latency or throughput? (for a personal assistant with one request)"

Tags: mid · common · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - One user, one request → latency. Throughput is a fleet-level metric; it doesn't help a single in-flight call. - Levers: smaller model, speculative decoding, KV-cache reuse, smaller context (don't dump 32k tokens when 4k suffices), stream tokens to mask total time. - Don't enable continuous batching with high max-batch — it's a throughput knob that hurts single-stream latency. - Numbers to drop: "Single-stream H100 token rate: 70–120 tok/s on a 70B model with spec-decode. TTFT target for chat UX: <500ms; total response for short answer: <3s."

Common follow-ups: - "Now 1k concurrent users — what changes?" - "Where does data parallelism fit?"

Traps: - "Use data parallelism" — it scales throughput, not single-request latency. - Adding a reranker in the hot path without budgeting its latency.

Q: "Cost vs quality: when is a small open-source model 'good enough' vs a GPT-4-class model?"

Tags: senior · very-common · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Small open model (7B–20B, e.g., Llama 3.1 8B, Qwen 2.5 14B, Phi-4) is good enough for: classification, extraction, summarization of short docs, single-turn FAQ over a clean RAG corpus. - Frontier model needed for: multi-step reasoning, long-document synthesis, code generation across files, judgment calls where errors cost real money. - Build the eval first: a 200–500 example task-specific test set with human labels. Route by measured win rate, not vibes. - Hybrid router: small model produces, judge model validates, escalate on low confidence. - Numbers to drop: "Distilled 7B–20B models solve 80–90% of single-turn chat/reasoning queries previously routed to 70B+ frontier models. Cost ratio: ~10–30x cheaper per million tokens."

Common follow-ups: - "How do you build that eval?" - "What metric tells you when to escalate?"

Traps: - Anecdote-driven choices; "GPT-4 felt better" isn't a metric. - Picking the small model on aggregate accuracy but missing tail failures on edge cases.

Q: "Pre-training vs SFT vs RLHF/DPO — when does each get touched at a company?"

Tags: senior · common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Pre-training: almost no one. Labs (Anthropic, OpenAI, Meta, DeepSeek) do this. Domain pre-training is a $1M+ bet, usually unjustified outside finance/biotech. - SFT (supervised fine-tune): the workhorse — teach format, voice, classification. Needs labeled outputs. Most companies stop here. - DPO/ORPO/KTO (preference optimization): needs pairs of chosen/rejected outputs. Used to align tone, refusal behavior, safety. Replaced RLHF for most teams due to stability. - RLHF: needs a reward model and PPO/GRPO loop. Heavy. Frontier labs and a few well-funded teams. - Decision rule by data shape: labeled outputs → SFT; preference pairs → DPO; verifiable rewards (unit tests, math correctness) → RFT/GRPO; nothing labeled → RAG/prompt. - Numbers to drop: "SFT on 1k–10k examples with LoRA: 1–10 GPU-hours. DPO on 5k preference pairs: similar order. Full RLHF: weeks of engineering + GPUs."

Common follow-ups: - "Why has DPO largely replaced RLHF?" - "Your model is verbose — SFT or DPO?"

Traps: - Reaching for RLHF when DPO would do. - Doing SFT before you have a real eval set.

Fine-tuning vs alternatives

Q: "LoRA vs full fine-tune vs DPO — pick one for a small instruction-tuning dataset."

Tags: senior · very-common · scenario · source: Kumar Gauraw — Fine-tuning LLM LoRA DPO Guide 2026; Effloow — LoRA QLoRA Guide 2026

Answer outline: - For a small instruction dataset (a few hundred to ~10k labeled outputs), LoRA SFT is the answer. Updates 0.1–1% of params, recovers 95–99% of full FT quality on most tasks. - Full fine-tune only when you have 100k+ examples AND need cross-task generalization AND have the GPU budget. - DPO only if your data is preference pairs (chosen vs rejected), not labeled outputs. Different data shape, different objective. - QLoRA when you can't fit the base model on your GPU — 4-bit base + LoRA adapters. Fine-tune a 70B on a single 24 GB consumer card. - Decision rule: instruction data → LoRA SFT. Preference data → DPO. Massive data + need to shift internal representations → full FT. The 2026 recipe is "thin LoRA + RAG." - Numbers to drop: "LoRA on Llama 3.1 8B with 5k examples: 1–4 GPU-hours on a single H100. Quality within 1–3 points of full FT. QLoRA cuts VRAM ~75% vs LoRA."

Common follow-ups: - "What rank do you start LoRA at?" - "How do you eval the fine-tuned model?"

Traps: - Full fine-tune on 1k examples → catastrophic forgetting. - DPO on labeled outputs (no preference pairs) — wrong objective.

Q: "QLoRA vs LoRA — when would you choose one over the other?"

Tags: senior · common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - QLoRA = base model in 4-bit + LoRA adapters in fp16/bf16. Use when VRAM is the bottleneck. - LoRA on a full-precision base: faster training, slightly higher fidelity, but needs more VRAM. - Decision rule: QLoRA if you can't fit the base model and adapters in your GPU memory at chosen batch size; otherwise LoRA. - Inference: after training, you can dequantize and merge LoRA into the base for production serving (or keep adapters separate for hot-swap). - Numbers to drop: "QLoRA enables fine-tuning a 70B model on a single 24 GB consumer GPU; full LoRA on the same model needs 80 GB+. Quality delta: typically <1 point on benchmarks."

Common follow-ups: - "Are there quality risks from 4-bit base?" - "How do you serve a QLoRA-trained model?"

Traps: - Using QLoRA when you have plenty of VRAM and don't need the savings — slower training, marginal accuracy hit. - Forgetting that the base 4-bit quantization at training time should match inference quantization to avoid drift.

Q: "What is the difference between RLHF and DPO? When would you prefer one over the other?"

Tags: senior · common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - RLHF: train a reward model on preference pairs, then run PPO/GRPO against it. Two-stage, sensitive to reward-model quality, KL-control matters. - DPO: skip the explicit reward model; directly optimize the policy on preference pairs via a contrastive loss. One stage, much more stable. - Choose DPO for almost all alignment tasks where you have preference pairs — it's the 2026 default for everyone outside frontier labs. - Choose RLHF when you need an explicit reward model you can reuse (e.g., for online RL, exploration), or when verifiable rewards exist (math, code). - Numbers to drop: "DPO training is typically 2–5x cheaper than full RLHF, with comparable or better win rates in published results. DPO converges in hours; full RLHF in days-to-weeks."

Common follow-ups: - "What's ORPO/KTO/IPO?" - "When does DPO go off the rails?"

Traps: - "RLHF is always better because frontier labs use it" — they have reasons (online exploration, multi-objective rewards) that don't apply to most teams. - Running DPO without a reference model checkpoint → unbounded drift.

Q: "Fine-tuning vs distillation — when to use each?"

Tags: senior · common · conceptual · source: Analytics Vidhya — Agentic AI Interview Questions Q23

Answer outline: - Fine-tuning: change a model's behavior on a domain or task. Goal is new capability or new style on the same model. - Distillation: train a smaller student model to mimic a larger teacher. Goal is cost reduction at fixed (or near-fixed) quality. - Decision rule: fine-tune for capability gain on the model you already serve; distill when serving cost dominates and you have a working teacher. - Often combined: fine-tune the teacher first (so it's right), then distill into a smaller student for serving. - Numbers to drop: "Distillation typical: 70B → 7B with 80–90% quality retention on chat/reasoning. Cost cut: 5–10x at serving. Fine-tune cost: 1–10 GPU-hours with LoRA; distillation: 10–1000 GPU-hours depending on student size."

Common follow-ups: - "Why not just use the smaller base model?" - "How do you choose the distillation dataset?"

Traps: - Distilling from a teacher you haven't validated — student inherits teacher bugs. - Treating distillation as free; it's a real training job.

Production patterns

Q: "Hallucination mitigation: constrained decoding vs retrieval grounding vs eval gate — which spend first?"

Tags: senior · very-common · design · source: Mitigating Hallucination in LLMs survey (2510.24476); Maxim — LLM Hallucination Detection

Answer outline: - Retrieval grounding first: most production hallucinations are "the model didn't know," and RAG with citations addresses that at the source. - Constrained decoding (JSON-schema, grammars, finite-state) for format hallucinations — schema-level errors disappear, but factual errors don't. - Eval gate (LLM-as-judge, classifier, factuality check on output) is the safety net: cheap to add, catches what other layers miss, but reactive. - Decision rule: ground → constrain → gate. Each layer covers a different failure mode; they're complements, not alternatives. - For high-stakes outputs (medical, legal, financial), require attribution-to-source check, not just self-judge. - Numbers to drop: "Retrieval grounding reduces hallucination rates 30–60% on factual QA in published RAG studies. LLM-as-judge agreement with humans: ~80–85% on standard tasks, lower on specialized domains."

Common follow-ups: - "Your judge is hallucinating too — what now?" - "How do you evaluate the eval gate?"

Traps: - Skipping retrieval grounding and trying to "judge the model into truth." - Eval gate that's the same model as the generator — self-enhancement bias.

Q: "LLM-as-judge vs human eval vs automated metrics — pick your eval stack."

Tags: senior · very-common · design · source: Confident AI — LLM-as-Judge Complete Guide; Maxim — LLM-as-Judge vs Human-in-the-Loop

Answer outline: - Automated metrics (BLEU, ROUGE, exact match) for tasks with reference answers — cheap, fast, but blind to paraphrase. - LLM-as-judge for everything else at scale — 80–85% human agreement on general tasks, 500–5000x cheaper than human review. - Human eval as the calibration anchor: a small N (50–200 examples) of human labels to compute the judge's agreement, plus on regression-prone segments. - Decision rule: judge for breadth, humans for ground truth. Always have humans rate the judge. - Watch for judge biases: position bias (favors first option), verbosity bias, self-enhancement (don't judge with the generator). - Numbers to drop: "LLM judge cost: \(0.01–\)0.10 per eval. Human eval: \(1–\)10 per item. Judge-vs-human agreement: 80–85% general, can drop to 60% on specialized domains."

Common follow-ups: - "How do you calibrate the judge?" - "Your judge ratings are drifting — what changed?"

Traps: - Using the same model family for generator and judge. - No human-labeled calibration set → judge is unverifiable.

Q: "How would you test a new model before full deployment? (A/B vs canary vs interleaved vs shadow)"

Tags: senior · common · design · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Shadow first: send same traffic to old and new, log only, no user impact. Catches structural bugs and cost surprises before any user sees the new model. - Canary next: 1–5% of real traffic to new model with auto-rollback on quality/latency/error budget breach. - A/B: balanced split with explicit success metric and statistical power calculation. Use for measurable quality lift. - Interleaved: especially for ranking/retrieval — alternate results from both systems, score by user interaction. Higher statistical power per sample. - Decision rule: shadow → canary → A/B for general LLM swaps. Interleaved when the task is search/ranking. - Numbers to drop: "A/B test power: typically need 1k–10k requests per arm to detect a 2% quality delta at p<0.05. Canary auto-rollback budget: 1% error rate or 20% latency regression are common triggers."

Common follow-ups: - "What metrics do you actually compare?" - "Cost regressed but quality won — what do you do?"

Traps: - Skipping shadow and going straight to A/B → blast-radius surprises. - Underpowered A/B with 100 requests calling a 0.5% lift "significant."

Q: "Two models have identical accuracy but different confidence levels. Which do you choose?"

Tags: senior · occasional · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Better-calibrated confidence wins almost always — calibrated probabilities enable routing, abstention, escalation. - Specifically prefer the model whose confidence histogram is monotonic with correctness (high-confidence answers are more often right). - Lower-confidence-but-calibrated > higher-confidence-but-overconfident: you can act on the former. - If both calibrated, pick the more conservative one if downstream cost of wrong answers is high. - Numbers to drop: "Calibration metric: Expected Calibration Error (ECE) typically 1–5% for well-tuned classifiers; >10% means logits can't be trusted as probabilities."

Common follow-ups: - "How would you measure calibration?" - "How does temperature affect this?"

Traps: - Treating softmax outputs as probabilities without temperature scaling or Platt/isotonic calibration.

Q: "Real-time vs batch processing for data updates — when is one preferred?"

Tags: mid · common · scenario · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Real-time (streaming via Kafka, change-data-capture, webhook): for sub-minute freshness — fraud, alerts, personalization that needs the last event. - Batch (nightly/hourly jobs): for index rebuilds, summarization, analytics, anything where 1–24 hour lag is fine. - Hybrid is the production norm: batch for the bulk corpus, streaming for the hot tail (new docs, edits, deletes). - Decision rule by SLA: if user-visible staleness budget is <5 minutes, streaming. Otherwise batch is simpler, cheaper, more reliable. - Numbers to drop: "Batch embedding cost: ~\(0.01–\)0.10 per 1M tokens. Streaming embed for incremental updates: more expensive per token, but bounded by event rate."

Common follow-ups: - "How do you handle deletes in a vector index?" - "What's the failure mode if your stream lags by an hour?"

Traps: - Streaming everything when nothing requires it — operational complexity, no user value. - Batch-only when documents update faster than the job runs.

Q: "LangGraph vs CrewAI vs AutoGen — pick a framework for a multi-agent product."

Tags: senior · common · design · source: lockedinai — AI Engineer Interview Questions Q46; Pooya Golchian — CrewAI vs LangGraph vs AutoGen 2026

Answer outline: - LangGraph: explicit state and control flow as a graph. Best observability via LangSmith, native human-in-the-loop, wins on cyclical/feedback-loop tasks. Steeper learning curve (~2 weeks). - CrewAI: role-based, declarative — ship a working demo in 2–3 engineer-days. Best for linear A→B→C task pipelines and non-engineer-modifiable configs. - AutoGen: conversation-first design, but effectively in maintenance mode in 2026 as Microsoft pivots to Agent Framework. Don't start new projects here. - Decision rule: cyclical/stateful/observable → LangGraph. Quick linear crews and prototypes → CrewAI. New projects → not AutoGen. - Custom (no framework): pick when you only need 2–3 tools and want zero dependencies. Many staff engineers build their own loop in 200 lines. - Numbers to drop: "Median CrewAI project to production: 11 days. LangGraph: ~62% success on complex tasks vs CrewAI 54% in published comparisons. AutoGen: ~5–7 days to demo, but in maintenance mode."

Common follow-ups: - "When would you build your own instead of using a framework?" - "How do you trace a multi-agent failure end-to-end?"

Traps: - Picking AutoGen for new work in 2026. - Adopting LangGraph for a 3-step linear chain — overkill.

Q: "MCP vs custom tools / direct function calling — when to standardize?"

Tags: senior · common · design · source: Obot AI — MCP vs Function Calling; Zilliz — Function Calling vs MCP vs A2A

Answer outline: - Function calling: a model capability. Tool schemas live inside your app; every request sends them all. Fast iteration, single-app scope. - MCP (Model Context Protocol): a wire protocol. Tools live behind MCP servers; clients discover and call them. Cross-app reuse, dynamic discovery. - Decision rule: start with direct function-calling for a single agent. Graduate to MCP when the same tool (e.g., "search internal docs") needs to serve multiple clients (Claude Desktop, your app, Cursor, automations). - Cost: MCP usually saves tokens per call (load only needed tools); adds a network hop. - Numbers to drop: "Function-calling overhead grows with tool count — each tool schema is sent every request. MCP loads only invoked tools; for a 30-tool catalog, that's often 5–15k tokens saved per call."

Common follow-ups: - "Security model differences?" - "Same agent, multiple deploys — how does MCP help?"

Traps: - Adopting MCP for a single-app prototype — extra moving parts for no reuse benefit. - MCP servers without auth/rate-limit — agent becomes an open RPC gateway.

Q: "Structured outputs vs function calling — which does an agent use for the action?"

Tags: senior · common · conceptual · source: MachineLearningMastery — Structured Outputs vs Function Calling

Answer outline: - Structured outputs: model is forced by schema/grammar to produce conformant JSON. Near 100% schema fidelity, lower latency, but only one "shape" per call. - Function calling: model chooses among many tools, may call zero or many, decides arguments. Drives the control flow. - Decision rule: when there's exactly one thing to format → structured outputs. When the model must choose what to do next → function calling. - Combine: function calling for tool selection, structured outputs for the arguments of each tool. - Numbers to drop: "Strict structured outputs (OpenAI strict mode, Anthropic schema) recover ~100% schema validity vs ~95% with prompt-only JSON. Function-calling tool-selection accuracy: 85–95% with current frontier models."

Common follow-ups: - "What do you do when the model picks the wrong tool?" - "How do you stream structured output safely?"

Traps: - Treating "function calling" as if the model always calls a function — it can refuse, call multiple, or loop. - Skipping schema validation post-generation because "structured output guarantees it" — providers still ship occasional violations.

Q: "What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers? When would you use each?"

Tags: mid · very-common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Encoder-only (BERT, RoBERTa, ModernBERT): bidirectional attention, output is a representation. Use for classification, NER, embeddings, retrieval rerankers. - Decoder-only (GPT, Llama, Claude, DeepSeek): causal attention, output is text. Use for generation, chat, instruction-following, code. - Encoder-decoder (T5, BART, mT5): encoder for input, decoder for output. Use for translation, summarization, structured seq2seq. - 2026 reality: decoder-only dominates even for non-generation tasks (extraction, classification) because instruction-tuned LLMs are good enough and one architecture simplifies stacks. - Decision rule: rerankers and embedders still pay off with encoder-only (5–50x cheaper than asking an LLM); everything else is decoder-only by default. - Numbers to drop: "BGE-reranker-base (encoder-only cross-encoder): ~10ms per pair on CPU. An LLM reranker via API: 100–500ms. 50x latency difference matters in hot paths."

Common follow-ups: - "Why are decoder-only models dominant even for non-generation tasks?" - "When is encoder-decoder still the right pick?"

Traps: - Using a chat LLM for a binary classification task at 100x the cost of a fine-tuned BERT. - Picking T5 in 2026 when a fine-tuned 8B decoder + LoRA would do everything cheaper.

Q: "Context engineering vs RAG — are they the same thing?"

Tags: senior · common · conceptual · source: Atlan — Context Engineering vs RAG 2026; Roadie — RAG vs Context Engineering

Answer outline: - RAG is a technique: fetch documents at query time, stuff into context. - Context engineering is the discipline: design every slot in the context window — system prompt, tool descriptions, retrieved docs, conversation history, memory, scratch — across sources, permissions, freshness. - RAG is one primitive inside context engineering. An agent that pulls only from a vector store has wired one slot of six. - Decision rule: every production system is doing context engineering whether they call it that or not. RAG is a tactic within it. - Watch for: source conflict resolution, ACL enforcement per slot, stale-vs-fresh arbitration, token budget per slot. - Numbers to drop: "Lost-in-the-middle benchmarks: 10–25% accuracy drop on facts in middle 50% of 128k-token prompts — context engineering is about which tokens, not just how many."

Common follow-ups: - "How do you decide what goes in each slot?" - "Two retrieved docs conflict — how does context engineering resolve that?"

Traps: - Treating "context engineering" as a buzzword for the same RAG you already have. - Filling the context window because it's big — Goodhart on token count.

Q: "ReAct vs Plan-and-Execute vs Tree-of-Thoughts — pick a planning strategy."

Tags: senior · common · design · source: aemonline — 25 Advanced Agentic AI Interview Questions 2026; Analytics Vidhya — Agentic AI Interview Questions

Answer outline: - ReAct (reason + act loop): the workhorse. Short tasks, tight loops, tool-use feedback drives next step. Good when the goal is clear and the path is reactive. - Plan-and-Execute (or Plan-and-Solve): plan upfront, execute steps, replan on failure. Good when steps are knowable but the order matters and you want auditability. - Tree-of-Thoughts: branch and search across reasoning paths. Strong on creative or combinatorial problems (puzzles, planning under uncertainty), but expensive — many LLM calls. - Decision rule: ReAct by default. Move to Plan-and-Execute when you need audit logs and partial-result reuse. Reach for ToT only when single-path reasoning measurably fails and you have token budget. - Numbers to drop: "ReAct: typically 3–10 LLM calls per task. Plan-and-Execute: 1 plan + N execution calls. ToT: 10–100+ calls depending on width and depth — easily 10–50x ReAct cost."

Common follow-ups: - "When does ReAct loop forever?" - "How do you choose ToT branching width?"

Traps: - ToT for tasks where ReAct converges — paying 10x for no quality lift. - Plan-and-Execute with no replan-on-failure path — first error halts the run.

Q: "Supervisor-based vs peer-to-peer multi-agent systems — when to choose one?"

Tags: senior · common · design · source: aemonline — 25 Advanced Agentic AI Interview Questions 2026

Answer outline: - Supervisor (hierarchical): one manager LLM owns the plan and dispatches sub-tasks to workers. Clear control flow, easier debugging, deterministic termination. - Peer-to-peer (collaborative): agents talk to each other directly, negotiate, hand off. Higher capability ceiling on emergent tasks, but harder to bound and trace. - Decision rule: ship supervisor first for any production system. Peer-to-peer when you've outgrown supervisor on real, measured tasks — and even then, instrument the agent-to-agent channel like an external API. - Token cost: peer-to-peer chat patterns can explode; supervisor patterns have predictable fan-out. - Numbers to drop: "Supervisor pattern token cost: ~1.5–3x a single-agent baseline. Peer-to-peer patterns can hit 5–15x as agents converse — Anthropic's multi-agent research showed ~15x token usage for their reference deep-research system."

Common follow-ups: - "How do you terminate a peer-to-peer system?" - "Supervisor is the bottleneck — what changes?"

Traps: - Letting peer agents free-form chat without a hard turn limit. - Supervisor with no model of subordinate cost → calls the most expensive worker for every task.

Q: "LLM context window vs external vector database for memory — what are the trade-offs?"

Tags: senior · common · conceptual · source: aemonline — 25 Advanced Agentic AI Interview Questions 2026

Answer outline: - Context window as memory: cheap to set up, no infra. But every turn pays for every token, and quality degrades in the middle of long contexts. - External vector DB: query-time retrieval keeps the prompt focused. Supports millions of memories. Needs eviction, summarization, and ACL. - Hybrid (the production answer): keep last K turns verbatim in window; summarize older history into compressed notes; offload episodic memories into a vector store. - Decision rule: window for working memory (current task), DB for long-term memory (across sessions and users). - Numbers to drop: "32k rolling window for chat: \(0.05–\)0.10 per turn at Sonnet pricing. pgvector memory for 1M memories: ~$50/mo on a small Postgres."

Common follow-ups: - "How do you summarize without losing important facts?" - "User says 'forget X' — what does that mean across both layers?"

Traps: - Dumping the entire chat history into the prompt every turn — linear cost growth. - Vector-store memory without eviction → infinite cost growth.

Q: "Input guardrails vs output guardrails — which spend first?"

Tags: senior · common · design · source: Datadog — LLM Guardrails Best Practices; Openlayer — AI Guardrails 2026

Answer outline: - Input guardrails (prompt-injection detection, PII redaction, topic filters): cheap, run before model call, save tokens and prevent obvious harms. - Output guardrails (toxicity, PII leak, hallucination/grounding check, schema validation): catch what slipped through generation, but you've already paid for the call. - Decision rule: do both, but if budget forces a choice, input first — failing fast is cheaper. Output guardrails are the safety net for what input can't catch (model behavior). - Stack: input filter → model → output filter → log + human-flag on borderline. - Numbers to drop: "Input prompt-injection classifiers (e.g., Lakera, NeMo Guardrails, Llama-Guard 3): ~20–100ms per call. Output toxicity classifiers: similar overhead. Skipping input filter on prompt-injection: estimated 5–15% of jailbreaks succeed without it on naive systems."

Common follow-ups: - "How do you measure guardrail false-positive rate?" - "Llama-Guard vs custom — which?"

Traps: - Guardrails as separate prompts to the same model — no real second opinion, prone to the same failures. - Stacking 5 guardrails with no eval — false-positive rate kills the product UX.

Q: "API-driven LLM usage vs chat-interface usage — what's fundamentally different?"

Tags: screen · occasional · conceptual · source: Analytics Vidhya — Agentic AI Interview Questions Q8

Answer outline: - API: stateless by default — you manage conversation history, set system prompt, control temperature/top-p/max-tokens, request structured output, parallelize, batch. - Chat interface: stateful inside a session — provider manages history, system prompt is fixed, parameters hidden, no parallelism. - API is the only way to build a product. Chat UI is a consumer surface. - Numbers to drop: "Production API workloads commonly run 10–1000 concurrent requests per service instance; chat UIs are bounded to one user thread."

Common follow-ups: - "What state do you have to manage yourself via API?" - "Why do API answers sometimes differ from the same prompt in the chat UI?"

Traps: - Assuming the chat UI's "feel" matches what the API will give you with default params — the UI bakes in opinionated system prompts and settings.

Q: "Temperature 0 vs higher temperatures — when to use each?"

Tags: screen · very-common · conceptual · source: lockedinai — AI Engineer Interview Questions Q11

Answer outline: - Temperature 0 (or near-0): deterministic-ish, max-likelihood sampling. Use for extraction, classification, structured output, tool selection, code generation where there's a "right" answer. - Higher temperature (0.5–1.0): broader sampling. Use for brainstorming, creative writing, paraphrase generation, synthetic data. - Temperature 0 is not fully deterministic across providers — sampling implementation and load-balancing across nodes can still vary. - Decision rule: 0 for anything you'd grade right/wrong; >0 only when diversity is the goal. - Pair with: top-p (nucleus) cap to bound the tail when you do go higher. - Numbers to drop: "Temperature 0 + seed (where supported, e.g., OpenAI seed param) gets reproducibility within ~98% — not 100% due to backend non-determinism."

Common follow-ups: - "Your tool selection is flaky at temperature 0 — why?" - "When is temperature 0.7 safer than 1.0?"

Traps: - Treating temperature 0 as deterministic — it isn't, fully. - High temperature for code generation — random syntax errors.

Q: "Compare general re-rankers and instruction-following re-rankers in RAG."

Tags: senior · occasional · conceptual · source: KalyanKS — RAG Interview Questions Hub Q64

Answer outline: - General rerankers (BGE-reranker, Cohere Rerank, Jina): score (query, doc) pairs on semantic relevance. Fixed objective, very fast, no prompt. - Instruction-following rerankers (rerank-as-LLM with an explicit instruction prompt): can re-rank by custom criteria — recency, source authority, user intent. - Decision rule: general rerankers for default relevance; instruction-following when ranking needs to obey business logic ("prefer official docs," "recent first," "answers in the user's role"). - Cost gap is real: instruction-following often goes through an LLM and is 10–50x slower. - Numbers to drop: "BGE-reranker-base: ~5–10ms per pair. LLM-as-reranker: 200–500ms per query for top-20. Cross-encoder typical NDCG@10 lift: +5–15 points over ANN-only."

Common follow-ups: - "How would you train an instruction-following reranker?" - "Where does instruction-following hurt latency too much?"

Traps: - Using LLM-as-reranker in a latency-critical path without budgeting. - Custom-prompt rerankers without eval — easy to make ranking worse.

Q: "Prompt caching vs semantic caching vs no cache for agents — what's the cost picture?"

Tags: senior · common · design · source: Redis — Prompt vs Semantic Caching; TrueFoundry — Semantic Caching

Answer outline: - Prompt cache (provider KV-cache, e.g., Anthropic prompt caching): reuses computed KV for shared prefixes (system prompt, tool definitions, RAG-corpus context). ~85–90% input-token discount. Safe, no false positives. - Semantic cache: embeds the user query, returns cached response if close enough. Big savings (50–68%) but introduces false-positive risk; needs threshold tuning and verification. - For agents: prompt cache the system prompt and tool catalog (stable across turns); use semantic cache only on read-heavy intent classes where wrong answers are cheap. - Decision rule: always-on prompt cache, opt-in semantic cache only with a verification step. - Numbers to drop: "Anthropic prompt cache: 90% discount on cached input, 5-min default TTL. Semantic at 0.95 threshold: ~38–61% savings. Exact match: ~12% in production."

Common follow-ups: - "What's the threshold and how did you pick it?" - "Cache invalidation on tool catalog change?"

Traps: - Semantic cache with no fallback — when miss is wrong, agent answers incorrectly. - Forgetting the TTL on prompt cache and paying full price after 5 minutes of inactivity.

Q: "Predictive/Discriminative AI vs Generative AI — when does each fit a problem?"

Tags: screen · very-common · conceptual · source: llmgenai — LLMInterviewQuestions; lockedinai — AI Engineer Interview Questions

Answer outline: - Discriminative: model P(y|x), classify or score. Logistic regression, XGBoost, BERT classifiers. Best for fraud, churn, tabular prediction, intent classification. - Generative: model P(x) or P(x|y), produce new content. LLMs, diffusion. Best for text generation, summarization, code, image synthesis. - Decision rule: if the output is a label or a score, go discriminative — cheaper, more interpretable, easier to calibrate. If output is unbounded text/image, go generative. - Hybrid: a generative LLM can act as a discriminator (classifier) but at 10–100x the cost of a fine-tuned BERT. - Numbers to drop: "Fine-tuned ModernBERT on classification: <10ms per item, ~\(0.0001 per 1k inferences. LLM-as-classifier via API: 100–500ms, ~\)0.001–$0.01 per 1k inferences."

Common follow-ups: - "Why might you use a generative LLM for classification anyway?" - "What's calibration for each?"

Traps: - Reaching for an LLM when XGBoost on a tabular dataset would crush it. - Generative model as classifier without calibration — confidence is meaningless.

Q: "Cross-encoder vs bi-encoder — when do you use which in retrieval?"

Tags: mid · very-common · conceptual · source: adilshamim8 — Every AI Engineer Interview Question 2026

Answer outline: - Bi-encoder: separate encoders for query and doc, similarity is dot/cosine. Index-friendly (O(1) lookup via ANN). Use for first-stage retrieval at scale. - Cross-encoder: joint encoder over (query, doc) pair, scores their relationship. Way more accurate but O(N) — you can't index it. - Standard stack: bi-encoder retrieves top-K (50–200) cheaply, cross-encoder reranks down to top-5/10. - Decision rule: bi-encoder for retrieval, cross-encoder for rerank. Never use cross-encoder as primary retriever unless your corpus fits in a few hundred docs. - Numbers to drop: "Bi-encoder: encode 1M docs once, query in <10ms via HNSW. Cross-encoder: ~5–10ms per pair on CPU, ~50ms on GPU — fine for top-50 rerank, fatal for top-1M scoring."

Common follow-ups: - "ColBERT — where does it sit?" - "When would you skip the cross-encoder rerank?"

Traps: - Using a cross-encoder to score every doc in the corpus — quadratic blowup. - Skipping rerank because "bi-encoder is fine" — measure NDCG before deciding.