Applied AI Engineer — Interview Focus Areas¶
Grounded in 2026 data: 100+ real interview reports, 1,000+ AI engineer JDs, and current salary surveys (sources at the end). The role sits between data science and software engineering — more production-oriented than the former, more AI-specific than the latter.
Market signal — what employers list (share of 2026 JDs):
Compensation (US, 2026)
Mid-level \(130K–\)175K · Senior \(195K–\)290K
This guide is organized by focus area, not by curriculum module. Each section maps to the reorganized AI tracks under 00_ai_foundation/ through 05_ai_specializations/. Tier 1 areas appear in nearly every loop. Tier 2 appears in most senior loops. Tier 3 is role-specific.
Tier 1 — Appears in nearly every loop¶
1. RAG systems (end-to-end)¶
The single highest-yield topic. "Design a RAG system for customer support" is the most-reported question across companies. You will be asked to design one, debug one, evaluate one, or all three.
Mental model. RAG = retrieve documents, stuff them into context, generate. Every component is a failure mode.
What you must be able to do:
- Walk through the full pipeline: ingest → chunk → embed → index → retrieve → rerank → augment → generate → evaluate → trace.
- Defend chunking strategy (fixed, semantic, hierarchical) and explain when each fails.
- Choose between sparse (BM25), dense (embeddings), and hybrid retrieval — and explain hybrid scoring (RRF, weighted).
- Pick a vector DB: pgvector (small, transactional), Pinecone/Weaviate (managed), Qdrant/Milvus (self-host scale). Defend the trade-off.
- Handle conflicting sources, recency bias, and citation/attribution.
- Detect hallucinations: faithfulness scoring, citation overlap, LLM-as-judge, RAGAS metrics (faithfulness, answer relevance, context precision, context recall).
- Reason about cost-per-query and latency-per-query at 1M queries/day.
Common questions (verbatim from interview reports):
- "Design a RAG system that handles conflicting information across sources."
- "How do you detect and mitigate hallucinations in production?"
- "What's the difference between a retriever returning the wrong document and a generator ignoring the right one — how do you debug?"
- "Your chunk size is 512 but legal documents have 50-page contracts. What breaks?"
Pitfalls to avoid in interview:
- Treating RAG as one box. Decompose it.
- Forgetting evaluation. Always say "and then we measure faithfulness with…"
- Claiming dense retrieval is always better. It isn't — sparse wins on rare entities and exact-match.
- Ignoring re-ranking. Cross-encoders are cheap compared to LLM generation.
Modules: 01_ai_engineering/08_rag_system_design/, 01_ai_engineering/09_advanced_rag_patterns/, 02_ai_infrastructure/03_vector_retrieval_infrastructure/, 01_ai_engineering/07_search_relevance_ranking/, 01_ai_engineering/10_knowledge_graph_retrieval/ (if Graph RAG role).
2. Evaluation & production trade-offs¶
The differentiator at senior loops. Most candidates can talk RAG; few can talk eval rigorously. A direct quote from a 2026 interview: "Is there an actual eval framework here, or is it vibes-based?"
Mental model. Evals = unit tests for non-deterministic systems. If you cannot measure it, you cannot improve it.
What you must be able to do:
- Distinguish offline evals (golden datasets, regression) from online (A/B, canary, shadow).
- Build a golden dataset: where do labels come from, how do you keep it fresh, how do you avoid leakage.
- LLM-as-judge: pairwise vs pointwise, how to validate the judge itself.
- Metrics zoo: ROUGE/BLEU (limited — surface n-grams), perplexity (only on probability output), task-specific (faithfulness, answer relevance, instruction-following).
- Drift detection: input drift, output drift, judge drift.
- Cost-aware quality: pareto frontiers of cost vs quality.
Common questions:
- "How do you evaluate a chatbot?"
- "Your eval metric went up but users complain more. What's wrong?"
- "Walk me through an eval framework you built — what were the failure modes?"
Pitfalls:
- Reaching for BLEU. It does not measure what you care about.
- Forgetting human evaluation sampling alongside LLM-judge.
- Confusing model evals with system evals (RAG eval ≠ LLM eval).
Modules: 04_ai_product_evals/00_ai_evals_release_gates/, 04_ai_product_evals/01_dataset_golden_set_operations/, 04_ai_product_evals/02_telemetry_feedback_loops/, plus production-ops modules in Tier 2.
3. Agents & tool calling¶
Agentic AI is now its own interview round. The defining question of 2026: "What's the difference between an agent and a simple LLM chain?"
Mental model. Agent = LLM + tools + loop + memory + termination condition. Chains are DAGs; agents are graphs with cycles.
What you must be able to do:
- Define agent autonomy boundaries — what can it do without human approval.
- Design tool schemas (JSON schema, function signatures, parameter validation).
- Sandbox tool execution — what isolates the agent from prod systems.
- Manage memory: short-term (working context), long-term (episodic, semantic), summarization vs eviction.
- Prevent over-reasoning loops (max iterations, budget, escalation).
- Handle tool failures, retries, partial results.
- Single-agent vs multi-agent: when does multi-agent justify the complexity.
- Frameworks: LangGraph (stateful, production), CrewAI/AutoGen (multi-agent), MCP (tool protocol).
Common questions:
- "How do you prevent an agent from over-reasoning or over-planning?"
- "Design an agent for end-to-end customer onboarding. Where does it call humans?"
- "Your agent hits a tool that returns 500. What happens next?"
Pitfalls:
- "Agentic" as buzzword. If your agent doesn't have a loop and a stop condition, it's a chain.
- Forgetting tool-call observability. Every tool call is a span.
- Letting agents run unbounded — interviewers love asking about budget caps.
Modules: 01_ai_engineering/01_agentic_system_design/, 01_ai_engineering/02_durable_agent_workflows/, 01_ai_engineering/11_long_term_memory_state/, 01_ai_engineering/16_multi_agent_coordination/.
4. Prompt engineering (system-level)¶
Not "tips and tricks." The system-level discipline of designing, versioning, and testing prompts.
Mental model. Prompts are code. Version them. Test them. Roll them back.
What you must be able to do:
- Anatomy of a system prompt: role, instructions, constraints, format, examples, tools.
- Few-shot placement: where in context, ordering effects, when to skip.
- Chain-of-thought: when it helps, when it costs without lift.
- Structured output: JSON schema, function calling, constrained decoding.
- Temperature, top-p, top-k — when each matters.
- Prompt chaining vs single-shot.
- A/B testing prompts: shadow traffic, paired comparisons.
- Versioning: how do you roll out a prompt change without breaking production.
- The big trade-off triangle: prompt engineer → RAG → fine-tune. Know when each wins.
Common questions (top-5 globally):
- "When would you fine-tune vs use prompt engineering vs RAG?"
- "Your prompt works 90% of the time. The 10% is critical. What do you do?"
- "How do you version prompts across environments?"
Pitfalls:
- Treating prompts as static strings. They drift with model upgrades.
- Forgetting that fine-tuning vs RAG is not either/or. They compose.
Modules: 00_ai_foundation/07_prompting_fundamentals/, 01_ai_engineering/13_prompt_lifecycle_operations/.
5. Cost & latency optimization at scale¶
Top-5 question globally: "Your app gets 1M queries/day — how do you optimize cost?"
Mental model. Three knobs: fewer tokens, cheaper model, fewer calls.
What you must be able to do:
- Reduce tokens: prompt compression, context trimming, summary memory, caching system prompts.
- Tier models: route easy queries to cheap models, escalate hard ones (router models, confidence-based escalation).
- Cache: exact-match cache, semantic cache, KV-cache awareness (prefix caching on providers).
- Batch when latency allows. Stream when latency matters.
- Quantize/distill for self-hosted.
- Speculative decoding for hot paths.
- Talk in $ and ms, not "fast" and "cheap." Numbers matter.
Common questions:
- "Your app gets 1M queries/day. Walk me through the cost stack."
- "P95 latency is 3 seconds. Customers complain. What do you measure first?"
- "When would you self-host vs use an API?"
Pitfalls:
- Reaching for fine-tuning to cut cost. Often a worse ROI than caching + router.
- Quoting cost without context window math.
- Forgetting that the embedding model has cost too.
Modules: 02_ai_infrastructure/05_agent_performance_economics/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/03_agent_observability_debugging/.
6. LLM fundamentals¶
Conceptual fluency, not coding-deep. You will not be asked to implement multi-head attention from memory in most loops (some research-leaning loops do — see Tier 2 coding).
Mental model. Transformer = attention layers stacked with residual connections. Attention = each token looks at every other token weighted by relevance.
What you must be able to do:
- Explain attention in one paragraph without notes.
- Explain why context windows are a quadratic-cost problem.
- Tokenization: BPE vs WordPiece vs SentencePiece. Why 1 word ≈ 1.3 tokens in English.
- KV cache: what it stores, why it speeds up generation, why it grows with context.
- Pre-training vs SFT vs RLHF/DPO — at a system level, not loss-function level.
- Scaling laws: parameters vs tokens (Chinchilla-style intuition).
- Context window vs effective context (lost-in-the-middle).
Common questions:
- "How do LLMs work?"
- "What is tokenization and how does it affect LLM performance?"
- "Why does an LLM hallucinate?"
Pitfalls:
- Going too deep too fast. Interviewer signals depth; match it.
- Forgetting tokenization → cost link. They are the same conversation.
Modules: 00_ai_foundation/02_tokens_embeddings_context/, 00_ai_foundation/03_transformer_mechanics/.
Tier 2 — Most senior loops¶
7. Fine-tuning & adaptation¶
Usually probed in a "when would you" framing, not "implement RLHF" framing.
What you must be able to do:
- PEFT family: LoRA, QLoRA, prefix tuning. When each fits.
- Instruction tuning vs preference tuning (DPO vs RLHF).
- Quantization for inference: int8, int4, FP8, GPTQ, AWQ.
- Distillation: teacher-student, when it pays off.
- Decision tree: prompt → RAG → fine-tune. Defend the order.
Modules: 00_ai_foundation/06_adaptation_compression/, 00_ai_foundation/05_llm_training_pipeline/.
8. AI system design¶
Now a dedicated round at most senior loops. Reported questions:
- "Design ChatGPT" / "Scale a chat feature to 1M daily users."
- "Document Q&A with hallucination prevention at 10M+ documents."
- "Hospital voice assistant — noise, latency, privacy."
- "AI-powered legal assistant."
- "Image generation pipeline."
What you must be able to do:
- Read the prompt for constraints: scale, latency budget, privacy class, modality.
- Decompose: ingestion, retrieval, generation, eval, safety, observability, cost.
- Defend choices with numbers (latency budget per stage, tokens per request, $ per query).
- Show fallback paths: model down, retrieval empty, tool failure, rate limit hit.
- Acknowledge non-AI infra: rate limiters, auth, logging, multi-tenant isolation.
Modules: ../06_system_designing/ (sibling track) — this is its own discipline.
9. Production MLOps & infrastructure¶
What you must be able to do:
- Deployment patterns: blue-green, canary, shadow traffic.
- Drift monitoring: input distribution, output distribution, judge score drift.
- Traffic spike handling without overloading the model provider (queueing, rate limit budgets, graceful degradation).
- Distributed inference: tensor parallelism vs pipeline parallelism (high level).
- Self-hosted serving engines: vLLM, TGI, TensorRT-LLM, SGLang — pick one and know it.
Modules: 02_ai_infrastructure/04_ml_platform_operations/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/04_resilient_agent_systems/, 02_ai_infrastructure/06_ai_runbooks_oncall/.
10. Safety & guardrails¶
Universal at safety-conscious companies (Anthropic, OpenAI, regulated industries).
What you must be able to do:
- Prompt injection: direct, indirect (poisoned docs), jailbreaks. Mitigations: input filtering, output filtering, sandboxing, constrained decoding.
- PII detection and redaction.
- Output moderation: classifiers, regex, LLM-judge.
- Red-teaming: how would you attack your own system.
- Constitutional AI / RLHF-style refusal training (at a system level).
- Audit logging and review workflows.
Modules: 03_ai_security_safety/00_safety_guardrail_design/, 03_ai_security_safety/01_prompt_injection_security/, 03_ai_security_safety/03_data_access_governance/.
11. Python & async engineering¶
Most coding rounds test eng fundamentals, not ML.
What you must be able to do:
- asyncio patterns: gather, wait_for, semaphore-based concurrency.
- Retries with backoff and jitter.
- Timeouts at every IO boundary.
- The GIL: what it does, what it doesn't, when threading helps.
- FastAPI: dependency injection, streaming responses, background tasks.
Modules: 02_ai_infrastructure/00_ai_backend_api_engineering/.
12. Observability & tracing¶
Almost always probed when monitoring comes up.
What you must be able to do:
- Span model for LLM apps: each call, each tool, each retrieval is a span.
- OpenTelemetry, LangSmith, Helicone, Phoenix — pick one.
- Trace what: prompt, response, tokens, latency, cost, eval scores, user ID.
- Use traces to debug an agent that "just looped forever."
Modules: 01_ai_engineering/03_agent_observability_debugging/, 04_ai_product_evals/02_telemetry_feedback_loops/.
Tier 3 — Role-specific¶
| Module | When it matters |
|---|---|
01_ai_engineering/15_reasoning_routing_verification/ |
Roles using o-series / R1-style reasoning |
05_ai_specializations/01_multimodal_vision_systems/ |
Multimodal product roles |
05_ai_specializations/02_diffusion_media_generation/ |
Image generation roles |
05_ai_specializations/00_realtime_voice_agents/ |
Voice agent / realtime roles |
01_ai_engineering/11_long_term_memory_state/ (deep) |
Agent-heavy roles |
01_ai_engineering/10_knowledge_graph_retrieval/ |
Enterprise knowledge platforms |
01_ai_engineering/17_schema_driven_generation/ |
Code assistant / structured extraction roles |
01_ai_engineering/06_evidence_data_pipelines/ |
Roles with heavy upstream data work |
Tier 4 — Deprioritize for interview cramming¶
These matter for the job but are not heavily probed in interviews:
00_ai_foundation/00_ml_prerequisites_refresher/— down to ~20-30% of interview time.00_ai_foundation/01_neural_network_primitives/— conceptual baseline only.00_ai_foundation/04_autoregressive_generation/— understand but rarely live-coded outside research loops.03_ai_security_safety/02_ai_ethics_risk_fairness/— important post-hire, lightly probed in screens.01_ai_engineering/20_engineering_leadership_judgment/— meta; apply through capstone, not standalone study.
Coding round patterns¶
ML / AI coding (research-leaning loops):
- Implement multi-head attention from scratch (PyTorch, no library).
- Implement LoRA adapter and merge.
- Implement beam search or top-p sampling.
- Implement autoregressive generation loop.
Classic algorithm (most loops):
- LRU cache (very common).
- Trie + DFS.
- Binary tree serialization.
- Union find.
Company-specific patterns:
- OpenAI: KV stores, versioned databases, credit/quota management.
- Anthropic: 4-level progressive build — start with SET/GET, end at timestamped TTL with snapshots.
Practical / take-home patterns:
- JSON extraction + LLM summarization.
- Web crawler with rate limiting.
- Medical/legal document NLP.
Take-home assignments (real, from 2026 reports)¶
| Assignment | What it tests |
|---|---|
| Customer support RAG chatbot (100+ concurrent, <2s latency) | RAG + production + latency |
| Document Q&A with citations | Faithfulness, retrieval, structured output |
| Blood test PDF analysis with online retrieval | Multi-step, tool use, structured output |
| Multi-agent content generation (5+ agents) | Agent orchestration |
| Marksheet extraction API with confidence scoring | OCR/structured extraction + calibration |
| Real-time call transcription with insights | Streaming + voice |
Red flags to push back on:
- 72-hour "Round 1" demands.
- Unpaid scope that resembles paid consulting work.
Capstone narrative (the deep-dive round)¶
The 25–45 minute "walk me through an end-to-end project you owned" round is universal. Have one project rehearsed cold.
Structure that works:
- Problem — one sentence. Who hurts, how much.
- Constraints — latency, cost, privacy, accuracy bar.
- Architecture — pipeline diagram in your head, three layers max.
- Why this, not that — name two alternatives you rejected, and the reason.
- Metrics — concrete numbers. "P95 dropped from 4.2s to 1.1s. Cost per query went from $0.018 to $0.004."
- Eval framework — what's in the golden set, how often it runs, what failed.
- What broke — one production incident, root cause, fix.
- What you'd change — show taste.
Senior signal: numbers, alternatives rejected, real eval framework, a real incident.
Junior signal: "we used LangChain and it worked."
Behavioral patterns (2026-specific)¶
Frequent prompts:
- "How do you stay updated with fast-changing AI tech?"
- "Walk me through an ethical concern in an ML project."
- "Tell me about a cost or latency reduction you drove."
- "How do you handle ambiguous, evolving requirements?"
- "Tell me about a model provider decision you defended."
Recommended prep sequence (cramming order)¶
Not the curriculum order — the interview order.
- Foundations refresh:
00_ai_foundation/02→00_ai_foundation/03 - The big four:
00_ai_foundation/07+01_ai_engineering/13→01_ai_engineering/08→01_ai_engineering/09→02_ai_infrastructure/03 - Agents:
01_ai_engineering/00→01_ai_engineering/01→01_ai_engineering/02 - Eval & production:
04_ai_product_evals/00→04_ai_product_evals/01→04_ai_product_evals/02→01_ai_engineering/03→03_ai_security_safety/00→02_ai_infrastructure/05 - System design: parallel
06_system_designing/track - Fine-tune layer:
00_ai_foundation/06→00_ai_foundation/05 - Capstone story:
01_ai_engineering/21— one project, 25-min cold narration with metrics - Coding warmup:
02_ai_infrastructure/00+ ML coding drills (MHA, LoRA, beam search)
Sources¶
- Every AI Engineer Interview Question You Need to Know in 2026 — Adil Shamim, 100+ real interviews
- What Is an AI Engineer? 2026 Role, Skills and Responsibilities — based on 1,000+ JDs
- How to Hire RAG Engineers in 2026: Salary, Skills & Interview Guide — KORE1
- Top 30 RAG Interview Questions — DataCamp 2026
- Top 50 AI Engineer Interview Questions — LockedinAI 2026
- 25 Advanced Agentic AI Interview Questions — AEM Institute, Feb 2026
- 30 Agentic AI Interview Questions — Analytics Vidhya, Feb 2026
- LLM Interview Questions repo — llmgenai/LLMInterviewQuestions
- RAG Interview Questions repo — KalyanKS-NLP
- AI Engineer Job Description Template — KORE1 2026