Skip to content

Applied AI Engineer — Interview Focus Areas

Grounded in 2026 data: 100+ real interview reports, 1,000+ AI engineer JDs, and current salary surveys (sources at the end). The role sits between data science and software engineering — more production-oriented than the former, more AI-specific than the latter.

Market signal — what employers list (share of 2026 JDs):

82.5%Python
35.9%RAG
29.1%Prompt engineering
25.4%General LLM knowledge

Compensation (US, 2026)

Mid-level \(130K–\)175K · Senior \(195K–\)290K

This guide is organized by focus area, not by curriculum module. Each section maps to the reorganized AI tracks under 00_ai_foundation/ through 05_ai_specializations/. Tier 1 areas appear in nearly every loop. Tier 2 appears in most senior loops. Tier 3 is role-specific.


Tier 1 — Appears in nearly every loop

1. RAG systems (end-to-end)

The single highest-yield topic. "Design a RAG system for customer support" is the most-reported question across companies. You will be asked to design one, debug one, evaluate one, or all three.

Mental model. RAG = retrieve documents, stuff them into context, generate. Every component is a failure mode.

What you must be able to do:

  • Walk through the full pipeline: ingest → chunk → embed → index → retrieve → rerank → augment → generate → evaluate → trace.
  • Defend chunking strategy (fixed, semantic, hierarchical) and explain when each fails.
  • Choose between sparse (BM25), dense (embeddings), and hybrid retrieval — and explain hybrid scoring (RRF, weighted).
  • Pick a vector DB: pgvector (small, transactional), Pinecone/Weaviate (managed), Qdrant/Milvus (self-host scale). Defend the trade-off.
  • Handle conflicting sources, recency bias, and citation/attribution.
  • Detect hallucinations: faithfulness scoring, citation overlap, LLM-as-judge, RAGAS metrics (faithfulness, answer relevance, context precision, context recall).
  • Reason about cost-per-query and latency-per-query at 1M queries/day.

Common questions (verbatim from interview reports):

  • "Design a RAG system that handles conflicting information across sources."
  • "How do you detect and mitigate hallucinations in production?"
  • "What's the difference between a retriever returning the wrong document and a generator ignoring the right one — how do you debug?"
  • "Your chunk size is 512 but legal documents have 50-page contracts. What breaks?"

Pitfalls to avoid in interview:

  • Treating RAG as one box. Decompose it.
  • Forgetting evaluation. Always say "and then we measure faithfulness with…"
  • Claiming dense retrieval is always better. It isn't — sparse wins on rare entities and exact-match.
  • Ignoring re-ranking. Cross-encoders are cheap compared to LLM generation.

Modules: 01_ai_engineering/08_rag_system_design/, 01_ai_engineering/09_advanced_rag_patterns/, 02_ai_infrastructure/03_vector_retrieval_infrastructure/, 01_ai_engineering/07_search_relevance_ranking/, 01_ai_engineering/10_knowledge_graph_retrieval/ (if Graph RAG role).


2. Evaluation & production trade-offs

The differentiator at senior loops. Most candidates can talk RAG; few can talk eval rigorously. A direct quote from a 2026 interview: "Is there an actual eval framework here, or is it vibes-based?"

Mental model. Evals = unit tests for non-deterministic systems. If you cannot measure it, you cannot improve it.

What you must be able to do:

  • Distinguish offline evals (golden datasets, regression) from online (A/B, canary, shadow).
  • Build a golden dataset: where do labels come from, how do you keep it fresh, how do you avoid leakage.
  • LLM-as-judge: pairwise vs pointwise, how to validate the judge itself.
  • Metrics zoo: ROUGE/BLEU (limited — surface n-grams), perplexity (only on probability output), task-specific (faithfulness, answer relevance, instruction-following).
  • Drift detection: input drift, output drift, judge drift.
  • Cost-aware quality: pareto frontiers of cost vs quality.

Common questions:

  • "How do you evaluate a chatbot?"
  • "Your eval metric went up but users complain more. What's wrong?"
  • "Walk me through an eval framework you built — what were the failure modes?"

Pitfalls:

  • Reaching for BLEU. It does not measure what you care about.
  • Forgetting human evaluation sampling alongside LLM-judge.
  • Confusing model evals with system evals (RAG eval ≠ LLM eval).

Modules: 04_ai_product_evals/00_ai_evals_release_gates/, 04_ai_product_evals/01_dataset_golden_set_operations/, 04_ai_product_evals/02_telemetry_feedback_loops/, plus production-ops modules in Tier 2.


3. Agents & tool calling

Agentic AI is now its own interview round. The defining question of 2026: "What's the difference between an agent and a simple LLM chain?"

Mental model. Agent = LLM + tools + loop + memory + termination condition. Chains are DAGs; agents are graphs with cycles.

What you must be able to do:

  • Define agent autonomy boundaries — what can it do without human approval.
  • Design tool schemas (JSON schema, function signatures, parameter validation).
  • Sandbox tool execution — what isolates the agent from prod systems.
  • Manage memory: short-term (working context), long-term (episodic, semantic), summarization vs eviction.
  • Prevent over-reasoning loops (max iterations, budget, escalation).
  • Handle tool failures, retries, partial results.
  • Single-agent vs multi-agent: when does multi-agent justify the complexity.
  • Frameworks: LangGraph (stateful, production), CrewAI/AutoGen (multi-agent), MCP (tool protocol).

Common questions:

  • "How do you prevent an agent from over-reasoning or over-planning?"
  • "Design an agent for end-to-end customer onboarding. Where does it call humans?"
  • "Your agent hits a tool that returns 500. What happens next?"

Pitfalls:

  • "Agentic" as buzzword. If your agent doesn't have a loop and a stop condition, it's a chain.
  • Forgetting tool-call observability. Every tool call is a span.
  • Letting agents run unbounded — interviewers love asking about budget caps.

Modules: 01_ai_engineering/01_agentic_system_design/, 01_ai_engineering/02_durable_agent_workflows/, 01_ai_engineering/11_long_term_memory_state/, 01_ai_engineering/16_multi_agent_coordination/.


4. Prompt engineering (system-level)

Not "tips and tricks." The system-level discipline of designing, versioning, and testing prompts.

Mental model. Prompts are code. Version them. Test them. Roll them back.

What you must be able to do:

  • Anatomy of a system prompt: role, instructions, constraints, format, examples, tools.
  • Few-shot placement: where in context, ordering effects, when to skip.
  • Chain-of-thought: when it helps, when it costs without lift.
  • Structured output: JSON schema, function calling, constrained decoding.
  • Temperature, top-p, top-k — when each matters.
  • Prompt chaining vs single-shot.
  • A/B testing prompts: shadow traffic, paired comparisons.
  • Versioning: how do you roll out a prompt change without breaking production.
  • The big trade-off triangle: prompt engineer → RAG → fine-tune. Know when each wins.

Common questions (top-5 globally):

  • "When would you fine-tune vs use prompt engineering vs RAG?"
  • "Your prompt works 90% of the time. The 10% is critical. What do you do?"
  • "How do you version prompts across environments?"

Pitfalls:

  • Treating prompts as static strings. They drift with model upgrades.
  • Forgetting that fine-tuning vs RAG is not either/or. They compose.

Modules: 00_ai_foundation/07_prompting_fundamentals/, 01_ai_engineering/13_prompt_lifecycle_operations/.


5. Cost & latency optimization at scale

Top-5 question globally: "Your app gets 1M queries/day — how do you optimize cost?"

Mental model. Three knobs: fewer tokens, cheaper model, fewer calls.

What you must be able to do:

  • Reduce tokens: prompt compression, context trimming, summary memory, caching system prompts.
  • Tier models: route easy queries to cheap models, escalate hard ones (router models, confidence-based escalation).
  • Cache: exact-match cache, semantic cache, KV-cache awareness (prefix caching on providers).
  • Batch when latency allows. Stream when latency matters.
  • Quantize/distill for self-hosted.
  • Speculative decoding for hot paths.
  • Talk in $ and ms, not "fast" and "cheap." Numbers matter.

Common questions:

  • "Your app gets 1M queries/day. Walk me through the cost stack."
  • "P95 latency is 3 seconds. Customers complain. What do you measure first?"
  • "When would you self-host vs use an API?"

Pitfalls:

  • Reaching for fine-tuning to cut cost. Often a worse ROI than caching + router.
  • Quoting cost without context window math.
  • Forgetting that the embedding model has cost too.

Modules: 02_ai_infrastructure/05_agent_performance_economics/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/03_agent_observability_debugging/.


6. LLM fundamentals

Conceptual fluency, not coding-deep. You will not be asked to implement multi-head attention from memory in most loops (some research-leaning loops do — see Tier 2 coding).

Mental model. Transformer = attention layers stacked with residual connections. Attention = each token looks at every other token weighted by relevance.

What you must be able to do:

  • Explain attention in one paragraph without notes.
  • Explain why context windows are a quadratic-cost problem.
  • Tokenization: BPE vs WordPiece vs SentencePiece. Why 1 word ≈ 1.3 tokens in English.
  • KV cache: what it stores, why it speeds up generation, why it grows with context.
  • Pre-training vs SFT vs RLHF/DPO — at a system level, not loss-function level.
  • Scaling laws: parameters vs tokens (Chinchilla-style intuition).
  • Context window vs effective context (lost-in-the-middle).

Common questions:

  • "How do LLMs work?"
  • "What is tokenization and how does it affect LLM performance?"
  • "Why does an LLM hallucinate?"

Pitfalls:

  • Going too deep too fast. Interviewer signals depth; match it.
  • Forgetting tokenization → cost link. They are the same conversation.

Modules: 00_ai_foundation/02_tokens_embeddings_context/, 00_ai_foundation/03_transformer_mechanics/.


Tier 2 — Most senior loops

7. Fine-tuning & adaptation

Usually probed in a "when would you" framing, not "implement RLHF" framing.

What you must be able to do:

  • PEFT family: LoRA, QLoRA, prefix tuning. When each fits.
  • Instruction tuning vs preference tuning (DPO vs RLHF).
  • Quantization for inference: int8, int4, FP8, GPTQ, AWQ.
  • Distillation: teacher-student, when it pays off.
  • Decision tree: prompt → RAG → fine-tune. Defend the order.

Modules: 00_ai_foundation/06_adaptation_compression/, 00_ai_foundation/05_llm_training_pipeline/.


8. AI system design

Now a dedicated round at most senior loops. Reported questions:

  • "Design ChatGPT" / "Scale a chat feature to 1M daily users."
  • "Document Q&A with hallucination prevention at 10M+ documents."
  • "Hospital voice assistant — noise, latency, privacy."
  • "AI-powered legal assistant."
  • "Image generation pipeline."

What you must be able to do:

  • Read the prompt for constraints: scale, latency budget, privacy class, modality.
  • Decompose: ingestion, retrieval, generation, eval, safety, observability, cost.
  • Defend choices with numbers (latency budget per stage, tokens per request, $ per query).
  • Show fallback paths: model down, retrieval empty, tool failure, rate limit hit.
  • Acknowledge non-AI infra: rate limiters, auth, logging, multi-tenant isolation.

Modules: ../06_system_designing/ (sibling track) — this is its own discipline.


9. Production MLOps & infrastructure

What you must be able to do:

  • Deployment patterns: blue-green, canary, shadow traffic.
  • Drift monitoring: input distribution, output distribution, judge score drift.
  • Traffic spike handling without overloading the model provider (queueing, rate limit budgets, graceful degradation).
  • Distributed inference: tensor parallelism vs pipeline parallelism (high level).
  • Self-hosted serving engines: vLLM, TGI, TensorRT-LLM, SGLang — pick one and know it.

Modules: 02_ai_infrastructure/04_ml_platform_operations/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/04_resilient_agent_systems/, 02_ai_infrastructure/06_ai_runbooks_oncall/.


10. Safety & guardrails

Universal at safety-conscious companies (Anthropic, OpenAI, regulated industries).

What you must be able to do:

  • Prompt injection: direct, indirect (poisoned docs), jailbreaks. Mitigations: input filtering, output filtering, sandboxing, constrained decoding.
  • PII detection and redaction.
  • Output moderation: classifiers, regex, LLM-judge.
  • Red-teaming: how would you attack your own system.
  • Constitutional AI / RLHF-style refusal training (at a system level).
  • Audit logging and review workflows.

Modules: 03_ai_security_safety/00_safety_guardrail_design/, 03_ai_security_safety/01_prompt_injection_security/, 03_ai_security_safety/03_data_access_governance/.


11. Python & async engineering

Most coding rounds test eng fundamentals, not ML.

What you must be able to do:

  • asyncio patterns: gather, wait_for, semaphore-based concurrency.
  • Retries with backoff and jitter.
  • Timeouts at every IO boundary.
  • The GIL: what it does, what it doesn't, when threading helps.
  • FastAPI: dependency injection, streaming responses, background tasks.

Modules: 02_ai_infrastructure/00_ai_backend_api_engineering/.


12. Observability & tracing

Almost always probed when monitoring comes up.

What you must be able to do:

  • Span model for LLM apps: each call, each tool, each retrieval is a span.
  • OpenTelemetry, LangSmith, Helicone, Phoenix — pick one.
  • Trace what: prompt, response, tokens, latency, cost, eval scores, user ID.
  • Use traces to debug an agent that "just looped forever."

Modules: 01_ai_engineering/03_agent_observability_debugging/, 04_ai_product_evals/02_telemetry_feedback_loops/.


Tier 3 — Role-specific

Module When it matters
01_ai_engineering/15_reasoning_routing_verification/ Roles using o-series / R1-style reasoning
05_ai_specializations/01_multimodal_vision_systems/ Multimodal product roles
05_ai_specializations/02_diffusion_media_generation/ Image generation roles
05_ai_specializations/00_realtime_voice_agents/ Voice agent / realtime roles
01_ai_engineering/11_long_term_memory_state/ (deep) Agent-heavy roles
01_ai_engineering/10_knowledge_graph_retrieval/ Enterprise knowledge platforms
01_ai_engineering/17_schema_driven_generation/ Code assistant / structured extraction roles
01_ai_engineering/06_evidence_data_pipelines/ Roles with heavy upstream data work

Tier 4 — Deprioritize for interview cramming

These matter for the job but are not heavily probed in interviews:

  • 00_ai_foundation/00_ml_prerequisites_refresher/ — down to ~20-30% of interview time.
  • 00_ai_foundation/01_neural_network_primitives/ — conceptual baseline only.
  • 00_ai_foundation/04_autoregressive_generation/ — understand but rarely live-coded outside research loops.
  • 03_ai_security_safety/02_ai_ethics_risk_fairness/ — important post-hire, lightly probed in screens.
  • 01_ai_engineering/20_engineering_leadership_judgment/ — meta; apply through capstone, not standalone study.

Coding round patterns

ML / AI coding (research-leaning loops):

  • Implement multi-head attention from scratch (PyTorch, no library).
  • Implement LoRA adapter and merge.
  • Implement beam search or top-p sampling.
  • Implement autoregressive generation loop.

Classic algorithm (most loops):

  • LRU cache (very common).
  • Trie + DFS.
  • Binary tree serialization.
  • Union find.

Company-specific patterns:

  • OpenAI: KV stores, versioned databases, credit/quota management.
  • Anthropic: 4-level progressive build — start with SET/GET, end at timestamped TTL with snapshots.

Practical / take-home patterns:

  • JSON extraction + LLM summarization.
  • Web crawler with rate limiting.
  • Medical/legal document NLP.

Take-home assignments (real, from 2026 reports)

Assignment What it tests
Customer support RAG chatbot (100+ concurrent, <2s latency) RAG + production + latency
Document Q&A with citations Faithfulness, retrieval, structured output
Blood test PDF analysis with online retrieval Multi-step, tool use, structured output
Multi-agent content generation (5+ agents) Agent orchestration
Marksheet extraction API with confidence scoring OCR/structured extraction + calibration
Real-time call transcription with insights Streaming + voice

Red flags to push back on:

  • 72-hour "Round 1" demands.
  • Unpaid scope that resembles paid consulting work.

Capstone narrative (the deep-dive round)

The 25–45 minute "walk me through an end-to-end project you owned" round is universal. Have one project rehearsed cold.

Structure that works:

  1. Problem — one sentence. Who hurts, how much.
  2. Constraints — latency, cost, privacy, accuracy bar.
  3. Architecture — pipeline diagram in your head, three layers max.
  4. Why this, not that — name two alternatives you rejected, and the reason.
  5. Metrics — concrete numbers. "P95 dropped from 4.2s to 1.1s. Cost per query went from $0.018 to $0.004."
  6. Eval framework — what's in the golden set, how often it runs, what failed.
  7. What broke — one production incident, root cause, fix.
  8. What you'd change — show taste.

Senior signal: numbers, alternatives rejected, real eval framework, a real incident.

Junior signal: "we used LangChain and it worked."


Behavioral patterns (2026-specific)

Frequent prompts:

  • "How do you stay updated with fast-changing AI tech?"
  • "Walk me through an ethical concern in an ML project."
  • "Tell me about a cost or latency reduction you drove."
  • "How do you handle ambiguous, evolving requirements?"
  • "Tell me about a model provider decision you defended."

Not the curriculum order — the interview order.

  1. Foundations refresh: 00_ai_foundation/0200_ai_foundation/03
  2. The big four: 00_ai_foundation/07 + 01_ai_engineering/1301_ai_engineering/0801_ai_engineering/0902_ai_infrastructure/03
  3. Agents: 01_ai_engineering/0001_ai_engineering/0101_ai_engineering/02
  4. Eval & production: 04_ai_product_evals/0004_ai_product_evals/0104_ai_product_evals/0201_ai_engineering/0303_ai_security_safety/0002_ai_infrastructure/05
  5. System design: parallel 06_system_designing/ track
  6. Fine-tune layer: 00_ai_foundation/0600_ai_foundation/05
  7. Capstone story: 01_ai_engineering/21 — one project, 25-min cold narration with metrics
  8. Coding warmup: 02_ai_infrastructure/00 + ML coding drills (MHA, LoRA, beam search)

Sources