Applied AI Engineer — Interview Focus Areas¶

Grounded in 2026 data: 100+ real interview reports, 1,000+ AI engineer JDs, and current salary surveys (sources at the end). The role sits between data science and software engineering — more production-oriented than the former, more AI-specific than the latter.

Market signal — what employers list (share of 2026 JDs):

82.5%Python

35.9%RAG

29.1%Prompt engineering

25.4%General LLM knowledge

Compensation (US, 2026)

Mid-level $130K–$175K · Senior $195K–$290K

This guide is organized by focus area, not by curriculum module. Each section maps to the reorganized AI tracks under 00_ai_foundation/ through 05_ai_specializations/. Tier 1 areas appear in nearly every loop. Tier 2 appears in most senior loops. Tier 3 is role-specific.

Tier 1 — Appears in nearly every loop¶

1. RAG systems (end-to-end)¶

The single highest-yield topic. "Design a RAG system for customer support" is the most-reported question across companies. You will be asked to design one, debug one, evaluate one, or all three.

Mental model. RAG = retrieve documents, stuff them into context, generate. Every component is a failure mode.

What you must be able to do:

Walk through the full pipeline: ingest → chunk → embed → index → retrieve → rerank → augment → generate → evaluate → trace.
Defend chunking strategy (fixed, semantic, hierarchical) and explain when each fails.
Choose between sparse (BM25), dense (embeddings), and hybrid retrieval — and explain hybrid scoring (RRF, weighted).
Pick a vector DB: pgvector (small, transactional), Pinecone/Weaviate (managed), Qdrant/Milvus (self-host scale). Defend the trade-off.
Handle conflicting sources, recency bias, and citation/attribution.
Detect hallucinations: faithfulness scoring, citation overlap, LLM-as-judge, RAGAS metrics (faithfulness, answer relevance, context precision, context recall).
Reason about cost-per-query and latency-per-query at 1M queries/day.

Common questions (verbatim from interview reports):

"Design a RAG system that handles conflicting information across sources."
"How do you detect and mitigate hallucinations in production?"
"What's the difference between a retriever returning the wrong document and a generator ignoring the right one — how do you debug?"
"Your chunk size is 512 but legal documents have 50-page contracts. What breaks?"

Pitfalls to avoid in interview:

Treating RAG as one box. Decompose it.
Forgetting evaluation. Always say "and then we measure faithfulness with…"
Claiming dense retrieval is always better. It isn't — sparse wins on rare entities and exact-match.
Ignoring re-ranking. Cross-encoders are cheap compared to LLM generation.

Modules: 01_ai_engineering/08_rag_system_design/, 01_ai_engineering/09_advanced_rag_patterns/, 02_ai_infrastructure/03_vector_retrieval_infrastructure/, 01_ai_engineering/07_search_relevance_ranking/, 01_ai_engineering/10_knowledge_graph_retrieval/ (if Graph RAG role).

2. Evaluation & production trade-offs¶

The differentiator at senior loops. Most candidates can talk RAG; few can talk eval rigorously. A direct quote from a 2026 interview: "Is there an actual eval framework here, or is it vibes-based?"

Mental model. Evals = unit tests for non-deterministic systems. If you cannot measure it, you cannot improve it.

What you must be able to do:

Distinguish offline evals (golden datasets, regression) from online (A/B, canary, shadow).
Build a golden dataset: where do labels come from, how do you keep it fresh, how do you avoid leakage.
LLM-as-judge: pairwise vs pointwise, how to validate the judge itself.
Metrics zoo: ROUGE/BLEU (limited — surface n-grams), perplexity (only on probability output), task-specific (faithfulness, answer relevance, instruction-following).
Drift detection: input drift, output drift, judge drift.
Cost-aware quality: pareto frontiers of cost vs quality.

Common questions:

"How do you evaluate a chatbot?"
"Your eval metric went up but users complain more. What's wrong?"
"Walk me through an eval framework you built — what were the failure modes?"

Pitfalls:

Reaching for BLEU. It does not measure what you care about.
Forgetting human evaluation sampling alongside LLM-judge.
Confusing model evals with system evals (RAG eval ≠ LLM eval).

Modules: 04_ai_product_evals/00_ai_evals_release_gates/, 04_ai_product_evals/01_dataset_golden_set_operations/, 04_ai_product_evals/02_telemetry_feedback_loops/, plus production-ops modules in Tier 2.

3. Agents & tool calling¶

Agentic AI is now its own interview round. The defining question of 2026: "What's the difference between an agent and a simple LLM chain?"

Mental model. Agent = LLM + tools + loop + memory + termination condition. Chains are DAGs; agents are graphs with cycles.

What you must be able to do:

Define agent autonomy boundaries — what can it do without human approval.
Design tool schemas (JSON schema, function signatures, parameter validation).
Sandbox tool execution — what isolates the agent from prod systems.
Manage memory: short-term (working context), long-term (episodic, semantic), summarization vs eviction.
Prevent over-reasoning loops (max iterations, budget, escalation).
Handle tool failures, retries, partial results.
Single-agent vs multi-agent: when does multi-agent justify the complexity.
Frameworks: LangGraph (stateful, production), CrewAI/AutoGen (multi-agent), MCP (tool protocol).

Common questions:

"How do you prevent an agent from over-reasoning or over-planning?"
"Design an agent for end-to-end customer onboarding. Where does it call humans?"
"Your agent hits a tool that returns 500. What happens next?"

Pitfalls:

"Agentic" as buzzword. If your agent doesn't have a loop and a stop condition, it's a chain.
Forgetting tool-call observability. Every tool call is a span.
Letting agents run unbounded — interviewers love asking about budget caps.

Modules: 01_ai_engineering/01_agentic_system_design/, 01_ai_engineering/02_durable_agent_workflows/, 01_ai_engineering/11_long_term_memory_state/, 01_ai_engineering/16_multi_agent_coordination/.

4. Prompt engineering (system-level)¶

Not "tips and tricks." The system-level discipline of designing, versioning, and testing prompts.

Mental model. Prompts are code. Version them. Test them. Roll them back.

What you must be able to do:

Anatomy of a system prompt: role, instructions, constraints, format, examples, tools.
Few-shot placement: where in context, ordering effects, when to skip.
Chain-of-thought: when it helps, when it costs without lift.
Structured output: JSON schema, function calling, constrained decoding.
Temperature, top-p, top-k — when each matters.
Prompt chaining vs single-shot.
A/B testing prompts: shadow traffic, paired comparisons.
Versioning: how do you roll out a prompt change without breaking production.
The big trade-off triangle: prompt engineer → RAG → fine-tune. Know when each wins.

Common questions (top-5 globally):

"When would you fine-tune vs use prompt engineering vs RAG?"
"Your prompt works 90% of the time. The 10% is critical. What do you do?"
"How do you version prompts across environments?"

Pitfalls:

Treating prompts as static strings. They drift with model upgrades.
Forgetting that fine-tuning vs RAG is not either/or. They compose.

Modules: 00_ai_foundation/07_prompting_fundamentals/, 01_ai_engineering/13_prompt_lifecycle_operations/.

5. Cost & latency optimization at scale¶

Top-5 question globally: "Your app gets 1M queries/day — how do you optimize cost?"

Mental model. Three knobs: fewer tokens, cheaper model, fewer calls.

What you must be able to do:

Reduce tokens: prompt compression, context trimming, summary memory, caching system prompts.
Tier models: route easy queries to cheap models, escalate hard ones (router models, confidence-based escalation).
Cache: exact-match cache, semantic cache, KV-cache awareness (prefix caching on providers).
Batch when latency allows. Stream when latency matters.
Quantize/distill for self-hosted.
Speculative decoding for hot paths.
Talk in $ and ms, not "fast" and "cheap." Numbers matter.

Common questions:

"Your app gets 1M queries/day. Walk me through the cost stack."
"P95 latency is 3 seconds. Customers complain. What do you measure first?"
"When would you self-host vs use an API?"

Pitfalls:

Reaching for fine-tuning to cut cost. Often a worse ROI than caching + router.
Quoting cost without context window math.
Forgetting that the embedding model has cost too.

Modules: 02_ai_infrastructure/05_agent_performance_economics/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/03_agent_observability_debugging/.

6. LLM fundamentals¶

Conceptual fluency, not coding-deep. You will not be asked to implement multi-head attention from memory in most loops (some research-leaning loops do — see Tier 2 coding).

Mental model. Transformer = attention layers stacked with residual connections. Attention = each token looks at every other token weighted by relevance.

What you must be able to do:

Explain attention in one paragraph without notes.
Explain why context windows are a quadratic-cost problem.
Tokenization: BPE vs WordPiece vs SentencePiece. Why 1 word ≈ 1.3 tokens in English.
KV cache: what it stores, why it speeds up generation, why it grows with context.
Pre-training vs SFT vs RLHF/DPO — at a system level, not loss-function level.
Scaling laws: parameters vs tokens (Chinchilla-style intuition).
Context window vs effective context (lost-in-the-middle).

Common questions:

"How do LLMs work?"
"What is tokenization and how does it affect LLM performance?"
"Why does an LLM hallucinate?"

Pitfalls:

Going too deep too fast. Interviewer signals depth; match it.
Forgetting tokenization → cost link. They are the same conversation.

Modules: 00_ai_foundation/02_tokens_embeddings_context/, 00_ai_foundation/03_transformer_mechanics/.

Tier 2 — Most senior loops¶

7. Fine-tuning & adaptation¶

Usually probed in a "when would you" framing, not "implement RLHF" framing.

What you must be able to do:

PEFT family: LoRA, QLoRA, prefix tuning. When each fits.
Instruction tuning vs preference tuning (DPO vs RLHF).
Quantization for inference: int8, int4, FP8, GPTQ, AWQ.
Distillation: teacher-student, when it pays off.
Decision tree: prompt → RAG → fine-tune. Defend the order.

Modules: 00_ai_foundation/06_adaptation_compression/, 00_ai_foundation/05_llm_training_pipeline/.

8. AI system design¶

Now a dedicated round at most senior loops. Reported questions:

"Design ChatGPT" / "Scale a chat feature to 1M daily users."
"Document Q&A with hallucination prevention at 10M+ documents."
"Hospital voice assistant — noise, latency, privacy."
"AI-powered legal assistant."
"Image generation pipeline."

What you must be able to do:

Read the prompt for constraints: scale, latency budget, privacy class, modality.
Decompose: ingestion, retrieval, generation, eval, safety, observability, cost.
Defend choices with numbers (latency budget per stage, tokens per request, $ per query).
Show fallback paths: model down, retrieval empty, tool failure, rate limit hit.
Acknowledge non-AI infra: rate limiters, auth, logging, multi-tenant isolation.

Modules: ../06_system_designing/ (sibling track) — this is its own discipline.

9. Production MLOps & infrastructure¶

What you must be able to do:

Deployment patterns: blue-green, canary, shadow traffic.
Drift monitoring: input distribution, output distribution, judge score drift.
Traffic spike handling without overloading the model provider (queueing, rate limit budgets, graceful degradation).
Distributed inference: tensor parallelism vs pipeline parallelism (high level).
Self-hosted serving engines: vLLM, TGI, TensorRT-LLM, SGLang — pick one and know it.

Modules: 02_ai_infrastructure/04_ml_platform_operations/, 02_ai_infrastructure/02_inference_serving_systems/, 01_ai_engineering/04_resilient_agent_systems/, 02_ai_infrastructure/06_ai_runbooks_oncall/.

10. Safety & guardrails¶

Universal at safety-conscious companies (Anthropic, OpenAI, regulated industries).

What you must be able to do:

Prompt injection: direct, indirect (poisoned docs), jailbreaks. Mitigations: input filtering, output filtering, sandboxing, constrained decoding.
PII detection and redaction.
Output moderation: classifiers, regex, LLM-judge.
Red-teaming: how would you attack your own system.
Constitutional AI / RLHF-style refusal training (at a system level).
Audit logging and review workflows.

Modules: 03_ai_security_safety/00_safety_guardrail_design/, 03_ai_security_safety/01_prompt_injection_security/, 03_ai_security_safety/03_data_access_governance/.

11. Python & async engineering¶

Most coding rounds test eng fundamentals, not ML.

What you must be able to do:

asyncio patterns: gather, wait_for, semaphore-based concurrency.
Retries with backoff and jitter.
Timeouts at every IO boundary.
The GIL: what it does, what it doesn't, when threading helps.
FastAPI: dependency injection, streaming responses, background tasks.

Modules: 02_ai_infrastructure/00_ai_backend_api_engineering/.

12. Observability & tracing¶

Almost always probed when monitoring comes up.

What you must be able to do:

Span model for LLM apps: each call, each tool, each retrieval is a span.
OpenTelemetry, LangSmith, Helicone, Phoenix — pick one.
Trace what: prompt, response, tokens, latency, cost, eval scores, user ID.
Use traces to debug an agent that "just looped forever."

Modules: 01_ai_engineering/03_agent_observability_debugging/, 04_ai_product_evals/02_telemetry_feedback_loops/.

Tier 3 — Role-specific¶

Module	When it matters
`01_ai_engineering/15_reasoning_routing_verification/`	Roles using o-series / R1-style reasoning
`05_ai_specializations/01_multimodal_vision_systems/`	Multimodal product roles
`05_ai_specializations/02_diffusion_media_generation/`	Image generation roles
`05_ai_specializations/00_realtime_voice_agents/`	Voice agent / realtime roles
`01_ai_engineering/11_long_term_memory_state/` (deep)	Agent-heavy roles
`01_ai_engineering/10_knowledge_graph_retrieval/`	Enterprise knowledge platforms
`01_ai_engineering/17_schema_driven_generation/`	Code assistant / structured extraction roles
`01_ai_engineering/06_evidence_data_pipelines/`	Roles with heavy upstream data work

Tier 4 — Deprioritize for interview cramming¶

These matter for the job but are not heavily probed in interviews:

00_ai_foundation/00_ml_prerequisites_refresher/ — down to ~20-30% of interview time.
00_ai_foundation/01_neural_network_primitives/ — conceptual baseline only.
00_ai_foundation/04_autoregressive_generation/ — understand but rarely live-coded outside research loops.
03_ai_security_safety/02_ai_ethics_risk_fairness/ — important post-hire, lightly probed in screens.
01_ai_engineering/20_engineering_leadership_judgment/ — meta; apply through capstone, not standalone study.

Coding round patterns¶

ML / AI coding (research-leaning loops):

Implement multi-head attention from scratch (PyTorch, no library).
Implement LoRA adapter and merge.
Implement beam search or top-p sampling.
Implement autoregressive generation loop.

Classic algorithm (most loops):

LRU cache (very common).
Trie + DFS.
Binary tree serialization.
Union find.

Company-specific patterns:

OpenAI: KV stores, versioned databases, credit/quota management.
Anthropic: 4-level progressive build — start with SET/GET, end at timestamped TTL with snapshots.

Practical / take-home patterns:

JSON extraction + LLM summarization.
Web crawler with rate limiting.
Medical/legal document NLP.

Take-home assignments (real, from 2026 reports)¶

Assignment	What it tests
Customer support RAG chatbot (100+ concurrent, <2s latency)	RAG + production + latency
Document Q&A with citations	Faithfulness, retrieval, structured output
Blood test PDF analysis with online retrieval	Multi-step, tool use, structured output
Multi-agent content generation (5+ agents)	Agent orchestration
Marksheet extraction API with confidence scoring	OCR/structured extraction + calibration
Real-time call transcription with insights	Streaming + voice

Red flags to push back on:

72-hour "Round 1" demands.
Unpaid scope that resembles paid consulting work.

Capstone narrative (the deep-dive round)¶

The 25–45 minute "walk me through an end-to-end project you owned" round is universal. Have one project rehearsed cold.

Structure that works:

Problem — one sentence. Who hurts, how much.
Constraints — latency, cost, privacy, accuracy bar.
Architecture — pipeline diagram in your head, three layers max.
Why this, not that — name two alternatives you rejected, and the reason.
Metrics — concrete numbers. "P95 dropped from 4.2s to 1.1s. Cost per query went from $0.018 to $0.004."
Eval framework — what's in the golden set, how often it runs, what failed.
What broke — one production incident, root cause, fix.
What you'd change — show taste.

Senior signal: numbers, alternatives rejected, real eval framework, a real incident.

Junior signal: "we used LangChain and it worked."

Behavioral patterns (2026-specific)¶

Frequent prompts:

"How do you stay updated with fast-changing AI tech?"
"Walk me through an ethical concern in an ML project."
"Tell me about a cost or latency reduction you drove."
"How do you handle ambiguous, evolving requirements?"
"Tell me about a model provider decision you defended."

Recommended prep sequence (cramming order)¶

Not the curriculum order — the interview order.

Foundations refresh: 00_ai_foundation/02 → 00_ai_foundation/03
The big four: 00_ai_foundation/07 + 01_ai_engineering/13 → 01_ai_engineering/08 → 01_ai_engineering/09 → 02_ai_infrastructure/03
Agents: 01_ai_engineering/00 → 01_ai_engineering/01 → 01_ai_engineering/02
Eval & production: 04_ai_product_evals/00 → 04_ai_product_evals/01 → 04_ai_product_evals/02 → 01_ai_engineering/03 → 03_ai_security_safety/00 → 02_ai_infrastructure/05
System design: parallel 06_system_designing/ track
Fine-tune layer: 00_ai_foundation/06 → 00_ai_foundation/05
Capstone story: 01_ai_engineering/21 — one project, 25-min cold narration with metrics
Coding warmup: 02_ai_infrastructure/00 + ML coding drills (MHA, LoRA, beam search)

Applied AI Engineer — Interview Focus Areas¶

Tier 1 — Appears in nearly every loop¶

1. RAG systems (end-to-end)¶

2. Evaluation & production trade-offs¶

3. Agents & tool calling¶

4. Prompt engineering (system-level)¶

5. Cost & latency optimization at scale¶

6. LLM fundamentals¶

Tier 2 — Most senior loops¶

7. Fine-tuning & adaptation¶

8. AI system design¶

9. Production MLOps & infrastructure¶

10. Safety & guardrails¶

11. Python & async engineering¶

12. Observability & tracing¶

Tier 3 — Role-specific¶

Tier 4 — Deprioritize for interview cramming¶

Coding round patterns¶

Take-home assignments (real, from 2026 reports)¶

Capstone narrative (the deep-dive round)¶

Behavioral patterns (2026-specific)¶

Recommended prep sequence (cramming order)¶

Sources¶