AI System Design — Interview Questions¶
The 35-60 minute round where everything from rag-fundamentals, agents-design, cost-latency-optimization, safety-guardrails, evals-production, mlops-deployment, and observability-tracing gets pulled into one conversation. The senior tell is reducing scope before designing — "design ChatGPT" in 45 minutes is impossible; you must scope to "the chat conversation loop with retrieval and safety" and ignore training infrastructure. The other senior tell is naming trade-offs explicitly (latency vs cost, quality vs safety, frontier vs self-hosted) and gating each layer on a measurable eval.
This file deliberately covers fewer questions in more depth — each is a full whiteboard prompt.
The framework¶
Q: "How do you approach a 'design X' question for an AI system?"¶
Tags: mid · very-common · conceptual · source: IGotAnOffer / SystemDesignHandbook / MyEngineeringPath 2026 GenAI system design guides; standard interview opener
Answer outline: - Use a six-step skeleton, time-budgeted across the 45-minute round: - Clarify (5-8 min): who's the user, what's the success metric, what's the scale (QPS, users, corpus size), what are the latency and cost SLOs, what's in scope and what's not. Reduce scope aggressively. - High-level architecture (5-8 min): data flow from user → ingestion → retrieval/tools → LLM → guardrails → response. One labeled box per component, arrows for the request path. - Component deep-dive (15-20 min): pick the 2-3 most interesting components (usually retrieval, LLM serving, evals) and design them concretely. - Data and context strategy (3-5 min): where does context come from, how is it stored, how is freshness maintained, how is PII handled. - Trade-offs (5-8 min): name 3-4 explicit decisions — frontier vs fine-tune, RAG vs long-context, sync vs async, self-host vs API. State which way you chose and why. - Operate and evolve (3-5 min): observability, evals, rollout, monitoring, incident response, cost guardrails. - The senior signal: candidate budgets time, says "I'm going to spend 5 minutes here, then move", names a specific metric for every decision (recall@10, p95 TTFT, $/call, refusal rate), and treats the eval loop as a first-class component, not an afterthought. - The fastest way to fail: dive into "I'll use vLLM and Pinecone and LangChain" without scoping the problem. - Numbers to drop: "interview duration 35-60 min", "spend 5-8 min on clarification, 15-20 min on component deep-dive", "name 3-5 specific trade-offs at the end"
Common follow-ups: - "What's the most common mistake candidates make?" - "How do you handle the interviewer interrupting?"
Traps: - Designing top-down without asking about scale. Different orders of magnitude need different designs. - Skipping evals. The capstone signal is "how do I know this is working in prod".
Related cross-cutting: Architecture choices, Production patterns
Related module: all of learning/01_ai_engineering/
Conversational / chatbot designs¶
Q: "Design an AI-powered customer support chatbot."¶
Tags: senior · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); FinalRound AI; standard senior LLM design round
Answer outline:
- Clarify first. Internal-agent assist or external user-facing? Domains (one product line vs the whole company)? Multilingual? Channels (web, voice, email)? Volume (1k vs 1M req/day)? SLOs (p95 TTFT, refusal rate, escalation rate)?
- Assume external user-facing, English, web + email, 100k req/day, 1M+ historical tickets as the knowledge base.
- Architecture (request path):
- Ingress + auth: tenant identification, rate limit, PII detector at boundary.
- Intent router: small classifier (1-3B model or fine-tuned BERT) labels intent (FAQ, account issue, billing, complex). FAQ → fast path; complex → full RAG path.
- RAG layer: hybrid retrieval (BM25 + dense) over the knowledge base, top-50 → cross-encoder rerank → top-5 chunks. See retrieval-and-ranking.md.
- LLM generation: Haiku/gpt-4o-mini for fast path, Sonnet/gpt-4o for complex. Cite-or-refuse instruction; structured output.
- Output guardrails: PII leak check, hallucination grounding check (claims verified against citations), policy/toxicity filter.
- Streaming: TTFT-optimized streaming to the user; TTS for voice channel.
- Escalation gate: if confidence low, intent is "account dispute", or guardrails flag → handoff to human with transcript.
- Data plane:
- Ingestion pipeline: pulls support docs, FAQs, ticket archives. Chunking (~500-1000 tokens with overlap), embedding, vector store update. Incremental re-embed on policy/doc change.
- Conversation memory: per-session short-term (recent turns), per-user long-term (summarized history) for returning customers. PII-redacted, encrypted at rest.
- Eval and operate:
- Golden eval set: 500-1000 (question, gold answer, source-doc-id) tuples. Weekly LLM-judge on sampled production traffic. Per-intent slicing.
- CSAT, deflection rate, escalation rate as business metrics. Refusal rate < 5%, faithfulness > 0.9.
- Canary rollout for prompt/model changes; rollback in <2 min.
- Trade-offs: router complexity vs always-using-frontier (router saves 60-70% cost), conversation memory depth vs PII risk (short-summary not full transcript), strict citation vs natural responses (strict wins for trust).
- Numbers to drop: "100k req/day at $0.005/call = $500/day frontier-only; with router + cache 70-85% cut", "p95 TTFT target: <500ms for chat", "escalation rate target: <10%; deflection rate target: 40-70%"
Common follow-ups: - "How do you handle a customer asking for a refund?" (irreversible action → confirmation + tool gate) - "What's your fallback if the LLM provider is down?" (multi-provider routing) - "How do you measure deflection vs assistance?"
Traps: - Skipping the escalation path. Production chatbots must hand off gracefully. - Forgetting tenant scoping for multi-tenant cases.
Related cross-cutting: Architecture choices, Cost & latency, Production patterns
Related module: learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/08_rag_system_design/, learning/03_ai_security_safety/00_safety_guardrail_design/
Q: "Design a conversational AI system with memory across sessions."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: how persistent (days vs years)? What's stored (facts about user, preferences, prior tasks, transcripts)? Privacy (per-user export/delete, GDPR/DPDPA)? - Three memory layers: - Short-term (in-session): the last N turns in raw form, in the conversation context. ~10-20 turns typical before summarization kicks in. - Episodic (per-session summary): each session ends with an LLM-generated summary capturing key facts, decisions, unresolved items. Stored per-user. - Semantic / long-term: extracted facts about the user — preferences, past purchases, recurring needs. Stored as key-value or in a small per-user knowledge graph. Updated when new facts conflict with old. - Retrieval: at session start, pull (a) the most recent episodic summary, (b) semantic facts relevant to the current query (via embedding similarity over the user's fact store). Inject as system-prompt context. - Update: at session end (or periodically mid-session), run an extractor LLM call: "given this transcript, what new facts about the user emerged?" Append/update the fact store. - Privacy: - User-scoped storage with encryption at rest. - PII redaction before any cross-user model training. - Explicit user controls: export, delete, opt-out of memory. - Retention policy by data class (30 days for transcripts, indefinite for explicit user preferences with opt-out). - Architecture: user-scoped DB (Postgres + pgvector or a per-tenant Redis), encrypted; an extractor worker on the async path; an injector layer in the prompt assembly stage. - Eval: continuity tests (does the model recall user X's preference from last session?), false-memory tests (does it confidently recall facts that aren't in memory?), privacy tests (does it leak one user's data to another?). - Trade-off: more memory = more personalization but more PII / drift / cost. Sane defaults: episodic + small fact set; full transcript long-term only with explicit opt-in. - Numbers to drop: "session-end summarization: 1 LLM call, 200-500 tokens output", "fact store size cap: 50-200 facts per user", "retrieval latency: <100ms to inject memory into prompt"
Common follow-ups: - "How do you handle conflicting facts across sessions?" - "What if the user wants to delete everything?" - "How do you avoid hallucinated memories?"
Traps: - Storing full transcripts indefinitely. PII liability and drift. - No conflict resolution. New facts must update old.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/11_long_term_memory_state/, learning/03_ai_security_safety/00_safety_guardrail_design/
RAG-heavy designs¶
Q: "Design a document Q&A system for enterprise use (10M+ documents)."¶
Tags: senior · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026); standard senior RAG design round
Answer outline: - Clarify: doc types (PDF, Office, web pages, scanned)? Update cadence (daily, weekly, real-time)? Multi-tenant? Multilingual? Citation required? Compliance (HIPAA, FedRAMP)? - Assume mixed PDF + Office + wiki, weekly updates, multi-tenant (separate corpora per customer), citation required, no PHI. - Ingestion pipeline: - Parse: format-aware parsing (LlamaParse, Unstructured.io, Marker) preserves tables, headings, figures. Each doc becomes structured JSON. - Chunk: structure-aware (by section/heading), 500-1000 tokens with overlap. Special handling for tables (preserve as markdown + summary). - Embed: text-embedding-3-large or self-hosted BGE-large. Asymmetric model (separate query/doc encoding). - Index: vector DB (Qdrant / Milvus / Vespa) with HNSW + metadata index (tenant ID, doc ID, section, date). Sharded by tenant or by hash. - Parallel BM25 index on the same chunks (Elasticsearch / Tantivy). - Query path: - Tenant-scoped retrieval (filter by tenant ID). - Hybrid retrieval (BM25 + dense, RRF fusion) → top-50. - Cross-encoder rerank → top-5. - LLM generation with citation-required output schema. - Grounding verifier: every claim must cite a chunk ID present in the retrieved context. - Scale numbers: 10M docs × ~10 chunks/doc = 100M chunks. At 1536-dim FP32, raw vectors = 600 GB. Use IVF-PQ for 5-10× compression, or HNSW sharded across nodes. - Update path: detect changed/new docs, re-chunk + re-embed only those; vector DB supports upsert. Soft-delete old chunks. - Per-tenant isolation: hard partitioning, separate encryption keys per tenant for at-rest data. Trace logs scoped per tenant. - Eval: per-tenant golden sets (50-200 queries each), recall@10 ≥ 0.9, faithfulness ≥ 0.9. Production sampling weekly. - Trade-offs: storage cost (full HNSW vs IVF-PQ trade-off), update freshness (incremental vs nightly batch), citation strictness vs UX. - Numbers to drop: "100M chunks × 1536d FP32 = ~600 GB raw; IVF-PQ compresses to ~60-120 GB", "hybrid retrieval p95: 100-200ms", "rerank: 100-200ms; total p95 TTFT: <2s"
Common follow-ups: - "How do you handle a tenant uploading malicious documents?" - "What if a doc is updated mid-query?" - "Walk me through the re-embed pipeline when the embedding model upgrades."
Traps: - Skipping per-tenant filtering. Cross-tenant leakage is a fireable offense. - Naive chunking on PDFs. Tables and figures need special handling.
Related cross-cutting: Retrieval, Architecture choices, Production patterns
Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/, learning/01_ai_engineering/08_rag_system_design/
Q: "Design an AI-powered search engine for an e-commerce platform."¶
Tags: senior · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: product catalog size (10k vs 100M)? Query types (keyword, natural language, image, voice)? Personalization (logged-in vs anonymous)? Latency SLO (search is typically <300ms p95)? - Assume 10M products, text + image queries, mixed logged-in/anonymous, p95 TTFT <300ms. - Hybrid architecture: - Sparse leg: BM25 over title + description + structured attributes (brand, category, tags). Exact match for SKUs, brand names. - Dense leg: product-text embeddings (Cohere / BGE / e-commerce-tuned). Captures semantic intent ("comfortable running shoes for flat feet"). - Image embedding leg (for image search): CLIP-style joint text-image embedding. Same vector space; query an image, retrieve products. - Two-tower model (for personalized ranking): user-tower (user features, history) + item-tower (product features). Precomputed item embeddings; user embedding computed at query time; ANN search. - Query path: hybrid retrieve top-200 → personalized rerank with two-tower → cross-encoder rerank (or LLM-as-reranker for high-stakes queries) → top-20 to UI. - Special handling: - Query understanding: small LLM extracts attributes ("under $50", "color: red") and converts to structured filters. Reduces retrieval space. - Out-of-vocabulary brands: BM25 leg handles. Embed-only would miss. - Cold-start products: until embeddings + behavior data accumulate, use a popularity prior. - Personalization decay: weight recent behavior heavier. - Catalog updates: products added/updated/discontinued continuously. Incremental indexing pipeline; vector DB upsert; cache invalidation. - Eval: NDCG@10 on labeled (query, relevant-product) pairs; CTR / conversion rate / revenue as business metrics in A/B. - Trade-offs: pure dense vs hybrid (hybrid wins for SKU/brand queries), reranker cost vs quality, personalization vs cold-start fairness. - Numbers to drop: "10M products × 1024d = ~40GB raw; HNSW + replicas for hot serving", "search p95 target: <300ms", "two-tower precompute: nightly batch on catalog; user-tower at query time"
Common follow-ups: - "How do you handle long-tail products?" - "What's the difference between this and a generic RAG system?" - "How would you A/B a new ranking model?"
Traps: - Pure LLM for ranking. Too slow; specialized rankers win. - Skipping query understanding. Filters do half the work.
Related cross-cutting: Retrieval, Cost & latency, Architecture choices
Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/05_ai_specializations/01_multimodal_vision_systems/
Q: "Design a multi-modal search system (text, image, video)."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: search across all modalities together, or mode-specific? Index size? Latency tolerance? - Joint-embedding approach: CLIP-style models (CLIP, SigLIP, EVA-CLIP) embed text and images into a shared vector space. Query in either modality retrieves both. - Video: index keyframes (1 per N seconds) + transcripts (Whisper-style ASR). Each keyframe gets a CLIP embedding; each transcript chunk a text embedding. Joint or separate index depending on use case. - Query path: detect modality of query (text vs image upload vs voice → ASR → text), embed in the shared space, hybrid retrieve (sparse for text keywords + dense joint). Rerank with a cross-modal reranker if available. - For long videos: hierarchical search — find candidate videos, then find the right segment within. Two-stage avoids exploding the per-segment index. - Indexing pipeline: video → keyframe extraction → embedding; audio → ASR → text embedding. Process incrementally; backfill historical content. - Eval: cross-modal retrieval benchmarks (a text query retrieving the right image), end-to-end relevance metrics. - Trade-offs: joint embedding (one space, simpler) vs separate-and-fuse (specialized, better per-modality quality). Most 2026 systems start joint, specialize later. - Numbers to drop: "CLIP-style embeddings: 512-1024d typical", "video keyframe extraction: 1 per 2-5 seconds", "ASR (Whisper-large): real-time on GPU"
Common follow-ups: - "What if text queries dominate and image queries are rare?" - "How does CLIP compare to specialized image embedders?"
Traps: - Indexing every frame of a video. Massive over-indexing; choose keyframes.
Related cross-cutting: Retrieval, Architecture choices
Related module: learning/05_ai_specializations/01_multimodal_vision_systems/, learning/01_ai_engineering/07_search_relevance_ranking/
Voice / real-time¶
Q: "Design an AI voice assistant architecture."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: full duplex (interruption support) or half duplex (push-to-talk)? Latency SLO? Language coverage? On-device or cloud? Personalization? - The dominant constraint: end-to-end latency budget of 700-1000ms from end-of-user-speech to start-of-agent-speech (above ~1.2s feels broken). - Pipeline (with budget allocation): - VAD / endpointing (150ms): voice activity detection identifies utterance boundaries. - ASR (250ms): streaming Whisper or a specialized real-time ASR (Deepgram, AssemblyAI). Partial results flow as user speaks. - LLM (300-400ms): small fast model (Haiku 4.5, gpt-4o-mini), prompt-cached system prompt, streaming output. Often with speculative decoding. - TTS (150-200ms): streaming TTS (ElevenLabs, OpenAI tts, Cartesia) that starts speaking on the first chunk of LLM output. - The trick is overlap, not sequential addition. Start ASR while VAD is still finishing; start LLM on partial ASR; start TTS on first LLM token. - Tool calls: an LLM tool call can blow the latency budget. Use backchannel speech ("let me check that") while the tool runs. Cache common tool results. - Interruption: if the user starts speaking while the agent is, immediately stop TTS, drop in-progress LLM output, restart from new input. - Memory: per-call conversation memory in-process; persistent across sessions in a user store. - Quality: a side eval pipeline runs ASR + LLM grading on sampled calls. Track word-error-rate, intent-recognition rate, task-completion rate. - Trade-offs: model size (smaller = faster but lower quality), on-device (privacy + zero network latency, but limited capability) vs cloud (capability, but RTT cost), full-duplex (UX win, infra complexity) vs half-duplex (simple, less natural). - Numbers to drop: "E2E budget: 700-1000ms", "ASR p95: 200-300ms streaming", "LLM TTFT: <400ms for voice", "TTS first-chunk: 150-200ms"
Common follow-ups: - "What's the failure mode if the tool call is slow?" - "How do you handle background noise?" - "How does this differ from a text chatbot?"
Traps: - Sequential pipeline. The win is overlap. - Skipping the interruption design. Production voice agents must handle barge-in.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/05_ai_specializations/00_realtime_voice_agents/, learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "Design a real-time AI transcription system for thousands of concurrent audio streams."¶
Tags: staff · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: live vs batch? Word-level timestamps required? Diarization (who-said-what)? Languages? Latency SLO? - Assume live streaming, word timestamps, diarization, English+top-10 languages, <500ms partial-result latency. - Per-stream pipeline: - WebSocket/RTP ingest: audio chunks (e.g., 100-500ms each) arrive continuously. - VAD + chunking: split into utterance-bounded chunks. - ASR worker: streaming Whisper-large (or a specialized streaming ASR) on GPU. Emits partial + final transcripts. - Diarization: pyannote-style speaker embedding per chunk; cluster into speakers within a session. - Post-processing: punctuation restoration, capitalization, number/entity normalization. - Stream out: emit partial + final tokens via WebSocket to the client; persist final transcript. - Scale: at 1000 concurrent streams × ~2 sec average ASR latency per chunk on H100, you need ~5-15 GPUs depending on ASR model size and concurrency packing. - Batching: pack multiple streams onto one GPU (batched ASR inference). Streaming ASR engines (Whisper-streaming, Sonix-style) support concurrent streams natively. - Sharding: each stream sticky to a worker (avoid context loss across workers). Worker pool autoscaled on concurrency. - Storage: transcripts streamed to durable storage (S3/GCS) as they finalize. Searchable index built async. - Eval: WER (word error rate) sampled across streams; diarization accuracy (DER) on labeled samples. - Trade-offs: model size vs latency (Whisper-large-v3 better, slower; -medium faster, lower quality), partial-result aggressiveness (more frequent = better UX, more compute), language coverage (multi-lingual model vs language-specific). - Numbers to drop: "Whisper-large-v3 streaming: ~1× real-time on H100 single stream, ~3-5× on packed batches", "1000 concurrent streams: 5-15 H100s typical", "WER target: <10% for clean English, higher for accents/noise"
Common follow-ups: - "What if a stream goes silent for 2 minutes?" - "How do you handle a worker crash mid-stream?" - "What's the cost per minute of audio?"
Traps: - One-worker-per-stream without packing. Wastes GPU. - No reconnection handling.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/05_ai_specializations/00_realtime_voice_agents/, learning/02_ai_infrastructure/02_inference_serving_systems/
Code / structured data systems¶
Q: "Design a code generation and review system."¶
Tags: senior · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: in-IDE assist (Copilot-style) or PR review or both? Languages? Repo size? Privacy (does code leave the customer's network)? - Assume PR review on a 10M-LOC monorepo, in-cloud, multiple languages, ~1000 PRs/day. - Components: - Context retrieval: for each changed file, retrieve related files (imports, callers, callees). Code-specific embeddings (CodeBERT, voyage-code, OpenAI text-embedding-3-large on code). Hybrid with structural retrieval (AST-based — find all callers of a function). - Static analysis layer: linters, type checkers, security scanners (Semgrep, Snyk) run alongside. LLM doesn't need to find typos; static tools do that. - LLM review: gpt-4o / Claude Sonnet / Gemini grade-2.5 quality models. Prompt: "given the diff and context, identify bugs, security issues, design issues, missing tests". Structured output with severity + file/line. - Confidence calibration: only surface high-confidence findings. Set noisy-comment penalty heavily — devs ignore tools that yell. - Code-execution sandbox (for test generation / verification): sandbox runs proposed tests in an ephemeral container to verify they pass. - For in-IDE assist: lower latency, smaller model, FIM (fill-in-the-middle) format, repository-aware context. - Eval: hand-labeled (diff, bug) dataset; precision@5 (of the top-5 surfaced issues, how many are real bugs?); recall on known issues; dev-feedback signal (thumbs, ignored-comment rate). - Privacy: code is sensitive. Per-tenant isolation, zero retention with API providers, on-prem option for high-sec customers, audit log on every code-fetch. - Trade-offs: frontier model quality vs cost per PR (~\(0.50-\)5 typical), comment volume vs noise (start strict and relax), context size (more context = better review, more cost). - Numbers to drop: "1000 PRs/day × $2/review = $2k/day", "comment precision target: ≥80%", "FIM autocomplete TTFT: <200ms in-IDE"
Common follow-ups: - "How do you prevent the model from suggesting insecure code?" - "What if the PR is 5000 lines?" - "How do you tune for low false-positive rate?"
Traps: - Sending raw repos to a third party. Privacy fail. - High comment volume. Devs disable the tool.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/17_schema_driven_generation/, learning/01_ai_engineering/12_model_vendor_strategy/
Q: "Design an AI-powered data extraction pipeline from unstructured documents."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: doc volume, format mix (PDF / scanned / Office), target schema (fixed or evolving), accuracy SLO, downstream consumer? - Assume 100k docs/day, mostly PDFs with some scanned, fixed schema (invoices), 99% precision required, downstream is accounting. - Pipeline: - Ingest + classify: classify doc type. Quickly skip non-invoices. - Parse: structure-aware PDF parsing (LlamaParse / Unstructured / Marker) for layout, plus OCR (Tesseract / AWS Textract / Azure Document Intelligence) for scanned. - LLM extract: structured-output prompt with the target schema. JSON schema enforced via function-calling / strict mode. Few-shot examples for unusual formats. - Validation: schema check, business rules (sum of line items = total, dates in valid range, vendor in allow-list), confidence scoring. - Human-in-the-loop queue: anything below confidence threshold or violating rules → human review. Reviewer's labels feed back to fine-tune the model and improve few-shot. - Downstream emit: validated records to the accounting system. - Eval: precision/recall per field on a labeled set. Continuous golden-set growth from reviewer feedback. - Scale: 100k docs/day at ~2 sec/doc = 200k seconds compute/day = ~3 GPUs at 24h utilization for LLM extraction (or commodity API at ~$0.02-0.10/doc). - Reliability: idempotent extraction (same doc → same record), retry-with-backoff on transient failures, dead-letter queue for unrecoverable failures. - Trade-offs: frontier model accuracy vs cost (frontier wins for 99% precision target; fine-tuned 7B can hit ~95-97%), HITL rate vs throughput (lower threshold = more HITL but higher final accuracy). - Numbers to drop: "100k docs/day at $0.02-0.10/doc API = $2k-10k/day", "fine-tuned 7B: ~95-97% per-field precision; frontier: 98-99%", "HITL queue rate: 5-20% typical to hit 99% end-to-end precision"
Common follow-ups: - "What if the invoice format is unusual?" - "How do you handle multi-page invoices?" - "Can you do this without an LLM?" (table-extraction approaches; LLM often wins on noisy data)
Traps: - No HITL. 99% precision without humans is rarely achievable. - Flat retry on all failures. Some failures need human triage, not retry.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/06_evidence_data_pipelines/, learning/01_ai_engineering/17_schema_driven_generation/
Q: "Design an AI-powered document processing pipeline for financial institutions."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Same engine as generic document extraction, but with regulated overlay. - Additional constraints: - Data residency (region-locked storage and processing). - Audit trail (every extraction logged with input hash, model version, output, human-review decision). - Reproducibility (rerun an extraction months later and get the same answer — pin model + prompt versions). - PII / NPI handling (SSN, account numbers, customer names — redact in logs, encrypt at rest, scoped access). - Regulatory reporting (SOX, FINRA, depending on jurisdiction). - Architecture differs from generic in: - On-prem or VPC-only deployment, no public cloud LLM API unless contractually compliant (Bedrock / Azure OpenAI in regulated configurations). - Air-gapped human review tooling. - Tamper-evident audit logs (append-only, hash-chained). - Quarterly model + prompt review by compliance team; gated promotion. - Don't claim full automation. Regulated extraction always has humans in the loop on edge cases — design for that, not against it. - Trade-offs: cloud-managed regulated services (Bedrock / Azure OpenAI in compliant tiers) vs full on-prem. Cloud is faster to ship; on-prem is the right answer for the strictest jurisdictions. - Numbers to drop: "audit log retention: 7+ years for SOX, longer for some", "human-review rate: 10-30% typical for regulated financial extraction", "regulator-facing reproducibility: pin (model_id, prompt_version, training_data_hash) per record"
Common follow-ups: - "What about model drift over years for audit reproducibility?" - "How do you handle a regulator subpoena for a specific decision?"
Traps: - Treating this like generic extraction. The regulated overlay matters more than the LLM choice.
Related cross-cutting: Production patterns, Architecture choices
Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/03_ai_security_safety/00_safety_guardrail_design/
Multi-agent / workflow systems¶
Q: "Design a multi-agent workflow system where agents collaborate on complex tasks."¶
Tags: senior · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: pre-defined workflow (orchestrated) or open-ended (planner-led)? Trust boundary between agents? Stateless agents or persistent memory? - Two patterns: - Orchestrator + workers: a single orchestrator decides which worker agent runs next; workers are specialized (research, code, writing). Predictable, debuggable. - Planner-led: a planner agent decomposes the task into subtasks; spawns worker agents per subtask; aggregates. More flexible, harder to bound. - Default to orchestrator + workers. Open-ended planner-led only when the task genuinely needs it (research, novel problems). - Architecture (orchestrator pattern): - Workflow definition: a DAG (or state machine) describing the steps. Stored as code, versioned. - Orchestrator: a deterministic engine (Temporal, Airflow, custom) drives the DAG. Each node calls an agent. - Agent worker: each node is an agent with its own system prompt, tool set, and stopping condition. Spans emit per agent for trace visibility. - Communication: structured messages between agents (JSON, not free text). Avoid agents writing in prose to each other. - State store: shared work-in-progress visible to relevant agents; isolation otherwise. - Guardrails: per-agent input/output guardrails; cross-agent safety checks at hand-offs. - Hard limits: max-steps per agent, max wall-clock per workflow, max cost budget. Trip → terminate with structured failure. - Eval: per-agent eval suite, plus end-to-end workflow eval. Trace-driven debugging when workflows fail. - Trade-offs: orchestrated (predictable, less flexible) vs planner-led (flexible, hard to bound), shared memory (powerful, leaks risk) vs isolated (safe, slower hand-offs). - Numbers to drop: "workflow max-steps: 20-100 typical; per-agent max-steps: 5-15", "structured-message hand-off vs prose: 3-5× fewer downstream failures"
Common follow-ups: - "What if one agent infinite-loops?" - "How do agents share long-running context?" - "When is multi-agent worse than single agent?"
Traps: - Free-form prose communication between agents. Hard to debug, error-prone. - No global termination guard. One stuck agent stalls the workflow.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/16_multi_agent_coordination/, learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/02_durable_agent_workflows/
Q: "Design an AI gateway/proxy for managing LLM access across an organization."¶
Tags: staff · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - The gateway sits between every internal LLM call and the upstream providers. Single point of policy, observability, cost control. - Core responsibilities: - Auth + tenant attribution: every call carries app ID + team ID; gateway validates and tags. - Rate limiting + budgets: per-team / per-app TPM/RPM limits; monthly $ budgets with alerts. - Provider abstraction: unified API across Anthropic / OpenAI / Google / Bedrock / self-hosted. Apps target one interface. - Routing: per-app provider preferences, smart fallback (if primary fails or rate-limited). - Prompt + response logging: redacted, retention-governed, queryable. - PII detection: at the boundary, before prompts leave the org. - Guardrails: shared safety classifiers, organization-wide policies. - Caching: shared prompt cache across the org. - Cost dashboard: per-team, per-app, per-day spend. - Implementation: a stateless proxy service (FastAPI / Go) with a fast cache layer (Redis), routing logic, async logging to a data lake. - Adoption: app teams target the gateway URL instead of provider URLs directly. CI checks ensure no direct provider calls in app code. - Eval: gateway-side dashboards for app-level cost / latency / refusal-rate / hallucination-rate (offline LLM judge on sampled traces). - Security: API keys held only in the gateway; no app team has direct provider credentials. Reduces blast radius of credential leaks. - Trade-offs: gateway latency (10-30ms added) vs centralized control (worth it for any org with 5+ AI products), vendor independence (you can swap providers org-wide via gateway config). - Numbers to drop: "gateway p95 overhead: 10-30ms", "centralized cache hit rate: 20-50% across org", "cost visibility per-team within 1-day lag"
Common follow-ups: - "What if a team needs a feature your gateway doesn't support?" - "How do you handle gateway-side outages?" - "How does adoption happen organically?"
Traps: - Building the gateway as a "must use" mandate without a clear value prop. App teams will route around. - No HA design. A gateway outage takes down every AI feature.
Related cross-cutting: Architecture choices, Cost & latency, Production patterns
Related module: learning/01_ai_engineering/12_model_vendor_strategy/, learning/02_ai_infrastructure/04_ml_platform_operations/
Q: "Design a multi-tenant AI chatbot platform where each business gets a custom chatbot."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: tenant count (10s, 100s, 10000s)? Customization depth (prompts only, or fine-tunes, or fully custom)? Per-tenant isolation level? - Assume 1000s of tenants, prompt + RAG corpus customization, no per-tenant fine-tunes (too expensive at scale), strong logical isolation. - Architecture: - Shared inference backplane: one fleet of LLM workers (self-hosted vLLM + frontier API as fallback). Stateless; processes any tenant's request. - Per-tenant config store: prompt template, RAG corpus ID, tool allow-list, guardrail policy, brand voice config. Loaded per request. - Per-tenant RAG corpus: each tenant's vector DB partition. Hard tenant-ID filter in every retrieval. Storage isolated per tenant for compliance. - Per-tenant cost attribution: every call tagged with tenant ID; cost rolled up daily for billing. - Per-tenant rate limits / budgets: a noisy tenant doesn't degrade others. - Per-tenant observability: traces scoped, tenant dashboards. - Customization API: tenants supply (a) prompt template via UI, (b) source documents (uploaded, ingested into their corpus), (c) tool integrations (via config). Self-serve. - Isolation: - Logical (tenant ID filter everywhere) is the baseline. - Physical (per-tenant DBs, per-tenant compute) for regulated customers willing to pay more. - LLM choice: shared cheap-tier (Haiku / Sonnet on a shared fleet) with shared API key on the gateway side, attributed per-tenant on billing. - Eval: per-tenant golden sets for the largest customers; aggregate for the rest. Quality monitoring per-tenant. - Trade-offs: shared infra (cheap, slight noisy-neighbor risk) vs dedicated infra per tenant (expensive, full isolation); fine-tune per tenant (best quality, $$$) vs shared model with per-tenant prompts (cheap, slightly weaker per-tenant). - Numbers to drop: "1000s of tenants on shared fleet typical", "per-tenant cost: $10-1000/month depending on usage", "isolation: logical for SMB tenants, optional physical tier for enterprise"
Common follow-ups: - "How do you handle a tenant that wants on-prem?" - "What if a tenant's corpus is bigger than the shared infra can hold?" - "How do you A/B test prompts across tenants?"
Traps: - Skipping per-tenant rate limits. One spiky tenant kills the platform. - Forgetting tenant ID in every retrieval — even one missed query is a leakage.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/12_model_vendor_strategy/
High-throughput / batch designs¶
Q: "Design an AI resume screening system that handles 100K applications per week."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: pure ranking or yes/no? Multilingual? Bias / fairness controls? Volume per recruiter? Latency tolerance (resume screening is usually batch, not real-time)? - Assume ranking against a job description, English+top-10 langs, strict fairness controls, 100k/week = ~14k/day = batch-able. - Pipeline: - Parse: PDF/Office resume parsing (Unstructured / Marker / LlamaParse) to structured text + extracted fields (name, contact, education, experience, skills). - PII redaction for screening: optionally strip name, age, address before LLM sees content (mitigate demographic bias). - Match LLM call: structured prompt: "given the job description and this resume, score 1-10 on relevance, list strengths, list gaps". Output schema enforced. - Calibration: re-score across many candidates, normalize to ensure consistent ranking across batches. - Recruiter UI: ranked list, with the LLM's reasoning visible per candidate. Recruiter can flag bad rankings, which feeds back. - Batch processing: jobs come in, candidate batch processed via Batch API (50% off, 24h SLA) — fits recruiter workflow which is daily, not real-time. - Fairness: - Audit per protected attribute (estimated from name/zip/college signals removed) to detect group-wise score differences. - Mandatory human review for any auto-rejection in some jurisdictions (EU AI Act for high-risk categories). - Explainability: per-candidate reasoning logged. - Periodic fairness eval; bias-detected → revisit prompt and training data. - Eval: hand-labeled "this candidate is a good fit / not" by senior recruiters on a held-out set; precision@10 and recall@50 are usable metrics; fairness across slices. - Regulatory: EU AI Act treats hiring AI as high-risk. NYC has bias-audit requirements. The system must support audit trail + explainability + human-in-loop. - Trade-offs: full LLM call per candidate (\(0.01-0.05 each) vs cheap classifier + LLM only for borderline (\)0.001/each + 10× cheaper on average), strict redaction (mitigates bias but loses signal) vs no redaction. - Numbers to drop: "100k/week × $0.02/call = $2k/week LLM cost", "batch API: $1k/week at the same throughput", "bias audit: at least monthly, more often during ramp"
Common follow-ups: - "How do you handle a candidate complaining they were auto-rejected?" - "What about EU AI Act compliance?" - "What if a hiring manager game-hacks the system?"
Traps: - Auto-rejection without human review. Legal exposure. - No fairness audit. Will eventually surface in headlines.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/03_ai_security_safety/00_safety_guardrail_design/, learning/03_ai_security_safety/02_ai_ethics_risk_fairness/
Q: "Design an AI meeting summarizer system for thousands of meetings daily."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: live or post-meeting? Length distribution? Speakers diarized? Action items extracted? Privacy? - Assume post-meeting batch, 15-90 min mean, diarized, action items + summary, enterprise tenant model. - Pipeline: - Ingest: meeting audio + (if available) calendar metadata, attendee list. - Transcribe: streaming or batch Whisper-large-v3. Word-level timestamps. - Diarize: pyannote-style speaker segmentation; cluster across utterances; resolve speakers via attendee list (heuristic match or fingerprinting). - Chunk for long meetings: 90-min meetings exceed many context windows. Split into segments by topic shift (LLM-based) or by fixed time windows with overlap. - Summarize: per-segment summary, then meta-summary combining all segments. Structured output schema (key topics, decisions, action items with owner + due-date). - Action item extraction: dedicated extractor pass; cross-link to the transcript span where the commitment was made. - Cost discipline: each meeting costs $0.10-1.00 depending on length and model. Multiply by daily volume. - Privacy: meeting content is sensitive. Per-tenant encryption, retention policy, explicit user opt-in for recording. Speakers can request deletion. - Eval: meeting-summary quality eval (human-graded summaries vs LLM-generated), action-item recall (% of human-identified actions surfaced), faithfulness (no invented attendees or commitments). - Trade-offs: bigger models for better summary quality vs cost per meeting; full transcript stored (better recall, more privacy risk) vs summary-only (cheaper, less explorable). - Numbers to drop: "Whisper-large transcribe: ~real-time on H100; can run async cheaper", "summary cost: $0.10-1/meeting at frontier model", "action-item recall target: 80%+ on human-labeled set"
Common follow-ups: - "What if attendees overlap-speak?" - "How do you handle confidential meetings?" - "Live vs post — when do you do live?"
Traps: - Loading the full transcript into one LLM call. Will OOM or cost too much. - No span-link from action item back to transcript. Reviewers can't verify.
Related cross-cutting: Architecture choices, Cost & latency
Related module: learning/05_ai_specializations/00_realtime_voice_agents/, learning/02_ai_infrastructure/05_agent_performance_economics/
Q: "Design a content moderation system using AI."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline:
- See safety-guardrails.md for the content-moderation deep-dive. System-design-round emphasis is on pipeline architecture, throughput, and human review tooling.
- Architecture:
- Tiered classification: fast filter (regex / hash match for CSAM, blocked keywords) → primary classifier (small classifier or LLM-Guard) → LLM judge for borderline → human review queue.
- Severity tiers: critical (hard block + escalate), high (block + log), medium (sanitize/warn), low (log only).
- Multi-modal: separate classifiers per modality, combined by policy. Text + image + (audio via ASR + text moderation).
- Throughput: stateless workers, autoscaled. Latency tolerance depends on use case — pre-publication moderation is sync (block bad content), post-publication is async (find and remove).
- Human review tooling: priority queue, reviewer assignment, inter-reviewer agreement tracking, appeals workflow.
- Policy versioning: policy is code, A/B tested, rolled out via canary.
- Scale: at, say, 10M user posts/day, the fast filter handles ~30-50% cheaply, primary classifier handles the rest at ~50-150ms each, LLM judge invoked on the borderline 10-20%.
- Audit: every decision logged with policy version, classifier scores, action, reviewer ID if human-touched.
- Trade-offs: latency vs deep classification (synchronous for pre-publication, async post for everything else), false-positive rate (over-blocking kills UX) vs false-negative (under-blocking is a brand/safety problem).
- Numbers to drop: "10M posts/day at 50-150ms classification ≈ 50-150 dedicated CPU/GPU workers", "human-review queue: 1-5% of posts typical", "FPR target: <2% on benign content"
Common follow-ups: - "How do you handle appeals?" - "What about adversarial content trying to evade filters?" - "How do you onboard a new policy category?"
Traps: - Single-classifier design. - No appeals path.
Related cross-cutting: Production patterns, Architecture choices
Related module: learning/03_ai_security_safety/00_safety_guardrail_design/
Domain-specific designs¶
Q: "Design an AI-powered legal document review system."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: contract review (clause-by-clause) or legal research (find precedent)? Lawyer-supervised or self-serve? Jurisdiction? Confidentiality? - Assume contract review, lawyer-supervised, US jurisdiction, attorney-client privilege applies. - Architecture: - Parse: structure-aware contract parsing — clauses, parties, defined terms, effective dates, governing law. Preserve numbering and cross-references. - Clause classifier: each clause categorized (indemnity, termination, IP, payment, etc.) using a fine-tuned model or LLM with structured output. - Rule-based checks: for each clause type, known issues (e.g., "indemnity is one-sided", "termination notice period < 30 days"). Deterministic, citable. - LLM analysis: nuanced issues — ambiguous language, unusual terms vs market standard, missing protections. Output structured with severity + suggested redline. - Cross-reference resolution: defined terms used consistently? Cross-clauses logically consistent? Cite-with-span output. - Playbook integration: each firm has standard positions ("we never accept arbitration with venue X"). The LLM is grounded against the playbook. - Lawyer UI: side-by-side document + suggested edits + reasoning. Lawyer accepts/rejects/edits. - Constraints: - Privilege preservation: never send privileged content outside the lawyer's environment. Self-hosted or VPC-locked deployment, depending on firm policy. - No legal advice: the system surfaces issues; the lawyer interprets and acts. The product UX makes this explicit. - Auditability: every suggestion logged with model version, prompt version, source playbook entry. - Eval: hand-labeled (clause, issue) pairs; recall on issue detection; lawyer-feedback rate. - Trade-offs: API providers (cheap, fast to ship, privilege concerns) vs self-hosted (privilege-safe, ops cost), strict adherence to playbook (consistent, less flexible) vs LLM judgment (flexible, less consistent). - Numbers to drop: "contract review: $5-50/contract at frontier model", "issue-detection recall target: 85%+ on labeled clause categories", "lawyer time saved: 50-70% on routine contracts"
Common follow-ups: - "What about cross-jurisdictional contracts?" - "How do you handle a contract type the model wasn't trained on?" - "What's the liability model if the AI misses a critical issue?"
Traps: - Trying to replace the lawyer. The product surfaces issues; lawyer decides. - Skipping privilege-preserving deployment.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/08_rag_system_design/, learning/03_ai_security_safety/00_safety_guardrail_design/
Q: "Design a fraud detection system powered by LLMs."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - LLMs alone are wrong for fraud detection — they're slow, expensive, and hard to audit at the scale of transaction streams. The right answer is a hybrid: classical ML for the bulk + LLM for nuanced cases. - Architecture: - Layer 1 — rule engine: fast deterministic rules (velocity checks, blocklists, IP/geo mismatch). Catches the obvious. - Layer 2 — ML classifier: gradient-boosted trees or a deep net on engineered features (tx amount, merchant category, user history, device fingerprint, geo). Real-time, milliseconds. - Layer 3 — LLM analyst (for high-value or borderline tx): given the transaction + user history + recent context, output a structured fraud assessment with reasoning. Slower, more expensive, more nuanced. Used on the 1-5% of transactions that the ML layer flags as ambiguous. - Human analyst queue: for highest-confidence-fraud and high-value-borderline cases, route to a human analyst with the LLM's reasoning pre-filled. - Why LLMs help: explainability (the LLM writes "this transaction is consistent with prior travel patterns" — useful for analyst), unusual-context reasoning (new fraud pattern not yet in training data), cross-modal cases (image-of-receipt + transaction). - Why classical ML still dominates: latency (single-digit ms), cost (millions of tx/day at $0.0001 each), audit (deterministic, regulator-friendly). - Adversarial considerations: fraudsters adapt fast. Continuous retraining, online learning, red-team-style test of new fraud patterns. - Trade-offs: LLM in the critical path (better detection, higher cost / latency) vs LLM only on flagged cases (cheaper, may miss novel patterns), explainability vs raw accuracy. - Numbers to drop: "1M tx/day × $0.0001 ML = $100/day; LLM on 1% = $1/tx × 10k = $10k/day if naive; route only borderline to control cost", "ML latency: 1-5ms; LLM: 1-3s — incompatible for inline use"
Common follow-ups: - "What if a new fraud pattern emerges that the ML missed?" - "How would you A/B a new LLM model in this stack?" - "How does regulatory explainability constrain you?"
Traps: - Putting LLM in the critical path of every transaction. Cost and latency explode.
Related cross-cutting: Architecture choices, Cost & latency
Related module: learning/00_ai_foundation/00_ml_prerequisites_refresher/, learning/02_ai_infrastructure/04_ml_platform_operations/
Q: "Design a medical diagnosis assistant using AI."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - The senior answer leads with scope-pushback, not architecture. A "medical diagnosis assistant" is a regulated medical device in most jurisdictions. Establish scope first. - Constrain scope: this is a clinical decision support tool that surfaces differentials and evidence, not an autonomous diagnoser. The system informs a licensed clinician; the clinician decides. - Architecture (after scoping): - Patient context aggregator: structured EHR pull (labs, history, meds) + unstructured (notes, imaging reports). - Differential generator: LLM grounded in (a) the patient context and (b) authoritative clinical sources (UpToDate, NICE, peer-reviewed via curated RAG). Output: ranked differentials, each with supporting evidence + citations. - Imaging path (if applicable): specialized medical imaging models (radiology / pathology). LLM only integrates their structured output. - Decision support UI: differentials presented with rationale; clinician can dig into citations. No "the answer is X" framing. - Crisis detection: red-flag symptoms (chest pain, neuro deficit) escalate immediately, regardless of differential ranking. - Eval: clinician-validated golden cases; sensitivity/specificity per disease class; against published clinical benchmarks; ongoing post-market surveillance. - Regulatory: FDA SaMD pathway (or equivalent — MHRA, CE, India CDSCO). Likely class II or III depending on use case. Plan validation studies; ongoing post-market monitoring is mandatory. - Liability: clearly framed as decision support; clinician retains responsibility. Internal legal review on the product UX wording. - Trade-offs: deep integration with EHR (better context, higher coupling) vs lightweight standalone (faster ship, less utility), broad disease coverage (more useful, harder to validate) vs narrow specialist (easier path). - Numbers to drop: "FDA clearance: months-to-years depending on class", "clinician-validated golden cases: 500-2000 to start", "post-market surveillance: ongoing"
Common follow-ups: - "What about HIPAA?" - "How do you handle a clinician overruling the AI when AI was right?" - "Where does the liability sit?"
Traps: - Skipping the scope-pushback. This is the senior signal. - Treating medical as a regular product.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/03_ai_security_safety/00_safety_guardrail_design/, learning/01_ai_engineering/08_rag_system_design/
Platform / specialized infra¶
Q: "Design a real-time AI recommendation system."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Clarify: recommendations of what (products, content, jobs, friends)? Cold start strategy? Personalization signals? Latency SLO? - Assume content-feed (videos/articles) at p95 <100ms inference latency. - Two-tower architecture is the modern default: - User tower: takes user features (demographics, history, recent actions, session context). Trained to output a user embedding. - Item tower: takes item features (content text/embedding, author, tags, recency). Outputs item embeddings, precomputed offline for the entire catalog. - At serve time: compute the user embedding (5-20ms), ANN-search item index for top-K nearest items (<10ms on millions of items), rerank with a heavier model on top-100 → top-20. - LLMs come in for: - Cold-start items: generate rich textual embeddings from content (title, body) so new items have meaningful tower-vectors before they accumulate behavior signals. - Re-ranking nuanced cases (large LLM scoring top-50 candidates) — only if latency budget allows. - Explainability ("why this video?") — small LLM generates an explanation from features. - Feature store: real-time features (last-10-clicks, session length) in Redis-style store; batch features (long-term preferences) in offline store; both joined at query time. - Eval: offline metrics (recall, NDCG on labeled relevant items), online A/B (click-through rate, engagement time, retention). - Personalization safeguards: explore/exploit (epsilon-greedy or Thompson sampling for diversity), filter bubble mitigation (inject some out-of-comfort-zone items), fairness across creator groups. - Trade-offs: two-tower (fast, scales) vs DLRM-style fully feature-crossed model (richer, slower), LLM-as-reranker (better quality, latency cost) vs gradient-boosted reranker (fast, slightly worse). - Numbers to drop: "two-tower p95: <50ms end-to-end at millions of items", "item embedding precompute: nightly batch, sometimes hourly for freshness", "online A/B per change: 1-2 weeks minimum"
Common follow-ups: - "How does LLM-generated content embedding compare to learned-from-behavior?" - "What about user cold-start?" - "Why two-tower over a unified model?"
Traps: - LLM in critical path of every recommendation. Latency disaster. - No explore mechanism. Filter bubble.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/01_ai_engineering/07_search_relevance_ranking/, learning/00_ai_foundation/00_ml_prerequisites_refresher/
Q: "Design an AI-powered email assistant."¶
Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline:
- Clarify: scope — read/triage, compose, search, summarize, or all? Single-user or org-wide? Privacy?
- Assume all of those for individual users, with per-user privacy (no cross-user training).
- Components:
- Mailbox sync: IMAP/Gmail/Outlook APIs ingest mail incrementally. Per-user encrypted storage.
- Index: per-user vector + BM25 index of emails. Embedding model embeds subject + body + thread context.
- Triage classifier: small classifier categorizes incoming email (urgent, important, FYI, newsletter, spam). Fine-tuned on user-feedback signals (which emails the user opens vs ignores).
- Search/Q&A: hybrid RAG over the user's mailbox. "Find emails from Jane about budget last quarter" → metadata filter + retrieval.
- Compose assist: draft replies. Context: the email thread, the user's previous writing style (small fine-tune or few-shot from user's past sends), action items.
- Summarize: thread summarization for long chains.
- The lethal-trifecta concern: emails contain untrusted content (1), the assistant has tools to read/send mail (2), and any URL or attachment is an exfiltration channel (3). Indirect prompt injection is a major risk — see safety-guardrails.md.
- Mitigations: quarantine retrieved email content from agentic tool calls (the agent sees a summary, not the raw email when deciding actions); user confirmation for any outbound action (send, delete, forward); deny-list for known exfiltration patterns.
- Privacy: per-user data, no cross-user mining without explicit opt-in. Encryption at rest, scoped access, audit log on any model access to mailbox content.
- Eval: triage accuracy per user, draft-acceptance rate, search NDCG@5 on labeled queries.
- Trade-offs: cloud LLM (capability, privacy concerns) vs on-device (privacy, capability limits), aggressive triage (saves time, risk of hiding important mail) vs gentle (preserves attention).
- Numbers to drop: "per-user mailbox index: 10k-1M emails", "triage classifier: 10-50ms per email", "compose draft acceptance target: 30-50% (rest edited)"
Common follow-ups: - "How do you handle attachments?" - "What if the user gets a phishing email — does the AI fall for it?" - "How does the AI learn the user's style?"
Traps: - Auto-sending replies without confirmation. UX disaster. - Skipping the lethal-trifecta analysis. Email is a prime injection vector.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/03_ai_security_safety/00_safety_guardrail_design/, learning/01_ai_engineering/01_agentic_system_design/
Q: "Design an AI dynamic pricing engine."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - LLMs are not the right tool for the core pricing loop — needs deterministic, fast, audit-friendly decisions. Classical ML (XGBoost / deep nets) is the workhorse. - Where LLMs add value: explaining price decisions to operators (compliance, trust), generating pricing rules from natural-language input, surfacing anomalies ("competitor X just dropped prices on category Y"). - Architecture (core pricing engine): - Feature store: demand signal, competitor prices, inventory, time-of-day, customer segment. - Pricing model: gradient-boosted trees or DNN predicting price elasticity → optimal price per item. - Constraint layer: hard floors/ceilings (legal, brand), strategic rules (don't price above MSRP, don't price loss-leaders below cost). - Audit log: every price change with timestamp, inputs, model version, recommended-vs-final price. - LLM components (separate from critical path): - Rule authoring: business user types "increase prices 5% in the morning on weekdays for SKU group X"; LLM converts to structured rule + safety check. - Anomaly explainer: when a price changes a lot, LLM generates a 1-sentence summary of why for ops review. - Competitor intelligence: LLM monitors competitor pricing pages, extracts price changes. - Regulatory: dynamic pricing is regulated in some jurisdictions (price gouging during emergencies, discriminatory pricing under anti-discrimination law). The engine must support these constraints. - Eval: revenue lift in A/B vs static-price baseline; conversion rate; customer complaints rate; fairness audit. - Trade-offs: aggressive personalization (revenue uplift, fairness risk) vs uniform pricing (safe, less revenue), LLM-driven rules (faster authoring, less interpretable) vs hand-coded rules (slow, fully auditable). - Numbers to drop: "price update cadence: every few minutes to hourly typical; some real-time", "ML inference: <10ms per item", "LLM in rule-author and audit loops only — not per-price"
Common follow-ups: - "How would you prevent collusion (algorithmic price-fixing with competitors)?" - "What about regulator audits?"
Traps: - Putting LLM in the critical pricing loop. Slow, opaque, non-deterministic.
Related cross-cutting: Architecture choices
Related module: learning/00_ai_foundation/00_ml_prerequisites_refresher/, learning/02_ai_infrastructure/04_ml_platform_operations/
Q: "Design an AI notification system that prioritizes instead of broadcasting."¶
Tags: senior · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Problem: a product produces N events worth notifying about; broadcasting all is noisy and gets the app uninstalled. Need per-user prioritization. - Components: - Event ingestion: every potentially-notifiable event lands in a queue with metadata (type, payload, user, source). - Per-user prioritizer: small ranking model (or LLM call for nuanced cases) scoring each event for this user's priority. Inputs: user preferences, past interaction patterns, time-of-day, urgency tier. - Batching and budget: per-user daily notification budget (e.g., max 5/day for casual users). Lower-priority events held; only the top-K within budget delivered. - Channel routing: push for urgent, email for digest, in-app for ambient. LLM can summarize a digest of medium-priority events. - Feedback loop: thumbs / open / dismiss feeds back into the prioritizer. - Cold-start: until you have signal, lean toward conservative (under-notify rather than over-notify). - LLM use cases: summarize digests, generate natural-language notification copy, classify novel event types. - Anti-pattern: LLM in the critical path of every notification. At even modest scale this is expensive and slow. - Eval: open-rate, dismiss-rate, opt-out-rate as the headline metrics. Track per-user notification budget adherence. - Trade-offs: more notifications (more engagement, more uninstalls) vs fewer (less engagement, more retention), per-user model (better quality, more cost) vs global model with per-user features. - Numbers to drop: "notification budget: 1-10/day per user typical", "small-model ranking: <10ms per event", "LLM in summary/copy only — not per-event"
Common follow-ups: - "How do you balance system-critical alerts (always send) vs marketing (send sparingly)?" - "How do you avoid an algorithmic feedback loop where you only notify users about things they already engage with?"
Traps: - LLM per notification. - No budget — turns into spam.
Related cross-cutting: Cost & latency, Architecture choices
Related module: learning/01_ai_engineering/12_model_vendor_strategy/
Q: "Design an AI-powered anomaly detection system for cloud infrastructure."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - LLM positioning: anomaly detection on metrics/logs is classical ML's job (statistical thresholding, isolation forests, deep autoencoders on time series). LLMs help with interpretation and root-cause hypothesis generation once anomalies are flagged. - Architecture: - Metric stream: time-series database (Prometheus / InfluxDB / TimescaleDB). - Anomaly detector: per-metric statistical or learned models. Output: anomaly score + flagged time window. - Log stream: log aggregator (Loki / ELK). - Correlator: when an anomaly fires, fetch correlated logs, recent deploys, infra changes, related metric anomalies. Build a context bundle. - LLM root-cause hypothesis: given the context bundle, generate a structured hypothesis: most-likely cause, supporting signals, suggested next steps. Cite specific log lines / change events. - Alerting: alarms route to on-call with the LLM hypothesis as the description. - LLM also helps with: alert deduplication (cluster similar alerts), runbook generation (LLM produces a draft from past incident histories), postmortem drafting. - Anti-pattern: LLM on every data point. Cost and latency. - Continuous learning: each incident's actual root cause logged; LLM hypothesis evaluated post-hoc; high-accuracy hypothesis patterns reinforced. - Eval: alert precision (fraction of fired alerts that were real incidents), hypothesis-accuracy (was the LLM's top cause right?), MTTR improvement. - Trade-offs: aggressive sensitivity (more alerts, more noise) vs conservative (miss subtle issues), LLM cost per alert vs throwing more humans at it. - Numbers to drop: "anomaly threshold: 3-5 sigma typical, tuned per metric", "LLM per alert: $0.05-0.50 — affordable at typical alert volumes", "hypothesis-accuracy target: 60-80% top-1 with good runbook context"
Common follow-ups: - "What if the metric is brand-new and you have no baseline?" - "How does the LLM not hallucinate root causes?"
Traps: - LLM-as-anomaly-detector on raw metric streams. Wrong tool.
Related cross-cutting: Architecture choices
Related module: learning/01_ai_engineering/05_ai_incident_operations/, learning/01_ai_engineering/03_agent_observability_debugging/
Q: "Design an AI-powered live streaming content moderation system."¶
Tags: staff · occasional · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)
Answer outline: - Hybrid sync moderation on live streams: detection latency must be seconds, not minutes, to be useful. - Pipeline (per stream): - Video frame sampling: keyframe every 1-3 seconds. Classify with NSFW / violence / CSAM models. CSAM hashes against PhotoDNA-style databases. - Audio ASR: streaming Whisper-small or Deepgram. Transcribe → text moderation classifier. - Chat moderation: text classifier on every chat message (already standard). - Combined verdict per few-second window: any modality flags high-severity → immediate action (mute / blackout / kick). Medium → human review queue. - Action latency: must be sub-5-seconds for "live" feel; sub-15 for tolerable. - Scale: at 10k concurrent streams, this is GPU-heavy. Tier streams by audience size — premium / high-audience streams get continuous moderation, low-audience streams get periodic + reactive moderation. - Human review tier: severity-prioritized queue with playback-with-context UI. Reviewer audits, confirms/overturns, escalates. - Appeals: streamers can dispute strikes; appeal goes to a separate review pool. - Eval: precision/recall per content class on labeled streams; reviewer-AI agreement rate; time-to-action distribution. - Trade-offs: more frequent sampling (better detection, more compute) vs less frequent (cheaper, slower detection), aggressive auto-action (safer, more false positives → angry streamers) vs human-in-loop (better accuracy, slower). - Numbers to drop: "1-3s sampling cadence", "10k concurrent streams: 20-100+ GPUs typical depending on model size", "auto-action latency: <5s for clear violations; HITL: <30s for borderline"
Common follow-ups: - "How do you handle adversarial content (a streamer testing what they can get away with)?" - "What about audio-only moderation for radio-style streams?"
Traps: - Sampling too coarsely. Misses fast violations. - Auto-action without appeals. Streamer backlash.
Related cross-cutting: Architecture choices, Cost & latency
Related module: learning/03_ai_security_safety/00_safety_guardrail_design/, learning/05_ai_specializations/01_multimodal_vision_systems/