MLOps & Deployment — Interview Questions¶

The senior signal here is eval gate + rollback path. Anyone can describe canary; few can describe how the canary actually decides to promote, which metric trips a rollback, and which prompt-version was running when something broke. In 2026 the differentiator is treating prompts, fine-tunes, and base-model pins as versioned artifacts with the same release discipline as code. If you talk only about model accuracy and skip prompt/eval/cost rollback paths, you're a junior.

Deployment strategies¶

Q: "What's the difference between canary and blue-green deployment?"¶

Tags: mid · very-common · conceptual · source: DataCamp Top 30 MLOps Interview Questions 2026; standard MLOps loop opener

Answer outline: - Both are zero-downtime deployment strategies. They differ in how traffic moves between old and new. - Blue-green: two identical environments. Blue serves 100% of traffic; green has the new version, idle. Once green is verified (smoke tests, manual checks), flip the load balancer — 100% to green instantly. Old blue stays warm for fast rollback. - Canary: progressively shift traffic. New version starts at 1% of traffic; you watch metrics for N minutes; promote to 5%, 25%, 100% as confidence grows. Slower than blue-green but catches issues with much smaller blast radius. - For LLM systems, canary is usually right: the failure modes (quality regressions, cost spikes, latency drift) only show up under real traffic distribution. Blue-green's instant 100% flip means a quality regression hits everyone at once. - Blue-green wins for: stateless rollouts where the only risk is infra (config errors, broken binary), and for hot-paths where you cannot tolerate any minutes of mixed-version state. - Canary wins for: model swaps, prompt changes, fine-tune promotions, anywhere the failure is a distribution-level shift rather than a binary up/down. - Numbers to drop: "canary cadence: 1% → 5% → 25% → 100% over hours-to-days", "blue-green flip: seconds to minutes", "rollback window in canary at 1%: a regression affects ~1% of traffic before you see it"

Common follow-ups: - "Which would you use for a model swap?" - "What metric decides canary promotion?" - "How do you handle mid-canary rollback?"

Traps: - Confusing canary with A/B testing. Canary is a rollout strategy (always promote to 100% eventually); A/B is an experiment (compare and pick a winner). - Skipping the gate-metric discussion. Canary without an explicit promote/rollback metric is just "we let people use it and hoped".

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "What is shadow deployment and when do you use it?"¶

Tags: mid · very-common · conceptual · source: AWS Prescriptive Guidance ML Operations; standard MLOps deployment-strategy question 2026

Answer outline: - Shadow deployment: run the new model in parallel with the production model on the same real traffic. Production model's output is what the user sees; shadow model's output is logged for offline comparison. Zero user-facing risk. - Use for: pre-promotion validation of any non-trivial change — model swap, fine-tune promotion, major prompt rewrite. Run shadow for 1-7 days, compare quality, cost, and latency between old and new on real traffic, then promote via canary. - The senior signal: explicitly call out that shadow catches distribution-shift bugs that offline evals miss. Your golden set is curated; production traffic is wild. - What you compare: agreement rate (do the two models give similar answers?), quality (LLM judge or human review on sampled disagreements), cost delta, latency delta, refusal-rate delta, guardrail-block-rate delta. - Cost: shadow doubles inference cost during the shadow window. Sample down (10-30% of traffic) to keep this manageable. - Numbers to drop: "shadow for 1-7 days before canary", "10-30% traffic sampling typical", "agreement rate threshold to promote: 90-97%"

Common follow-ups: - "Do you shadow on 100% of traffic?" - "What does the LLM judge score?" - "How does this combine with canary?"

Traps: - Treating shadow as a substitute for canary. Shadow validates with zero user impact; canary validates with real user impact at small scale. You typically want both. - Forgetting the cost of shadow inference.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Walk me through your deployment pipeline for a model swap."¶

Tags: senior · very-common · design · source: standard senior MLOps probe; reported across 2026 AI engineer loops

Answer outline: - Stage 0 — eval gate. Run the new model against the held-out eval suite: target-task accuracy, off-task / forgetting, safety / refusal, cost, latency. Must beat or tie current production on every guardrail metric. If any fail, reject before any deployment activity. - Stage 1 — shadow. Route 10-30% of production traffic to the new model in shadow (output not shown to user, just logged). Run 1-7 days. Compare agreement rate with current production on an LLM judge; flag disagreements for human review. - Stage 2 — canary. Promote to 1% real traffic. Watch dashboards for 30-60 min: error rate, p99 latency, refusal rate, cost-per-call, downstream metrics (CSAT, completion rate). - Stage 3 — ramp. 1% → 5% → 25% → 50% → 100%. Each step waits for a window (1h-1d) and an explicit promote signal. - Stage 4 — post-deploy. Keep the old model warm for fast rollback for ~1 week. Tag the artifact: (base_model_version, prompt_version, training_data_hash, eval_pass_id) — full lineage. - Rollback path: any guardrail metric trips → automated rollback to previous version, alert the team. Manual rollback button always available. - Senior tells: candidate names a specific quality metric (not just latency/errors) as the promote gate, names the rollback trigger, mentions keeping the old version warm. - Numbers to drop: "shadow 1-7 days", "canary 1% → 5% → 25% → 100% over 4-24h", "old version warm for 1 week post-promotion"

Common follow-ups: - "What's the failure mode of skipping shadow?" - "How long does the full rollout take?" - "Who has the rollback button?"

Traps: - Listing stages without naming the gate-metric for each. - Forgetting the rollback path. Senior interviewers always probe this.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you handle model updates and migrations without downtime?"¶

Tags: mid · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Three patterns, picked by failure tolerance: - Blue-green for stateless model swaps where the new model is functionally equivalent (a base-model patch version, a quantization change with verified eval parity). Both versions ready, flip the load balancer. - Canary for any model swap where behavior shifts (new base model, new fine-tune, prompt change). Gradually shift traffic, monitor, promote. - Shadow + canary for high-stakes changes where you want extra confidence before any user sees the new model. - Stateful migrations (vector index regenerated, embedding model swapped, schema change in conversation memory): rebuild the new state in parallel, dual-write or back-fill, switch reads atomically, decommission old. - Always: keep the old version warm for fast rollback. Always: tag artifacts with full lineage so rollback is unambiguous. - For vector store migrations specifically: the embedding model swap is brutal — you must re-embed the entire corpus before switching reads, otherwise retrieval breaks. Plan for the parallel-store window in your storage budget. - Numbers to drop: "blue-green flip: seconds. Canary: hours-to-days. Vector re-embed: hours-to-days depending on corpus size at typical embedding throughput."

Common follow-ups: - "What's the downside of blue-green for a model swap?" - "How do you handle a vector store migration during traffic?"

Traps: - Treating model swap and infra swap as the same problem. The eval gate matters for model swap; infra swap is closer to traditional deploy.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/02_ai_infrastructure/03_vector_retrieval_infrastructure/

Versioning & lineage¶

Q: "What is model versioning, and how do you handle model rollbacks?"¶

Tags: mid · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - A model's identity isn't just the weights. The deployable artifact is the tuple (base_model_version, fine-tune_weights, prompt_version, tool_schema_version, retrieval_index_version, eval_pass_id). Versioning means each of these is pinned, hashed, immutable. - Storage: weights in an artifact registry (S3/GCS with versioned objects + a model registry like MLflow, W&B, Vertex AI). Prompts in git, content-addressable. Indices versioned with the embedding model and chunking parameters. - Each deploy maps to a single version-tuple. The runtime knows which tuple is serving which traffic slice (canary vs main). - Rollback: select the previous version-tuple, route traffic to its warm instance (zero re-spinup time) or re-deploy from artifacts (minutes). Test the rollback path in non-prod before you need it — most rollbacks fail at the team-not-having-tested-it step. - Automated rollback triggers: any guardrail metric exceeds threshold (error rate, p99 latency, refusal rate, cost per call). Manual rollback always available with one button. - Numbers to drop: "rollback target: <2 min from decision to old version serving", "version-tuple: 5-7 components versioned per release", "keep previous version warm: 1 week post-promotion"

Common follow-ups: - "What if the rollback target is itself buggy?" - "How do you version prompts?" - "What gets versioned for a RAG application?"

Traps: - Versioning only the model weights. Prompts and indices break rollbacks just as often. - Untested rollback path. The first time you exercise rollback shouldn't be during an incident.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/13_prompt_lifecycle_operations/

Q: "How do you version and manage prompts in production?"¶

Tags: mid · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Prompts are code. Version them like code: in git, with PR review, with CI evals before merge. - Storage: prompts in .md or .jinja files in the repo. Build-time templating where appropriate (system + tool definitions + few-shot composed at build). Reject "inline prompt string in application code" — gets edited unsafely. - Each prompt has a version identifier (file hash or semver tag). The runtime logs which prompt-version handled each request. On incident, you know exactly which prompt was running. - CI gate: every prompt change runs the eval suite on the new prompt. Quality regression > threshold blocks merge. Cost change > threshold flagged for review. - Deploy with canary, like a model swap. Prompt changes can regress quality dramatically; treat them with the same rollout discipline. - Feature flag the prompt-version so you can swap without a redeploy in emergencies. Per-tenant overrides for testing on specific cohorts. - Numbers to drop: "every prompt change runs 200-500 example eval suite", "regression block threshold: -2% accuracy or +10% cost", "canary ramp same as models: 1 → 5 → 25 → 100%"

Common follow-ups: - "What about prompts generated at runtime?" - "How do you A/B prompt variants?" - "What's the failure mode of inline prompts?"

Traps: - Treating prompts as configuration that doesn't need rollout discipline. They behave like code. - No CI eval gate. Prompts ship that regress silently.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/13_prompt_lifecycle_operations/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "What is CI/CD for AI applications, and how does it differ from traditional CI/CD?"¶

Tags: mid · very-common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Same skeleton (commit → build → test → deploy) but with AI-specific gates and artifacts. - The build stage: pull base model + fine-tune adapter + prompt template + retrieval index; assemble the artifact. Pin each component with a hash. - Test stage adds three AI-specific gates traditional CI doesn't have: - Eval gate: run the held-out eval suite. Block on quality regression beyond threshold. - Safety gate: run red-team probe subset. Block on new jailbreak/injection successes. - Cost gate: estimate cost-per-call on a representative input set. Block on >X% cost increase. - Deploy stage: shadow → canary → ramp, with rollback automation. Versioning across the whole tuple, not just the binary. - Different from traditional CI: AI systems have no deterministic unit-test equivalent for many failures. Quality is statistical; your gate is "regression < threshold on a held-out suite", not "all tests pass". - Different from traditional CD: you cannot push every change instantly. Quality eval takes minutes; safety eval takes hours; full red-team weekly. Faster ≠ better. - Numbers to drop: "CI eval: 200-500 examples, <10 min", "CI safety: 100-300 probes, <30 min", "Full red-team weekly off-CI"

Common follow-ups: - "What's the equivalent of unit tests for LLM apps?" - "How fast can you push a prompt change?" - "How do you handle eval flakiness?"

Traps: - Treating LLM tests as deterministic. They're sample-based. - No cost gate. Cost regressions slip through if you only watch quality.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How would you implement a CI/CD pipeline for machine learning models?"¶

Tags: mid · common · design · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Pipeline stages: - Source: code + data + config in git (data via DVC/LakeFS pointers; large weights via model registry references). - Build: train (or fetch fine-tune), assemble artifact, generate model card, store in registry. - Test: unit tests on tokenizer / utility code; data-validation tests (schema, distribution, freshness); eval tests on held-out (regression vs current prod); safety/red-team probe subset; cost projection. - Stage: push to staging environment; smoke tests; LLM-judge sample of staged outputs vs prod outputs. - Deploy: shadow → canary → ramp → full. Automated rollback on guardrail breach. - Monitor: post-deploy dashboards, drift detection, cost tracking, incident-response hooks. - Tooling: GitHub Actions / GitLab CI / Jenkins for orchestration; MLflow / W&B / Vertex AI Models for registry; Argo / Kubeflow for workflows; LangSmith / Arize / Weights & Biases for production observability. - The senior tell: candidate names what blocks deployment, not just what runs. "Eval suite must pass" is the gate; "we run evals" is not. - Numbers to drop: "CI run time: 5-15 min for fast gates, hours for full safety + drift", "canary stages timed to allow human review windows"

Common follow-ups: - "How do you handle a flaky eval?" - "What's the difference between staging and shadow?"

Traps: - Designing CI/CD without explicit gate metrics. CI/CD is a blocking mechanism, not just an automation.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "What is the role of feature flags in AI deployments?"¶

Tags: mid · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Feature flags decouple deploy from release. The new code/model ships to all instances but is only enabled for a subset of users/tenants via config. - AI-specific uses: swap prompt versions, swap model versions, enable a new tool, toggle a guardrail, route a cohort to a fine-tuned vs base model. All without a redeploy. - Combine with canary: feature flag drives the cohort split; the platform tracks metrics per cohort. - Per-tenant overrides: for B2B customers asking for the new feature early or staying on the old version, you flag per-tenant. - Emergency kill-switch: every risky feature ships with a kill flag. Incident response toggles instantly without code change. - Trade-off: too many flags = combinatorial debugging hell. Flag, ramp, then remove the flag once stable. - Numbers to drop: "kill-switch toggle time: seconds to apply globally", "remove flag within 1-2 sprints of full ramp", "per-tenant override count: typically <5% of total traffic"

Common follow-ups: - "When do you remove an old feature flag?" - "How do feature flags interact with caching?"

Traps: - Permanent feature flags. They turn into tech debt and obscure the actual code path. - Forgetting that feature flags affect prompt caching — flag-keyed prompts may have lower cache hit rates.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/

Traffic & scaling¶

Q: "Your application is hitting LLM provider rate limits during peak hours. How do you handle it?"¶

Tags: mid · very-common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Diagnose first. Which limit? TPM (tokens per minute), RPM (requests per minute), or concurrent-request? Each fix is different. - Immediate mitigations: - Exponential backoff with jitter on 429s. Most SDKs do this; verify it's actually enabled. - Token-bucket on the client to smooth bursts below the limit instead of hammering. - Per-tenant rate limiting so one customer's spike doesn't starve the rest. - Queue-and-defer: low-priority calls (batch jobs, eval runs) move off the hot path during peak hours. - Structural fixes: - Multiple provider keys / accounts: shard your traffic across keys. Most providers grant additional capacity on request. - Multi-provider fallback: primary on Anthropic, fallback on OpenAI/Bedrock. Cost-equivalent models behind a router. - Self-hosted slice: route the cheap-tier traffic to your own vLLM cluster. Removes pressure on the API budget. - Provider provisioned-throughput tier: pay for guaranteed TPM (Bedrock, Vertex). Predictable but expensive. - Process: monitor a leading-indicator metric (tokens consumed per minute as a fraction of limit). Alarm at 70% utilization, not at 100%. - Numbers to drop: "exponential backoff: base 1s, max 30s, jitter ±20%", "alarm at 70% TPM utilization", "multi-key sharding: 2-5 keys typical for high-volume teams"

Common follow-ups: - "What if the provider is just down?" - "How would you implement the token bucket?"

Traps: - Just adding retries. Retries during a rate limit make it worse (retry-storm). - No leading-indicator monitoring. You only notice when traffic is already getting dropped.

Related cross-cutting: Cost & latency, Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/04_resilient_agent_systems/

Q: "A traffic spike brings down your AI system. How do you handle peak traffic?"¶

Tags: senior · common · debugging · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Immediate: stop the bleeding. Shed load with backpressure — return 503 / "service degraded" to a fraction of traffic so the rest stays functional. Rate-limit the noisy tenant if one customer is the spike. - Identify the choke point. Provider rate limits? GPU capacity? Downstream tool service? Logging backend? Each has a different fix. - Common LLM-specific choke points: - Provider TPM exhausted → spread across keys, shed to a cheaper model. - Self-hosted GPU capacity → autoscale up (slow if it's a cold GPU spin-up), shed to API providers as overflow. - Long-context requests crowding the KV cache → gate max input length, route long-context to a separate pool. - Structural prevention: autoscaling with predictive (not just reactive) scale-up — scale ahead of the spike based on day-of-week/hour patterns. Provisioned throughput from the provider as a base capacity, with API overflow. - Post-incident: capacity planning. What was peak QPS? What's the headroom? Where would you have liked an alarm to fire earlier? - Numbers to drop: "load-shed at 80% capacity, not 100%", "autoscale lead-time: minutes for K8s, 10-30 min for GPU instances", "headroom target: 30-50% above expected peak"

Common follow-ups: - "How do you decide who gets shed?" - "What does 'predictive autoscaling' look like?" - "When does this kind of incident happen?"

Traps: - Reaching only for autoscaling. GPU autoscale is slow; you need load-shedding and shed-tier degradation for the minutes it takes to come online. - No priority tiering. All traffic is equal under load → all traffic fails together.

Related cross-cutting: Production patterns, Cost & latency Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/04_resilient_agent_systems/

Q: "Your AI system handles 100 requests/sec but crashes at 5000. How do you scale for concurrent requests?"¶

Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Profile first. At 5k QPS what's saturated? CPU? GPU? memory? KV cache? Downstream tool latency? Database? Each has a different solution. - LLM-specific bottlenecks at scale: - GPU compute / memory bandwidth: max-batch-size in vLLM is full; add replicas behind a load balancer. - KV cache memory: long-context requests are evicting each other. Cap max input length per tenant; route long-context separately. - Tokenizer / preprocessing CPU: parallelize at the application layer; consider a separate tokenizer pool. - Provider rate limits: multi-key sharding, multi-provider fallback. - Downstream tool calls: tool services scale independently; sometimes they're the actual bottleneck. - Horizontal scaling pattern: K8s/Ray Serve/SageMaker with autoscaling on a GPU-aware metric (queue depth, tokens-in-flight, not just CPU). Replicas behind a smart load balancer that's aware of in-flight tokens, not naive round-robin. - Stateless design: every request must be servable by any replica. Conversation memory in an external store (Redis, Postgres), not in-process. - Async / queue tier: any non-interactive work (eval grading, batch summaries) goes through a queue + worker pool, not the synchronous LLM path. - Caching layer in front to absorb repeat traffic. - Numbers to drop: "vLLM single H100: 200-2000 tokens/sec depending on model and concurrency; aggregate concurrency 50-200 simultaneous sequences per GPU", "load balancer choice: token-aware over round-robin for LLM"

Common follow-ups: - "Why is round-robin bad for LLM?" - "Why are LLM workloads memory-bound?" - "Where does conversation memory live?"

Traps: - Scaling CPU/memory naively without GPU-aware metrics. - Forgetting that GPU autoscale is slow (10-30 min cold start) — over-provision headroom.

Related cross-cutting: Production patterns, Cost & latency Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "How do you implement auto-scaling for AI workloads?"¶

Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - AI workloads need workload-specific signals, not CPU/RAM: - Queue depth: number of pending requests in front of LLM workers. Most direct signal. - Tokens-in-flight: total tokens currently being processed across replicas. Better proxy for GPU saturation than request count. - GPU utilization: useful but lags reality; can be misleadingly low even when memory-bound. - P99 TTFT: latency-based trigger; scale up when user-perceived latency degrades. - Combine signals. Single-signal scaling oscillates. Stable scaling = composite metric (e.g., weighted sum of queue depth and TTFT) with hysteresis. - Scale-up speed: GPU instances are slow to spin up (10-30 min from cold). Pre-warm a small reserve pool. Scale-down conservatively (e.g., wait 15-30 min of low load before reducing). - Cost vs latency trade-off: keep some over-provisioning (30-50% headroom) so a spike doesn't have to wait for cold-start. - Predictive layer: for predictable workloads (business-hour peaks, weekly cycles), schedule scale-ups before the spike rather than reacting to it. - Tools: KEDA for K8s with custom metrics, Ray Serve autoscaler, vLLM serve with replica autoscaling, cloud-native (SageMaker / Vertex / Bedrock provisioned-throughput-with-burst). - Numbers to drop: "scale-up trigger: queue depth > 10 sustained 60s, or p99 TTFT > 800ms", "scale-down delay: 15-30 min cooldown", "headroom: 30-50%"

Common follow-ups: - "What metric would you scale on for a chat product specifically?" - "How do you avoid scale-down during a temporary lull?"

Traps: - CPU-based autoscaling for GPU workloads. Wrong signal. - Aggressive scale-down. The spike comes back and you cold-start.

Related cross-cutting: Production patterns, Cost & latency Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "How do you ensure high availability and fault tolerance for ML models in production?"¶

Tags: senior · common · design · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Layers of redundancy: - Replica redundancy: N-of-M replica setup behind a load balancer; one going down doesn't take traffic. - Zone / region redundancy: replicas across availability zones; for serious uptime, across regions with traffic steering. - Provider redundancy: multi-provider fallback (Anthropic + OpenAI + self-hosted) for the API tier. One provider outage degrades quality but doesn't take you down. - Circuit breakers: per-dependency circuits; when a downstream is failing, fast-fail and degrade rather than tying up replicas. - Graceful degradation: define what "degraded mode" looks like. Some examples: serve cached responses, fall back to a cheaper model, return a static apology, queue and reply later. Always better than total failure. - Health checks: deep health (does the model produce a valid response on a synthetic prompt?) not just shallow (process alive). Load balancer removes failing replicas. - Backups for stateful components: vector indices (snapshot + restore plan), conversation memory (replicated store), prompt store (git is the source of truth). - Chaos testing: periodically kill replicas, simulate provider outages, drop dependencies. Find broken assumptions before production does. - Numbers to drop: "target SLA: 99.9% for most LLM products, 99.95% for high-stakes", "circuit-breaker threshold: 5 consecutive failures, half-open after 30s", "multi-provider fallback as Tier 1 disaster recovery"

Common follow-ups: - "What's your degraded-mode design?" - "How often do you chaos-test?"

Traps: - Shallow health checks. They miss model-side failures. - No multi-provider fallback. A single-provider stack is one outage away from total downtime.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/04_resilient_agent_systems/

Monitoring & drift¶

Q: "How do you monitor LLM applications in production?"¶

Tags: mid · very-common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Three layers, all required. - Health metrics (every request, 100% sampled): TTFT, TPOT, end-to-end latency, error rate, cost per call, refusal rate, tokens in/out. Standard SLO targets per metric. - Quality metrics (sampled 1-10%): LLM-as-judge faithfulness/groundedness/correctness, agreement-rate vs reference, human review on a smaller sample. Quality is statistical — must be sampled, not per-call. - Behavioral metrics: tool-call success rate, conversation depth distribution, retry rate, escalation rate, user satisfaction (CSAT, thumbs). - Per-version + per-tenant breakdowns: aggregate dashboards hide regressions on specific cohorts. Always slice. - Alarms: rate-based, not point-based. "Refusal rate >5% for 30 min" is actionable; "one bad response" is not. - Tooling: LangSmith, Arize, Langfuse, Honeycomb, Datadog. Trace every request with a span tree (LLM call, tool call, retrieval, guardrail check) so you can debug specific traces. - Numbers to drop: "sample rate for quality: 1-10% of traffic, depending on cost budget", "alarm window: 15-30 min of sustained breach", "trace retention: 7-30 days hot, longer in cold storage"

Common follow-ups: - "What's the SLO for an LLM product?" - "How do you set the quality threshold?" - "How do you find the rare-but-bad failure modes?"

Traps: - Monitoring only health (latency / errors). Quality regressions hide behind green health dashboards. - No per-version breakdown. A regression in one slice gets averaged away.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "What is model or concept drift, and how do you detect it for LLMs?"¶

Tags: mid · common · conceptual · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Drift = the model's performance degrades over time because the world changed even though the model didn't. - Two flavors: - Data drift: input distribution shifts. Users start asking questions about a topic the model wasn't optimized for; new slang, new product names. - Concept drift: the correct answer changes. Pricing updated, policy updated, regulations changed, but the model still answers the old way (or RAG retrieves stale docs). - For LLMs both manifest as quality regressions on real traffic even though offline evals stay green. Your golden set didn't drift; production did. - Detection: - Input-side: embed sampled inputs over time, compare distribution to a baseline (KL/MMD/topic-cluster shift). Sudden changes warrant attention. - Output-side: LLM-judge sampled outputs over time; track quality score. Trend down → drift. - Business metrics: CSAT, deflection rate, escalation rate. Trends down → something is regressing even if you can't pinpoint it. - Response: triage drift signals, promote new failure cases into the golden set, retrain or update the prompt / retrieval. Without continuous golden-set updates, your offline evals become decreasingly representative. - Numbers to drop: "weekly drift check: 500-2000 sampled inputs, compare to last week", "quality-trend alarm: >5% drop in LLM-judge score over 2 weeks", "golden-set freshness: promote 5-20 new cases per week"

Common follow-ups: - "How is LLM drift different from classical ML drift?" - "What about embedding drift in RAG?"

Traps: - Static golden set. Without continuous updates, you don't detect drift; you only detect bugs in the eval. - Treating drift as a one-shot retrain trigger. It's a continuous loop.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "Why is monitoring important in MLOps, and what metrics should you track?"¶

Tags: screen · very-common · conceptual · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Monitoring catches the failures that offline evals miss — distribution shift, drift, infra issues, provider problems, abuse patterns, cost runaway. - Track, by layer: - System health: latency (TTFT, TPOT, E2E), error rate, throughput, GPU utilization, queue depth. - Cost: $/request, $/tenant, monthly spend trend, cache hit rate. - Quality: LLM-judge faithfulness/correctness on sampled traffic, agreement rate vs reference, regression vs prior version. - Safety: refusal rate, guardrail block rate, jailbreak success rate (red-team sampled), PII-detector hit rate. - Behavioral: tool-call success rate, conversation depth distribution, retry/escalation rates. - Business: CSAT, deflection rate, completion rate, churn signals. - All metrics sliced by model version, prompt version, tenant, intent class. Aggregates hide regressions. - Alarms on rate / trend, not on single events. SLO-based alerting against explicit targets. - Senior tell: candidate ties metrics to actions — "if X is true, page on-call" or "if Y trends down for 2 weeks, run drift triage". Metrics without actions are wallpaper. - Numbers to drop: "SLO refusal rate < 5%", "quality regression > 3% triggers eval-team triage", "cost growth > 20% WoW alarms"

Common follow-ups: - "Which metric matters most?" - "How do you tie metrics to action?"

Traps: - Listing metrics without actions. - Forgetting business metrics. Health is necessary, not sufficient.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Testing & validation¶

Q: "What testing should be done before deploying an ML model into production?"¶

Tags: mid · common · design · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Eight tests, all gating: - Unit tests on data-preprocessing, tokenization, post-processing utilities. - Eval suite on held-out target-task examples. Pass threshold: ≥ current production performance. - Off-task / forgetting eval: catches regression on capabilities outside the target task. - Safety / red-team probe subset: catches new jailbreaks, injection holes, harmful-content elicitation. - Cost projection: estimate $/call on a representative input set; gate on >X% increase. - Latency profile: TTFT and TPOT on a benchmark input set; gate on regression. - Schema / format compliance: structured outputs must match schema; gate on % invalid responses. - Smoke test in staging: end-to-end real call against the deployed artifact. - Plus shadow deployment (1-7 days) to catch distribution-shift bugs the offline tests miss. - Each test is automated; CI blocks promotion on failure. Manual overrides require explicit sign-off + a documented reason. - Numbers to drop: "eval suite: 200-500 examples", "off-task: 200 examples", "safety: 100-300 probes", "shadow window: 1-7 days"

Common follow-ups: - "Which test is most likely to catch a regression?" - "What gets manually overridden vs always-gating?"

Traps: - Listing tests without thresholds. "We run evals" doesn't gate anything without an explicit pass/fail rule.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/04_ai_product_evals/00_ai_evals_release_gates/

Q: "How do you approach automating model retraining in an MLOps pipeline?"¶

Tags: senior · common · design · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - Trigger types: - Scheduled: retrain weekly / monthly on rolling data window. Simple, predictable, often enough. - Performance-triggered: drift detector or quality-trend monitor fires → retrain. More responsive but harder to debug. - Event-triggered: base-model upgrade, prompt change, new training data batch ready, regulation change. Specific events that demand re-tune. - For LLMs in 2026 the common cadence is event-triggered (base upgrade) + scheduled (quarterly refresh on rolling production data) — rarely fully automated continuous retraining, because the eval+safety gate needs human attention. - Pipeline: data prep (dedupe, PII redact, quality filter) → train (LoRA on a small instance) → eval → safety → cost projection → if pass, register new candidate → shadow → canary → promote. - Guardrails: never auto-promote without an explicit human sign-off on safety + an over-refusal eval. Auto-promote works for narrow, well-eval'd tasks where the risk of bad data is low. - Data lineage: every retrain logs (data version, base model version, training config, eval pass) so the artifact is reproducible. - Numbers to drop: "scheduled cadence: weekly to quarterly", "human gate before any production-affecting auto-promote", "rolling data window: 30-90 days typical"

Common follow-ups: - "What's the failure mode of fully automated retraining?" - "How do you handle bad training data slipping in?" - "When do you skip retraining?"

Traps: - Fully automated retraining without human gate. Bad data + auto-promote = catastrophic regression. - No data-quality gate. Retraining on noisy data makes the model worse.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/00_ai_foundation/06_adaptation_compression/

Infrastructure & packaging¶

Q: "How do you select GPUs for LLM inference?"¶

Tags: senior · common · design · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Three factors: model size (determines VRAM needed), throughput target (determines memory bandwidth), and cost (determines economic viability). - Memory: model weights + KV cache + activations. For a 7B FP16 model: ~14 GB weights + ~10-30 GB KV cache at production batch sizes. A 70B FP16: ~140 GB weights, needs multi-GPU. - Bandwidth: LLM decode is memory-bandwidth-bound. H200 (4.8 TB/s) beats H100 (3.35 TB/s) beats A100 (2 TB/s) almost linearly on decode throughput. - Selection: - 7B-13B serving: single A100 80 GB or H100. Pick A100 if cost-sensitive, H100 if latency-sensitive. - 30B-70B: H100/H200 with tensor parallelism (2-4 GPUs), or H200 single-card if it fits. - 70B+ at high QPS: H100/H200 multi-GPU, or B200 if available. - Edge / cost-extreme: quantized models on consumer GPUs (RTX 4090, L4) for narrow workloads. - Self-hosted vs API decision is upstream of this — most teams shouldn't self-host until volume justifies the ops cost. - Numbers to drop: "A100 80GB: $1-2/hr cloud. H100: $3-4/hr. H200: $5-7/hr. B200: premium.", "memory bandwidth ratio = throughput ratio on decode"

Common follow-ups: - "When does B200 win over H200?" - "What about consumer GPUs?" - "How do you decide single-GPU vs multi-GPU?"

Traps: - Picking the newest GPU automatically. The cost-per-request matters more than raw throughput. - Forgetting KV cache memory. Weights fit but KV cache OOMs under load.

Related cross-cutting: Cost & latency Related module: learning/02_ai_infrastructure/02_inference_serving_systems/, learning/02_ai_infrastructure/04_ml_platform_operations/

Q: "What is the role of load balancing in AI serving infrastructure?"¶

Tags: mid · common · conceptual · source: Amit Shekhar AI engineering questions repo (GitHub, 2026)

Answer outline: - Distributes incoming requests across replicas; standard role. - LLM-specific concerns: - Token-aware routing: a naive round-robin sends a 50k-token request and a 500-token request to the same replica equally. Token-aware load balancers (vLLM has one; some service meshes do) route based on in-flight tokens or queue depth per replica. - Sticky sessions for stateful flows (conversations with KV cache reuse across turns): some setups pin a conversation to a replica to maximize KV cache hits on the prefix. - Health-aware routing: deep health checks (real model invocation), automatic eviction of unhealthy replicas. - Priority lanes: route paid-tier traffic to dedicated replicas while free-tier shares a pool. Avoids one tenant's spike degrading another's. - For multi-region / multi-provider: traffic steering at the edge (Cloudflare, AWS Global Accelerator) routes by latency, region, or failover state. - Numbers to drop: "token-aware routing: 10-30% throughput uplift vs round-robin on variable-length workloads", "sticky session: cache-hit rate improvement of 20-50% for multi-turn"

Common follow-ups: - "Why is round-robin suboptimal for LLM?" - "How do you handle sticky sessions during scale-down?"

Traps: - Treating LLM workloads as homogeneous HTTP requests. They're not — request cost varies by 100×.

Related cross-cutting: Production patterns, Cost & latency Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "What are the ways of packaging ML models?"¶

Tags: mid · common · conceptual · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - For LLM systems, "packaging" is the artifact-tuple: (weights or weight pointer, runtime/serving engine, prompt template, tool schemas, config). All immutable, all versioned. - Common patterns: - Docker image with the serving engine (vLLM, TGI, TensorRT-LLM) + weights baked in or fetched at startup. Heaviest but most reproducible. - Weights in object store + image fetches at startup: smaller images, slower cold-start. Good for many-model setups. - Model registry references: MLflow / W&B / Vertex Models stores the canonical artifact; deploy systems pull by version tag. - Adapter-only packaging: ship only the LoRA adapter; the base model lives separately on the serving host. Great for multi-tenant adapter farms. - For SaaS API-tier (no self-hosting): packaging is mainly the prompt + config + tool schema bundle, version-tagged, with the API provider's model ID pinned. - Standards: ONNX for cross-framework portability (less relevant for LLM serving where vLLM/TGI are dominant), Hugging Face safetensors for weight format, GGUF for llama.cpp/Ollama ecosystem. - Numbers to drop: "Docker image size for vLLM + 7B model: 15-25 GB", "adapter-only artifact: 50-200 MB"

Common follow-ups: - "Where do the secrets / API keys live in the artifact?" - "How does adapter packaging help multi-tenant?"

Traps: - Baking secrets into images. - Treating the prompt as separate from the model artifact. The deployable unit is the tuple.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/02_ai_infrastructure/02_inference_serving_systems/

Q: "How do you deploy ML models on the cloud?"¶

Tags: mid · common · design · source: DataCamp Top 30 MLOps Interview Questions 2026

Answer outline: - For LLMs in 2026, three paths: - Managed API: provider-hosted (Anthropic, OpenAI, Google Vertex, AWS Bedrock). You ship prompts + tools; provider runs the model. Simplest, fastest to ship. - Managed serving service: Bedrock provisioned-throughput, Vertex Online Predictions, Azure ML Online Endpoints, SageMaker Endpoints. You own the model artifact; cloud runs the serving infra. Middle ground. - Self-hosted: vLLM / TGI / TensorRT-LLM on your own GPU cluster (EKS / GKE / AKS / Ray Serve). Most control, most ops burden. Justified at high volume or strict data residency. - Either way, the pipeline: artifact in registry → CI gates pass → deploy (blue-green or canary) → monitoring + rollback wired up → ramp to full. - For multi-region: replicate the artifact, deploy in each region, traffic steering at the edge. - IaC: Terraform / Pulumi for infra, Helm / Kustomize for K8s deploys. Drift-free; everything in git. - Numbers to drop: "managed API: hours to ship. Managed serving: days. Self-hosted: weeks-to-months for production-grade."

Common follow-ups: - "When does self-hosted make sense?" - "What's the trade-off between Bedrock provisioned throughput and on-demand?"

Traps: - Defaulting to self-hosted because "we want control". The ops cost is real. - No IaC. Hand-deployed cloud resources rot fast.

Related cross-cutting: Cost & latency, Architecture choices Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/12_model_vendor_strategy/

Scenario / incident response¶

Q: "Your model deploy broke production. Walk me through the incident response."¶

Tags: senior · very-common · scenario · source: standard senior incident-response probe; reported across 2026 AI engineer loops

Answer outline: - Step 0 — confirm the signal. What broke? Latency spike, quality drop, error storm, cost explosion? Reproduce on a clean session if possible. - Step 1 — stop the bleeding. Rollback to the previous version-tuple. Should be a one-button operation with ~2 min wall-clock. Engage incident channel; assign IC, comms, scribe roles. - Step 2 — confirm rollback worked. Watch the dashboards: metrics returning to baseline? Mostly yes within minutes; full settle within ~30 min. - Step 3 — communicate. Internal status page first, customer-facing status second (if user-visible impact), executive summary by end-of-incident. - Step 4 — do not hot-fix forward unless rollback is impossible. The temptation is to "just fix it" — that creates new untested artifacts mid-incident. Rollback first; investigate second. - Step 5 — investigate. What did the eval suite miss? What signal would have caught it in shadow / canary? Pull traces for failed requests; cluster by failure mode. - Step 6 — postmortem. Blameless. What signal caught it? How long from deploy to detection to rollback? What's the structural fix to prevent recurrence? - Step 7 — add to regression suite. The specific failure becomes a permanent test case. The eval suite must catch this class of bug going forward. - Numbers to drop: "rollback decision target: <5 min from page", "rollback execution: <2 min", "postmortem due: within 1 week of incident"

Common follow-ups: - "What if rollback also breaks?" - "Who has rollback authority?" - "Have you ever skipped rollback and hot-fixed forward — when is that OK?"

Traps: - Hot-fixing forward instead of rolling back. New artifacts during an incident usually multiply the damage. - Blaming individuals in postmortem. Blame the system; fix the system.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/05_ai_incident_operations/

Q: "Walk me through your most painful production incident."¶

Tags: senior · very-common · scenario · source: standard behavioral probe in AI engineer loops 2026

Answer outline: - Tell a real story. The interviewer is checking: did you actually run a system in prod? Do you have the scars? - Pick an incident where you can be honest about what went wrong and what you'd do differently. Cost incidents, quality regressions, prompt-injection mishaps, drift-driven failures are all common 2026 examples. - Structure: situation (what broke and how it manifested) → detection (how long it took, what alarmed) → response (what you did, in order) → resolution (what fixed it) → lessons (what you'd change). - Senior tells: the candidate honest about the delay (most incidents are not detected for 10-30 min), the false starts (wrong hypothesis before the right one), and the structural change afterward (not "we'll be more careful" but "we added a specific gate / metric / runbook"). - Avoid: the squeaky-clean story where you detected immediately, hypothesized correctly, fixed in 5 minutes. Interviewers know that's a fantasy. - Numbers to drop: real ones from your incident — duration, customer impact, MTTR.

Common follow-ups: - "What would you do differently?" - "How did you tell customers?" - "How big a deal was this internally?"

Traps: - A perfect story. Sounds rehearsed. - Blaming a teammate. Even if true, doesn't reflect well.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/05_ai_incident_operations/, learning/02_ai_infrastructure/04_ml_platform_operations/