Skip to content

02. MLOps & Production — Narrative Explainer

> Module 17 · Companion files: 01_weekly_plan.md · 03_study_material.md · 04_daily_recall.md · 05_hands_on_lab.md · 06_revision.md

Table of Contents

  1. ELI5 — The Factory Floor Analogy
  2. Chapter 1 — The Opening Failure
  3. Chapter 2 — Model Lifecycle Management
  4. Chapter 3 — CI/CD for ML
  5. Chapter 4 — Serving Infrastructure
  6. Chapter 5 — Monitoring and Maintenance
  7. Retrieval Prompts
  8. Honest Admission
  9. Chapter 6 — Recap, Interview Frame, and Bridge
  10. Foundation-Gap Audit
  11. What Comes Next

ELI5 — The Factory Floor Analogy

Imagine a brilliant R&D lab. They built a product that works once. Everyone claps. The demo looks excellent. The notebook output looks sharp. The boss says, “Ship it next week.” Now pause. Training is the R&D lab. MLOps is the factory. The factory has a harder job. It must produce the same quality daily. It must survive noisy inputs, broken machines, and impatient customers. In this story, remember five named helpers. They will stay with you throughout the module. - the assembly line = the CI/CD pipeline. - the quality gate = the automated eval before promotion. - the warehouse = the model registry. - the production monitor = observability and alerts. - the upgrade without downtime = blue-green or canary deployment. Now see the full picture.

R&D lab
training run
artifacts + metrics
the warehouse
the assembly line
the quality gate
production service
the production monitor
feedback, rollback, retraining
Suppose the R&D team invents a beautiful mixer grinder. In the lab, one engineer uses it gently. Voltage is stable. Ingredients are perfect. Nobody interrupts the test. Naturally, the machine looks wonderful. Now the factory must manufacture one lakh units. Suddenly new questions appear. - Can we reproduce the exact build? - Can we trace which motor went inside which batch? - Can we stop defective units automatically? - Can we replace version 2 without shutting factories? - Can we notice defects before angry customers call? That is exactly the MLOps problem. A model that worked once is not enough. A production system must be repeatable, measurable, recoverable, and economical. See, many juniors confuse model quality with system quality. The model may be strong. The system may still be weak. A weak system loses money quietly. One more picture.
Training success ≠ Production success

Training asks:
  "Can this model learn?"

Production asks:
  "Can this system deliver safely,
   repeatedly,
   cheaply,
   and with proof?"
If you remember only one thing, remember this. MLOps is not decoration around ML. MLOps is how ML becomes trustworthy business software.


Chapter 1 — The Opening Failure

1.1 The notebook triumph

Your model works in a notebook. Accuracy looks good. A few screenshots look even better. The team deploys it happily. For two weeks, everyone relaxes. Then next month, performance degrades. Nobody noticed for three weeks. When someone finally noticed, nobody could reproduce the training run. The original data snapshot is gone. The feature code changed twice. The hyperparameters live inside one forgotten notebook cell. This is the opening failure. It is ordinary. It is painful. It is expensive. It is why MLOps exists.

1.2 What exactly broke

Usually, not one thing. Usually, five things broke together. - Data distribution shifted. - Monitoring was shallow. - Training lineage was incomplete. - Deployment had no guardrail. - Incident response was improvised. A notebook hides these risks. A notebook is personal. Production is organizational. The failure appears when work crosses teams.

1.3 The silent three-week gap

The most dangerous failure is silent degradation. No crash. No red screen. No obvious pager. Just slightly worse predictions. Then worse business outcomes. Recommendation rates drop. Fraud misses increase. Support deflection falls. Sales scoring becomes noisy. Users lose trust before dashboards move. This is why latency-only monitoring is insufficient. A system can be fast and wrong. That is still a production incident.

1.4 Why nobody could reproduce the run

Reproduction fails when lineage is broken. Lineage means the full chain of evidence. You need to know: - which code commit trained the model, - which dataset version fed the run, - which features were materialized, - which hyperparameters were used, - which artifacts were produced, - which eval set approved promotion. Without that chain, debugging becomes storytelling. Two engineers remember different histories. Both sound plausible. Neither can prove the path.

1.5 Prototype versus product

Let me say this plainly. A prototype answers, “Could this work?” A product answers, “Can this keep working under pressure?” Prototype success is scientific. Product success is operational. The second is harder. The second decides careers.

1.6 The senior lens

Senior engineers do not stop at model accuracy. They ask operational questions immediately. - What is the rollback path? - Where is the source of truth? - How do we know drift started? - Who is paged at 2 a.m.? - What cost curve appears at 10x traffic? - How quickly can we retrain safely? That habit is the difference. It is not cynicism. It is maturity.

1.7 The failure chain in one diagram

notebook win
manual deploy
input distribution shifts
quality slips quietly
no alert fires
business metric degrades
panic retraining starts
original run cannot be reproduced
slow, expensive recovery

1.8 The stakes

MLOps is what separates prototype from product. That sentence is the heart of this module. Without MLOps, ML remains a demo culture. With MLOps, ML becomes an operating discipline. You are not just shipping predictions. You are shipping promises. Promises about reliability. Promises about reversibility. Promises about cost. Promises about accountability. If those promises break, trust breaks first. And trust is slower to retrain than any model.


Chapter 2 — Model Lifecycle Management

2.1 Start with the run, not the model

Many teams obsess over the final model file. That is too late. The real unit of work is the run. A run captures how a model came into existence. Think like an auditor. If tomorrow somebody asks, “Why this model?” you need evidence, not vibes. A good run record stores: - code commit hash, - dataset version, - feature definitions, - hyperparameters, - environment details, - metrics, - artifacts, - owner, - approval decision.

2.2 Experiment tracking

Experiment tracking tools solve memory loss. They answer, “What did we try?” They also answer, “What actually worked?” The classic tools are MLflow and Weights & Biases. You may also see SageMaker Experiments, Vertex Experiments, or Neptune. But the core idea stays identical.

run
 ├─ params
 ├─ metrics
 ├─ artifacts
 ├─ code version
 └─ notes / tags
Do not log only the winning run. That is a common mistake. Log the failed runs too. Production wisdom often hides inside failures.

2.3 What to log every single time

At minimum, log these fields. Treat them as non-negotiable. | Category | Must log | Why it matters | |---|---|---| | Code | commit SHA, branch, training script path | Reproducibility starts here | | Data | dataset snapshot ID, filters, split logic | Drift and leakage debugging | | Features | feature spec version, transformations | Prevents train-serve skew | | Config | hyperparameters, seeds, hardware | Explains output variance | | Metrics | train/val/test metrics, slice metrics | Approval evidence | | Artifacts | model file, tokenizer, plots, confusion matrix | Deployment handoff | | Environment | package versions, container image | Rebuild accuracy | | Governance | owner, review status, risk notes | Accountability | See the rhythm here. Every line exists because some team once suffered without it.

2.4 Tool comparison — experiment tracking

Tool Best for Strengths Tradeoffs
MLflow Open, flexible teams Simple model registry, broad ecosystem, self-hostable UI feels utilitarian
W&B Research-heavy teams Excellent dashboards, sweep support, collaboration SaaS cost, vendor dependency
SageMaker Experiments AWS shops IAM and managed integration AWS gravity everywhere
Vertex AI Experiments GCP shops Managed, connected to Vertex pipelines GCP-centric workflow
Neptune Metric-heavy experimentation Strong metadata organization Another platform to adopt
If your team is small and practical, MLflow is enough. If your team lives in research dashboards, W&B feels pleasant. Do
not over-romanticize tool choice. Operational habits matter more than the logo.
### 2.5 Model registry — the warehouse
Now let us meet the warehouse. A registry is the place where approved models live. Not every run deserves promotion.
The registry separates experiments from deployable assets.
A mature registry stores:
- model name,
- version,
- stage,
- approval metadata,
- linked run,
- linked dataset,
- artifact pointers,
- deprecation notes.
Typical stages look like this:
None / Draft → Staging → Production → Archived
Do not treat the registry like a file folder. It is a control point. Promotion into production should require evidence.
That evidence often comes from the quality gate.
### 2.6 What lives in a good model card
A model version should travel with documentation. Not a novel. Just the truth.
A useful model card contains:
- task definition,
- intended users,
- training data summary,
- evaluation datasets,
- slice-level behavior,
- known failure modes,
- latency profile,
- cost profile,
- safety or policy notes.
That small discipline prevents heroic memory dependence. People leave teams. Model cards stay behind.
### 2.7 Versioning is bigger than Git alone
Many beginners say, “We already use Git.” Very good. Git versions code. It does not fully version data, artifacts, or
deployed states.
In ML, you must version four things together.
code + data + features + model artifact
If only one changes, behavior can change. That is why lineage graphs matter.
### 2.8 Reproducibility is a stack, not a wish
Reproducibility has layers. Missing one layer can ruin the whole attempt.
Layer 1: code reproducibility. Same commit. Same scripts.
Layer 2: data reproducibility. Same snapshot. Same filters. Same split logic.
Layer 3: environment reproducibility. Same package versions. Same CUDA stack. Same container image.
Layer 4: execution reproducibility. Same seed. Same hardware class. Same distributed setup.
Layer 5: evaluation reproducibility. Same benchmark set. Same thresholding rules. Same business acceptance criteria.
See, teams often stop at layer 1. Then they wonder why results drift mysteriously.
### 2.9 Artifact storage
Artifacts are the physical outputs of runs. They include:
- model binaries,
- tokenizers,
- embeddings,
- plots,
- calibration files,
- checkpoints,
- logs,
- validation reports.
Object stores usually hold these artifacts. S3, GCS, and Azure Blob are common choices. The tracker stores metadata. The
blob store holds the heavy files.
This split is sensible. Metadata wants quick queries. Artifacts want durable cheap storage.
### 2.10 Artifact storage rules that save pain
Use deterministic paths. Use immutable version folders. Never replace artifacts silently. Hash important files. Tag
retention classes clearly. Encrypt sensitive artifacts.
One practical pattern:
s3://ml-artifacts/
  project=
  task=
  run_id=
  artifact_type=
  version=
Boring naming conventions prevent dramatic outages. Please do not underestimate boring conventions.
### 2.11 Lineage diagram
data snapshot ─────┐
feature spec ──────┼──→ training run ───→ model artifact ───→ registry version
code commit ───────┤            │                    │                │
container image ───┘            │                    │                │
                                 ├── metrics report ─┘                │
                                 └── eval evidence ───────────────────┘
### 2.12 The minimum lifecycle discipline
If your team is still early, start small. Do not wait for a grand platform team.
Minimum viable lifecycle management means:
- every run tracked,
- every deployable model registered,
- every production model linked to evidence,
- every artifact stored durably,
- every rollback version discoverable in minutes.
That alone moves you from chaos to competence.
---
## Chapter 3 — CI/CD for ML
### 3.1 Why software CI/CD is not enough
Normal software CI/CD assumes code changes drive behavior. ML systems violate that assumption. Data changes behavior
too. Features change behavior too. Model weights definitely change behavior.
So the pipeline cannot only ask, “Did the tests pass?” It must also ask, “Did the model remain good enough?” That second
question is the quality gate.
### 3.2 Meet the assembly line
This is the assembly line from our analogy. A mature ML pipeline does not move one file. It moves evidence.
source change
  ├─ code commit
  ├─ data refresh
  └─ feature update
training pipeline
evaluation suite
the quality gate
registry promotion
deployment rollout
A good assembly line reduces heroics. Humans decide policy. Machines perform repetition. That is the correct split.
### 3.3 Training pipelines
Training pipelines make retraining boring. That is praise. Boring is good here.
A strong training pipeline handles:
- data extraction,
- validation,
- feature generation,
- split creation,
- training,
- evaluation,
- packaging,
- registration.
The point is not glamour. The point is repeatability. If retraining depends on one patient engineer, the system is
fragile.
### 3.4 Pipeline orchestration tools
Tool Best for Strengths Tradeoffs
--- --- --- ---
GitHub Actions Small teams, lightweight CI Familiar, simple, good for glue Weak for heavy DAG orchestration
Airflow Batch-oriented orchestration Mature scheduling, retries, DAG visibility ML-specific metadata is manual
Kubeflow Pipelines Kubernetes-heavy ML teams Container-native, ML-oriented steps Operational overhead
TFX TensorFlow-centric ecosystems Strong validation and lineage concepts Opinionated, TF-flavored
SageMaker Pipelines AWS-managed shops Managed infra, IAM integration AWS lock-in
Vertex Pipelines GCP-managed shops Managed orchestration, pipeline tracking GCP lock-in
Early teams often succeed with GitHub Actions plus Python scripts. At scale, orchestration needs become sharper. But
again, do not over-engineer on day one.
### 3.5 Automated evaluation gates
The pipeline should stop bad models automatically. That stopping point is the quality gate.
A quality gate usually checks:
- headline metrics,
- slice metrics,
- regression thresholds,
- calibration changes,
- latency regressions,
- safety policy tests,
- cost constraints.
Here is a simple mental model.
candidate model enters
compare against champion
check offline metrics
check protected slices
check latency / cost guardrails
approve, hold, or reject
Do not use only one metric. That is how teams ship silent regressions. AUC may improve while a critical segment
collapses.
### 3.6 Champion versus challenger
This language matters. The current production model is the champion. The new candidate is the challenger.
Your pipeline should ask:
- Is the challenger better overall?
- Is it worse on any critical slice?
- Is it cheaper or more expensive?
- Is it more stable across time windows?
This framing prevents lazy comparison. You are not comparing a model to your hopes. You are comparing it to the
currently trusted baseline.
### 3.7 Automated retraining
Now we reach a seductive topic. Automated retraining sounds impressive. Sometimes it is wise. Sometimes it is reckless.
Good triggers for retraining:
- scheduled refresh with stable labels,
- clear data accrual cadence,
- monitored feature pipelines,
- robust eval sets,
- human review for promotion.
Bad triggers for retraining:
- “traffic feels different,”
- unlabeled drift without eval plan,
- missing rollback version,
- broken feature lineage,
- no post-train acceptance criteria.
See, automation multiplies both discipline and chaos. If the process is weak, automation accelerates the weakness.
### 3.8 Feature stores
Feature stores try to solve consistency. The main promise is simple. The same feature logic should serve training and
inference.
That matters most in classical ML systems. Credit scoring, churn, fraud, personalization, pricing. These systems depend
on tabular features with freshness constraints.
A feature store typically offers:
- offline feature retrieval,
- online low-latency serving,
- feature definitions,
- freshness metadata,
- point-in-time joins.
Feast is a common open-source choice. Managed clouds offer their own versions too.
### 3.9 When feature stores help and when they do not
Feature stores help when many models share operational features. They help when point-in-time correctness matters. They
help when train-serve skew burned you already.
They help less when your application is pure prompt engineering. They help less when features are trivial and few. They
help less when the team cannot maintain another platform surface.
Use the pattern where the pain exists. Not because conference talks made it sound mandatory.
### 3.10 Data versioning with DVC
Git is excellent for code. Large datasets need another mechanism. That is where DVC becomes useful.
DVC gives you:
- dataset version references in Git,
- remote artifact backing in object storage,
- reproducible pipeline stages,
- cache reuse.
Think of DVC as “Git-like pointers for large data and ML pipelines.” It does not magically solve governance. But it
gives structure where folders previously lied.
### 3.11 CI/CD for ML in one diagram
Git commit / data snapshot / feature change
        pipeline kickoff
       validate data quality
          build features
            train model
      run offline evaluations
        the quality gate
          ├─ fail → notify
          └─ pass → register
             deploy safely
### 3.12 Common mistakes in ML pipelines
- treating retraining as a cron job only,
- skipping slice-level metrics,
- deploying straight from notebooks,
- ignoring data validation,
- registering models without evidence,
- mixing training-only code with serving code,
- forgetting rollback automation.
Every one of these mistakes looks harmless early. Then scale arrives. Then they become very expensive habits.
---
## Chapter 4 — Serving Infrastructure
### 4.1 Serving is where latency becomes political
Training teams often think serving is “just expose an endpoint.” No, beta. Serving is where compute, product, and
finance start negotiating.
Users feel latency immediately. Finance feels GPU bills monthly. SRE feels incidents at night. Product feels abandonment
during spikes. Serving is where all four voices meet.
### 4.2 The serving stack picture
client request
API gateway / auth
request router
model server
   ├─ batching
   ├─ caching
   ├─ scheduler
   └─ GPU workers
response + metrics + traces
That middle box is everything. The model alone does not save you. The scheduler, cache, and router decide real
experience.
### 4.3 Model serving choices
For modern LLM inference, three names appear constantly. You must know them cold.
Stack Best for Strengths Tradeoffs
--- --- --- ---
vLLM Open-weight LLM serving Continuous batching, PagedAttention, OpenAI-compatible APIs LLM-centric, not universal
TGI Hugging Face-centered serving Familiar ecosystem, reasonable performance, mature Usually trails vLLM in raw throughput
Triton Inference Server Multi-framework inference Flexible, enterprise-friendly, many backends Higher ops complexity
KServe / Seldon Kubernetes model platforms Standard deployment patterns Adds platform layers
SageMaker / Vertex endpoints Managed inference Less ops burden Higher cost, less low-level control
If the interview asks, start with vLLM for open weights. Mention TGI as the ecosystem cousin. Mention Triton when multi-
model or framework diversity matters.
### 4.4 Autoscaling
Autoscaling sounds simple. Add pods when traffic rises. But ML serving complicates everything.
What should trigger scaling?
- request QPS,
- queue length,
- tokens per second,
- GPU utilization,
- p95 latency,
- memory pressure.
For LLMs, QPS alone is weak. One long generation can dominate resources. Token-level work matters more than raw request
count.
### 4.5 Batching
Batching is the oldest trick in ML serving. Still essential. Still misunderstood.
Three useful patterns exist.
Batch type What happens Best use
--- --- ---
Static batching Wait for fixed-size batch Offline jobs
Dynamic batching Wait briefly, form batch from arrivals Low-to-medium real-time traffic
Continuous batching Add requests into active decode loop High-throughput LLM serving
Continuous batching is why vLLM matters. It keeps GPUs busy during generation. Without it, expensive hardware idles
shamefully.
### 4.6 Caching
Caching is the cheapest performance engineer on your team. Use it well.
Useful caches include:
- feature cache,
- embedding cache,
- prompt prefix cache,
- retrieval cache,
- response cache for deterministic tasks.
But be careful. Caching can also preserve stale errors. So cache invalidation rules matter deeply. That old computer
science joke remains fully alive here.
### 4.7 GPU orchestration
Serving one model on one machine is easy. Serving many models across GPUs is real operations.
You must care about:
- GPU type selection,
- memory fragmentation,
- tenancy policy,
- placement,
- warm starts,
- bin packing,
- failure recovery.
Kubernetes plus device plugins is common. Ray Serve appears in some teams. Managed endpoints hide part of this pain.
They also hide useful control knobs. Tradeoffs everywhere.
### 4.8 Cost optimization begins with architecture
Teams jump to quantization first. Sometimes that is correct. Often the earlier wins are simpler.
Start with these levers:
1. Route easy tasks to cheaper models.
2. Reduce prompt length aggressively.
3. Cache repeated prefixes or retrieval outputs.
4. Tune output token limits.
5. Batch smarter.
6. Right-size GPU class.
7. Separate latency-critical and batch workloads.
Only after these, evaluate heavier changes. Such as quantization, distillation, or model swaps.
### 4.9 Latency optimization basics
Module 18 will assume you know these basics already. So listen carefully.
Latency is not one number. Break it into stages.
network in
  + queue wait
  + prefill
  + decode
  + post-processing
  + network out
If you do not break it apart, you cannot improve it intelligently. Always ask where the milliseconds live.
### 4.10 Blue-green and canary — the upgrade without downtime
Now remember the fifth helper. the upgrade without downtime. This means you can deploy new versions safely. Without
taking the whole service offline.
Strategy What it does Best use
--- --- ---
Blue-green Run old and new stacks side by side Fast full cutover and rollback
Canary Send small traffic slice to new version Early risk detection on real users
Shadow Duplicate traffic without user-visible response Measure latency and behavior safely
Percentage rollout Gradually increase traffic share Controlled scaling of confidence
Blue-green gives crisp rollback. Canary gives live evidence. Shadow gives safe observation. Use the right tool for the
failure you fear.
### 4.11 Serving diagram with control points
         deployment control plane
          ├─ registry lookup
          ├─ rollout policy
          └─ autoscaling policy
user → gateway → router → model server → GPU
                     │         │            │
                     │         ├─ cache     ├─ metrics
                     │         ├─ batcher   └─ traces
                     │         └─ scheduler
                logs + alerts
### 4.12 Cost table — ballpark infra economics
These numbers move by region and vendor. Treat them as ballpark, not scripture.
Resource Rough cost Where it fits
--- --- ---
L4 GPU ~\(0.7-\)1.2 per hour Smaller inference, embedding jobs
A10G GPU ~\(1-\)1.8 per hour Mid-tier inference
A100 80GB ~\(3-\)5 per hour Heavy serving, training, bigger contexts
H100 ~\(8-\)12+ per hour Frontier-scale inference
S3 / GCS storage ~\(20-\)30 per TB-month Artifact and dataset storage
Prometheus + Grafana infra low hundreds per month Monitoring stack
A senior answer always includes utilization. Hardware price alone means little. Idle GPUs are luxury furniture.
---
## Chapter 5 — Monitoring and Maintenance
### 5.1 The production monitor
Now meet the production monitor. If the warehouse stores trust, the monitor protects trust. It tells you whether
reality still matches your assumptions.
Do not monitor only servers. Monitor the model behavior too. An ML system can be healthy operationally and unhealthy
scientifically.
### 5.2 Four families of signals
A good monitoring stack watches four families together.
1. System signals — CPU, GPU, memory, queue depth, latency.
2. Data signals — feature distributions, missingness, schema changes.
3. Model signals — prediction scores, calibration, drift indicators.
4. Business signals — conversion, fraud catch rate, support deflection, revenue.
When these disagree, that disagreement is informative. Fast server plus bad business metric means quality issue. Stable
quality plus exploding latency means infrastructure issue.
### 5.3 Data drift detection
Data drift means input distributions changed. The model may still be identical. But the world feeding it shifted.
Examples:
- new user segment arrived,
- sensor firmware changed,
- upstream feature pipeline broke,
- prompt style changed after a product redesign.
Useful detection methods include:
- summary statistics drift,
- PSI,
- KL divergence,
- embedding-space shift,
- missing-value surge,
- schema validation.
The trick is not only detecting change. The trick is judging whether the change matters. Not every drift event deserves
retraining. Some deserve investigation first.
### 5.4 Model drift
Model drift means predictive behavior degraded over time. Sometimes the world changed. Sometimes labels changed.
Sometimes the model was always fragile. Sometimes a vendor silently updated the underlying model.
Here is the practical difference.
- Data drift asks, “Did inputs change?”
- Model drift asks, “Did performance change?”
The two often travel together. But not always. That distinction matters during incidents.
### 5.5 Drift table
Drift type What changed Typical signal First response
--- --- --- ---
Data drift Input distribution PSI, missingness, schema shift Inspect upstream changes
Concept drift Mapping from input to truth Business metric drop, label lag analysis Re-evaluate assumptions
Model drift Output quality over time Online eval or delayed labels Compare against champion
Vendor drift External model behavior changed Same prompt, new output pattern Pin version, run fallback
Do not casually merge these labels. Precise naming produces faster response.
### 5.6 Online monitoring for LLM systems
LLM systems add fresh headaches. Ground truth is often delayed or ambiguous. So you use proxies.
Common online signals:
- refusal rate,
- hallucination reports,
- groundedness score,
- judge-model rating,
- user thumbs up or down,
- fallback rate,
- retrieval hit quality,
- token cost per session.
None is perfect alone. Together they become a practical early-warning system.
### 5.7 A/B testing in production
A/B testing sounds standard. For ML systems, it becomes subtle. For LLM systems, it becomes extra subtle.
Why?
- quality labels are noisy,
- long-tail failures matter more than means,
- users may cross between variants,
- model outputs are non-deterministic,
- business effects may lag.
So do not run blind percentage tests. Define guardrails first. Latency. Error rate. Cost. Safety. Then define success
metrics.
### 5.8 Shadow, canary, rollout
These are not synonyms. Please use them accurately.
- Shadow duplicates traffic safely.
- Canary exposes a small real audience.
- Percentage rollout expands exposure gradually.
- Blue-green swaps whole environments.
In incidents, language discipline saves time. If someone says “canary” but means “shadow,” the whole room imagines the
wrong risk surface.
### 5.9 Rollback strategies
Rollback is not just “deploy old code.” You may need to roll back:
- model weights,
- prompt template,
- routing rule,
- feature transformation,
- retrieval index,
- traffic policy.
That is why the registry and pipeline history matter. Rollback is only fast when artifacts are already organized. You
cannot improvise order during fire.
### 5.10 Incident response for AI systems
AI incidents need a runbook. Not a brave Slack thread. A runbook.
A clean runbook includes:
1. trigger conditions,
2. severity rubric,
3. owner on call,
4. triage checklist,
5. rollback steps,
6. communication template,
7. recovery verification,
8. postmortem questions.
For LLM incidents, add these extra checks:
- provider outage or throttling,
- prompt regression,
- retrieval freshness,
- safety policy false positives,
- model version drift,
- cost spike due to token explosion.
### 5.11 Monitoring architecture diagram
requests
service metrics ──┐
data checks ──────┼──→ dashboards ─→ alerts ─→ on-call
model quality ────┤         │
business KPIs ────┘         └→ weekly review and retraining decisions
### 5.12 Observability stack comparison
Layer Common tools What you learn
--- --- ---
Metrics Prometheus, CloudWatch, Datadog Latency, throughput, errors, saturation
Logs ELK, Loki, Cloud Logging Per-request evidence
Traces OpenTelemetry, Jaeger, Honeycomb Stage-by-stage latency
ML quality Evidently, Arize, WhyLabs, custom dashboards Drift and performance trends
Annotation / feedback Label Studio, human review queues Reality correction
A perfect tool does not exist. A clear operating model matters more. Who reads the dashboard? Who owns the alert? Who
approves rollback? These are the real questions.
### 5.13 Maintenance rhythm
High-performing teams create rhythm. Not just alerts.
Daily:
- check error, latency, and cost anomalies.
Weekly:
- review drift, slice metrics, failed cases.
Monthly:
- re-evaluate thresholds, SLOs, and runbooks.
Quarterly:
- test disaster rollback and stale-assumption risks.
Maintenance is not glamour work. It is compounding work. The teams that skip it look fine until they suddenly do not.
---
## Retrieval Prompts
Use these when you want to pull the module back from memory. Do not ask for definitions only. Ask for operational
comparisons.
1. “Explain MLOps using the factory analogy, then map the assembly line, quality gate, warehouse, and production monitor to real tools.”
2. “Walk through a failed production model incident where drift went unnoticed for weeks, and show how experiment tracking plus a model registry would have shortened recovery.”
3. “Compare vLLM, TGI, and Triton for a startup serving open-weight models with strict latency targets and a small ops team.”
4. “Give me an interview answer for automated retraining: when it is safe, when it is reckless, and what evidence must exist before promotion.”
5. “Design a rollback plan for an AI system where the model, prompt, retrieval index, and routing rules can all change independently.”
---
## Honest Admission
MLOps tooling is fragmented. That is the truth. Different teams stitch together different stacks. Very few stacks feel
elegant end to end.
Most teams over-engineer or under-engineer. Over-engineering means platform castles before real usage exists. Under-
engineering means notebooks, manual deploys, and pure hope. Both fail differently.
Also, many vendors market “one-click MLOps.” Please be skeptical. The hard part is not the dashboard. The hard part is
operational discipline across data, models, infra, and teams.
So the adult answer is balanced. Use enough tooling to create lineage, safety, and speed. Do not build a moon mission
for a bicycle. But do not drive a bus with bicycle brakes either.
---
## Chapter 6 — Recap, Interview Frame, and Bridge
### 6.1 Failure-fix table
Failure What it looks like Fix
--- --- ---
Model cannot be reproduced Nobody knows exact data or config Track runs, data version, environment, artifacts
Silent quality degradation Business metric slips slowly Add quality monitoring, slice reviews, alerts
Bad model reaches prod Offline result looked okay once Enforce automated eval gate
Train-serve skew Offline metrics great, online poor Use shared feature definitions or feature store
Rollback takes hours Team hunts for old files Use registry stages and immutable artifacts
GPU bill explodes Throughput poor, utilization low Batch better, cache, route cheaper models
Canary causes surprise outage Latency spikes under real traffic Shadow first, canary second, monitor guardrails
Vendor model changes silently Same prompt, new behavior Pin versions, run regression tests
Data refresh breaks predictions Schema drift or null surge Add data validation in pipeline
Incident response is chaotic Slack panic, no clear owner Maintain runbook and severity rules
Read that table twice. This is the module in compressed form.
### 6.2 Key points to remember
- The run is the primary unit of evidence.
- The registry is a control plane, not storage only.
- The quality gate separates experimentation from promotion.
- Retraining without evaluation is automation of risk.
- Serving performance depends on scheduling, not only weights.
- Monitoring must include system, data, model, and business layers.
- Rollback must cover more than code.
- Cost discipline is architecture plus operations, not discounts alone.
### 6.3 Important interview questions
Practice these aloud. Do not memorize robotic answers. Build decision trees.
1. “Your model degraded after deployment and nobody noticed for three weeks. What monitoring design would you add?”
2. “Design a model registry and promotion workflow for a ten-person ML team.”
3. “When is automated retraining a good idea, and when is it dangerous?”
4. “Compare vLLM, TGI, and Triton for a production LLM service.”
5. “What would your rollback strategy look like for a retrieval-augmented AI assistant?”
6. “How would you cut a $100K monthly inference bill without harming user experience?”
7. “What signals distinguish data drift from model drift in production?”
A strong answer names tradeoffs. A weak answer names only tools.
### 6.4 Production experience — tools and ballpark costs
If you want credible interview texture, speak concretely. Here is a useful example frame.
Example stack A — small pragmatic team
- MLflow for runs and registry.
- DVC for dataset snapshots.
- GitHub Actions for pipeline glue.
- S3 for artifacts.
- Prometheus + Grafana for observability.
- vLLM on one A10G or L4 for open-weight inference.
Ballpark cost shape:
- object storage: tens of dollars monthly early on,
- observability infra: low hundreds monthly,
- single mid-tier GPU endpoint: low hundreds to low thousands monthly,
- biggest hidden cost: engineer time.
Example stack B — managed cloud-heavy team
- SageMaker or Vertex pipelines.
- Managed endpoints.
- Managed experiment tracking.
- Cloud monitoring and alerting.
Ballpark cost shape:
- faster time to first production,
- higher per-unit infra cost,
- lower platform maintenance burden,
- easier IAM and audit integration.
Example stack C — high-throughput LLM team
- vLLM or TGI for serving.
- Kubernetes or managed GPU nodes.
- OpenTelemetry traces.
- Arize or WhyLabs for quality monitoring.
- Canary plus shadow rollout policies.
Ballpark cost shape:
- GPU cost dominates,
- utilization determines economic health,
- caching and routing often save more than model swaps.
Use ranges. State assumptions. Never pretend cloud cost is fixed independent of traffic mix.
### 6.5 Exercises
1. Draw your own lifecycle diagram from memory.
Include data, run tracking, registry, deploy, and monitoring.
2. Write a one-page promotion policy.
Specify what the quality gate must verify.
3. Design a rollback plan with four possible rollback targets.
Code alone is not enough.
4. Compare MLflow and W&B for your likely next team.
Choose one and justify the tradeoff honestly.
5. Take one product you know.
List the exact signals you would monitor weekly.
6. Estimate serving cost for one small open-weight model.
Write down GPU class, expected utilization, and fallback path.
### 6.6 Mini architecture recap
data / code / feature change
      tracked run
   metrics + artifacts
      the warehouse
   the assembly line
    the quality gate
 safe rollout strategy
 the production monitor
incident response / retraining
## Foundation-Gap Audit
Module 18 will quietly assume four things are already firm. Do not carry confusion forward.
#### A. Serving infrastructure basics
You should already know:
- what a model server does,
- why routers, schedulers, and caches matter,
- why vLLM, TGI, and Triton differ,
- why blue-green and canary exist.
If this is weak, re-read Chapter 4. Especially §4.3 to §4.10.
#### B. Latency optimization
You should already know:
- latency is stage-wise,
- batching changes both throughput and latency,
- caching can remove repeated work,
- queue time can dominate user experience.
If this is weak, re-read §4.5 to §4.9. Module 18 will make every millisecond more painful.
#### C. Monitoring patterns
You should already know:
- what signals belong in dashboards,
- how drift differs from outages,
- why business metrics matter beside system metrics,
- how alerts should map to owners.
If this is weak, re-read Chapter 5. Especially §5.2 to §5.10.
#### D. Deployment strategies
You should already know:
- shadow versus canary,
- percentage rollout logic,
- rollback targets,
- why safe release patterns matter for AI systems.
If this is weak, re-read §4.10 and §5.8 to §5.10. Voice systems punish sloppy rollout discipline immediately.
---
## What Comes Next
Next module — 00_realtime_voice_agents — applies all these production skills to the hardest latency challenge: voice AI and
real-time streaming, where every millisecond of delay breaks the user experience.
That is why this module sits here. First learn how to run models like products. Then learn how to run them under brutal
realtime constraints. That sequence is intentional.