02. MLOps & Production — Narrative Explainer¶
> Module 17 · Companion files: 01_weekly_plan.md · 03_study_material.md · 04_daily_recall.md · 05_hands_on_lab.md · 06_revision.md¶
Table of Contents¶
- ELI5 — The Factory Floor Analogy
- Chapter 1 — The Opening Failure
- Chapter 2 — Model Lifecycle Management
- Chapter 3 — CI/CD for ML
- Chapter 4 — Serving Infrastructure
- Chapter 5 — Monitoring and Maintenance
- Retrieval Prompts
- Honest Admission
- Chapter 6 — Recap, Interview Frame, and Bridge
- Foundation-Gap Audit
- What Comes Next
ELI5 — The Factory Floor Analogy¶
Imagine a brilliant R&D lab. They built a product that works once. Everyone claps. The demo looks excellent. The notebook output looks sharp. The boss says, “Ship it next week.” Now pause. Training is the R&D lab. MLOps is the factory. The factory has a harder job. It must produce the same quality daily. It must survive noisy inputs, broken machines, and impatient customers. In this story, remember five named helpers. They will stay with you throughout the module. - the assembly line = the CI/CD pipeline. - the quality gate = the automated eval before promotion. - the warehouse = the model registry. - the production monitor = observability and alerts. - the upgrade without downtime = blue-green or canary deployment. Now see the full picture.
R&D lab
↓
training run
↓
artifacts + metrics
↓
the warehouse
↓
the assembly line
↓
the quality gate
↓
production service
↓
the production monitor
↓
feedback, rollback, retraining
Training success ≠ Production success
Training asks:
"Can this model learn?"
Production asks:
"Can this system deliver safely,
repeatedly,
cheaply,
and with proof?"
Chapter 1 — The Opening Failure¶
1.1 The notebook triumph¶
Your model works in a notebook. Accuracy looks good. A few screenshots look even better. The team deploys it happily. For two weeks, everyone relaxes. Then next month, performance degrades. Nobody noticed for three weeks. When someone finally noticed, nobody could reproduce the training run. The original data snapshot is gone. The feature code changed twice. The hyperparameters live inside one forgotten notebook cell. This is the opening failure. It is ordinary. It is painful. It is expensive. It is why MLOps exists.
1.2 What exactly broke¶
Usually, not one thing. Usually, five things broke together. - Data distribution shifted. - Monitoring was shallow. - Training lineage was incomplete. - Deployment had no guardrail. - Incident response was improvised. A notebook hides these risks. A notebook is personal. Production is organizational. The failure appears when work crosses teams.
1.3 The silent three-week gap¶
The most dangerous failure is silent degradation. No crash. No red screen. No obvious pager. Just slightly worse predictions. Then worse business outcomes. Recommendation rates drop. Fraud misses increase. Support deflection falls. Sales scoring becomes noisy. Users lose trust before dashboards move. This is why latency-only monitoring is insufficient. A system can be fast and wrong. That is still a production incident.
1.4 Why nobody could reproduce the run¶
Reproduction fails when lineage is broken. Lineage means the full chain of evidence. You need to know: - which code commit trained the model, - which dataset version fed the run, - which features were materialized, - which hyperparameters were used, - which artifacts were produced, - which eval set approved promotion. Without that chain, debugging becomes storytelling. Two engineers remember different histories. Both sound plausible. Neither can prove the path.
1.5 Prototype versus product¶
Let me say this plainly. A prototype answers, “Could this work?” A product answers, “Can this keep working under pressure?” Prototype success is scientific. Product success is operational. The second is harder. The second decides careers.
1.6 The senior lens¶
Senior engineers do not stop at model accuracy. They ask operational questions immediately. - What is the rollback path? - Where is the source of truth? - How do we know drift started? - Who is paged at 2 a.m.? - What cost curve appears at 10x traffic? - How quickly can we retrain safely? That habit is the difference. It is not cynicism. It is maturity.
1.7 The failure chain in one diagram¶
notebook win
↓
manual deploy
↓
input distribution shifts
↓
quality slips quietly
↓
no alert fires
↓
business metric degrades
↓
panic retraining starts
↓
original run cannot be reproduced
↓
slow, expensive recovery
1.8 The stakes¶
MLOps is what separates prototype from product. That sentence is the heart of this module. Without MLOps, ML remains a demo culture. With MLOps, ML becomes an operating discipline. You are not just shipping predictions. You are shipping promises. Promises about reliability. Promises about reversibility. Promises about cost. Promises about accountability. If those promises break, trust breaks first. And trust is slower to retrain than any model.
Chapter 2 — Model Lifecycle Management¶
2.1 Start with the run, not the model¶
Many teams obsess over the final model file. That is too late. The real unit of work is the run. A run captures how a model came into existence. Think like an auditor. If tomorrow somebody asks, “Why this model?” you need evidence, not vibes. A good run record stores: - code commit hash, - dataset version, - feature definitions, - hyperparameters, - environment details, - metrics, - artifacts, - owner, - approval decision.
2.2 Experiment tracking¶
Experiment tracking tools solve memory loss. They answer, “What did we try?” They also answer, “What actually worked?” The classic tools are MLflow and Weights & Biases. You may also see SageMaker Experiments, Vertex Experiments, or Neptune. But the core idea stays identical.
Do not log only the winning run. That is a common mistake. Log the failed runs too. Production wisdom often hides inside failures.2.3 What to log every single time¶
At minimum, log these fields. Treat them as non-negotiable. | Category | Must log | Why it matters | |---|---|---| | Code | commit SHA, branch, training script path | Reproducibility starts here | | Data | dataset snapshot ID, filters, split logic | Drift and leakage debugging | | Features | feature spec version, transformations | Prevents train-serve skew | | Config | hyperparameters, seeds, hardware | Explains output variance | | Metrics | train/val/test metrics, slice metrics | Approval evidence | | Artifacts | model file, tokenizer, plots, confusion matrix | Deployment handoff | | Environment | package versions, container image | Rebuild accuracy | | Governance | owner, review status, risk notes | Accountability | See the rhythm here. Every line exists because some team once suffered without it.
2.4 Tool comparison — experiment tracking¶
| Tool | Best for | Strengths | Tradeoffs |
|---|---|---|---|
| MLflow | Open, flexible teams | Simple model registry, broad ecosystem, self-hostable | UI feels utilitarian |
| W&B | Research-heavy teams | Excellent dashboards, sweep support, collaboration | SaaS cost, vendor dependency |
| SageMaker Experiments | AWS shops | IAM and managed integration | AWS gravity everywhere |
| Vertex AI Experiments | GCP shops | Managed, connected to Vertex pipelines | GCP-centric workflow |
| Neptune | Metric-heavy experimentation | Strong metadata organization | Another platform to adopt |
| If your team is small and practical, MLflow is enough. If your team lives in research dashboards, W&B feels pleasant. Do | |||
| not over-romanticize tool choice. Operational habits matter more than the logo. | |||
| ### 2.5 Model registry — the warehouse | |||
| Now let us meet the warehouse. A registry is the place where approved models live. Not every run deserves promotion. | |||
| The registry separates experiments from deployable assets. | |||
| A mature registry stores: | |||
| - model name, | |||
| - version, | |||
| - stage, | |||
| - approval metadata, | |||
| - linked run, | |||
| - linked dataset, | |||
| - artifact pointers, | |||
| - deprecation notes. | |||
| Typical stages look like this: | |||
| Do not treat the registry like a file folder. It is a control point. Promotion into production should require evidence. | |||
| That evidence often comes from the quality gate. | |||
| ### 2.6 What lives in a good model card | |||
| A model version should travel with documentation. Not a novel. Just the truth. | |||
| A useful model card contains: | |||
| - task definition, | |||
| - intended users, | |||
| - training data summary, | |||
| - evaluation datasets, | |||
| - slice-level behavior, | |||
| - known failure modes, | |||
| - latency profile, | |||
| - cost profile, | |||
| - safety or policy notes. | |||
| That small discipline prevents heroic memory dependence. People leave teams. Model cards stay behind. | |||
| ### 2.7 Versioning is bigger than Git alone | |||
| Many beginners say, “We already use Git.” Very good. Git versions code. It does not fully version data, artifacts, or | |||
| deployed states. | |||
| In ML, you must version four things together. | |||
| If only one changes, behavior can change. That is why lineage graphs matter. | |||
| ### 2.8 Reproducibility is a stack, not a wish | |||
| Reproducibility has layers. Missing one layer can ruin the whole attempt. | |||
| Layer 1: code reproducibility. Same commit. Same scripts. | |||
| Layer 2: data reproducibility. Same snapshot. Same filters. Same split logic. | |||
| Layer 3: environment reproducibility. Same package versions. Same CUDA stack. Same container image. | |||
| Layer 4: execution reproducibility. Same seed. Same hardware class. Same distributed setup. | |||
| Layer 5: evaluation reproducibility. Same benchmark set. Same thresholding rules. Same business acceptance criteria. | |||
| See, teams often stop at layer 1. Then they wonder why results drift mysteriously. | |||
| ### 2.9 Artifact storage | |||
| Artifacts are the physical outputs of runs. They include: | |||
| - model binaries, | |||
| - tokenizers, | |||
| - embeddings, | |||
| - plots, | |||
| - calibration files, | |||
| - checkpoints, | |||
| - logs, | |||
| - validation reports. | |||
| Object stores usually hold these artifacts. S3, GCS, and Azure Blob are common choices. The tracker stores metadata. The | |||
| blob store holds the heavy files. | |||
| This split is sensible. Metadata wants quick queries. Artifacts want durable cheap storage. | |||
| ### 2.10 Artifact storage rules that save pain | |||
| Use deterministic paths. Use immutable version folders. Never replace artifacts silently. Hash important files. Tag | |||
| retention classes clearly. Encrypt sensitive artifacts. | |||
| One practical pattern: | |||
| Boring naming conventions prevent dramatic outages. Please do not underestimate boring conventions. | |||
| ### 2.11 Lineage diagram | |||
| ### 2.12 The minimum lifecycle discipline | |||
| If your team is still early, start small. Do not wait for a grand platform team. | |||
| Minimum viable lifecycle management means: | |||
| - every run tracked, | |||
| - every deployable model registered, | |||
| - every production model linked to evidence, | |||
| - every artifact stored durably, | |||
| - every rollback version discoverable in minutes. | |||
| That alone moves you from chaos to competence. | |||
| --- | |||
| ## Chapter 3 — CI/CD for ML | |||
| ### 3.1 Why software CI/CD is not enough | |||
| Normal software CI/CD assumes code changes drive behavior. ML systems violate that assumption. Data changes behavior | |||
| too. Features change behavior too. Model weights definitely change behavior. | |||
| So the pipeline cannot only ask, “Did the tests pass?” It must also ask, “Did the model remain good enough?” That second | |||
| question is the quality gate. | |||
| ### 3.2 Meet the assembly line | |||
| This is the assembly line from our analogy. A mature ML pipeline does not move one file. It moves evidence. | |||
| A good assembly line reduces heroics. Humans decide policy. Machines perform repetition. That is the correct split. | |||
| ### 3.3 Training pipelines | |||
| Training pipelines make retraining boring. That is praise. Boring is good here. | |||
| A strong training pipeline handles: | |||
| - data extraction, | |||
| - validation, | |||
| - feature generation, | |||
| - split creation, | |||
| - training, | |||
| - evaluation, | |||
| - packaging, | |||
| - registration. | |||
| The point is not glamour. The point is repeatability. If retraining depends on one patient engineer, the system is | |||
| fragile. | |||
| ### 3.4 Pipeline orchestration tools | |||
| Tool | Best for | Strengths | Tradeoffs |
| --- | --- | --- | --- |
| GitHub Actions | Small teams, lightweight CI | Familiar, simple, good for glue | Weak for heavy DAG orchestration |
| Airflow | Batch-oriented orchestration | Mature scheduling, retries, DAG visibility | ML-specific metadata is manual |
| Kubeflow Pipelines | Kubernetes-heavy ML teams | Container-native, ML-oriented steps | Operational overhead |
| TFX | TensorFlow-centric ecosystems | Strong validation and lineage concepts | Opinionated, TF-flavored |
| SageMaker Pipelines | AWS-managed shops | Managed infra, IAM integration | AWS lock-in |
| Vertex Pipelines | GCP-managed shops | Managed orchestration, pipeline tracking | GCP lock-in |
| Early teams often succeed with GitHub Actions plus Python scripts. At scale, orchestration needs become sharper. But | |||
| again, do not over-engineer on day one. | |||
| ### 3.5 Automated evaluation gates | |||
| The pipeline should stop bad models automatically. That stopping point is the quality gate. | |||
| A quality gate usually checks: | |||
| - headline metrics, | |||
| - slice metrics, | |||
| - regression thresholds, | |||
| - calibration changes, | |||
| - latency regressions, | |||
| - safety policy tests, | |||
| - cost constraints. | |||
| Here is a simple mental model. | |||
| Do not use only one metric. That is how teams ship silent regressions. AUC may improve while a critical segment | |||
| collapses. | |||
| ### 3.6 Champion versus challenger | |||
| This language matters. The current production model is the champion. The new candidate is the challenger. | |||
| Your pipeline should ask: | |||
| - Is the challenger better overall? | |||
| - Is it worse on any critical slice? | |||
| - Is it cheaper or more expensive? | |||
| - Is it more stable across time windows? | |||
| This framing prevents lazy comparison. You are not comparing a model to your hopes. You are comparing it to the | |||
| currently trusted baseline. | |||
| ### 3.7 Automated retraining | |||
| Now we reach a seductive topic. Automated retraining sounds impressive. Sometimes it is wise. Sometimes it is reckless. | |||
| Good triggers for retraining: | |||
| - scheduled refresh with stable labels, | |||
| - clear data accrual cadence, | |||
| - monitored feature pipelines, | |||
| - robust eval sets, | |||
| - human review for promotion. | |||
| Bad triggers for retraining: | |||
| - “traffic feels different,” | |||
| - unlabeled drift without eval plan, | |||
| - missing rollback version, | |||
| - broken feature lineage, | |||
| - no post-train acceptance criteria. | |||
| See, automation multiplies both discipline and chaos. If the process is weak, automation accelerates the weakness. | |||
| ### 3.8 Feature stores | |||
| Feature stores try to solve consistency. The main promise is simple. The same feature logic should serve training and | |||
| inference. | |||
| That matters most in classical ML systems. Credit scoring, churn, fraud, personalization, pricing. These systems depend | |||
| on tabular features with freshness constraints. | |||
| A feature store typically offers: | |||
| - offline feature retrieval, | |||
| - online low-latency serving, | |||
| - feature definitions, | |||
| - freshness metadata, | |||
| - point-in-time joins. | |||
| Feast is a common open-source choice. Managed clouds offer their own versions too. | |||
| ### 3.9 When feature stores help and when they do not | |||
| Feature stores help when many models share operational features. They help when point-in-time correctness matters. They | |||
| help when train-serve skew burned you already. | |||
| They help less when your application is pure prompt engineering. They help less when features are trivial and few. They | |||
| help less when the team cannot maintain another platform surface. | |||
| Use the pattern where the pain exists. Not because conference talks made it sound mandatory. | |||
| ### 3.10 Data versioning with DVC | |||
| Git is excellent for code. Large datasets need another mechanism. That is where DVC becomes useful. | |||
| DVC gives you: | |||
| - dataset version references in Git, | |||
| - remote artifact backing in object storage, | |||
| - reproducible pipeline stages, | |||
| - cache reuse. | |||
| Think of DVC as “Git-like pointers for large data and ML pipelines.” It does not magically solve governance. But it | |||
| gives structure where folders previously lied. | |||
| ### 3.11 CI/CD for ML in one diagram | |||
| ### 3.12 Common mistakes in ML pipelines | |||
| - treating retraining as a cron job only, | |||
| - skipping slice-level metrics, | |||
| - deploying straight from notebooks, | |||
| - ignoring data validation, | |||
| - registering models without evidence, | |||
| - mixing training-only code with serving code, | |||
| - forgetting rollback automation. | |||
| Every one of these mistakes looks harmless early. Then scale arrives. Then they become very expensive habits. | |||
| --- | |||
| ## Chapter 4 — Serving Infrastructure | |||
| ### 4.1 Serving is where latency becomes political | |||
| Training teams often think serving is “just expose an endpoint.” No, beta. Serving is where compute, product, and | |||
| finance start negotiating. | |||
| Users feel latency immediately. Finance feels GPU bills monthly. SRE feels incidents at night. Product feels abandonment | |||
| during spikes. Serving is where all four voices meet. | |||
| ### 4.2 The serving stack picture | |||
| That middle box is everything. The model alone does not save you. The scheduler, cache, and router decide real | |||
| experience. | |||
| ### 4.3 Model serving choices | |||
| For modern LLM inference, three names appear constantly. You must know them cold. | |||
| Stack | Best for | Strengths | Tradeoffs |
| --- | --- | --- | --- |
| vLLM | Open-weight LLM serving | Continuous batching, PagedAttention, OpenAI-compatible APIs | LLM-centric, not universal |
| TGI | Hugging Face-centered serving | Familiar ecosystem, reasonable performance, mature | Usually trails vLLM in raw throughput |
| Triton Inference Server | Multi-framework inference | Flexible, enterprise-friendly, many backends | Higher ops complexity |
| KServe / Seldon | Kubernetes model platforms | Standard deployment patterns | Adds platform layers |
| SageMaker / Vertex endpoints | Managed inference | Less ops burden | Higher cost, less low-level control |
| If the interview asks, start with vLLM for open weights. Mention TGI as the ecosystem cousin. Mention Triton when multi- | |||
| model or framework diversity matters. | |||
| ### 4.4 Autoscaling | |||
| Autoscaling sounds simple. Add pods when traffic rises. But ML serving complicates everything. | |||
| What should trigger scaling? | |||
| - request QPS, | |||
| - queue length, | |||
| - tokens per second, | |||
| - GPU utilization, | |||
| - p95 latency, | |||
| - memory pressure. | |||
| For LLMs, QPS alone is weak. One long generation can dominate resources. Token-level work matters more than raw request | |||
| count. | |||
| ### 4.5 Batching | |||
| Batching is the oldest trick in ML serving. Still essential. Still misunderstood. | |||
| Three useful patterns exist. | |||
| Batch type | What happens | Best use | |
| --- | --- | --- | |
| Static batching | Wait for fixed-size batch | Offline jobs | |
| Dynamic batching | Wait briefly, form batch from arrivals | Low-to-medium real-time traffic | |
| Continuous batching | Add requests into active decode loop | High-throughput LLM serving | |
| Continuous batching is why vLLM matters. It keeps GPUs busy during generation. Without it, expensive hardware idles | |||
| shamefully. | |||
| ### 4.6 Caching | |||
| Caching is the cheapest performance engineer on your team. Use it well. | |||
| Useful caches include: | |||
| - feature cache, | |||
| - embedding cache, | |||
| - prompt prefix cache, | |||
| - retrieval cache, | |||
| - response cache for deterministic tasks. | |||
| But be careful. Caching can also preserve stale errors. So cache invalidation rules matter deeply. That old computer | |||
| science joke remains fully alive here. | |||
| ### 4.7 GPU orchestration | |||
| Serving one model on one machine is easy. Serving many models across GPUs is real operations. | |||
| You must care about: | |||
| - GPU type selection, | |||
| - memory fragmentation, | |||
| - tenancy policy, | |||
| - placement, | |||
| - warm starts, | |||
| - bin packing, | |||
| - failure recovery. | |||
| Kubernetes plus device plugins is common. Ray Serve appears in some teams. Managed endpoints hide part of this pain. | |||
| They also hide useful control knobs. Tradeoffs everywhere. | |||
| ### 4.8 Cost optimization begins with architecture | |||
| Teams jump to quantization first. Sometimes that is correct. Often the earlier wins are simpler. | |||
| Start with these levers: | |||
| 1. Route easy tasks to cheaper models. | |||
| 2. Reduce prompt length aggressively. | |||
| 3. Cache repeated prefixes or retrieval outputs. | |||
| 4. Tune output token limits. | |||
| 5. Batch smarter. | |||
| 6. Right-size GPU class. | |||
| 7. Separate latency-critical and batch workloads. | |||
| Only after these, evaluate heavier changes. Such as quantization, distillation, or model swaps. | |||
| ### 4.9 Latency optimization basics | |||
| Module 18 will assume you know these basics already. So listen carefully. | |||
| Latency is not one number. Break it into stages. | |||
| If you do not break it apart, you cannot improve it intelligently. Always ask where the milliseconds live. | |||
| ### 4.10 Blue-green and canary — the upgrade without downtime | |||
| Now remember the fifth helper. the upgrade without downtime. This means you can deploy new versions safely. Without | |||
| taking the whole service offline. | |||
| Strategy | What it does | Best use | |
| --- | --- | --- | |
| Blue-green | Run old and new stacks side by side | Fast full cutover and rollback | |
| Canary | Send small traffic slice to new version | Early risk detection on real users | |
| Shadow | Duplicate traffic without user-visible response | Measure latency and behavior safely | |
| Percentage rollout | Gradually increase traffic share | Controlled scaling of confidence | |
| Blue-green gives crisp rollback. Canary gives live evidence. Shadow gives safe observation. Use the right tool for the | |||
| failure you fear. | |||
| ### 4.11 Serving diagram with control points | |||
| ### 4.12 Cost table — ballpark infra economics | |||
| These numbers move by region and vendor. Treat them as ballpark, not scripture. | |||
| Resource | Rough cost | Where it fits | |
| --- | --- | --- | |
| L4 GPU | ~\(0.7-\)1.2 per hour | Smaller inference, embedding jobs | |
| A10G GPU | ~\(1-\)1.8 per hour | Mid-tier inference | |
| A100 80GB | ~\(3-\)5 per hour | Heavy serving, training, bigger contexts | |
| H100 | ~\(8-\)12+ per hour | Frontier-scale inference | |
| S3 / GCS storage | ~\(20-\)30 per TB-month | Artifact and dataset storage | |
| Prometheus + Grafana infra | low hundreds per month | Monitoring stack | |
| A senior answer always includes utilization. Hardware price alone means little. Idle GPUs are luxury furniture. | |||
| --- | |||
| ## Chapter 5 — Monitoring and Maintenance | |||
| ### 5.1 The production monitor | |||
| Now meet the production monitor. If the warehouse stores trust, the monitor protects trust. It tells you whether | |||
| reality still matches your assumptions. | |||
| Do not monitor only servers. Monitor the model behavior too. An ML system can be healthy operationally and unhealthy | |||
| scientifically. | |||
| ### 5.2 Four families of signals | |||
| A good monitoring stack watches four families together. | |||
| 1. System signals — CPU, GPU, memory, queue depth, latency. | |||
| 2. Data signals — feature distributions, missingness, schema changes. | |||
| 3. Model signals — prediction scores, calibration, drift indicators. | |||
| 4. Business signals — conversion, fraud catch rate, support deflection, revenue. | |||
| When these disagree, that disagreement is informative. Fast server plus bad business metric means quality issue. Stable | |||
| quality plus exploding latency means infrastructure issue. | |||
| ### 5.3 Data drift detection | |||
| Data drift means input distributions changed. The model may still be identical. But the world feeding it shifted. | |||
| Examples: | |||
| - new user segment arrived, | |||
| - sensor firmware changed, | |||
| - upstream feature pipeline broke, | |||
| - prompt style changed after a product redesign. | |||
| Useful detection methods include: | |||
| - summary statistics drift, | |||
| - PSI, | |||
| - KL divergence, | |||
| - embedding-space shift, | |||
| - missing-value surge, | |||
| - schema validation. | |||
| The trick is not only detecting change. The trick is judging whether the change matters. Not every drift event deserves | |||
| retraining. Some deserve investigation first. | |||
| ### 5.4 Model drift | |||
| Model drift means predictive behavior degraded over time. Sometimes the world changed. Sometimes labels changed. | |||
| Sometimes the model was always fragile. Sometimes a vendor silently updated the underlying model. | |||
| Here is the practical difference. | |||
| - Data drift asks, “Did inputs change?” | |||
| - Model drift asks, “Did performance change?” | |||
| The two often travel together. But not always. That distinction matters during incidents. | |||
| ### 5.5 Drift table | |||
| Drift type | What changed | Typical signal | First response |
| --- | --- | --- | --- |
| Data drift | Input distribution | PSI, missingness, schema shift | Inspect upstream changes |
| Concept drift | Mapping from input to truth | Business metric drop, label lag analysis | Re-evaluate assumptions |
| Model drift | Output quality over time | Online eval or delayed labels | Compare against champion |
| Vendor drift | External model behavior changed | Same prompt, new output pattern | Pin version, run fallback |
| Do not casually merge these labels. Precise naming produces faster response. | |||
| ### 5.6 Online monitoring for LLM systems | |||
| LLM systems add fresh headaches. Ground truth is often delayed or ambiguous. So you use proxies. | |||
| Common online signals: | |||
| - refusal rate, | |||
| - hallucination reports, | |||
| - groundedness score, | |||
| - judge-model rating, | |||
| - user thumbs up or down, | |||
| - fallback rate, | |||
| - retrieval hit quality, | |||
| - token cost per session. | |||
| None is perfect alone. Together they become a practical early-warning system. | |||
| ### 5.7 A/B testing in production | |||
| A/B testing sounds standard. For ML systems, it becomes subtle. For LLM systems, it becomes extra subtle. | |||
| Why? | |||
| - quality labels are noisy, | |||
| - long-tail failures matter more than means, | |||
| - users may cross between variants, | |||
| - model outputs are non-deterministic, | |||
| - business effects may lag. | |||
| So do not run blind percentage tests. Define guardrails first. Latency. Error rate. Cost. Safety. Then define success | |||
| metrics. | |||
| ### 5.8 Shadow, canary, rollout | |||
| These are not synonyms. Please use them accurately. | |||
| - Shadow duplicates traffic safely. | |||
| - Canary exposes a small real audience. | |||
| - Percentage rollout expands exposure gradually. | |||
| - Blue-green swaps whole environments. | |||
| In incidents, language discipline saves time. If someone says “canary” but means “shadow,” the whole room imagines the | |||
| wrong risk surface. | |||
| ### 5.9 Rollback strategies | |||
| Rollback is not just “deploy old code.” You may need to roll back: | |||
| - model weights, | |||
| - prompt template, | |||
| - routing rule, | |||
| - feature transformation, | |||
| - retrieval index, | |||
| - traffic policy. | |||
| That is why the registry and pipeline history matter. Rollback is only fast when artifacts are already organized. You | |||
| cannot improvise order during fire. | |||
| ### 5.10 Incident response for AI systems | |||
| AI incidents need a runbook. Not a brave Slack thread. A runbook. | |||
| A clean runbook includes: | |||
| 1. trigger conditions, | |||
| 2. severity rubric, | |||
| 3. owner on call, | |||
| 4. triage checklist, | |||
| 5. rollback steps, | |||
| 6. communication template, | |||
| 7. recovery verification, | |||
| 8. postmortem questions. | |||
| For LLM incidents, add these extra checks: | |||
| - provider outage or throttling, | |||
| - prompt regression, | |||
| - retrieval freshness, | |||
| - safety policy false positives, | |||
| - model version drift, | |||
| - cost spike due to token explosion. | |||
| ### 5.11 Monitoring architecture diagram | |||
| ### 5.12 Observability stack comparison | |||
| Layer | Common tools | What you learn | |
| --- | --- | --- | |
| Metrics | Prometheus, CloudWatch, Datadog | Latency, throughput, errors, saturation | |
| Logs | ELK, Loki, Cloud Logging | Per-request evidence | |
| Traces | OpenTelemetry, Jaeger, Honeycomb | Stage-by-stage latency | |
| ML quality | Evidently, Arize, WhyLabs, custom dashboards | Drift and performance trends | |
| Annotation / feedback | Label Studio, human review queues | Reality correction | |
| A perfect tool does not exist. A clear operating model matters more. Who reads the dashboard? Who owns the alert? Who | |||
| approves rollback? These are the real questions. | |||
| ### 5.13 Maintenance rhythm | |||
| High-performing teams create rhythm. Not just alerts. | |||
| Daily: | |||
| - check error, latency, and cost anomalies. | |||
| Weekly: | |||
| - review drift, slice metrics, failed cases. | |||
| Monthly: | |||
| - re-evaluate thresholds, SLOs, and runbooks. | |||
| Quarterly: | |||
| - test disaster rollback and stale-assumption risks. | |||
| Maintenance is not glamour work. It is compounding work. The teams that skip it look fine until they suddenly do not. | |||
| --- | |||
| ## Retrieval Prompts | |||
| Use these when you want to pull the module back from memory. Do not ask for definitions only. Ask for operational | |||
| comparisons. | |||
| 1. “Explain MLOps using the factory analogy, then map the assembly line, quality gate, warehouse, and production monitor to real tools.” | |||
| 2. “Walk through a failed production model incident where drift went unnoticed for weeks, and show how experiment tracking plus a model registry would have shortened recovery.” | |||
| 3. “Compare vLLM, TGI, and Triton for a startup serving open-weight models with strict latency targets and a small ops team.” | |||
| 4. “Give me an interview answer for automated retraining: when it is safe, when it is reckless, and what evidence must exist before promotion.” | |||
| 5. “Design a rollback plan for an AI system where the model, prompt, retrieval index, and routing rules can all change independently.” | |||
| --- | |||
| ## Honest Admission | |||
| MLOps tooling is fragmented. That is the truth. Different teams stitch together different stacks. Very few stacks feel | |||
| elegant end to end. | |||
| Most teams over-engineer or under-engineer. Over-engineering means platform castles before real usage exists. Under- | |||
| engineering means notebooks, manual deploys, and pure hope. Both fail differently. | |||
| Also, many vendors market “one-click MLOps.” Please be skeptical. The hard part is not the dashboard. The hard part is | |||
| operational discipline across data, models, infra, and teams. | |||
| So the adult answer is balanced. Use enough tooling to create lineage, safety, and speed. Do not build a moon mission | |||
| for a bicycle. But do not drive a bus with bicycle brakes either. | |||
| --- | |||
| ## Chapter 6 — Recap, Interview Frame, and Bridge | |||
| ### 6.1 Failure-fix table | |||
| Failure | What it looks like | Fix | |
| --- | --- | --- | |
| Model cannot be reproduced | Nobody knows exact data or config | Track runs, data version, environment, artifacts | |
| Silent quality degradation | Business metric slips slowly | Add quality monitoring, slice reviews, alerts | |
| Bad model reaches prod | Offline result looked okay once | Enforce automated eval gate | |
| Train-serve skew | Offline metrics great, online poor | Use shared feature definitions or feature store | |
| Rollback takes hours | Team hunts for old files | Use registry stages and immutable artifacts | |
| GPU bill explodes | Throughput poor, utilization low | Batch better, cache, route cheaper models | |
| Canary causes surprise outage | Latency spikes under real traffic | Shadow first, canary second, monitor guardrails | |
| Vendor model changes silently | Same prompt, new behavior | Pin versions, run regression tests | |
| Data refresh breaks predictions | Schema drift or null surge | Add data validation in pipeline | |
| Incident response is chaotic | Slack panic, no clear owner | Maintain runbook and severity rules | |
| Read that table twice. This is the module in compressed form. | |||
| ### 6.2 Key points to remember | |||
| - The run is the primary unit of evidence. | |||
| - The registry is a control plane, not storage only. | |||
| - The quality gate separates experimentation from promotion. | |||
| - Retraining without evaluation is automation of risk. | |||
| - Serving performance depends on scheduling, not only weights. | |||
| - Monitoring must include system, data, model, and business layers. | |||
| - Rollback must cover more than code. | |||
| - Cost discipline is architecture plus operations, not discounts alone. | |||
| ### 6.3 Important interview questions | |||
| Practice these aloud. Do not memorize robotic answers. Build decision trees. | |||
| 1. “Your model degraded after deployment and nobody noticed for three weeks. What monitoring design would you add?” | |||
| 2. “Design a model registry and promotion workflow for a ten-person ML team.” | |||
| 3. “When is automated retraining a good idea, and when is it dangerous?” | |||
| 4. “Compare vLLM, TGI, and Triton for a production LLM service.” | |||
| 5. “What would your rollback strategy look like for a retrieval-augmented AI assistant?” | |||
| 6. “How would you cut a $100K monthly inference bill without harming user experience?” | |||
| 7. “What signals distinguish data drift from model drift in production?” | |||
| A strong answer names tradeoffs. A weak answer names only tools. | |||
| ### 6.4 Production experience — tools and ballpark costs | |||
| If you want credible interview texture, speak concretely. Here is a useful example frame. | |||
| Example stack A — small pragmatic team | |||
| - MLflow for runs and registry. | |||
| - DVC for dataset snapshots. | |||
| - GitHub Actions for pipeline glue. | |||
| - S3 for artifacts. | |||
| - Prometheus + Grafana for observability. | |||
| - vLLM on one A10G or L4 for open-weight inference. | |||
| Ballpark cost shape: | |||
| - object storage: tens of dollars monthly early on, | |||
| - observability infra: low hundreds monthly, | |||
| - single mid-tier GPU endpoint: low hundreds to low thousands monthly, | |||
| - biggest hidden cost: engineer time. | |||
| Example stack B — managed cloud-heavy team | |||
| - SageMaker or Vertex pipelines. | |||
| - Managed endpoints. | |||
| - Managed experiment tracking. | |||
| - Cloud monitoring and alerting. | |||
| Ballpark cost shape: | |||
| - faster time to first production, | |||
| - higher per-unit infra cost, | |||
| - lower platform maintenance burden, | |||
| - easier IAM and audit integration. | |||
| Example stack C — high-throughput LLM team | |||
| - vLLM or TGI for serving. | |||
| - Kubernetes or managed GPU nodes. | |||
| - OpenTelemetry traces. | |||
| - Arize or WhyLabs for quality monitoring. | |||
| - Canary plus shadow rollout policies. | |||
| Ballpark cost shape: | |||
| - GPU cost dominates, | |||
| - utilization determines economic health, | |||
| - caching and routing often save more than model swaps. | |||
| Use ranges. State assumptions. Never pretend cloud cost is fixed independent of traffic mix. | |||
| ### 6.5 Exercises | |||
| 1. Draw your own lifecycle diagram from memory. | |||
| Include data, run tracking, registry, deploy, and monitoring. | |||
| 2. Write a one-page promotion policy. | |||
| Specify what the quality gate must verify. | |||
| 3. Design a rollback plan with four possible rollback targets. | |||
| Code alone is not enough. | |||
| 4. Compare MLflow and W&B for your likely next team. | |||
| Choose one and justify the tradeoff honestly. | |||
| 5. Take one product you know. | |||
| List the exact signals you would monitor weekly. | |||
| 6. Estimate serving cost for one small open-weight model. | |||
| Write down GPU class, expected utilization, and fallback path. | |||
| ### 6.6 Mini architecture recap | |||
| ## Foundation-Gap Audit | |||
| Module 18 will quietly assume four things are already firm. Do not carry confusion forward. | |||
| #### A. Serving infrastructure basics | |||
| You should already know: | |||
| - what a model server does, | |||
| - why routers, schedulers, and caches matter, | |||
| - why vLLM, TGI, and Triton differ, | |||
| - why blue-green and canary exist. | |||
| If this is weak, re-read Chapter 4. Especially §4.3 to §4.10. | |||
| #### B. Latency optimization | |||
| You should already know: | |||
| - latency is stage-wise, | |||
| - batching changes both throughput and latency, | |||
| - caching can remove repeated work, | |||
| - queue time can dominate user experience. | |||
| If this is weak, re-read §4.5 to §4.9. Module 18 will make every millisecond more painful. | |||
| #### C. Monitoring patterns | |||
| You should already know: | |||
| - what signals belong in dashboards, | |||
| - how drift differs from outages, | |||
| - why business metrics matter beside system metrics, | |||
| - how alerts should map to owners. | |||
| If this is weak, re-read Chapter 5. Especially §5.2 to §5.10. | |||
| #### D. Deployment strategies | |||
| You should already know: | |||
| - shadow versus canary, | |||
| - percentage rollout logic, | |||
| - rollback targets, | |||
| - why safe release patterns matter for AI systems. | |||
| If this is weak, re-read §4.10 and §5.8 to §5.10. Voice systems punish sloppy rollout discipline immediately. | |||
| --- | |||
| ## What Comes Next | |||
| Next module — 00_realtime_voice_agents — applies all these production skills to the hardest latency challenge: voice AI and | |||
| real-time streaming, where every millisecond of delay breaks the user experience. | |||
| That is why this module sits here. First learn how to run models like products. Then learn how to run them under brutal | |||
| realtime constraints. That sequence is intentional. |