01. Week 17 — MLOps & Production¶

Key concepts to master¶

Experiment tracking and run lineage.
Model registry stages and approval evidence.
Reproducibility across code, data, environment, and evaluation.
Artifact storage and immutable versioning.
Training pipelines and eval gates.
Automated retraining: safe use and failure modes.
Feature stores and train-serve skew.
Data versioning with DVC.
Serving stacks: vLLM, TGI, Triton, managed endpoints.
Autoscaling, batching, caching, and GPU scheduling.
Drift types: data, concept, model, vendor.
Rollout strategies: shadow, canary, blue-green, percentage rollout.
Incident response and rollback targets.
Cost optimization through routing, caching, batching, and right-sizing.

🧠 Mental models¶

Experiment tracking: "a lab notebook with the exact ingredients and timestamps"
Model registry: "passport control for promoting models between environments"
Feature store: "a shared pantry so training and serving eat the same ingredients"
Drift monitoring: "a smoke alarm for a changing world or changing data"
Rollouts: "test pilots before you hand the whole fleet to a new model"
Serving stack: "a kitchen balancing queues, burners, and prep stations under load"

⚠️ Common traps¶

Recording a model version without the code commit, prompt/config, data snapshot, and eval context needed to reproduce it.
Promoting models without approval evidence, rollback targets, or clear ownership.
Confusing data drift, concept drift, and vendor/model behavior drift during incidents.
Ignoring train-serve skew until online metrics collapse after launch.
Autoscaling on raw QPS when sequence length and token throughput actually drive GPU pressure.
Automating retraining without human gates for label quality, regression checks, or business review.

🔗 Prerequisites & connections¶

Builds on: Module 16 engineering discipline around reproducibility, decision records, testing layers, and versioned change management.

Feeds into: Module 18 voice and realtime systems, where serving, monitoring, rollback, and latency discipline must operate under much tighter SLAs.

💬 Interview phrasing¶

What has to be captured so you can reproduce an ML run six months later?
Why is a model registry more than a folder full of artifacts?
How would you detect train-serve skew or drift before users notice?
When would you choose shadow deployment, canary rollout, or blue-green release?
In an AI incident, what can you actually roll back?

⏱️ Difficulty markers¶

🟢 experiment tracking basics
🟢 model registry stages
🟡 artifact and data versioning
🟡 feature stores and train-serve skew
🔴 serving-stack capacity tuning
🔴 drift taxonomy and incident response
🔴 safe automated retraining

Self-check questions¶

Why did the opening failure remain invisible for weeks? See 02_explainer.md §1.3-§1.4.
What must every tracked run contain? See 02_explainer.md §2.2-§2.3.
Why is a model registry more than a folder? See 02_explainer.md §2.5-§2.6.
What makes a reproducible ML system different from plain Git history? See 02_explainer.md §2.7-§2.10.
What exactly is the quality gate? See 02_explainer.md §3.5-§3.6.
When is automated retraining wise, and when is it reckless? See 02_explainer.md §3.7.
When would you choose vLLM over TGI or Triton? See 02_explainer.md §4.3.
Why is token-level work often better than QPS for autoscaling? See 02_explainer.md §4.4-§4.5.
What is the difference between data drift and model drift? See 02_explainer.md §5.3-§5.5.
What exactly can you roll back in an AI system? See 02_explainer.md §5.8-§5.10.

Health check¶

By the end of Week 17, you should be able to say all of this honestly: - [ ] I can explain the factory analogy without notes. - [ ] I can describe a run-tracking + registry workflow clearly. - [ ] I can compare vLLM, TGI, and Triton in interview language. - [ ] I can define drift, rollback, and incident response precisely. - [ ] I have completed the hands_on_lab in 05_hands_on_lab.md. - [ ] I feel ready for the latency-heavy world of ../00_realtime_voice_agents/.