Skip to content

01. Week 17 — MLOps & Production

Key concepts to master

  • Experiment tracking and run lineage.
  • Model registry stages and approval evidence.
  • Reproducibility across code, data, environment, and evaluation.
  • Artifact storage and immutable versioning.
  • Training pipelines and eval gates.
  • Automated retraining: safe use and failure modes.
  • Feature stores and train-serve skew.
  • Data versioning with DVC.
  • Serving stacks: vLLM, TGI, Triton, managed endpoints.
  • Autoscaling, batching, caching, and GPU scheduling.
  • Drift types: data, concept, model, vendor.
  • Rollout strategies: shadow, canary, blue-green, percentage rollout.
  • Incident response and rollback targets.
  • Cost optimization through routing, caching, batching, and right-sizing.

🧠 Mental models

  • Experiment tracking: "a lab notebook with the exact ingredients and timestamps"
  • Model registry: "passport control for promoting models between environments"
  • Feature store: "a shared pantry so training and serving eat the same ingredients"
  • Drift monitoring: "a smoke alarm for a changing world or changing data"
  • Rollouts: "test pilots before you hand the whole fleet to a new model"
  • Serving stack: "a kitchen balancing queues, burners, and prep stations under load"

⚠️ Common traps

  • Recording a model version without the code commit, prompt/config, data snapshot, and eval context needed to reproduce it.
  • Promoting models without approval evidence, rollback targets, or clear ownership.
  • Confusing data drift, concept drift, and vendor/model behavior drift during incidents.
  • Ignoring train-serve skew until online metrics collapse after launch.
  • Autoscaling on raw QPS when sequence length and token throughput actually drive GPU pressure.
  • Automating retraining without human gates for label quality, regression checks, or business review.

🔗 Prerequisites & connections

Builds on: Module 16 engineering discipline around reproducibility, decision records, testing layers, and versioned change management.

Feeds into: Module 18 voice and realtime systems, where serving, monitoring, rollback, and latency discipline must operate under much tighter SLAs.

💬 Interview phrasing

  • What has to be captured so you can reproduce an ML run six months later?
  • Why is a model registry more than a folder full of artifacts?
  • How would you detect train-serve skew or drift before users notice?
  • When would you choose shadow deployment, canary rollout, or blue-green release?
  • In an AI incident, what can you actually roll back?

⏱️ Difficulty markers

  • 🟢 experiment tracking basics
  • 🟢 model registry stages
  • 🟡 artifact and data versioning
  • 🟡 feature stores and train-serve skew
  • 🔴 serving-stack capacity tuning
  • 🔴 drift taxonomy and incident response
  • 🔴 safe automated retraining

Self-check questions

  1. Why did the opening failure remain invisible for weeks? See 02_explainer.md §1.3-§1.4.
  2. What must every tracked run contain? See 02_explainer.md §2.2-§2.3.
  3. Why is a model registry more than a folder? See 02_explainer.md §2.5-§2.6.
  4. What makes a reproducible ML system different from plain Git history? See 02_explainer.md §2.7-§2.10.
  5. What exactly is the quality gate? See 02_explainer.md §3.5-§3.6.
  6. When is automated retraining wise, and when is it reckless? See 02_explainer.md §3.7.
  7. When would you choose vLLM over TGI or Triton? See 02_explainer.md §4.3.
  8. Why is token-level work often better than QPS for autoscaling? See 02_explainer.md §4.4-§4.5.
  9. What is the difference between data drift and model drift? See 02_explainer.md §5.3-§5.5.
  10. What exactly can you roll back in an AI system? See 02_explainer.md §5.8-§5.10.

Health check

By the end of Week 17, you should be able to say all of this honestly: - [ ] I can explain the factory analogy without notes. - [ ] I can describe a run-tracking + registry workflow clearly. - [ ] I can compare vLLM, TGI, and Triton in interview language. - [ ] I can define drift, rollback, and incident response precisely. - [ ] I have completed the hands_on_lab in 05_hands_on_lab.md. - [ ] I feel ready for the latency-heavy world of ../00_realtime_voice_agents/.