07. Monitoring and drift¶

⏱️ Estimated time: 23 min | Level: advanced

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

Monitoring asks whether reality still matches training assumptions¶

A deployed model starts aging the moment traffic hits it. Users change, products change, and data pipelines change. Monitoring exists to catch that gap early. See. The quality inspector is a full workflow, not one dashboard. The kitchen may need retraining when assumptions break. The prep station may be producing stale or shifted features. The recipe book tells you which model and data version are live. The serving counter exposes latency, errors, and traffic mix changes. So what to do? Monitor model health across multiple layers. Watch infrastructure, data quality, feature distributions, and business outcomes together. One green chart does not prove a healthy system. Simple, no? ML monitoring is multi-layer truth checking. Start by defining what healthy means in operational language.

traffic → predictions → outcomes
   │         │            │
   ├─ infra  ├─ quality   ├─ business
   │         │            │
   └──────── alerts / actions ───────┘
                 │
                 v

Monitor the system, not only the score.
Define health signals before the first incident.
Keep operational and model views in one story.
Production reality changes even when code does not.
Dashboards should support action, not decoration.
Now watch. Drift comes in different flavors.

Data drift and feature drift are early warning signals¶

Data drift means input distributions change from the training baseline. Feature drift can happen even if raw sources look similar. Upstream teams may change logging, defaults, or units quietly. See. A tiny schema change can become a model incident. Population Stability Index is a common operational signal. KL divergence or Wasserstein distance can also help compare distributions. Choose methods that teams can interpret and action. So what to do? Track drift per important feature and per segment. Global averages can hide high-risk pockets. Also monitor missingness and freshness directly. A stale feature is often worse than a shifted feature. Set alert thresholds conservatively at first, then tune. Now watch. Drift alerts must map to diagnosis steps.

training baseline
      │
      ├── compare current histograms
      ├── PSI / KL / missingness
      ├── segment by cohort
      └── alert if threshold crossed
                 ↓

Measure drift where business impact is highest.
Track freshness and missingness beside distribution distance.
Segment-level drift often matters more than global drift.
Thresholds should drive diagnosis, not noise.
Data drift is often the first visible crack.
Simple, no? Inputs deserve their own observability.

Concept drift appears when the world changes meaningfully¶

Concept drift means the relationship between inputs and outcomes has shifted. The same features no longer predict as before. This is harder than simple data drift because labels may arrive late. See. You can have stable inputs and still have broken predictions. Outcome monitoring therefore matters deeply. Track calibration, error rate, ranking lift, or business conversions over time. Use delayed labels when necessary, but keep the lag visible. So what to do? Pair fast proxy signals with slower outcome truth. For example, complaint rate may arrive before confirmed fraud labels. Human review queues can also provide early warning. Review cohorts where the model now disagrees with experts often. Now watch. Concept drift often requires model or policy changes, not just data fixes. That is why ownership across teams matters so much.

same input pattern
      │
      ├── old outcome link
      ├── new outcome link
      ├── delayed labels arrive
      └── performance drops
              ↓

Separate input drift from meaning drift in your diagnosis.
Track delayed labels with clear freshness markers.
Use proxy signals carefully but intentionally.
Human review can be a monitoring asset.
Concept drift often changes business rules too.
See. Stable traffic does not guarantee stable truth.

Alerts need playbooks and retraining triggers¶

An alert without an action path is just background anxiety. Monitoring should know what comes next. Some events require rollback, some require investigation, some trigger retraining. Use severity levels tied to user or business impact. See. Automated retraining is powerful and also dangerous. Do not retrain blindly on corrupted or drifting labels. So what to do? Gate retraining with data quality and evaluation checks. A trigger may start a candidate pipeline, not immediate deployment. Also keep human approval for high-risk domains. Runbooks should name owner, data sources, rollback option, and escalation path. Make these steps easy to find during incidents. Now watch. Monitoring becomes learning only when actions are closed properly. Post-incident notes should improve thresholds and playbooks.

alert fires
   │
   ├── investigate source issue
   ├── rollback if severe
   ├── launch retrain candidate
   └── update playbook
           ↓

Map alert classes to clear next actions.
Retraining should be validated, not automatic magic.
Keep runbooks short, explicit, and discoverable.
Escalation paths must reflect business impact.
Learning loops depend on disciplined closure.
Simple, no? Alerts should tell you what to do next.

Mature platforms monitor cost and trust together¶

A model can remain accurate while becoming too expensive. A model can stay cheap while quietly harming user trust. Monitoring must cover both. See. Cost per prediction belongs beside quality per prediction. Track traffic growth, cache hit rate, token usage, and labeling spend. Track complaint rates, override rates, and manual review escalation. These signals often reveal hidden decay earlier than accuracy charts. So what to do? Build one executive summary and one engineering detail view. Leaders need trends. Operators need root-cause signals. Also review false positives and false negatives regularly with domain experts. Trust erodes through examples, not only aggregates. Now watch. The best monitoring systems create useful conversations across teams. That is how platforms improve without panic cycles.

quality
  ├── outcome metrics
cost
  ├── infra + labeling spend
trust
  └── complaints / overrides
         ↓

Put quality, cost, and trust on the same review rhythm.
Build views for both leadership and operators.
Examples and slices should complement aggregates.
Monitoring should reduce surprises, not create more noise.
Mature teams review platform health as a business asset.
See. Trust is measurable if you bother to name signals.

Where this lives in the wild¶

A fraud platform tracks PSI on key financial features and complaint rate as an early concept-drift proxy.
A recommendation team monitors stale-feature rates because freshness failures looked like model regressions.
A support assistant team tracks token cost per answer beside answer-quality review scores.
A lending workflow team requires human approval before any retraining candidate reaches production.
A search platform uses cohort-level drift dashboards because one market can diverge before global averages move.

Pause and recall¶

Why is ML monitoring broader than standard service monitoring?
How is concept drift different from data drift?
Why can automated retraining be risky without gates?
What does it mean to monitor trust alongside quality and cost?

Interview Q&A¶

Q: What should you monitor in an ML system? A: Monitor infrastructure health, feature freshness, input drift, prediction quality, business outcomes, cost, and trust signals together. Common wrong answer to avoid: Just monitor latency and error rate like any other API.

Q: How do you detect data drift? A: Compare live feature distributions against training baselines using measures like PSI, KL divergence, missingness, and freshness checks. Common wrong answer to avoid: If traffic volume stays stable, drift is probably not happening.

Q: What is concept drift? A: It is a change in how inputs relate to outcomes, so the model loses predictive meaning even if input distributions look similar. Common wrong answer to avoid: It is another name for bad latency.

Q: When should retraining trigger automatically? A: Only after validated alerts and quality gates indicate the candidate path is worth launching; high-risk domains still need human approval. Common wrong answer to avoid: Whenever any dashboard turns yellow.

Apply now (5 min)¶

Choose one production model and write five monitoring signals for it. Include one infrastructure metric, one feature freshness metric, one drift metric, one outcome metric, and one trust metric. Then attach an owner and action for each alert. If any alert has no action, fix that first. That is real monitoring design.

Bridge. Full platform covered. What don't we fully understand? → 08