10. Monitoring & Drift — when the factory alarms actually matter¶

~14 min read. Live AI systems fail quietly before they fail loudly.

Built on the ELI5 in 00-eli5.md. The production monitor — the wall of gauges and sirens — now watches your model after launch.

1) First picture: one control room, four signal families¶

See.

                live traffic
                    │
                    ▼
┌──────────────────────────────────────────┐
│           the production monitor         │
├──────────────┬─────────────┬─────────────┤
│ system       │ data        │ model       │
│ CPU GPU p95  │ nulls PSI   │ scores drift│
├──────────────┼─────────────┼─────────────┤
│ business     │ alerts      │ owner       │
│ revenue      │ severity    │ runbook     │
└──────────────┴─────────────┴─────────────┘

One dashboard above the factory floor. That is the starting picture.

If you watch only one gauge, you get fooled. Latency may look fine while conversion falls. Revenue may look fine while GPU queues are growing.

So what to do? Watch four signal families together.

First, system signals. These tell you whether serving infrastructure is struggling. Watch CPU, GPU memory, queue depth, p95 latency, error rate. A model can be correct and still unusable.

Second, data signals. These tell you whether incoming inputs changed shape. Watch missingness, category frequencies, text length, language mix. Watch embedding norms if inputs become semantically different.

Third, model signals. These tell you whether outputs changed in risky ways. Watch prediction distribution, confidence, calibration, refusal rate. For ranking, watch score spread and reorder instability.

Fourth, business signals. These tell you whether the product still creates value. Watch conversion, revenue, deflection, fraud prevented, retention. A model with perfect latency can still harm the business.

Look. The assembly line ships code and models safely. The quality gate checks promotion before release. The production monitor checks reality after release. Simple, no?

2) Drift is not one thing¶

Teams say drift as if it is one bucket. That creates bad decisions. Different drifts need different actions.

Think like a factory supervisor. Bad raw material is one problem. A worn machine is another problem. A changed customer specification is another problem. Same alarm word, different fix.

Data drift¶

Input distribution changed. Maybe average cart value moved. Maybe support chats now contain new slang. Maybe images got darker after a camera change.

The model may still be acceptable. But your assumptions are weaker now. That is data drift.

Model drift¶

Prediction behavior changed. Maybe scores became overconfident. Maybe output class balance collapsed. Maybe answer length suddenly doubled.

Inputs may look normal. Outputs no longer behave like before. That is model drift.

Concept drift¶

The mapping from input to truth changed. The same signal now means a different outcome. Fraud patterns changed. Customer intent behind the same query changed. Policy rules changed what counts as a positive label.

This one hurts most. The input shape may look stable. Truth underneath moved.

Vendor drift¶

An external dependency changed under you. Maybe the OCR vendor degraded. Maybe the embedding endpoint changed behavior. Maybe the hosted LLM silently moved to a new checkpoint.

The production monitor should separate vendor drift clearly. Otherwise your team blames the wrong layer. Yes?

3) Detection means comparison, not astrology¶

You need a reference window. That may be training data, staging data, or last stable week. Without a baseline, every chart becomes a guessing game.

The quality gate uses offline samples before promotion. The production monitor uses live samples after promotion. Both compare current behavior to something trusted. They answer different questions.

PSI for bucketed shifts¶

Population Stability Index is simple and useful. Suppose feature income_band had baseline shares:

Low: 50%
Medium: 30%
High: 20%

Now live traffic shows:

Low: 30%
Medium: 40%
High: 30%

PSI per bucket is (actual - expected) × ln(actual / expected). Now compute it step by step.

Low: (0.30 - 0.50) × ln(0.30 / 0.50) ≈ 0.102
Medium: (0.40 - 0.30) × ln(0.40 / 0.30) ≈ 0.029
High: (0.30 - 0.20) × ln(0.30 / 0.20) ≈ 0.041

Total PSI ≈ 0.172. That means noticeable shift. It does not automatically mean panic. See the difference.

KL divergence for shape change¶

KL divergence compares probability distributions more sharply. Use it when bucket meaning matters a lot. It is asymmetric, so write down the direction. A small KL on critical safety classes can still matter.

Embedding shift for messy text and images¶

Text and image inputs resist neat buckets. So what to do? Project each input into embedding space. Track centroid movement, cluster share, or nearest-neighbor mismatch.

If support traffic moves from billing language to cancellation language, embedding shift catches it before labels fully arrive. That is very useful for the production monitor.

Detection must be sliced¶

Do not trust only global averages. Drift often hides inside one segment. Slice by geography, language, platform, customer tier, and provider. A calm global chart can hide a burning segment.

4) Not every drift needs retraining¶

This is the main judgment call. Teams often get this wrong. An alert is not a retrain button.

First ask three questions. Is the alert real or just noise? Is user harm happening now? Is the cause temporary, operational, or structural?

A holiday campaign can create strong data drift. Retraining immediately may overfit a short spike. A broken parser can create massive model drift. Retraining on broken data only bakes in corruption. A vendor endpoint change may need rollback first.

Look. This is where the upgrade without downtime matters. You may route traffic back to a safer version. You may pull an older model from the warehouse. You may tighten the quality gate before the next release.

Only some cases need retraining. Other cases need investigation, rollback, or communication. That is mature MLOps.

A practical response ladder¶

Level 1: observe and annotate.
Level 2: investigate slices and upstream changes.
Level 3: mitigate with routing, rollback, or thresholds.
Level 4: retrain or relabel if the world truly changed.

The production monitor should support this ladder. Not just ring bells. Simple, no?

5) What a useful alert should actually say¶

A bad alert says, "Drift high." That creates panic and confusion.

A useful alert says four things.

what changed,
where it changed,
who owns first response,
what action is suggested.

If possible, add blast radius too. How much traffic is affected? Which customers are affected? How much business risk is visible already?

See. The production monitor is not only a dashboard. It is a decision-support system. If it cannot guide first action, it is incomplete.

Where this lives in the wild¶

Uber Eats ranking team — an ML engineer watches conversion, latency, and score drift after a recommender refresh.
PhonePe risk platform — a fraud data scientist tracks concept drift when transaction behavior changes across festival seasons.
LinkedIn feed ranking — a machine learning engineer monitors calibration, click yield, and member session quality together.
Google Cloud Vision operations — an applied scientist watches image brightness shift and vendor-level OCR quality changes.
Intercom Fin support AI — an inference engineer tracks refusal rate, retrieval freshness, and ticket deflection together.

Pause and recall¶

Why are system, data, model, and business signals all needed together?
How is concept drift different from plain data drift?
When would embedding shift help more than PSI?
Why should a drift alert usually trigger investigation before retraining?

Interview Q&A¶

Q: Why is data drift not enough to prove model quality dropped? A: Inputs can move while labels stay functionally stable. Some shifts are harmless. Quality drops only when the new input mix breaks model assumptions or business outcomes.

Common wrong answer to avoid: "Any data drift means retrain immediately." Change detection and action selection are different jobs.

Q: How do you separate vendor drift from model drift in production? A: Tag every request with provider, model version, embedding version, and routing path. Then compare slices. If only one dependency path shifts, investigate that layer first.

Common wrong answer to avoid: "Just compare overall accuracy." Aggregate accuracy hides dependency-specific failures and delayed labels.

Q: Why monitor business metrics if model metrics already look healthy? A: Model metrics are proxies. Users and money are the destination. A calibrated model can still hurt conversion if latency, UX, or thresholds changed.

Common wrong answer to avoid: "Business teams own that, not ML." Production ownership ends only when user value stays healthy.

Q: When is PSI a weak primary detector? A: PSI is weak for rich text, embeddings, and complex multimodal signals. It also misses semantic movement hidden inside stable buckets.

Common wrong answer to avoid: "PSI works for every feature." It is useful, not universal.

Apply now (5 min)¶

Pick one live AI feature you know. List one system, one data, one model, and one business metric. Then mark which metric the production monitor should alert on first.

Now sketch from memory:

the four-signal dashboard,
the four drift types,
and the response ladder from observe to retrain.

Say aloud where the quality gate ends, and where the production monitor begins.

Bridge. The production monitor can raise the siren, but a siren alone does nothing. Next we need a runbook so the team knows who acts, how fast, and what to roll back. → 11-incident-response.md