Skip to content

09. Drift detection — when 78% quietly becomes 64% and nobody pages

~18 min read. Production rarely fails with a trumpet. It slips a little, then a little more, until a chart that used to read 78 reads 64 and nobody can point at the week it changed.

Builds on the ELI5 in 00-eli5.md. The kitchen log — logging and tracing alongside the rolling rubric — is what tells you which week the chef stopped tasting the soup. The inspection is now a habit, not an event, and the spot check repeats on a schedule.


What 08 locked down and what production still tries to hide

By chapter 08 the team has done real work. The judge is calibrated against human graders to roughly 0.78 agreement. The rubric is locked, with anchors that two new reviewers grade the same way. The launch eval came in at 78% pass rate on a representative sample, and the slice table cleared on enterprise (62%), free tier (74%), and EU (71%). Leadership signed off. The refund chatbot has been live for three weeks.

The question now is not whether the system was good on launch day. It was. The question is how a team learns, before users teach them, that the same system on the same prompts under the same rubric has quietly become a worse system. Nothing in the launch eval prevents the answer from being wrong four weeks later. The judge is calibrated, the rubric is locked, the launch eval was at 78%. The system is now in production — and the question becomes how to know when 78% has quietly become 64%.

What this file solves

This chapter walks the same refund chatbot from week 1 to week 3 of production, watches its aggregate pass rate stay at 76% while one slice silently collapses from 62% to 41%, and shows why an aggregate-only drift dashboard is the production version of shipping on vibes. By the end you can name the five drift sources, pick a detection window that matches your traffic volume, and explain to a PM why "the dashboard is green" is not the same as "the product is healthy".

Why a launch eval is not a permanent eval

A launch eval is a photograph. Production is a film. The launch sample was drawn from the traffic distribution as it stood on launch day, scored against the rubric, and used to certify the system. Every assumption baked into that certificate is time-bounded. The user mix that produced those 100 prompts will move. The third-party APIs the agent calls will silently change their behaviour. The prompt sitting in the repo will be edited four times in a fortnight by three different engineers. The vendor will update the underlying model without telling anyone in a way that survives a static evals harness. The retrieval index that returns "current refund policy" will fall out of date the day the company launches a new product line.

There are five drift sources, and they breathe at different rates. Input drift — the mix of user queries changes, because the product launched in a new geo, or a marketing campaign brought a different audience, or a competitor went down and you absorbed their traffic shape. Tool drift — a third-party API the agent calls quietly changes its response schema, error rate, or latency, and the agent's success rate on that tool path falls. Prompt drift — somebody edited the system prompt to fix a bug last Tuesday, somebody else nudged the few-shot examples on Friday, and by next month the prompt looks nothing like what the launch eval certified. Model drift — the vendor pushes an update to the underlying model, sometimes silently, and the same prompt produces different completions. Data drift — the retrieval store is stale; the documents the agent grounds against no longer match the world the user is asking about.

Teacher voice. Five sources. One symptom. The aggregate quality number moves and nobody can say which source moved first. The discipline of drift detection is keeping the sources separable so the fix lands in the right place.

The naive repair, the visible break, the diagnosis

The first instinct a careful team reaches for is "we'll re-run the launch eval every month". That feels rigorous. It is half-right and half-dangerous. Re-running the launch eval catches the case where the launch eval prompts themselves now fail. It misses every case where the traffic distribution has moved away from the launch eval, which is the case that actually happens. A frozen launch eval becomes a known-good test the model keeps passing while the live user experience rots.

The second instinct is "we'll watch the aggregate pass rate on a rolling window of live traffic". That feels honest. It is also wrong, in a sharper way. An aggregate over heterogeneous traffic can stay steady while a single slice collapses, because the slice that collapsed is also the slice with low volume, and its weight in the aggregate is small. The chart looks fine; the high-stakes customers are on fire.

Not a re-run problem. Not a single-chart problem. A continuous sampling and slicing problem. The natural question becomes: what signal would tell us, within hours of it starting, that one of the five drift sources has begun to move — and which one?

When the same chatbot is fine and not fine at the same time

Here is the refund chatbot's production trace, week by week, on the same model and same locked prompt that passed the launch eval at 78%. Same code, same weights, same rubric. Different production weeks.

WEEK 1 — first week post-launch
  sampled chats:       150
  aggregate pass:      78%
  slice — enterprise:  62%   (n=33)
  slice — free tier:   80%   (n=72)
  slice — EU:          71%   (n=21)
  retrieval hit rate:  91%
  input embedding PSI vs baseline: 0.04   (stable)

WEEK 2
  sampled chats:       150
  aggregate pass:      77%   (-1, within noise)
  slice — enterprise:  54%   (-8, n=35)
  slice — free tier:   82%   (n=69)
  slice — EU:          70%   (n=22)
  retrieval hit rate:  86%   (-5)
  input embedding PSI vs baseline: 0.12   (drifting)

WEEK 3
  sampled chats:       150
  aggregate pass:      76%   (-2 cumulative — looks fine)
  slice — enterprise:  41%   (-21 cumulative — on fire)
  slice — free tier:   83%   (n=68)
  slice — EU:          72%   (n=24)
  retrieval hit rate:  74%   (-17)
  input embedding PSI vs baseline: 0.28   (significantly drifted)

  *PagerDuty has not fired. The aggregate dashboard is green.*

Read the aggregate row alone and ship date plus three weeks looks like the system is performing exactly as the launch eval predicted. The 76 is two points off the 78 the launch certified. Two points is inside the week-to-week noise floor of a 150-sample window with binary outcomes. A team running on an aggregate alert at "page if pass rate drops below 70%" sees nothing.

Read the slice row and the story is different. Enterprise has dropped 21 points in three weeks. The retrieval hit rate, which is a structural signal not directly visible in the rubric, has fallen 17 points. The input embedding distance from the launch baseline has crossed 0.2 — a standard rule-of-thumb threshold beyond which the live traffic is no longer the same distribution the eval certified. Three independent signals are screaming. The aggregate is silent.

Mini-FAQ. "Why does the aggregate stay steady when one slice collapses?" Because the slice that collapsed is also the slice with the smallest volume. Enterprise was 22% of traffic by volume but 60% of refund dollars by impact. A 21-point drop on 22% of traffic moves the aggregate by roughly 4.6 points. Free tier improved by 3 points on 48% of traffic — that contributes +1.4. EU stayed flat. Net aggregate change: about −2. The math is exact; the dashboard is honest; the customer experience is on fire.

The load-bearing rule

State the rule plainly: drift detection that watches only the aggregate is the production-time version of shipping on vibes. The launch eval rule from chapter 01 said a quality claim covers only the population the sample measured. The drift-time corollary is sharper: the population is moving, and the claim from a month ago no longer covers the population today.

Every mechanism in this chapter — rolling-window pass rate, slice alerts, embedding-distribution distance, judge-versus-truth divergence — is a way of keeping that drift visible per slice and per source. They are not redundant. Each one catches a different drift family that the others miss.

Teacher voice. Treat the rolling eval like a smoke detector wired room by room. One detector in the hallway will not tell you the kitchen is burning if the hallway air is clean. Wire each slice. Wire each drift source. Then connect them to one screen.

How the mechanism actually works — four signals, four windows

The detection layer is built from four signals running at four cadences against the locked rubric and the calibrated judge from chapter 08.

Rolling-window pass rate, per slice. The same rubric that gated launch runs every day on a fresh sample of live conversations, broken down by the slices that mattered at launch. The window length is set by traffic volume: hourly p95 for high-volume products, daily for medium, weekly for low. The signal is the delta from the launch baseline, not the absolute number. A 5-point drop on any slice that holds for two consecutive windows is the page-the-on-call threshold.

Embedding-distribution distance. Each live prompt is embedded. The embedding distribution is compared to the baseline distribution captured at launch using either KL divergence or PSI (Population Stability Index). PSI under 0.1 is stable, 0.1–0.2 is drifting, above 0.2 is significantly drifted. This signal catches input drift before pass rate moves — the input mix shifts first, the quality drop follows a few days later when the model meets queries the launch eval never saw.

Slice-level alerts on the same rolling pass rate. Same data as signal one, but alerts fire on the slice number, not the aggregate. This is the single most leveraged change a drift dashboard can make. A slice-alert dashboard for the refund chatbot would have paged the on-call in week 2 when enterprise crossed −8 points, not in week 5 when a customer complaint reached the CEO.

Judge-versus-truth divergence. A small fraction of the live sample is dual-scored — by the judge from chapter 08 and by a human. The agreement rate is tracked over time. If the judge agreement with human truth drifts (because the judge model itself drifted, or because the failure modes shifted into areas the judge cannot grade), every downstream signal degrades. This is the signal that catches the most expensive failure: the dashboard staying green because the judge stopped recognising the failures it used to catch.

                     PRODUCTION TRAFFIC
              ┌─────────────┼─────────────┐
              │             │             │
              ▼             ▼             ▼
         live chats    live embeddings   small dual-graded sample
              │             │             │
              ▼             ▼             ▼
      ┌───────────────┐  ┌──────┐  ┌──────────────────┐
      │ judge scores  │  │ PSI  │  │ judge vs human   │
      │ rolling window│  │ vs   │  │ agreement curve  │
      │ per slice     │  │ base │  │                  │
      └──────┬────────┘  └───┬──┘  └────────┬─────────┘
             │               │              │
             ▼               ▼              ▼
       slice-level      input-drift     judge-decay
         alerts           alerts          alerts
             │               │              │
             └──────────┬────┴──────────────┘
              one drift dashboard
              (slice × signal × source)

The four signals do not collapse into one number. They feed one screen, where a human operator can see which signal moved first and infer which of the five drift sources is responsible.

For the refund chatbot — what each signal said and when

Run the trace from earlier through the four-signal layer.

Week 1 — all four signals near baseline. Pass rate per slice within 2 points of launch. PSI 0.04. Judge-versus-human agreement 0.78. Healthy.

Week 2 — the embedding-distance signal moves first. PSI crosses 0.1 and lands at 0.12. The aggregate pass rate is essentially unchanged. The enterprise slice has slipped 8 points but is one window away from the threshold. The structural signal — retrieval hit rate — has fallen 5 points. At this point a slice-alert dashboard with a 2-window-confirm rule would have paged.

Week 3 — PSI now 0.28. Enterprise slice down 21 cumulative points. Retrieval hit rate down 17. Judge agreement still 0.77, so the judge itself is not the issue. The drift source is now identifiable from the signal shape: input drift is leading, retrieval is failing on the new input mix, model and prompt are intact. That points the on-call directly at the retrieval store.

The investigation, taking maybe an hour, finds the cause. A new product line launched 9 days ago. Enterprise customers — who follow product announcements — now ask about it. The retrieval store was indexed before the launch and contains zero documents on the new product. The model is grounding on adjacent-but-wrong policy text and producing fluent, confident, wrong answers about a product line whose actual policy is different.

The fix is a freshness fix, not a model fix. Re-index against the current policy corpus on a 24-hour cadence. Re-run the rolling eval the next day: enterprise slice climbs from 41% to 64%. Aggregate moves from 76% to 80%. PSI stays at 0.28, because the input distribution genuinely shifted and the new distribution is the new normal — the baseline is what now needs updating.

Mini-FAQ. "Why not just re-baseline whenever PSI drifts?" Because re-baselining hides the drift, it does not fix it. Re-baseline only after the system has been repaired against the new distribution. Re-baselining first is the production-time equivalent of moving the goalposts.

Alternative comparison — which signal earns its place under which pressure

The four signals are not interchangeable. Each one shines under a different drift source and fails on a different one.

Rolling pass rate per slice

Catches: prompt drift, model drift, tool drift, anything that changes outputs without changing inputs.

Misses: input drift in its earliest phase, before quality has actually fallen. The pass rate moves after the system meets the new inputs.

Cost: judge calls on N sampled conversations per window. At a 0.5¢ judge call and 150 chats/day per slice, roughly $0.75/day/slice.

Use when: you have a calibrated judge and a stable rubric. This is the workhorse signal.

Embedding-distribution distance (KL or PSI)

Catches: input drift, often days before quality moves. Useful as a leading indicator.

Misses: anything that does not change the input distribution. A silent model update with stable inputs leaves PSI untouched.

Cost: embed every live prompt (or a sample) once. At $0.02/1K tokens for a small embedding model, roughly $5/day for 10K chats. Compute cost on the comparison is negligible.

Use when: traffic is heterogeneous enough that input mix shifts are plausible — consumer products, search-style traffic, anything seasonal.

Slice-level alerts on rolling pass rate

Catches: localised regressions that the aggregate hides. This is the single most leveraged change.

Misses: nothing the aggregate would catch; this is strictly additive.

Cost: same judge calls, more dashboard infrastructure, more alert routing.

Use when: always, the moment you have more than one slice that matters differently.

Judge-versus-truth divergence

Catches: judge decay, evaluator failure, the slow rot of the evaluation system itself.

Misses: anything happening to the system the judge is not graded on. If your dual-graded sample is 50 chats/week and the failure is on a slice that gets 5 chats/week, this signal cannot see it.

Cost: human grading on a small fraction of live traffic. At $0.50/chat human grading and 50 chats/week, roughly $100/week per judge being monitored.

Use when: the judge is doing high-stakes work (gating launches, paging on-call). The signal earns its cost the first time the judge silently starts grading the wrong thing.

A cost comparison the room actually argues about

Numbers are illustrative, qualified by stack. They land within a factor of 2 across most production setups in late 2025.

Detection layer Daily compute/judge cost Catches Misses Decision lag
Aggregate-only pass rate $0.75 Big regressions on common slices Slice collapses, input drift 2–4 weeks
Slice-level pass rate (5 slices) $3.75 Slice collapses across all 5 Pure input drift before quality falls 3–7 days
Slice + embedding PSI $8.75 Slice collapses + leading input drift Judge decay, silent model swap 1–3 days
Full four-signal stack $25 (with human dual-grading) All five drift sources Pathologies below sample resolution Hours to 1 day

The cost movement is real but small relative to the cost of a single missed enterprise-slice collapse. The pressure relieved by adding signals is detection lag. The new pressure created is alert fatigue, absorbed by the on-call rotation, which is why slice alerts must come with two-window-confirm and clear runbooks per drift source.

Teacher voice. Cheap dashboards page on the aggregate. Expensive dashboards page on the slice. The difference between them is the customer complaint that did or did not reach the CEO.

Operational signals — what tells you drift is happening

A healthy drift dashboard is one where the four signals move together when they move at all. Pass rate dipped slightly, PSI is stable, judge agreement is stable: the rubric caught a real but small quality wobble, and the on-call investigates with a moderate priority. The clean correspondence between signals is the calm state.

The first signal that drift has started in earnest is almost always a slice falling out of line with the aggregate. Enterprise drops 5 points while the aggregate moves 1; free tier rises 3 while the aggregate is flat. The shape of the divergence between slice and aggregate is the most reliable early warning a drift system produces.

The second signal, often within days of the first, is embedding PSI crossing 0.1 and trending upward. PSI is a leading indicator for input drift. Watch it cross 0.1; alert on 0.2; investigate before the rubric numbers catch up.

The third signal, which experienced operators check first when paged, is the slice × source matrix: a one-screen grid with slices on rows, the four signals on columns, deltas in cells. The grid tells the operator within ten seconds which source moved on which slice, which is the diagnostic step before any fix.

The misleading metric beginners watch is the aggregate trend line — one number, smooth-looking, lagging by weeks, hiding everything that matters. The expert metric is the slice-level rate-of-change over the rolling window, with two-window-confirm to suppress noise.

A late-signal warning that the judge itself has rotted: judge-versus-human agreement curve drifts downward over months, even as the rubric remains locked. When this happens, every downstream alert is degrading, and a re-calibration sprint (chapter 08) is overdue.

Boundary of applicability — when daily is enough and when drift detection is futile

Drift detection earns its complexity at certain workloads and is honest overhead at others.

Hourly p95 windows are mandatory when the product handles >10K chats/hour, the user base is global so traffic shape shifts inside a single day, and a 6-hour outage on a slice is a customer-visible incident. Consumer chatbots at scale, search products, e-commerce assistants on a Black Friday — all of them need hourly.

Daily windows are enough when traffic is in the 1K–10K/day range, slices are stable, and the business cost of catching a regression 24 hours later is bounded. Most B2B SaaS assistants, internal copilots, and mid-volume support bots live here.

Weekly windows are enough when volume is low enough that daily samples are too small for a stable pass-rate estimate — typically under 500 chats/day in the slice you care about. Below that resolution, slice-level pass-rate noise drowns the signal. The compensating move is to lean harder on embedding PSI and tool-success signals, which do not need a binary outcome per chat.

Drift detection is futile in two cases. Genuine low-volume products — a niche internal tool that handles 20 chats a week — cannot support a statistical detection layer. The right discipline there is per-failure review and qualitative trace inspection, not a dashboard. And greenfield products in their first month, where the launch baseline itself is unstable, cannot drift-detect against a baseline that is still moving. Wait for the baseline to settle (typically 3–4 weeks) before turning on drift alerts.

Common wrong mental model — "if the aggregate eval stays green, we're fine"

The seductive belief is that the aggregate pass rate, which gated launch, is also the right number to monitor in production. The belief is wrong for three reasons that stack.

First, aggregates average over slices with unequal stakes. The refund chatbot's enterprise slice was 22% of traffic and 60% of refund dollars. A collapse on enterprise moves the aggregate by 4.6 points and the business outcome by 30 points. The aggregate is mathematically honest about traffic; it is mathematically silent about money.

Second, aggregates are slow. By the time the aggregate drops far enough to fire an alert, several weeks of customer-visible damage have already shipped. The slice signal moves at the rate of the slice; the aggregate signal moves at the rate of the whole.

Third, aggregates do not name the drift source. A 3-point aggregate drop could be input drift, prompt drift, model drift, tool drift, or data drift. The aggregate does not say. The slice × source matrix does.

Replace the wrong model with the right one: the aggregate is for the launch certificate; the slice × source grid is for production health. Not "if the dashboard is green, we're fine". Not a single-number dashboard. A grid where one cell can turn red while every other cell stays green, and the page fires on the cell, not the average.

Teacher voice. This is the kitchen log in its production form. Not just "this dish was returned", but which station, which ingredient, which shift. The aggregate is the dining-room satisfaction score. The grid is the kitchen log.

Six failure shapes drift detection keeps re-discovering

  • Aggregate-only blindness. The chart is green, the customers are unhappy, the call comes from leadership. Cause: no slice alerts.
  • Frozen-baseline rot. The baseline was set at launch and never refreshed after a legitimate distribution shift. Every signal is now "drifted" and the team ignores them all.
  • Judge silent decay. The rubric is locked, the judge model is the same, but the failure modes have moved into areas the judge was never calibrated for. Pass rate stays green; CSAT drops.
  • Tool drift hidden in success counts. A vendor API quietly increased its 95th-percentile latency from 800ms to 4s. Tool success is unchanged; tool timeouts are climbing; the agent silently falls back to a worse path.
  • Prompt drift by a thousand edits. Three engineers each made a small, well-intentioned prompt edit. None tripped a single-edit eval. The cumulative drift over four weeks is large.
  • Data-freshness collapse. Retrieval against a stale index produces fluent but wrong answers. The refund chatbot's week-3 incident is exactly this shape.

Each one is a specific failure of the slice × source habit. Each one disappears when the grid exists.

Cross-topic references — where this pressure shows up again

  • Same pressure, earlier chapter. Chapter 01 said "a quality claim covers only the sample that generated it". Drift detection is that rule under operational pressure: the sample has moved, so the claim no longer covers the population, and the dashboard's job is to notice.
  • Shared invariant, chapter 08. Judge-versus-human divergence in this chapter is the same calibration check from chapter 08, but now repeated continuously instead of once. The calibration sprint is not a launch milestone; it is a recurring obligation.
  • Failure geometry repeats in module 25. Multi-agent debugging in 03_agent_observability_debugging faces the same which-source-moved problem at a larger blast radius. Drift sources × agents × tools is a 3D version of this chapter's grid.
  • Adjacent pressure, next chapter. Drift detection tells you something moved. Chapter 10's A/B testing tells you whether your proposed fix actually fixes it — same statistical machinery, different hypothesis.

A fast self-test before you trust your drift dashboard

  • Can your dashboard show a slice red while the aggregate stays green? If no, the dashboard cannot catch the failure that happens most often.
  • Does your alert rule require two consecutive windows of confirmation? If no, you have either an alert-fatigued on-call or you missed a runbook.
  • Is the embedding-PSI signal wired? If no, you will see drift only after quality has fallen, never before.
  • Is the judge-versus-human dual-graded sample running on a recurring schedule? If no, the calibration certificate from chapter 08 is decaying invisibly.
  • When the page fires, does the runbook tell the on-call which of the five drift sources to investigate first based on which signals moved? If no, the page is a starting gun, not a diagnosis.

Five yeses means the kitchen log is reading like a real log. One or more nos means a customer is teaching you what your dashboard refused to.

Where this lives in the wild

Drift is enough of a real problem that an entire layer of the eval-tooling market sells drift detection as its primary product.

  • Arize Phoenix — drift is the headline product; embedding-distribution monitors plus rubric-rolling-window dashboards are the canonical configuration enterprise teams deploy for LLM apps.
  • WhyLabs — original ML-drift specialist; their data-distribution profiling (whylogs) was the reference implementation for PSI/KL drift on tabular and now LLM features.
  • Evidently AI — open-source library where the team learned the slice × source pattern; their column-by-column drift reports are the open-source benchmark.
  • NannyML — focused specifically on post-deployment performance estimation when ground truth is delayed, which is the canonical refund-chatbot situation.
  • Fiddler AI — enterprise observability with drift detection as the gating control for regulated LLM deployments (banking, healthcare).
  • Galileo — markets drift detection bundled with hallucination scoring; their pitch is the slice-alert workflow this chapter recommends.
  • LangSmith monitoring — rolling-window pass-rate dashboards with slice tags; pairs with their tracing layer so a drift alert lands one click from the failing trace.
  • Helicone observability — captures embedding distributions on every request; drift charts are a first-class object alongside cost and latency.
  • Comet Opik — drift dashboards designed to be the inner loop alongside the eval-driven-development workflow from chapter 13.
  • Datadog LLM observability — drift signals folded into the same dashboard ops teams already watch for service health, which is how drift becomes operational at large enterprises.
  • New Relic AI monitoring — same shape as Datadog, with the slice × source grid as a built-in widget.
  • Anthropic model-card drift studies — Anthropic publicly tracks behavioural-eval deltas across Claude releases so customers can detect model drift caused by vendor updates.
  • OpenAI evals platform — versioned evals so a team can re-run the launch certificate against a new model and detect vendor-side model drift before traffic does.
  • Vectara HHEM monitoring — hallucination-rate drift specifically; a leading indicator that retrieval freshness has collapsed even when rubric pass rate has not yet moved.
  • Perplexity citation-drift alerts — the team monitors per-source citation accuracy week over week and treats a citation-drift spike as a P1, because the product's contract is citation integrity.
  • Intercom Fin — slice-by-customer-tier pass-rate monitoring; the architecture this chapter describes is what they ship to enterprise customers as the SLA dashboard.
  • GitHub Copilot Chat — tool-success and acceptance-rate drift segmented by repo size, language, and editor; tool drift is detected here weeks before users notice.
  • Cursor — accept-rate-by-language drift dashboards; a one-language collapse fires the page even when the aggregate is steady.
  • Glean — combined nDCG drift and CTR drift; the Goodhart check from chapter 01 lives here continuously.
  • Notion AI — golden-set re-run on a weekly cadence to catch model drift; live-traffic slice monitoring catches input drift.
  • Stripe Radar — drift on input feature distributions has been the canonical fraud-model monitoring pattern for a decade; the LLM version is younger, the discipline is older.
  • Air Canada chatbot, post-2024 — the case study every enterprise legal team cites; a drift-detection layer would have caught the policy-violation slice before it produced legal liability.

The pattern is consistent. Teams that detect drift early ship a slice × source grid. Teams that learn drift from their customers ship one number.

Recall — can you reconstruct the chapter cold?

  1. Name the five drift sources and one symptom each one produces.
  2. Why does the refund chatbot's aggregate stay near 76% while enterprise collapses from 62% to 41%?
  3. What does PSI > 0.2 mean for the live input distribution, and what action does it recommend?
  4. Which of the four signals is a leading indicator, and what does it lead?
  5. State the chapter's load-bearing rule about aggregates and drift.
  6. When are hourly drift windows mandatory and when is weekly enough?
  7. What is the canonical wrong mental model, and what replaces it?
  8. Which signal catches the silent decay of the judge itself, and how often must it run?

Interview Q&A

Q1. Your aggregate pass rate has been steady at 76% for three weeks, but support tickets from enterprise customers are climbing. What is your first hypothesis and your first check?

A. Aggregate-hides-slice. Pull the slice-level pass rate for the enterprise segment over the same three weeks. If enterprise has dropped while free tier or another large slice has compensated upward, the aggregate is mathematically honest and operationally blind. Second check: PSI on input embeddings for the enterprise slice specifically. If PSI has crossed 0.2, the input distribution for that segment has shifted, and the next investigation is whether retrieval or prompt still covers the new mix. Common wrong answer to avoid: "The aggregate is steady so the system is fine — the tickets are an anomaly."

Q2. Why is the embedding-distance signal a leading indicator while the rolling pass rate is a lagging one?

A. The input distribution moves first — users start asking different questions. The system meets the new inputs and produces lower-quality answers a few days later, because the model and the retrieval store were tuned for the old distribution. PSI registers the shift as soon as the inputs change; pass rate registers the consequence after the rubric has accumulated enough failed conversations to move the rate. Watching only pass rate means you learn about input drift several days after it started, which is several days of customer-visible damage. Common wrong answer to avoid: "PSI is a vanity metric — only the eval score matters."

Q3. The on-call gets paged: enterprise slice dropped 7 points, PSI is at 0.22, retrieval hit rate is down 14 points, judge agreement is unchanged. Is this a chapter 08 calibration bug, a chapter 09 drift bug, or a chapter 11 logging bug?

A. Chapter 09 drift bug, specifically data drift caused by input drift. Judge agreement unchanged rules out chapter 08. Logging is providing the signals, so chapter 11 is healthy. PSI 0.22 plus retrieval hit rate dropping in lockstep with the slice pass rate points at input drift exposing a stale retrieval store — the same shape as the week-3 incident in the chapter. First action: check the retrieval index freshness and whether any new product, policy, or feature launched in the past two weeks that enterprise customers would ask about. Common wrong answer to avoid: "Recalibrate the judge" — judge agreement is stable, the judge is not the problem.

Q4. A PM asks why you can't just run the launch eval every month and call that drift detection. What's your answer?

A. Re-running the launch eval catches the case where the launch eval prompts themselves now fail — useful but narrow. It misses every case where the traffic distribution has moved. The launch eval was certified against last month's traffic; this month's traffic is different. A frozen launch eval becomes a known-good test the model keeps passing while the live experience degrades. Drift detection must run against a rolling sample of live traffic, sliced by segment, not against a fossilised launch set. Common wrong answer to avoid: "Monthly re-runs are sufficient and cheaper."

Q5. Your team handles 200 chats per week on a niche internal tool. What does drift detection look like for you?

A. Honest answer: a slice-level pass-rate dashboard is futile at 200 chats/week, because per-slice samples are too small for stable rate estimates. The right discipline is qualitative — weekly review of every failure trace, plus embedding PSI on the full week's prompts (PSI doesn't need binary outcomes), plus a quarterly re-run of the locked golden set. Pretending you have a statistical drift system at this volume produces alert noise and false confidence. Common wrong answer to avoid: "Set up the same daily dashboard as the high-volume product — discipline transfers."

Q6. PSI on your input embeddings has been at 0.28 for a month. Pass rate has recovered to launch level after a retrieval refresh. Should you re-baseline?

A. Yes, now you re-baseline. The rule is re-baseline after the system has been repaired against the new distribution, never before. Re-baselining before the fix hides the drift. Re-baselining after the fix sets the new normal as the new baseline so future drift alerts are meaningful. If you skip the re-baseline, the PSI will sit at 0.28 forever and the operator will learn to ignore it — exactly the alert-fatigue failure to avoid. Common wrong answer to avoid: "Never re-baseline — the launch baseline is sacred."

Q7. Cumulative — your dashboard shows pass rate 78%, PSI 0.05, judge-human agreement drifting from 0.78 down to 0.66 over six weeks. CSAT is steady. What is happening and what do you do?

A. The judge itself is decaying — the chapter 08 calibration is going stale, even though pass rate and CSAT look fine. The judge is grading the same way; the failure modes have moved into areas the judge was never calibrated for; agreement with human truth is dropping; pass rate is currently green because the judge is no longer flagging the failures that exist. CSAT is a lagging signal that will eventually drop too. Action: run a calibration sprint against a fresh human-graded sample, re-anchor the judge, and consider expanding the rubric anchors to cover the new failure shapes the judge is missing. Common wrong answer to avoid: "Pass rate is green and CSAT is steady, ignore the judge drift."

Q8. Why must slice alerts use a two-window confirmation rule before paging?

A. Because per-window pass-rate noise at typical sample sizes (50–150 chats/slice/day) is large enough to produce false alarms on a single window — a 5-point swing on n=100 is well within binomial noise. Two-window-confirm filters single-window noise without adding meaningful latency: at daily cadence, it's a 24-hour confirm. The pressure relieved is alert fatigue; the pressure created is a small detection lag, which the embedding-PSI signal already covers as a separate leading indicator. Common wrong answer to avoid: "Page on every window — speed matters most."

Apply now (10 min)

Step 1 — model the exercise. For the refund chatbot at end of week 3, this is the slice × source grid an on-call would read:

Slice Pass rate Δ PSI vs baseline Retrieval hit Δ Judge vs human Diagnosis
All chats −2 0.28 −17 0.77 Aggregate hides slice
Enterprise (22%) −21 0.41 −22 0.77 Input drift + data freshness
Free tier (48%) +3 0.09 −4 0.78 Stable
EU (14%) +1 0.11 −6 0.77 Stable, watch PSI
Multi-turn (16%) −5 0.18 −12 0.76 Adjacent to enterprise — same root cause

The grid points the on-call at retrieval freshness for enterprise queries within minutes. The aggregate row, read alone, says ship more dashboards.

Step 2 — your turn. Take one production AI feature you own or know well. Sketch the slice × source grid: pick 3–5 slices, pick the four signals (rolling pass rate Δ, PSI, one structural signal like retrieval or tool success, judge-human agreement). For each row, predict which signal would move first under each of the five drift sources. The exercise is finding the cells where you cannot predict — those are your blind spots.

Step 3 — reproduce from memory. Without scrolling, draw the four-signal architecture diagram from the mechanism section. Label which signal is leading, which is structural, which is the eval-system-health check. Then write one sentence connecting it to the chapter 01 rule about samples and population. If you can do this cold, the chapter has landed.

What you should remember

This chapter explained why a launch eval certificate decays the moment the launch is over, and why a single aggregate pass-rate chart in production is the operational equivalent of shipping on vibes. The refund chatbot held its aggregate at 76% for three weeks while enterprise collapsed from 62% to 41% and retrieval freshness silently rotted against a new product line. The mechanism that catches this is the kitchen log raised to production cadence: a grid of slices crossed with four signals — rolling pass rate, embedding-distribution distance, structural health like retrieval or tool success, and judge-versus-human agreement — read together, alerted on per cell, with a two-window-confirm rule.

You learned the five drift sources — input, tool, prompt, model, data — and that each one moves a different signal first. You learned to set the window cadence by traffic volume: hourly for high-volume, daily for medium, weekly for low, qualitative-only below 500 chats/day per slice. You learned the cost movement: roughly $0.75/day for an aggregate-only dashboard, $25/day for the full four-signal stack including human dual-grading. The new pressure created by every added signal is alert fatigue, absorbed by the on-call rotation, paid down by clear runbooks per drift source.

Carry this diagnostic forward: when the aggregate looks fine and customers are unhappy, your first move is to pull the slice × source grid. Aggregates are for launch certificates; grids are for production health. The kitchen log in its production form is not "the dish was returned", it is which station, which ingredient, which shift, and how does that compare to last week. If you cannot answer that grid question in ten seconds when paged, the dashboard is decoration, not detection.

Remember:

  • Five drift sources — input, tool, prompt, model, data. Each one moves a different signal first.
  • Aggregate-only dashboards are silent on the failure shape that happens most: a low-volume high-stakes slice collapsing while a high-volume slice compensates.
  • PSI is a leading indicator. Pass rate is a lagging one. Wire both.
  • Re-baseline only after repair, never as a way to hide drift.
  • Judge-versus-human agreement must run continuously, not just at launch — chapter 08's calibration certificate decays.
  • The signal that takes longest to notice is the most expensive: when the eval system itself stops grading the right thing.

Bridge. Drift detection tells you that quality has moved and which source moved it. It does not tell you whether your proposed fix actually fixes the problem without breaking something else — that is a controlled-comparison question, not a monitoring question. The same statistical machinery that powers slice-level alerts powers the next chapter's safe rollouts and head-to-head comparisons, but the hypothesis flips: not "has the system changed?" but "is version B genuinely better than version A on the same traffic?"

10-ab-testing.md