13. Drift detection — the cold case¶

~15 min read. Every trace looks fine. But week over week, quality is sliding. This is the bug that hides in the average.

Built on the ELI5 in 00-eli5.md. The cold case — a failure where no single case file looks wrong, yet crime statistics trend the wrong way — needs a different style than chapters 1-12. Aggregate first, single trace later.

The picture before the math¶

Every chapter so far started with one complaint and one trace. Drift is the opposite — no complaint, no single trace, just a slow slide on the case board.

escalation rate (% of sessions handed to human)

week  1     2     3     4     5     6
 16% ─┤                              ●
 14% ─┤                        ●─────│
 12% ─┤                  ●─────│     │
 10% ─┤      ●─────●     │     │     │
  8% ─●──────│     │     │     │     │
      └─────────────────────────────┘
      baseline                    today

No single session looks broken. Yet the curve climbs. That is the cold case.

Baseline 8%. Today 14%. Six points up in six weeks. The alarm bell should ring before the user emails you. The move is to open the weekly cold-case review and look at crime statistics instead of individual case files.

The six flavors of drift¶

Drift is not one thing. It is six. Each needs a different detection method.

input drift               ──→ what users ask changed
output distribution drift ──→ what agent decides changed
embedding drift           ──→ RAG hits subtly shifted
eval-score decay          ──→ stable eval suite scores lower
cost drift                ──→ tokens per session creeping up
capability drift          ──→ once-stable task type now fails

Each is a cold case in its own way. Each needs different evidence tags on spans.

1. Input drift — users changed¶

Users started asking different questions. New product launched. Topic mix shifted. Agent has not changed. The world did.

Detection. PSI (Population Stability Index) on intent classification.

intent           baseline %   current %   contrib
order_status         45           38       0.011
refund               20           18       0.002
shipping             15           12       0.013
new_product_X         0           14       0.230  (1% floor)
other                20           18       0.002

PSI ≈ 0.26

Thresholds: < 0.10 no shift. 0.10-0.25 moderate. ≥ 0.25 significant. A 14% slice is a brand-new intent. That is the alarm bell.

Action. Add a playbook for the new intent. Sample 20 traces from the new bucket. See 00_ai_evals_release_gates/09-drift-detection.md for slice-eval method depth.

2. Output distribution drift — proportions shifted¶

Each output looks fine. The mix changed. Refund rate 5% → 11%. Escalation rate 8% → 14%.

Detection. Two-proportion z-test.

baseline:  12,000 sessions,   960 escalations  →  8.0%
current:   12,000 sessions, 1,680 escalations  → 14.0%

p̂ = 2640/24000 = 0.11
z = (0.14 - 0.08) / sqrt(p̂(1-p̂)(2/n)) ≈ 14.8
p-value ≈ 0  →  real shift, not noise

Action. Slice by intent, tenant, channel. Find the bleeding slice. Sample 30 traces and run the lineup from chapter 06 — the suspects are still the same five layers.

3. Embedding drift — RAG hits subtly different¶

Index unchanged. Embedding model unchanged. Yet retrieval shifted. Queries changed shape. Or hidden tokenizer upgrade. Or the embeddings endpoint silently rolled forward (it happens).

Detection. Overlap@k on a frozen eval query set.

fixed query: "how do I cancel my subscription"
baseline top-5:  [d_042, d_017, d_088, d_201, d_055]
current  top-5:  [d_042, d_088, d_017, d_311, d_055]
overlap@5 = 4/5 = 0.80   (d_201 out, d_311 in)

Track overlap@k weekly on a frozen 200-query set. A drop from 0.95 to 0.80 is the alarm bell.

Action. Diff embed.model.version. Re-embed a sample with the prior tokenizer. Tag every retrieval span with version as an evidence tag.

4. Eval-score decay — frozen suite scores lower¶

A frozen eval suite is your reference clock. Same 500 examples. Same judge. Same rubric. Today 87.2. A month ago 89.4.

Detection. Trend test on weekly eval runs.

week    1     2     3     4     5     6
score  89.4  89.1  88.8  88.3  87.7  87.2

Six consecutive declining weeks. Mann-Kendall p < 0.01. Not noise.

Action. Check model version (chapter 11). Check prompt-assembly hash. Check judge model — judges drift too. See 00_ai_evals_release_gates/09-drift-detection.md for suite hygiene.

5. Cost drift — tokens creeping up¶

Quality fine. Latency fine. Bill climbing. Tokens per session 4,200 → 6,800.

Detection. Median + p95 trend, or KL on token-count histograms.

metric           week 1     week 6
median tokens     3,800     5,400   (+42%)
p95 tokens        7,200    11,300   (+57%)

p95 climbing faster than median → some sessions blew up. Usually loop runaway (ch 09) or context bleed (ch 07). Tag every LLM span with tokens.in and tokens.out as evidence tags.

Action. Slice by tenant, intent, model. Find the fat tail. Chapter 14 takes this further.

6. Capability drift — once-stable task now fails¶

Agent used to handle "refund older than 30 days" at 95% accuracy. This week: 78%. Same task. Same prompt.

Detection method. Per-task accuracy on a tagged eval set.

task type          week 1   week 6   delta
order_status        96%      95%     -1
refund_old          95%      78%    -17   ← bleeding
shipping_track      92%      91%     -1
escalation_decide   88%      87%     -1

One slice is bleeding. Others are stable. The cold case gives a confession through statistics.

What to do. Pull 50 failing traces from refund_old. Run the lineup. Bug is almost certainly model or memory layer.

The weekly cold-case review¶

The case board is for daily incidents. The cold-case review is once a week. Quiet. Methodical.

weekly cold-case review checklist
┌────────────────────────────────────────────────────┐
│ 1. PSI on input intent distribution      > 0.25?   │
│ 2. Output mix (refund/escalation rates)  Δ > 2σ?   │
│ 3. RAG overlap@5 on frozen query set     < 0.90?   │
│ 4. Eval suite score on frozen 500        Δ < -1?   │
│ 5. Median + p95 tokens per session       Δ > 20%?  │
│ 6. Per-task accuracy on eval set         Δ < -3?   │
└────────────────────────────────────────────────────┘

One engineer. One hour. Every Monday. Most weeks, nothing fires. Some weeks, you catch a cold case four weeks before users do.

Worked example — the escalation climb¶

A customer-service agent. Six weeks of data.

Symptom. Escalation rate climbed 8% → 14%. No single complaint slip. No single trace flagged.

Step 1. Weekly review fires output drift (z = 14.8). Real.

Step 2. Slice escalations.

intent class         baseline   current
order_status           7%         8%
refund                 9%        10%
shipping               8%         9%
product_question       8%        42%   ← bleeding

Step 3. PSI on intent distribution = 0.26. A new product launched. New intent class. Agent has no playbook.

Step 4. Pull 30 traces from the new bucket. Run the lineup. Prompt has no instructions for the new product. Tool returns "unknown SKU". Agent escalates.

Confession. Input drift exposed a missing playbook. Not a layer bug. A coverage gap.

Lock. Add new product to the playbook. Add eval examples for new-product questions. Add policy: any intent class > 5% traffic must have an eval slice. The lock is permanent.

The cold case is solved. No single trace was wrong — the whole shape was.

Drift-detection patterns in shipped LLM products¶

Arize Phoenix — drift dashboards compute PSI and KL divergence on embeddings and outputs in real time; alerts wire to PagerDuty.
WhyLabs (whylogs) — LLM drift monitoring profiles token counts, response lengths, and prompt embeddings to flag distribution shifts weekly.
Evidently AI — open-source library generates drift reports across categorical, numerical, and text columns; widely used for offline weekly review of LLM input/output mix.
NannyML — specialises in detecting silent model failure without ground truth, estimating performance via DLE and CBPE — fits eval-score decay even when labels arrive late.
Fiddler AI — enterprise observability with LLM-specific monitors for safety, drift, and hallucination rate; integrates SHAP-style explanations for shifted predictions.
Galileo (Luna evaluators) — production monitoring tracks groundedness, completeness, and PII drift per slice; alerts on rolling-window deviations.
LangSmith monitoring — built-in datasets enable a frozen eval suite to run on a schedule; per-slice score deltas surface as first-class charts in the dashboard.
Helicone observability — request-level analytics with custom property segmentation makes input drift visible as a shifting tenant or feature mix.
Comet Opik drift — open-source LLM eval platform tracks score trends across model and prompt versions; integrates with frozen "regression sets" similar to the chapter's pattern.
Anthropic model-card drift studies — published evals on Claude families compare scores across versions on identical task sets, treating eval-score decay as a first-class release gate.
Vectara drift checks — RAG-as-a-service product surfaces retrieval-quality drift via HHEM hallucination scoring on a moving window of queries.
Datadog drift detection — LLM observability product adds custom metrics for token counts, completion length, and per-tenant cost; PSI-style alerts wire into existing on-call.
Anthropic's continuous-eval pipelines — frozen eval suites run nightly against production model versions; a 1-point drop blocks the next release.
Notion AI — tracks quality regression by sampling completions across feature surfaces; per-surface acceptance rate is the leading drift signal.
GitHub Copilot — completion-acceptance rate (% suggestions kept by the user) is the headline drift signal; a 2pp drop triggers investigation of model and prompt changes.
Cursor — tracks tab-acceptance rate and edit-revert rate weekly; sustained dips against a release baseline trigger model and prompt bisection.
Replit Ghostwriter / Agent — monitors task-completion success across project types; per-language slices catch capability drift (e.g., Rust drops while Python holds).
Perplexity — citation-accuracy drift is monitored per topical slice; a drop on a single slice (say, finance) triggers reviewer audit before users notice.
OpenAI Evals + production sampling — internal pipelines compare frozen-suite scores across snapshots of the same model name (e.g., gpt-4o-2024-XX-YY) to catch silent rollouts.

Recall — the six drifts and the weekly review¶

Why does a daily case-by-case review miss a cold case?
PSI returns 0.18. What action level does that signal, and what do you do next?
Embedding drift can happen without changing the index. Name two causes.
When p95 tokens climbs faster than median, what is the likely class of bug?

Interview Q&A¶

Q: A new feature launched and quality dropped. Is this a regression or input drift? A: Both, and you must separate them. Input drift means traffic shape changed — PSI on intent distribution rises. Slice by intent. If old intents score the same and only new intents fail, it is a coverage gap (input drift, no playbook), not a regression in the agent's capability.

Common wrong answer to avoid: "Roll back the model" — the model did not change. The traffic did. Rollback fixes nothing and may break slices that still work.

Q: Why is PSI preferred over raw category counts for input drift detection? A: Raw counts move with traffic volume. PSI normalizes to proportions and weights bins by how much each bin's share changed. It gives one number you can threshold (0.10, 0.25). It is symmetric and bounded enough to be useful, unlike raw KL which blows up when a new category appears with zero baseline.

Common wrong answer to avoid: "PSI is the same as KL divergence" — PSI is symmetrized and bin-based; raw KL is asymmetric and explodes on zero-probability bins. PSI is the operational form.

Q: Your eval suite scored 89 a month ago and 87 today. Model version is unchanged. What three things do you check first? A: One, the judge model version — LLM judges drift independently. Two, the prompt-assembly hash — silent template changes mutate scores. Three, the eval data pipeline — sometimes the eval set gets re-shuffled or a sample changes. Verify each is byte-identical to the prior run before touching the agent.

Common wrong answer to avoid: "The model regressed silently" — possible but rare on pinned versions. Always rule out judge drift and pipeline drift first; they are cheaper to confirm.

Q: Cost per session is up 30% with no quality change. Where do you look? A: Slice tokens-in and tokens-out by tenant, intent, and model. Compare p95 vs median — if p95 alone exploded, suspect loop runaway on a small slice. If median climbed, suspect prompt bloat — someone added a system-prompt chunk, or retrieval is returning longer passages. Trace-level evidence tags on token counts make this a five-minute query, not a five-day one.

Common wrong answer to avoid: "Just lower max_tokens" — that caps the symptom and may truncate legitimate answers. Find the slice that bloated and fix the cause.

Apply now (5 min)¶

Step 1 — model the exercise. Walk the escalation-climb case. Baseline escalation rate was 8%, today's is 14%. No complaint slip, no single bad case file — just a slow slide. The weekly review fires output-drift on a z-test (z = 14.8, p ≈ 0). Slicing by intent shows product_question jumped from 8% to 42% while other intents held flat. A PSI of 0.26 on the intent distribution confirms a new product launched and the agent has no playbook for it. Thirty traces from the new bucket show the prompt has no instructions, the tool returns "unknown SKU", and the agent escalates by policy. The confession is a coverage gap, not a layer bug. The lock is a permanent policy — any intent class above 5% traffic must have an eval slice — plus a refreshed playbook and eval examples.

Step 2 — your turn. Pick a production AI workflow you know. Write down the six drift types and choose one detection metric per type with a concrete threshold. Then compute PSI by hand on this small case — baseline A=50%, B=30%, C=20%; current A=40%, B=30%, C=20%, D=10% (new). Use a 1% floor for the zero bin and compare to the 0.25 threshold. Decide whether you would page on-call.

Step 3 — reproduce from memory. Without scrolling, draw the six-week escalation-rate climb against a flat baseline and label which week you would catch it with a weekly cold-case review. Then list the six drift flavours from memory and one detection metric for each. If you can do this cold, the cold case discipline is yours.

What you should remember¶

This chapter taught the bug class that hides from the lineup — failures where no single case file looks wrong, but the crime statistics trend the wrong way. A cold case does not arrive as a complaint slip; it arrives as a slow slide on the case board that nobody emails about. Day-to-day debugging cannot see it because day-to-day debugging samples one trace at a time. Weekly review samples the distribution.

The six drift flavours are six different shapes of the same problem. Input drift means users changed; detection is PSI on intent distribution. Output distribution drift means the agent's mix of decisions changed; detection is a two-proportion z-test on rates that matter, like refund or escalation. Embedding drift means retrieval shifted even when the index and model look unchanged; detection is overlap@k on a frozen query set. Eval-score decay means the frozen suite scores lower against a pinned model; detection is a trend test on weekly runs. Cost drift means tokens per session climb; detection is a median plus p95 trend. Capability drift means a once-stable task type now fails; detection is per-task accuracy on a tagged eval set.

Each drift needs its own alarm bell and its own evidence tag on spans. The weekly cold-case review is one engineer for one hour every Monday, walking the six checks. Most weeks nothing fires. Some weeks you catch a regression four weeks before users do, which is the entire point.

Carry one diagnostic forward — when a drift fires, do not start with the lineup. Slice the crime statistics first to localise the bleeding slice. Then sample 30 traces from that slice and run the lineup on a normal case file drawn from the new distribution. The lock is structural — frozen eval sets, slice tables, PSI thresholds, and a policy that any traffic class above a threshold gets its own eval slice.

Remember:

A cold case leaves no single bad case file. The bug is in the shape of the distribution, not in any one trace.
The alarm bell for drift rings on aggregates — PSI, z-tests, overlap@k, trend tests — not on individual complaint slips.
Each drift flavour needs its own evidence tag on spans, or the cold-case review cannot slice cleanly.
The first suspect for input drift is a coverage gap, not a layer bug. Roll-backs of unchanged models fix nothing.
The lock for cold cases is a policy — frozen eval suite, weekly checklist, PSI thresholds, mandatory slice for any intent above 5% traffic.
When a drift fires, slice first, lineup second. The bleeding slice tells you which 30 traces to walk.

Bridge. Drift in quality is now visible. But agents drift in two other dimensions — speed and cost. A slow agent or an expensive agent is still a bug, even when answers are correct. → 14-latency-cost-regressions.md