06. Prompt drift detection — did the small wording fix change behavior?¶

~17 min read. A one-line prompt edit can keep your eval green and still ship a behavior change your users feel in the first hour. Drift detection is the discipline of asking, every time a recipe is edited, "did this actually change what comes out of the kitchen?" — and answering with numbers, not vibes.

Builds on 05-shadow-and-ab-testing.md. Shadow tells you v18 differs from v17 on real traffic. This page is how you measure that difference along axes the eval rubric does not see.

1) Hook — the harmless edit that wasn't¶

A team ships a one-line prompt edit. Before:

Be concise.

After:

Keep responses under 3 sentences.

The author argued — correctly, on paper — that "under 3 sentences" is a clearer instruction than "concise." The offline eval suite passed. Pairwise judge said tie. The change merged on a Tuesday afternoon. Friday morning, the on-call channel woke up.

The complaints clustered. "The bot feels rushed." "It used to explain. Now it just says no." "Why is it so curt?" The product manager pulled metrics. Csat hadn't moved. NPS hadn't moved. But session count had dropped 4% and complaint volume had jumped 11%.

The team dug. The offline eval was fine — every rubric dimension held. Pairwise judge was fine — 38% win, 36% tie, 26% loss, net positive. But average output length had collapsed from 80 tokens to 24 tokens. Three-quarters of the response had been stripped. The "harmless wording fix" had shifted the distribution of output the bot produced. The rubric did not measure length. The judge did not weight length. Users did.

This is drift. The output shape changed. The output distribution changed. The downstream effect — session count, complaint rate — changed. None of it showed up in the pre-merge gate. All of it showed up in the bakery log, after the fact, in production.

The drift question is the question this page exists for. Given a prompt edit, did the behavior change in ways that matter — and if so, in which ways?

2) The metaphor — the recipe edit that doesn't taste different but smells different¶

Picture the bakery again. The head baker tweaks the recipe — "knead for five minutes" becomes "knead until smooth." Same croissant, same flour, same oven. The croissants come out. A taste test says identical. A scoring panel says identical. The customer walks in, picks up a croissant, and says "smells different."

Smell was not on the taste-test rubric. The recipe edit changed something the rubric did not measure. The rubric was complete for what the team thought to score. It was not complete for what the customer experienced. That gap is drift.

Drift detection is the kitchen's habit of measuring everything that could change when a recipe is edited — not just the things the panel scores. Length of bread. Color of crust. Salt content. Smell. JSON validity, if the kitchen happens to be a JSON kitchen. Anything that could shift and that a customer could notice. Measure all of it before and after every edit. Notice what moved. Decide whether the shift is acceptable.

3) The anatomy — three kinds of drift¶

┌────────────────────────────────────────────────────────────────────┐
│ DRIFT TAXONOMY                                                     │
├────────────────────────────────────────────────────────────────────┤
│                                                                    │
│  1) OUTPUT-SHAPE DRIFT                                             │
│     • length distribution shifted                                  │
│     • JSON validity rate changed                                   │
│     • schema match rate changed                                    │
│     • format (markdown / plain / list) changed                     │
│     measure: deterministic — no judge required                     │
│                                                                    │
│  2) OUTPUT-DISTRIBUTION DRIFT                                      │
│     • the words the model picks have shifted                       │
│     • semantic content has shifted (embeddings)                    │
│     • style / tone has shifted                                     │
│     measure: embeddings + LLM-as-judge pairwise                    │
│                                                                    │
│  3) DOWNSTREAM-EFFECT DRIFT                                        │
│     • tool calls fire differently                                  │
│     • retrieval calls hit different corpora                        │
│     • escalation rate / handoff rate changes                       │
│     • business metrics (csat, conversion) move                     │
│     measure: production traces + business dashboards               │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The "be concise" → "under 3 sentences" change was shape drift that the team caught downstream after it landed. Had the team measured shape drift pre-merge — a 70% drop in average length on the eval set — they would have either rejected the change or shipped it knowingly.

The three layers are not the same kind of measurement. Shape drift is deterministic — count tokens, parse JSON, run the schema validator. Distribution drift needs embeddings or a judge — semantic comparison. Downstream drift is the bakery log itself — observe what production did, with traces tagged by SHA.

4) Output-shape diffing — the cheapest, deepest signal¶

Run v17 and v18 over the same 200-example eval set. For each output, compute:

token count
character count
JSON validity (if structured output)
schema match rate (if you have a schema)
markdown structure (number of headers, lists, code blocks)
presence of refusals ("I cannot")
presence of disclaimers ("As an AI...")

Compare the distributions. Not just the means — the full distributions.

LENGTH DISTRIBUTION                v17                v18
────────────────────                ───                ───
p10                                 25 tok             8 tok
p50                                 80 tok             24 tok    ← collapsed
p90                                 200 tok            60 tok
mean                                95 tok             31 tok
JSON validity                       99.1%              99.4%
schema match                        96%                95%
refusal rate                        2%                 9%        ← jumped

Two signals jump out — length collapsed at every percentile, refusal rate jumped from 2% to 9%. The team now knows what changed before any user sees v18. The team can decide: ship knowingly, iterate, or reject.

Shape diffing is the cheapest drift test you can run. No judge needed. No human. No production exposure. A pytest with N=200 and a few percentile asserts will catch most of the harmful changes before they ever reach shadow mode.

5) Embedding-based output drift — for distribution change¶

Shape diffing catches what shifted in form. It does not catch what shifted in content. A prompt change can keep length, JSON validity, and structure identical and still pivot what the model is saying. To catch that, embed the outputs.

For each eval example, compute the embedding of v17's output and v18's output. Cosine similarity between the pair is the per-example signal. The distribution of cosine similarities across the eval set is the drift signal.

COSINE SIMILARITY DISTRIBUTION
──────────────────────────────
1.00 ──┤█                       ← identical
0.95 ──┤████████                ← near-identical
0.90 ──┤█████████████           ← minor rewording
0.85 ──┤███████                 ← noticeable rewording
0.80 ──┤████                    ← different content
0.75 ──┤██                      ← substantially different
0.70 ──┤█                       ← reformulated
< .70──┤█                       ← unrelated answer    ← worry

A drift report summarizes this distribution. v17 → v18 shifted 8% of outputs below 0.85 similarity. The team eyeballs the bottom 5% to see whether the divergent answers are improvements, regressions, or shrugs. Embedding drift is cheap, deterministic, and catches semantic shifts the rubric does not.

The tail matters more than the mean. A change can leave 95% of outputs near-identical and substantially reshape the remaining 5% — and that 5% is often the part users notice.

6) LLM-as-judge drift — for the cases shape and embeddings miss¶

For each pair (v17_out, v18_out), an LLM judge answers a narrow question: "Is the new output materially different from the old, in a way a user would notice?" The answer is yes/no, not better/worse. This is drift detection, not quality eval.

The judge prompt is small and the bias surface is manageable. Position bias can be controlled by swapping order. Self-preference can be controlled by picking a judge from a different family than the candidate. Calibration against a small human-labeled set tells you the judge's false-positive and false-negative rate.

A typical drift report includes the judge's per-pair verdict, aggregated. "v18 is materially different from v17 on 32% of pairs." The team then samples and reads those 32%, classifies them by reason — "more refusals," "shorter answers," "more cautious tone" — and decides whether to ship.

7) The two traps — harmless drift and invisible drift¶

Trap one — harmless drift. Drift detection says "v18 is materially different on 12% of pairs." The team reads the 12%. The differences are stylistic — slightly more casual tone, same task success, same content. Should the team reject the change?

There is no universal answer. Some teams reject any material drift on the grounds that consistency is itself a product quality. Other teams accept stylistic drift as long as task-success metrics hold. The decision is a product decision, not an engineering one. What matters is that the team sees the drift and decides knowingly, rather than discovering it from a user complaint three weeks later.

Trap two — invisible drift. The eval set has 200 examples. Production has a long tail of queries the eval set does not cover. A prompt change can keep the eval set near-identical and shift the long tail dramatically. The team ships. Users on the long tail experience the regression. The team does not see it in pre-merge testing.

The mitigation is to keep the eval set fresh — sample from production, especially from low-confidence or escalated queries, and add to the eval set weekly. The eval set is a living document. A stale eval set is invisible drift waiting to happen.

8) Worked example — "Be concise" to "Keep responses under 3 sentences"¶

Re-running the hook scenario through a drift-detection pipeline shows exactly where the gate should have caught it.

Eval set: 200 customer-support queries. Mix of returns, refunds, account questions, product questions, complaints.

Stage 1 — offline rubric eval. v17 average: 4.0/5. v18 average: 4.0/5. Pass.

Stage 2 — shape diff.

                              v17          v18         delta
mean output length            81 tok       24 tok      −70%
p10 length                    28 tok       9 tok       −68%
p90 length                    198 tok      61 tok      −69%
JSON validity                 99.2%        99.4%       +0.2%
refusal phrase rate           1.8%         8.4%        +6.6%
disclaimer rate               0.5%         0.4%        −0.1%
headers in output             14%          2%          −12%

Two flags. Length collapsed by ~70% at every percentile. Refusal rate jumped from 1.8% to 8.4%. The shape diff alone would have failed a sensible CI gate.

Stage 3 — embedding drift. Cosine distribution: 6% of pairs below 0.80 similarity. The team sampled 30 of those pairs and found a clear pattern — for complex queries, v18 was answering only the first sub-question and dropping the rest. Length had not "compressed" — it had truncated. The model interpreted "under 3 sentences" as "stop after 3 sentences regardless of completeness."

Stage 4 — LLM judge drift. 38% of pairs flagged as materially different. The sampled reasons matched the embedding analysis — truncated answers, missed sub-questions, more refusals on ambiguous queries.

Decision. Reject the merge. Re-draft. v19 reads "Aim for under 3 sentences, but answer every part of the question." The drift on v19 is small (8% materially different, length down 30% not 70%, refusal rate unchanged). v19 enters the shadow stage. The "harmless wording fix" is no longer harmless — it is a known behavior change, scoped and shipped knowingly.

The cost of the drift suite — about 6 minutes of compute, $1.20 of LLM judge spend. The cost it avoided — a week of complaint volume on the v18 regression. The math is not subtle.

Mid-content recall¶

Name the three kinds of drift and the cheapest signal that detects each.
Why does the tail of the cosine-similarity distribution matter more than the mean?
What is invisible drift, and what is the standard mitigation?

9) Failure modes — where drift detection silently fails¶

FAILURE MODE                              FIX
────────────                              ───
"small text change ⇒ small effect"    →   measure every time; assumption is false
stale eval set (3 months old)         →   weekly production sampling into eval set
shape diff on means only              →   compare full distributions (p10, p50, p90)
embedding model is the candidate's    →   use a different embedding family for fairness
  family (self-bias)
LLM judge in "better/worse" mode for  →   judge in "materially different" mode for drift
  drift detection (not its job)
no human spot-check of the bottom 5%  →   always sample and read the most-divergent pairs
of similarity
shipping when shape drift is large    →   if shape drift > threshold, gate fails. period.
because rubric still passes
treating drift as automatic regression →  drift means "look at this," not "block this"
no SHA-pair history of drift runs     →   store every drift report keyed on (old SHA,
                                          new SHA) for audit and trend analysis

The deepest mistake on this list is the first one. Engineers who write prompts believe small text changes have small effects, because that is true in code. Prompts are not code. Token distributions are not deterministic functions of source text. A two-word change can shift the model's interpretation of the entire task. Treat every prompt edit as potentially behavior-changing until measurement says otherwise.

10) Drift detection in CI — what the gate looks like¶

The standard pattern is a GitHub Actions or CircleCI workflow that fires on every prompt PR. The job:

Resolves the prior SHA (the SHA on main for this prompt name).
Resolves the new SHA (the SHA of the proposed change).
Runs both prompts over a stratified eval set of N=100-500.
Computes shape diff — length percentiles, JSON validity, refusal rate.
Computes embedding drift — cosine distribution.
Optionally runs LLM judge drift on a 30-pair sample.
Posts a report comment on the PR with deltas and tail samples.
Fails the gate if any delta exceeds policy thresholds.

Policy thresholds are team-set. A common starting point: length percentiles must stay within ±25% of prior, JSON validity must not drop, refusal rate must not jump by more than 2 percentage points, embedding-drift tail (below 0.80) must stay under 10%. Anything outside these requires explicit reviewer override on the PR.

This gate is not the eval gate. The eval gate from chapter 08 measures quality. The drift gate measures change. A prompt edit can be quality-neutral and still high-drift — that is the case the drift gate exists for.

Where this lives in the wild¶

Drift is a relatively new vocabulary in prompt ops, borrowed from ML monitoring. The surfaces that handle it most directly:

Langfuse — version diff view with output-length and output-similarity comparisons across prompt SHAs.
LangSmith — pairwise comparison runs across prompt versions, with similarity and length deltas surfaced.
Braintrust — built-in pairwise diff with shape and semantic comparison; PR comment integration.
Helicone — request distribution dashboards by prompt version; length and refusal-rate trend visualization.
PromptLayer — output similarity heatmaps across prompt template versions.
Vellum — diff view comparing outputs of two prompt versions on a dataset, with semantic scoring.
Pezzo — version comparison with deterministic shape metrics.
Phoenix (Arize) — distribution drift detection imported from ML observability; cosine and KL divergence on output embeddings.
Galileo — production drift dashboards with auto-flagging of distribution shifts.
Datadog LLM Observability — span-level metrics with version tags; alerts on length and validity shifts.
OpenLLMetry — OpenTelemetry spans carry prompt version; downstream metric stores compute deltas.
Promptfoo — local pairwise eval with output diffing for CI gates.
DeepEval — assertions on output shape that fail CI when thresholds breach.
OpenAI Evals — paired eval mode for delta scoring across prompt versions.
TruLens — feedback functions that can be repurposed as drift detectors across versions.
RAGAS — drift metrics for retrieval-augmented prompts where retrieval distribution shifts.
Patronus AI — automated regression detection across prompt versions.
Inspect AI — UK AI Safety Institute eval framework, supports paired comparison runs.
GitHub Actions — common surface for the drift CI workflow.
CircleCI — alternative CI surface running the same drift gate.
Statsig Sidecar — A/B platform with drift dashboards for prompt experiments.
Notion AI — internal drift dashboards before prompt rollouts.
GitHub Copilot — reportedly tracks acceptance-rate drift on prompt edits as the primary signal.
Sourcegraph Cody — output-shape gates on prompt edits in their experimentation system.
Linear AI — drift comparison on prompt PRs as a merge requirement.

The common pattern is "before-and-after, on the same input set, with metrics that go beyond the rubric." Teams that adopt it stop being surprised by silent regressions.

Pause and recall¶

What are the three kinds of drift, and which is the cheapest to measure?
Why is shape diff a stronger signal than rubric eval for "did behavior change"?
What is the difference between an LLM judge in quality mode and in drift mode?
Why must the embedding model used for drift not be from the same family as the candidate LLM?
What is the standard mitigation for invisible drift?
Why do percentile comparisons matter more than mean comparisons?
What goes into a CI drift gate, and what kind of policy thresholds gate it?

Interview Q&A¶

Q1. How do you know a prompt change actually changed behavior? A. Three measurements over the same eval set, before and after. Shape diff — length percentiles, JSON validity, refusal rate, format. Embedding drift — cosine distribution of paired outputs; the tail below 0.80 matters more than the mean. LLM judge in materially-different mode for the cases shape and embeddings miss. Together they catch most behavior change pre-merge. Trap: "If the eval rubric scores held, behavior didn't change." The rubric measures what the team thought to score. Drift measures what shifted.

Q2. A teammate says "it's a one-line wording fix, no need for eval." How do you respond? A. Show them a known case — "be concise" to "under 3 sentences" cut length by 70% and jumped refusal rate by 6 points. The relationship between text edit size and behavior change is not linear. The minimum gate for any prompt edit is shape diff, which is free to run and catches most surprises. Skipping it has shipped real regressions at real companies. Trap: Caving on the gate because the change "looks small." All prompt-edit-shipped-regressions look small in retrospect.

Q3. What is the difference between drift detection and eval? A. Eval measures quality — is v18 good enough on the rubric? Drift measures change — did v18 behave differently from v17? A change can pass eval and have high drift (the "rushed bot" case). A change can fail eval and have low drift (regression on every dimension, but the regression is small). They answer different questions and gate on different thresholds. Trap: Conflating them. A team that only runs eval will miss the rushed-bot case; a team that only runs drift will ship pure regressions if quality drops without behavior changing.

Q4. How do you handle "harmless drift" — the case where v18 is materially different but not worse? A. Decide knowingly. Read the sample of materially-different pairs. If the drift is stylistic and task success holds, the product team chooses whether to accept the shift. Some product surfaces (legal, medical, fiduciary) reject any drift because consistency itself is a product quality. Others (consumer chat, creative tools) accept drift if quality holds. Either policy is defensible — but the choice has to be a choice, not a surprise. Trap: Treating drift as automatic rejection. Drift means "look at this," not "block this."

Q5. How do you detect drift in a long-tail query the eval set does not cover? A. You don't, before merge. You catch it in production via tagged traces (chapter 07) and a complaint feedback loop into the eval set. Mitigation is to keep the eval set fresh — weekly sampling from production, especially from low-confidence outputs and escalated sessions. The eval set is a living document; an eval set that does not grow grows blind. Trap: "We have 500 examples in our eval set, we're covered." 500 examples is a starting point. Production has tens of thousands of distinct query patterns.

Q6. Why does percentile drift matter more than mean drift? A. Means hide tails. A prompt change can leave the median output identical and reshape only the p10 and p90 — and the tails are usually where users notice. Length p90 collapsing from 200 to 60 tokens means complex queries are getting truncated; the mean barely moves if the median is stable. Always report p10 / p50 / p90 distributions, not just means. Trap: Reporting mean deltas and missing the tail collapse.

Q7. How do you guard against the LLM judge being biased toward the prompt version it produced? A. Different judge family — if you ship Claude prompts, judge with GPT-class or Gemini-class models, not Claude. Ensemble — three judges, majority vote. Position swap — randomize old-vs-new order on every comparison. Calibration — score 50 pairs with humans, compute judge agreement, weight subsequent runs by that agreement rate. Trap: Using the same model family as both generator and judge, then trusting the win-rate.

Q8. How does a drift CI gate fit into a prompt PR workflow? A. On PR open, the gate runs both SHAs over the eval set, computes shape diff and embedding drift, optionally runs judge drift, and posts a report comment on the PR. The gate fails if deltas exceed policy thresholds (length percentile ±25%, JSON validity drop, refusal rate +2pp, embedding tail >10%). Failure requires a reviewer override with a written rationale. The drift gate runs in parallel with the eval-quality gate; together they form the merge gate. Trap: Putting drift behind eval. Drift can pass while eval fails, and vice versa. Run both, gate on both.

Apply now (5 min)¶

Step 1 — model first. Pick a recent prompt change in your system. Run v_prev and v_new over the same 50 examples. Compute length p10/p50/p90 for each. Compute JSON validity rates. Compute refusal-phrase rates (count "I cannot," "I'm sorry," "as an AI"). Print a delta report. Look for any percentile that moved more than 25%.

Step 2 — your turn. Pick one prompt under your control. Write the policy thresholds you would set on a drift gate — length percentile band, validity-rate floor, refusal-rate ceiling, embedding-tail ceiling. Justify each threshold in one sentence. These become your team's gate policy.

Step 3 — sketch from memory. Redraw the three-kinds-of-drift box from section 3. Then redraw the cosine-similarity distribution sketch from section 5 with the "worry" annotation in the tail.

Bridge. Drift tells you behavior changed. The next question is — when a production trace looks bad, which prompt version produced it? The bakery log must answer that question in seconds. Traces tagged with the SHA, the SHA traceable to the recipe, the recipe rolled back if needed. That is the next page.

→ 07-prompt-observability.md