09. Provider drift and deprecation¶
Providers retire models on their own schedules. Behaviour shifts inside a model name. New versions ship in regions you do not run. The gateway is the layer where these external changes are absorbed without breaking callers, on a calendar you can defend.
A platform lead at a Pune insurance company opens the inbox on a Monday morning to a provider email: claude-3-haiku-20240307 will be retired in 60 days. The platform has three production features bound to that model. Without a gateway, each feature would have its own migration project, its own evals to re-baseline, its own deploy schedule. With the gateway, the lead opens the routing policy, pins the affected alias to claude-haiku-4-5 as a canary, schedules an eval run, sets the deprecation calendar, and posts a single rollout plan. Two weeks later the canary is at 100%, the old model is in the fallback chain at weight zero for safety, and the retirement date passes without an incident.
This chapter is the discipline that makes that two-week migration the normal case rather than the heroic one. Pinning. The deprecation calendar. Drift monitors that catch silent behaviour shifts. Dual-running through routing weights. The same engineering that module 19 chapter 08 taught for tool contracts, applied to the model layer.
What "drift" means for model providers¶
Six shapes, ordered by how visible they are.
| Drift | What changes | How it shows up |
|---|---|---|
| Model retirement | A model is end-of-life'd; calls fail after the date | Provider email, calendar, eventually 4xx on calls |
| New model version | A new model is released; the old continues | New ID in the catalogue; opt-in |
| Silent behaviour shift in a named version | Same name; the model's outputs change | Eval scores move; user reports; latency or token-count shifts |
| API shape change | Request or response shape changes | Pact tests fail; transform layer breaks |
| Region availability change | A model becomes available or unavailable in a region | Routing's privacy-zone candidate set changes |
| Rate-limit policy change | The provider's limits shift up or down | 429 rate changes from baseline |
Three of these are announced; three are silent. The announced ones (retirement, new version, region) come through the provider's changelog or email. The silent ones (behaviour shift, API shape, rate-limit change) the gateway must detect.
Pinning concrete model versions¶
The single most important defence against drift is to pin the concrete model version at the gateway level. The alias-to-version mapping is owned by the gateway team; product code never names a model directly.
aliases:
smart-reasoner:
candidates:
- id: "anthropic:claude-sonnet-4-6:ap-south-1"
version_pin: "claude-sonnet-4-6" # explicit, not "latest"
weight: 100
Three rules:
- Never use a provider's "latest" alias. Many providers offer something like
claude-3-5-sonnet-latest; using it cedes promotion control to the provider's schedule. The gateway must own that decision. - The pin includes the provider, the version, and the region. A pin without region can silently shift if the provider rolls a new region.
- Pin changes go through the canary process. Promoting a new version is a routing-weight change, gated by evals.
A platform that pins concretely turns drift from "every provider release affects production" to "the team chooses when to evaluate and promote new versions."
The deprecation calendar¶
Each provider's retirement schedule is tracked centrally — usually a YAML file or a small service that aggregates each provider's calendar.
deprecations:
- provider: anthropic
model: claude-3-haiku-20240307
retirement_date: 2026-07-15
successor: claude-haiku-4-5
migration_owner: platform-team
affected_aliases: [fast-summariser, simple-classifier]
status: in-progress
notes: |
Canary at 25%; eval scores within tolerance; promote to 100% by 2026-07-01.
- provider: openai
model: gpt-4-turbo-2024-04-09
retirement_date: 2026-09-01
successor: gpt-4o
migration_owner: platform-team
affected_aliases: [secondary-reasoner]
status: planned
The calendar feeds two things:
- Dashboards. "How many days until the next retirement that affects us?" should be one query.
- Alerts. A retirement date approaching with status not
completepages the team.
A reasonable cadence: review the calendar weekly; schedule migrations at least 30 days before retirement; complete migrations 7 days before retirement, leaving margin for rollback.
The migration playbook¶
A concrete migration when a provider announces a retirement, distilled to seven steps:
1. Identify affected aliases. Search the routing policy for candidates pinned to the retiring model. Cross-reference the audit log to confirm production usage.
2. Pick the successor. Provider-recommended or platform-chosen, based on capability match, region availability, cost, and eval performance.
3. Add the successor as a canary candidate. Weight 0 initially; verify the gateway can reach it via synthetic traffic.
4. Run the eval suite. Module 04_ai_product_evals and module 13 (prompt lifecycle) own the eval definition. The gateway is the executor: synthetic calls against the new model, scores compared to the current baseline. Any regression that exceeds tolerance blocks the migration.
5. Canary rollout. Adjust the weight: 5% → 25% → 50% → 100%. Each step watches: eval scores in production traffic, error rate, latency, cost per call, customer-impact reports. A regression halts; rollback is dropping the canary weight to 0.
6. Promote and announce. When the successor is at 100% and stable for the agreed window (often 1–2 weeks), the old version drops to weight 0 in the fallback chain (still callable in extremis). Internal communication informs stakeholders.
7. Decommission. After the retirement date passes (or sooner if confidence is high), the old version is removed from the routing policy entirely. The deprecation calendar entry is marked complete.
The playbook is the same shape as module 19 chapter 08's dual-run window: two versions coexist on a schedule, traffic shifts gradually, the old is retired with comms and audit confirmation.
Catching silent behaviour shifts¶
The hard case. A provider ships an update to a model under the same name. The model's outputs change — sometimes subtly, sometimes materially. There is no announcement.
The signals:
Eval score drift. Module 04_ai_product_evals runs an eval suite on a schedule (daily, weekly). A drop in scores against a stable golden set is the leading indicator. The drop may correlate with a provider deploy window — checking the provider's status page or changelog is a confirmation step.
Token-count distribution. The average output token count for a fixed prompt template shifts. The model is either becoming chattier or more terse; either way, cost shifts and downstream consumers (token-budget callers) may see surprises.
Latency baseline. p50 and p95 latency for a fixed prompt template shifts. The provider may have re-routed to a different inference cluster.
Refusal rate. The rate at which the model refuses calls (returns "I cannot help with that" or similar) shifts. The provider has changed safety policy or instruction-following.
Tool-call accuracy. For tool-using calls, the rate at which the model produces invalid tool calls shifts. Schema-adherence has changed.
All five are monitored per pinned model. A shift beyond baseline tolerance fires an alarm; the team investigates whether the shift is from the model or from input distribution.
API shape changes¶
Even pinned model versions sometimes receive API-shape changes (a new field in the response, a renamed parameter). The gateway's transform layer (chapter 02) is the boundary that absorbs them.
Defences:
- Pact tests (module 19 chapter 10 templates the pattern) — exercise the gateway's transform against each provider on a schedule. A change in response shape fails the pact and pages.
UPSTREAM_UNCLASSIFIEDrate — calls that fail the transform map to this error code; its rate is monitored.- Provider changelog watch — a daily job that diffs the provider's published API spec, if available, against the version the gateway integrates.
The fix is to extend the transform; in the worst case, treat the API change as a major contract bump (module 19 chapter 08) and dual-run the old and new transforms.
Rate-limit policy changes¶
Providers sometimes change their rate limits — tightening for a tier, raising for a new tier, or changing the burst behaviour. The signal is the per-provider 429 rate departing from baseline.
The gateway's quota plane (chapter 05) caps internally below the provider's limit; a tightening on the provider's side is detected when the gateway sees 429s despite being under its internal cap. The response is to tighten the gateway's cap, investigate the change, and update the calendar.
What pinning cannot solve¶
- A provider going hard down. Pinning doesn't help; fallback chains do.
- A model becoming "smarter" in ways that change downstream behaviour. The model is now better at tool selection, say, and the agent's prompt was tuned to compensate for the old model's quirks. Pin only protects against the version change; promotion still requires re-tuning.
- A region disappearing. If the pinned version is region-pinned and the region is removed, the candidate is gone and routing must find an alternate.
Pinning is a stability mechanism, not a guarantee.
How drift and deprecation interact with the other surfaces¶
- Routing (chapter 03) — pinning is a routing-policy field; promotion is a weight change.
- Fallback (chapter 04) — during dual-run, the previous version is in the fallback chain.
- Cache (chapter 08) — promotion to a new model invalidates the cache (the model version is in the key).
- Cost (chapter 07) — new model versions have new prices; the price book is updated before the canary.
- Audit (chapter 11) — every call records the concrete model version, so drift investigations can scope by model.
How to recognise broken drift handling in the wild¶
- Product code names a concrete model (a
"claude-..."string in a non-gateway file) - The provider's "latest" alias is used in production
- There is no deprecation calendar; retirements surprise the team
- Eval scores are not monitored against the production model on a schedule
- A behaviour shift gets reported by users before metrics catch it
- The migration to a new model takes a quarter and is a project rather than a routine
Interview Q&A¶
Q1. Why never use a provider's "latest" alias? Because "latest" cedes promotion control to the provider's schedule. The gateway needs to evaluate, canary, and promote on its own schedule with its own gates. Using "latest" turns every provider release into an unplanned production rollout. The discipline is to pin concretely and own promotion. Wrong-answer notes: "for convenience" misses the cost; "latest" is exactly the production hazard.
Q2. Walk through a migration when a provider announces a 60-day retirement of a model behind your alias. Identify affected aliases from the routing policy and audit. Pick the successor and add it as weight 0 candidate. Run evals; gate promotion on results. Canary at 5%, 25%, 50%, 100% with monitoring at each step (evals in prod traffic, error rate, latency, cost). At 100%, drop the old version to weight 0 in the fallback chain. After a stability window (1–2 weeks), remove the old version. Complete the deprecation calendar entry. Migration completes well before the retirement date with a rollback path at each step. Wrong-answer notes: ad-hoc steps without canary or eval gates produce regressions.
Q3. The eval suite shows scores dropping on a pinned model. How do you investigate? Confirm the model version in audit hasn't drifted (the pin is honoured). Check input distribution — is the eval set still representative of production? Check provider status page and changelog for a recent deploy or notice. Run the eval against the previous model version (if still available) — if scores recover, the provider has shifted within-name. Coordinate with the provider's support; consider reverting via the fallback chain if material. The investigation is the same as drift detection in module 19 chapter 09, applied to model behaviour. Wrong-answer notes: "we'd retrain the prompt" jumps to a fix without diagnosis.
Q4. The provider tightens a rate limit; you start seeing 429s. What does the gateway do, and what should you do next? The fallback executor advances on 429 (chapter 04); calls continue degraded. The on-call investigates: tighten the gateway's internal cap below the new effective limit; check the provider's notice or changelog; raise with the provider's support if no notice exists. The 429 rate itself is the alarm. After the new limit is internalised, the gateway is back to never seeing 429s in normal operation. Wrong-answer notes: "raise our cap" is the wrong direction — the gateway's cap is below the provider's; the provider lowered theirs.
What to do differently after reading this¶
- Pin every alias to concrete model versions; never use provider "latest" aliases.
- Stand up the deprecation calendar. Review weekly; alert on approaching dates.
- Build the canary rollout flow with eval gates; rehearse it on the next migration.
- Wire eval-score, token-count, latency, and refusal-rate drift monitors per pinned model.
- Pact-test the transform layer against each provider on a schedule.
- Document the migration playbook so any platform engineer can run it.
Bridge. Drift is one external change. Privacy and regional residency are another — and they have legal teeth. The next chapter builds multi-region routing and privacy zones: data residency, regional failover, on-prem fallbacks, and the audit posture that proves the platform honoured its claims. → 10-multi-region-and-privacy.md