08. Provider and cost runbooks¶

~9 min read. Quality runbooks deal with the model's outputs. Provider and cost runbooks deal with the model provider's behaviour and the bill. Both have distinct shapes; both need their own scoped cards.

Continues from 07-degraded-quality-runbooks.md. This chapter develops the runbook family for provider drift, provider outage, cost spike, and quota exhaustion. Recurring concepts in bold: fallback chain, model version pin, tenant quota tightening, kill path, provider escalation template.

The third and fourth runbook families: provider behaviour change and cost runaway. Both intersect with the model gateway (01_model_gateway_provider_ops); the runbook plane operationalises what the gateway makes possible.

The provider drift runbook¶

The gateway's drift detector fires. The runbook:

1. Acknowledge page (5 min). Open incident channel; post payload.
2. Identify the drift shape:
   - Refusal rate change: provider tightened policy.
   - Response shape change: provider changed default formatting / tool-call shape.
   - Latency change: provider performance shift.
   - Error class change: provider returning new error types.
3. Confirm scope: which model, which workload class, what traffic share.
4. Decide mitigation:
   - If a prior model version is still served, pin to it via gateway alias update.
   - If a fallback provider exists, route the affected workload to it.
   - If neither, enable the gateway's cache fall-through for repeat queries
     while the team investigates.
5. Execute the chosen mitigation:
       gateway-cli alias set <alias> --model <prior-version>
   or  gateway-cli route <workload> --provider <fallback>
6. Verify the drift signal recovers post-mitigation.
7. Open provider escalation using the template (see chapter 06):
   - Affected model, magnitude, traffic share, business impact.
   - Send to the provider's enterprise channel.
8. Postmortem actions: model version pinning policy, fallback validation,
   eval-set update if the drift exposed a coverage gap.

The runbook contains within minutes; the provider conversation runs on its own timeline. The apparatus's value is independence from the provider's response time.

The provider outage runbook¶

The provider's API is failing — 5xx errors, complete unavailability, partial regional failures. The runbook:

1. Acknowledge page; check the provider's status page (one click from runbook).
2. Confirm at the gateway: which provider, which region, error rate climb.
3. Activate the fallback chain (chapter 04 of the gateway module):
       gateway-cli fallback engage --primary <provider> --reason outage
   This routes the affected workload to the next provider in the chain.
4. If no fallback is configured for this workload, the workload degrades
   to refusal-at-gateway. Notify product surface owners; consider feature
   flag if user impact is severe.
5. Coordinate with provider via enterprise channel; share business impact
   and request ETA.
6. While in fallback: monitor secondary provider quota; tighten if approaching.
7. When primary recovers: route back gradually (canary 5% → 25% → 100%)
   to confirm stability.
8. Postmortem: fallback validation, provider SLA review if applicable.

The outage runbook depends on the fallback chain being pre-configured and validated. If the team configured fallbacks but never drilled them, the first real fallback engagement may surface unconfigured workloads.

The cost spike runbook¶

The cost anomaly alert fires. The runbook:

1. Acknowledge page (5 min for P1, 15 min for P2).
2. Identify the tenant or feature with the spike:
       gateway-cli cost --by tenant --since "1h ago" --sort desc
   The top entry is usually the cause.
3. Confirm the spike is anomalous:
   - Per-call cost: are individual calls more expensive than usual?
     Possible cause: long context, output, expensive model.
   - Call volume: is the count above baseline?
     Possible cause: runaway loop, batch job misconfiguration.
4. Mitigation per cause:
   - Per-call cost anomaly:
        gateway-cli budget tighten --tenant <id> --rate 0.5x
   - Volume anomaly (most common):
        feature-flag set <feature>.<tenant> enabled=false
     or gateway-cli quota tighten --tenant <id> --rate 0.1x
5. Verify the cost rate returns to baseline within 5-10 minutes.
6. Identify the root cause: open the agent's recent activity, look for
   loops, retry storms, batch misconfigurations.
7. Coordinate with the tenant if user-facing impact: the feature flag
   off is the apparatus's call; reactivating is a product decision.
8. Postmortem: cost anomaly threshold review, agent retry policy review,
   tenant quota policy review.

The cost spike runbook differs from quality and provider runbooks in one way: the user impact is often invisible (the calls succeed; the bill climbs). The apparatus must accept that a feature flag off is acceptable for cost containment even when no user complaint has arrived.

The quota exhaustion runbook¶

The gateway's per-tenant or per-provider quota is exhausted. The runbook:

1. Acknowledge page. Confirm which quota is exhausted:
       gateway-cli quota status --all
2. Identify the consumer:
   - Per-tenant quota: which tenant; legitimate burst or misconfiguration?
   - Per-provider quota: which provider; expected or unexpected?
3. Decide mitigation:
   - Legitimate burst (e.g., a promotion driving traffic): increase quota
     temporarily; alert capacity planning.
   - Misconfiguration: throttle the offending caller; route to fallback if
     provider-level.
4. Execute:
       gateway-cli quota raise --tenant <id> --until <time>
   or  gateway-cli throttle --tenant <id> --rate <slower>
5. Verify the alert clears.
6. Postmortem: quota planning review, anomaly detection on call-rate climbs
   before they exhaust quota.

Worked example — the runaway agent¶

The Pune SaaS team's cost anomaly alert fires at 22:14 on a Friday. The on-call follows the cost spike runbook:

22:14 — page; channel opened.
22:15 — cost dashboard shows tenant acme-corp at 8× baseline; the spike started 90 minutes ago.
22:18 — investigation: call volume is the spike (not per-call cost). Looking at the agent's recent activity, retries are looping.
22:20 — feature flag flipped: agent.acme-corp.enabled=false.
22:24 — cost rate drops to near zero for the tenant; alert clears.
22:30 — investigation continues: the tenant updated their integration; a tool call now fails consistently; the agent retries indefinitely.
22:45 — coordinated with the tenant's contact; integration fixed by 23:30.
Saturday morning — postmortem ticket opened: agent retry policy missing a max-retries cap; cost anomaly threshold was at 3σ (caught at the right time); follow-up: retry policy fix, cost-anomaly threshold validation in drills.

Time to contain: 10 minutes. Estimated cost saved by containment: $3,200 (extrapolating the spike rate over the night).

Operational signals¶

Healthy. Provider and cost runbooks are exercised in drills quarterly. Fallback chains are validated at least every 6 months. Mean time to contain for cost spikes is under 15 minutes. Provider escalation templates are pre-written and tested.

First degrading metric. Fallback chain configuration drift — workloads added without fallbacks; the next outage tests this.

Misleading metric. Number of provider incidents. A platform with sensitive drift detection catches more than one with blind detection; high count is health, not pathology. The metric to watch is mean time to mitigation.

Expert graph. Per-runbook drill participation, mean time to contain per family, fallback chain coverage matrix (workload × provider), cost-anomaly threshold precision.

Boundary of applicability¶

Strong fit. Multi-provider, multi-tenant platforms with a model gateway. The full runbook family is justified.

Pathology. A single-provider platform has no fallback chain; the provider outage runbook degenerates to "wait." The pattern is to plan for the option even if the current implementation is single-provider; the apparatus design surfaces the single-provider risk explicitly.

Scale limit. Very large platforms have many providers, regions, and workload classes; the fallback chain becomes a matrix. The runbooks reference the matrix rather than enumerate cases.

Failure-prone assumption¶

The seductive wrong belief: cost is a finance problem, not an apparatus problem. It is both. Cost runaway has user-impact through unintended throttling, billing surprises, and budget breaches. The apparatus must contain cost in minutes, not days; finance reconciles later. The correct belief: cost is an operational signal with its own paging conditions and runbooks.

Where this appears in production¶

A fintech has a fallback chain validated quarterly; a real provider outage routes seamlessly to the secondary.
A telecom AI caught a refusal-posture change overnight; pinned the prior model version within 18 minutes.
A consumer chatbot has no cost anomaly runbook; a runaway loop overnight cost $14,000.
A healthtech AI has provider escalation templates pre-written; provider response time on the first real incident was 90 minutes.
A coding assistant has drilled fallback engagement; the team is confident in the runbook.
A retail AI has a cost-spike runbook with feature-flag kill paths per tenant; mean time to contain is 8 minutes.
A logistics AI has a single provider; the apparatus design surfaced this gap; secondary provider added.
A government AI has quota planning as a quarterly review; quota exhaustion alerts have never paged.
A B2B SaaS has cost-anomaly thresholds tuned per tenant tier; enterprise tenants get 2σ, smaller get 3σ.
A travel platform lost 4 hours during a provider outage because the fallback was configured but never validated.
A payments AI has cost-spike runbooks tested monthly; the apparatus's signal-to-action is well-rehearsed.
A legal AI had a provider tone change degrade brand voice; ran the drift runbook; pinned prior version; eval-set expanded.
A staffing AI has the cost-spike runbook coordinated with finance on-call; large kills are confirmed before execution.
A search-rerank service has the fallback chain routed through workload class; embedding workloads have different fallbacks than chat workloads.
A document AI has the cost-spike runbook integrated with the tenant CSM channel; tenant is notified when their feature is throttled.
A media AI validated their fallback chain after the apparatus design review; one workload had no configured fallback.
An ad-tech AI has cost containment SLOs in their incident postmortems; trends are reviewed quarterly.
A real-estate AI had a quota exhaustion incident traced to a misconfigured batch job; postmortem produced quota planning policy.
A medical AI has the cost-spike runbook tested in drills with real spike patterns from past incidents.
A small SaaS has only the provider outage runbook; cost runaway is "we'll see the bill"; first cost incident cost $20,000.

Recall / checkpoint¶

Name the four runbooks in this chapter's family.
What is the provider drift runbook's first decision point?
Why is the fallback chain a load-bearing dependency for the provider outage runbook?
Distinguish per-call cost anomaly from volume anomaly; what mitigation differs?
How does the cost spike runbook differ from quality runbooks in terms of user impact?
What is the postmortem's typical follow-up after a quota exhaustion?
Why is "cost is a finance problem" a failure-prone assumption?

Interview Q&A¶

Q1. The provider's API starts returning a different tool-call shape. The gateway's drift detector fires. Walk through the runbook. Identify the drift shape (response shape change). Confirm scope at the gateway (which model, what traffic share). Mitigation: pin the prior model version via gateway alias update — the prior version's behaviour was understood and tested. Verify drift signal recovers. Open provider escalation: send the deviation details, traffic share, business impact. While the provider responds asynchronously, the apparatus is contained. Postmortem: model version pinning policy, eval-set check for the new shape. Common wrong answer to avoid: "wait for the provider to revert" — the apparatus's value is independence from the provider's timeline.

Q2. The cost anomaly alert fires; the on-call is unsure whether to feature-flag-off a tenant. Walk through the framing. The apparatus's discipline is to act on cost anomalies as if they were quality anomalies — contain first, investigate after. A feature-flag off is reversible in minutes; an uncontained spike is not reversible in dollars. The on-call's authority covers feature-flag off; the runbook says so explicitly. The tenant is notified afterwards; if the spike was legitimate, the flag is reversed quickly. The asymmetry favours containment. Common wrong answer to avoid: "wait until the tenant complains" — the cost cost continues to climb; the apparatus has failed its job.

Q3. The team has a fallback chain configured but has never validated it. What is the risk? The configuration may be wrong. The fallback provider may have its own quota that is exhausted in the same incident. The model alias on the fallback may not match. The workload class may not have a configured fallback. The first real outage is the first validation; the discovery cost is high (extended incident time, possibly the apparatus's failure). The remediation is to drill the fallback engagement quarterly; the cost is bounded; the value is high. Common wrong answer to avoid: "we'll validate when we need it" — needing it is the worst time to find out it does not work.

Q4. How does the cost spike runbook intersect with the model gateway's quota plane? The cost-spike runbook uses the gateway's quota and throttling commands as kill paths. The gateway implements per-tenant quotas; the runbook tightens them when an anomaly fires. The cost telemetry that triggers the runbook also comes from the gateway. The two modules — gateway and runbook — share the data plane and the control plane; the apparatus is the operational layer that uses both. Common wrong answer to avoid: "they're separate concerns" — they're tightly coupled; the runbook depends on the gateway's surface.

Q5. The provider's status page shows green; the gateway's drift detector still fires. Who is right? Both can be right. Status pages report outages, not behaviour changes. The provider may have rolled out a behaviour change without marking it as an incident (because, from their perspective, the API is up and serving). The gateway's drift detector watches behaviour; it sees changes the status page does not report. The runbook trusts the drift detector and acts on it; provider escalation includes the detector's evidence. Common wrong answer to avoid: "trust the status page" — the status page covers a narrower failure surface than the apparatus needs.

Q6. What is the value of pre-written provider escalation templates? Mean time to provider response. A solid escalation message — with affected model, magnitude, traffic share, business impact — gets faster, more accurate responses than a vague "the model is broken." The template makes this consistent regardless of which on-call sends it. Under stress, on-calls do not write coherent provider tickets; the template carries the load. Common wrong answer to avoid: "the on-call will figure out what to say" — under stress, structured templates beat improvised messages every time.

Design / debug exercise (10 minutes)¶

Modelled example. Walk through the worked example (the runaway agent). Verify the cost spike runbook steps, the mitigation timing, and the postmortem follow-ups.

Your turn. Take your team's provider configuration. For each model alias, fill in: fallback provider, fallback model, quota policy, escalation contact. Identify any cell that is not configured; that is your next provider apparatus work.

Reproduce from memory. Write the cost spike runbook's 8 steps from memory. The signal of internalisation is the steps land in under three minutes with the right kill paths and the right authority boundaries.

Operational memory¶

This chapter explained the provider drift, provider outage, cost spike, and quota exhaustion runbooks. The important idea is that provider and cost incidents have user impact even when they look like infrastructure problems — drift degrades user output, outages take features down, cost runaway forces unplanned throttling.

You learned to identify the shape (drift vs. outage vs. cost vs. quota), execute the matching mitigation with explicit kill paths, and coordinate with the provider through pre-written templates. That solves the opening failure because the gateway makes mitigation possible; the runbook makes mitigation routine.

Carry this diagnostic forward: when a cost dashboard tile catches your eye, ask whether it would have produced a page at 22:00 on a Friday. The answer determines whether the apparatus is real.

Remember:

Four runbooks: drift, outage, cost spike, quota exhaustion.
Containment is independent of the provider's timeline.
Cost is an operational signal with paging conditions, not a finance-only concern.
Fallback chains must be drilled, not just configured.
Provider escalation templates carry the on-call's load under stress.

Bridge. The runbooks fire and resolve incidents. The postmortem plane is where the apparatus learns. The next chapter is the discipline of postmortem capture — what to record, what to follow up on, and how to make incidents change the system. → 09-postmortem-capture.md