00. Model gateway and provider operations — First-principles overview¶
Module 19 taught you to operate the contracts between an agent and every system that is not a model provider. This module is the matching discipline for the providers themselves — Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, and the local models you may run alongside them.
A platform engineer at a Pune insurance company gets paged at 19:40 IST on a Wednesday because the underwriting agent has stopped responding. The first guess is the agent platform. Logs say no — the agent platform is healthy. Logs also say it has been sending requests to Anthropic for four minutes and receiving nothing but 529 Overloaded. The team's wiring uses anthropic.Client() directly inside the agent runtime. There is no retry queue, no fallback provider, no cache, no per-tenant rate limit. When the provider throttles, every tenant of the platform fails at once. Twenty minutes later Anthropic recovers; another agent platform owned by a different team in the same company has the same outage, with the same root cause, with a different timestamp. By 22:00 leadership asks why the same incident happened twice in a single quarter. The honest answer is that nothing sat between the agent and the provider, and nothing in the platform's design forced that boundary to exist.
That boundary is the model gateway. It is the single layer through which all model calls flow, so that routing, fallback, quota, cache, observability, and cost can be enforced once instead of being re-implemented per agent. Every chapter of this module is one surface of that gateway. The opening incident is what happens when the boundary is missing; the rest of the module is what is on either side of it once the boundary exists.
What a model gateway is, in one sentence¶
A model gateway is the production boundary between every caller in the company and every model provider, owned as a platform service, designed so that provider failure, drift, cost, and capacity are absorbed once at the boundary rather than being re-fought in every product.
Read the sentence left to right.
- Production boundary — not a library inside each product, not a wrapper around each SDK. A network service or sidecar that calls go through.
- Every caller, every provider — the gateway is the single bottleneck the company tolerates between its agents and the outside world of models.
- Owned as a platform service — a team owns it, with its own on-call, SLOs, runbook, and roadmap.
- Absorbed once at the boundary — provider rate limits become gateway-level quotas; provider outages become gateway-managed fallbacks; provider drift becomes gateway monitors. Products do not re-implement these.
If a company has more than one product calling models, and does not have a gateway, the question is not whether the chapter-opening incident will happen but how many duplicate copies of it the company is preparing to ship.
The six gateway surfaces¶
Every production gateway has exactly six surfaces. Memorise them once. The rest of the module is consequences.
| Surface | One-liner | Pressure it answers |
|---|---|---|
| The routing plane | Pick the provider/model for this call based on workload, latency budget, cost ceiling, privacy zone | mismatch: not every call wants the same model |
| The fallback chain | Define what happens when the primary fails — degraded model, alternate provider, cached result, refusal | availability: providers fail in ways your callers cannot absorb |
| The quota plane | Token-bucket per-tenant, per-feature, per-model, against the provider's own limits | fairness: shared providers create shared throttling |
| The credential plane | Per-provider keys, per-region keys, rotation, scope to the gateway only | security: a leaked key is a company-wide bill and risk |
| The observability plane | Per-call audit, per-provider latency and error rates, per-tenant cost attribution | accountability: who used what, how well it worked, what it cost |
| The transform plane | Translate between unified schemas and each provider's native shapes; pin model versions | drift: providers change their APIs and their models on their own schedule |
Caching (exact, semantic) lives inside the transform plane in this module. Some platforms call it a seventh surface; we treat it as a transform-side concern because the cache shapes responses, not routing decisions.
The recurring vocabulary¶
These terms appear in every chapter.
| Name | Surface | What it is |
|---|---|---|
| the route key | Routing | the tuple (workload_class, latency_budget, privacy_zone, cost_ceiling) that selects a provider/model |
| the fallback chain | Fallback | the ordered list of providers/models tried on failure, with refusal as the terminal step |
| the workload class | Routing | a named tier — interactive, batch, background, embeddings, tool_call — with its own SLOs and routing |
| the privacy zone | Routing | a tenant- or feature-level constraint on which providers/regions are eligible |
| the per-tenant bucket | Quota | the rate limit a tenant gets, independent of the provider's own quota |
| the gateway key | Credential | the provider credential the gateway holds; callers never see it |
| the unified request | Transform | the gateway's internal request shape, decoupled from any provider |
| the model alias | Transform | a stable name (fast-summariser) the caller uses, mapped to a concrete provider/model version |
| the price book | Observability | the per-token, per-image, per-call price table used for cost attribution |
| the audit record | Observability | the per-call record: who, on whose behalf, which model, what cost, what outcome |
| the deprecation calendar | Transform | the schedule of provider model retirements the gateway tracks |
| the canary slice | Routing | the fraction of traffic a new model version sees before being promoted |
The journey: own the boundary, then operate it¶
This module has two acts.
Act 1 — Build the gateway (files 01–07). The case for the gateway, its anatomy, routing, fallback, quotas, credentials, cost attribution. By file 07 the gateway exists as a defensible production service.
Act 2 — Operate the gateway (files 08–11). Caching, provider drift, multi-region/privacy, per-provider observability. The gateway does not become more powerful; it becomes resilient to time and scale.
Synthesis (files 12–13). Architect checklist and honest admission.
Memory map¶
| # | File | Surface | Pressure answered | What it adds |
|---|---|---|---|---|
| 01 | why-direct-provider-calls-break | — | the cost of having no gateway | the case that forces the boundary to exist |
| 02 | gateway-anatomy | All | what the gateway actually does | the six surfaces as a service architecture |
| 03 | routing-policies | Routing | one model is not enough | workload, latency, cost, privacy routing |
| 04 | fallback-chains | Fallback | providers fail in production | degraded models, alternate providers, refusal |
| 05 | rate-limit-and-quota | Quota | shared providers create shared throttling | per-tenant buckets vs provider buckets |
| 06 | key-and-credential-management | Credential | a leaked key is a company-wide bill | per-provider keys, rotation, scope |
| 07 | cost-attribution-and-budgets | Observability | bills land monthly, decisions land continuously | price book, attribution, budget enforcement |
| — milestone: gateway is defensible — | ||||
| 08 | prompt-and-response-caching | Transform | tokens are expensive and many calls repeat | exact and semantic caching, invalidation |
| 09 | provider-drift-and-deprecation | Transform | providers retire models and shift behaviour | version pinning, deprecation calendar, dual-run |
| 10 | multi-region-and-privacy | Routing | data has residency, providers have regions | region routing, privacy zones, on-prem fallback |
| 11 | observability-per-provider | Observability | aggregate metrics hide provider-specific failures | per-provider dashboards, error taxonomy |
| — milestone: gateway is operable — | ||||
| 12 | architect-checklist | Synthesis | completeness | 20-item design/build/launch/operate |
| 13 | honest-admission | Boundaries | humility | what gateway design cannot solve |
Three traversal paths use this map. Prerequisite path — read top to bottom. Failure path — when a provider incident wakes you, find which surface absorbed it (or didn't). Synthesis path — pick two rows and ask how they compose (e.g., Routing + Quota = how do you choose a provider when a tenant is near their limit?).
How this module relates to its neighbours¶
12_model_vendor_strategy— that module is the strategic choice of which vendors to bet on. This module is the production discipline of operating those vendors once chosen. Strategy decides the menu; this module runs the kitchen.19_tool_integration_contracts— same boundary discipline, different boundary. Module 19 sits between agent and tools; this module sits between agent and model providers.05_agent_performance_economics— token budgets, caching, batching, model routing, latency SLOs. The gateway is where many of those decisions are enforced.13_prompt_lifecycle_operations— prompts are config; the gateway is the place they are versioned alongside model selections.05_ai_incident_operations— provider outages are AI incidents. The gateway is the system that turns them from product outages into operator-managed degradations.
Top resources¶
- Anthropic — API rate limits — https://docs.anthropic.com/en/api/rate-limits
- Anthropic — model deprecations — https://docs.anthropic.com/en/docs/about-claude/model-deprecations
- OpenAI — production best practices — https://platform.openai.com/docs/guides/production-best-practices
- AWS Bedrock — model invocation — https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation.html
- Vertex AI — model garden — https://cloud.google.com/vertex-ai/docs/start/explore-models
- LiteLLM — open-source gateway — https://docs.litellm.ai/
- Cloudflare AI Gateway — https://developers.cloudflare.com/ai-gateway/
- Portkey — https://docs.portkey.ai/
What's coming¶
- 01-why-direct-provider-calls-break.md — The case for the boundary. What goes wrong without a gateway and why every product team rebuilds the same wheels badly.
- 02-gateway-anatomy.md — The six surfaces, as a service architecture.
- 03-routing-policies.md — Workload classes, latency budgets, cost ceilings, privacy zones. The route key in detail.
- 04-fallback-chains.md — Provider fails. What does the gateway return? Degraded model, alternate provider, cached result, refusal — and how the caller is told.
- 05-rate-limit-and-quota.md — Provider limits, gateway quotas, per-tenant fairness, the math of bursty agents.
- 06-key-and-credential-management.md — Per-provider keys, rotation, scope, and the discipline that keeps them out of every product binary.
- 07-cost-attribution-and-budgets.md — Price book, attribution per tenant/feature/agent, budget enforcement.
- 08-prompt-and-response-caching.md — Exact and semantic caching, hit-rate economics, invalidation, poisoning concerns.
- 09-provider-drift-and-deprecation.md — Models retire. Behaviour shifts. The gateway is the place you absorb both without breaking callers.
- 10-multi-region-and-privacy.md — Regional routing, data residency, on-prem fallbacks, privacy zones.
- 11-observability-per-provider.md — Per-provider dashboards, error taxonomy, latency baselines, the alarm panel.
- 12-architect-checklist.md — Twenty items: design, build, launch, operate.
- 13-honest-admission.md — Where gateway design still has no defensible answer.
Bridge. Before we design the gateway, we have to feel why direct provider calls are not enough. The first chapter walks the failure modes that show up when each product team is its own boundary — and then closes the case that those failures are not bugs but design absences. → 01-why-direct-provider-calls-break.md