00. Model gateway and provider operations — First-principles overview¶

Module 19 taught you to operate the contracts between an agent and every system that is not a model provider. This module is the matching discipline for the providers themselves — Anthropic, OpenAI, Bedrock, Vertex, Azure OpenAI, and the local models you may run alongside them.

A platform engineer at a Pune insurance company gets paged at 19:40 IST on a Wednesday because the underwriting agent has stopped responding. The first guess is the agent platform. Logs say no — the agent platform is healthy. Logs also say it has been sending requests to Anthropic for four minutes and receiving nothing but 529 Overloaded. The team's wiring uses anthropic.Client() directly inside the agent runtime. There is no retry queue, no fallback provider, no cache, no per-tenant rate limit. When the provider throttles, every tenant of the platform fails at once. Twenty minutes later Anthropic recovers; another agent platform owned by a different team in the same company has the same outage, with the same root cause, with a different timestamp. By 22:00 leadership asks why the same incident happened twice in a single quarter. The honest answer is that nothing sat between the agent and the provider, and nothing in the platform's design forced that boundary to exist.

That boundary is the model gateway. It is the single layer through which all model calls flow, so that routing, fallback, quota, cache, observability, and cost can be enforced once instead of being re-implemented per agent. Every chapter of this module is one surface of that gateway. The opening incident is what happens when the boundary is missing; the rest of the module is what is on either side of it once the boundary exists.

What a model gateway is, in one sentence¶

A model gateway is the production boundary between every caller in the company and every model provider, owned as a platform service, designed so that provider failure, drift, cost, and capacity are absorbed once at the boundary rather than being re-fought in every product.

Read the sentence left to right.

Production boundary — not a library inside each product, not a wrapper around each SDK. A network service or sidecar that calls go through.
Every caller, every provider — the gateway is the single bottleneck the company tolerates between its agents and the outside world of models.
Owned as a platform service — a team owns it, with its own on-call, SLOs, runbook, and roadmap.
Absorbed once at the boundary — provider rate limits become gateway-level quotas; provider outages become gateway-managed fallbacks; provider drift becomes gateway monitors. Products do not re-implement these.

If a company has more than one product calling models, and does not have a gateway, the question is not whether the chapter-opening incident will happen but how many duplicate copies of it the company is preparing to ship.

The six gateway surfaces¶

Every production gateway has exactly six surfaces. Memorise them once. The rest of the module is consequences.

Surface	One-liner	Pressure it answers
The routing plane	Pick the provider/model for this call based on workload, latency budget, cost ceiling, privacy zone	mismatch: not every call wants the same model
The fallback chain	Define what happens when the primary fails — degraded model, alternate provider, cached result, refusal	availability: providers fail in ways your callers cannot absorb
The quota plane	Token-bucket per-tenant, per-feature, per-model, against the provider's own limits	fairness: shared providers create shared throttling
The credential plane	Per-provider keys, per-region keys, rotation, scope to the gateway only	security: a leaked key is a company-wide bill and risk
The observability plane	Per-call audit, per-provider latency and error rates, per-tenant cost attribution	accountability: who used what, how well it worked, what it cost
The transform plane	Translate between unified schemas and each provider's native shapes; pin model versions	drift: providers change their APIs and their models on their own schedule

Caching (exact, semantic) lives inside the transform plane in this module. Some platforms call it a seventh surface; we treat it as a transform-side concern because the cache shapes responses, not routing decisions.

The recurring vocabulary¶

These terms appear in every chapter.

Name	Surface	What it is
the route key	Routing	the tuple `(workload_class, latency_budget, privacy_zone, cost_ceiling)` that selects a provider/model
the fallback chain	Fallback	the ordered list of providers/models tried on failure, with refusal as the terminal step
the workload class	Routing	a named tier — `interactive`, `batch`, `background`, `embeddings`, `tool_call` — with its own SLOs and routing
the privacy zone	Routing	a tenant- or feature-level constraint on which providers/regions are eligible
the per-tenant bucket	Quota	the rate limit a tenant gets, independent of the provider's own quota
the gateway key	Credential	the provider credential the gateway holds; callers never see it
the unified request	Transform	the gateway's internal request shape, decoupled from any provider
the model alias	Transform	a stable name (`fast-summariser`) the caller uses, mapped to a concrete provider/model version
the price book	Observability	the per-token, per-image, per-call price table used for cost attribution
the audit record	Observability	the per-call record: who, on whose behalf, which model, what cost, what outcome
the deprecation calendar	Transform	the schedule of provider model retirements the gateway tracks
the canary slice	Routing	the fraction of traffic a new model version sees before being promoted

The journey: own the boundary, then operate it¶

This module has two acts.

Act 1 — Build the gateway (files 01–07). The case for the gateway, its anatomy, routing, fallback, quotas, credentials, cost attribution. By file 07 the gateway exists as a defensible production service.

Act 2 — Operate the gateway (files 08–11). Caching, provider drift, multi-region/privacy, per-provider observability. The gateway does not become more powerful; it becomes resilient to time and scale.

Synthesis (files 12–13). Architect checklist and honest admission.

Memory map¶

#	File	Surface	Pressure answered	What it adds
01	why-direct-provider-calls-break	—	the cost of having no gateway	the case that forces the boundary to exist
02	gateway-anatomy	All	what the gateway actually does	the six surfaces as a service architecture
03	routing-policies	Routing	one model is not enough	workload, latency, cost, privacy routing
04	fallback-chains	Fallback	providers fail in production	degraded models, alternate providers, refusal
05	rate-limit-and-quota	Quota	shared providers create shared throttling	per-tenant buckets vs provider buckets
06	key-and-credential-management	Credential	a leaked key is a company-wide bill	per-provider keys, rotation, scope
07	cost-attribution-and-budgets	Observability	bills land monthly, decisions land continuously	price book, attribution, budget enforcement
	— milestone: gateway is defensible —
08	prompt-and-response-caching	Transform	tokens are expensive and many calls repeat	exact and semantic caching, invalidation
09	provider-drift-and-deprecation	Transform	providers retire models and shift behaviour	version pinning, deprecation calendar, dual-run
10	multi-region-and-privacy	Routing	data has residency, providers have regions	region routing, privacy zones, on-prem fallback
11	observability-per-provider	Observability	aggregate metrics hide provider-specific failures	per-provider dashboards, error taxonomy
	— milestone: gateway is operable —
12	architect-checklist	Synthesis	completeness	20-item design/build/launch/operate
13	honest-admission	Boundaries	humility	what gateway design cannot solve

Three traversal paths use this map. Prerequisite path — read top to bottom. Failure path — when a provider incident wakes you, find which surface absorbed it (or didn't). Synthesis path — pick two rows and ask how they compose (e.g., Routing + Quota = how do you choose a provider when a tenant is near their limit?).

How this module relates to its neighbours¶

12_model_vendor_strategy — that module is the strategic choice of which vendors to bet on. This module is the production discipline of operating those vendors once chosen. Strategy decides the menu; this module runs the kitchen.
19_tool_integration_contracts — same boundary discipline, different boundary. Module 19 sits between agent and tools; this module sits between agent and model providers.
05_agent_performance_economics — token budgets, caching, batching, model routing, latency SLOs. The gateway is where many of those decisions are enforced.
13_prompt_lifecycle_operations — prompts are config; the gateway is the place they are versioned alongside model selections.
05_ai_incident_operations — provider outages are AI incidents. The gateway is the system that turns them from product outages into operator-managed degradations.

Top resources¶

Anthropic — API rate limits — https://docs.anthropic.com/en/api/rate-limits
Anthropic — model deprecations — https://docs.anthropic.com/en/docs/about-claude/model-deprecations
OpenAI — production best practices — https://platform.openai.com/docs/guides/production-best-practices
AWS Bedrock — model invocation — https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation.html
Vertex AI — model garden — https://cloud.google.com/vertex-ai/docs/start/explore-models
LiteLLM — open-source gateway — https://docs.litellm.ai/
Cloudflare AI Gateway — https://developers.cloudflare.com/ai-gateway/
Portkey — https://docs.portkey.ai/

What's coming¶

01-why-direct-provider-calls-break.md — The case for the boundary. What goes wrong without a gateway and why every product team rebuilds the same wheels badly.
02-gateway-anatomy.md — The six surfaces, as a service architecture.
03-routing-policies.md — Workload classes, latency budgets, cost ceilings, privacy zones. The route key in detail.
04-fallback-chains.md — Provider fails. What does the gateway return? Degraded model, alternate provider, cached result, refusal — and how the caller is told.
05-rate-limit-and-quota.md — Provider limits, gateway quotas, per-tenant fairness, the math of bursty agents.
06-key-and-credential-management.md — Per-provider keys, rotation, scope, and the discipline that keeps them out of every product binary.
07-cost-attribution-and-budgets.md — Price book, attribution per tenant/feature/agent, budget enforcement.
08-prompt-and-response-caching.md — Exact and semantic caching, hit-rate economics, invalidation, poisoning concerns.
09-provider-drift-and-deprecation.md — Models retire. Behaviour shifts. The gateway is the place you absorb both without breaking callers.
10-multi-region-and-privacy.md — Regional routing, data residency, on-prem fallbacks, privacy zones.
11-observability-per-provider.md — Per-provider dashboards, error taxonomy, latency baselines, the alarm panel.
12-architect-checklist.md — Twenty items: design, build, launch, operate.
13-honest-admission.md — Where gateway design still has no defensible answer.

Bridge. Before we design the gateway, we have to feel why direct provider calls are not enough. The first chapter walks the failure modes that show up when each product team is its own boundary — and then closes the case that those failures are not bugs but design absences. → 01-why-direct-provider-calls-break.md