Skip to content

01. Why direct provider calls break

Until the team feels the absence of a gateway in production, every chapter that follows reads as over-engineering. This chapter is the absence.


A product team at a Delhi NCR fintech launches its first model-backed feature: an explanation tool that turns a customer's transaction list into a plain-language summary. The team is small. The wiring is simple: anthropic.Client() is instantiated in the FastAPI app, the API key sits in an environment variable, the call is one line. The feature ships. It is well-received. Three months later the company has four other teams each shipping their own model-backed feature using the same pattern. Each team has its own API key, its own retry logic, its own error mapping, its own cost dashboard built in spreadsheets. Each is now subject to a different set of preventable production incidents. This chapter is the catalogue of those incidents.


The six failure modes of direct provider calls

Each of these has happened in production at companies that did not have a gateway. Each is the answer to a question the gateway exists to solve.

1. The throttled provider takes the entire surface down

The opening scene of chapter 00. The provider throttles. Every product calling that provider fails at once because no one absorbs the throttle. A gateway can buffer (a request queue with bounded latency tolerance), shed load fairly (per-tenant queues so noisy neighbours do not starve quiet ones), or fall back to an alternate provider. None of these are possible from inside a single product without significant duplication.

2. Provider outage = product outage, with no degraded path

When the primary provider is hard-down, every direct caller is hard-down. The product has no notion of "we can return a worse answer instead of no answer." A gateway with a fallback chain returns a degraded but valid response (smaller model, alternate provider, cached value, refusal with explanation). The decision to degrade or refuse is operational, not product-coded; it can be tuned per workload during the incident.

3. Cost is invisible until the invoice arrives

Each product team has their own dashboard, built from each provider's billing export, lagging by hours or days. Per-tenant cost attribution is impossible because the provider sees one API key. Per-feature attribution is approximate. A finance ask like "what did Tenant X spend on AI last month?" cannot be answered with confidence. A gateway sees every call; per-tenant, per-feature, per-agent cost attribution becomes a column in the audit log instead of a quarter-long project.

4. Provider model retirement is each team's surprise

A provider sunsets claude-3-haiku-20240307. The notice goes out via email and changelog. Each product team owning a direct integration has to find out, find the right people, plan a migration, test it, ship it. Some teams miss the notice; the model stops working on the cutover date; a midnight incident is opened. A gateway tracks the deprecation calendar centrally, pins model versions per route, and runs the migration on the gateway side with traffic-shifting tools the products do not need to know about.

5. Each team carries the same API keys

Direct integrations need provider credentials. They end up in product environment variables, deployed binaries, occasionally in code. A leaked key is a company-wide bill, and on some providers a security risk too. Rotation requires changing every product that holds the key. A gateway holds one set of keys, scoped to itself; products receive a gateway credential, which is narrower and rotated through the gateway's own credential plane.

6. Drift catches every product separately

A provider changes a model's behaviour without changing its name. Output formatting shifts. Refusal rates change. Latency p95 moves. Each direct-integrating product feels the drift independently and runs their own investigation. A gateway with per-provider observability surfaces the drift as a platform-level signal that one team can investigate once and one set of dashboards can prove.


Common counter-arguments, and what is wrong with them

Teams that resist the gateway usually say one of these things. They are answered here once.

"We're small, we don't need it yet"

The cost of retrofitting a gateway onto products that have grown up calling providers directly is consistently several quarters of work — finding all the call sites, building the gateway, migrating each call site without behaviour regressions, decommissioning the direct paths. The cost of building the gateway when you have two products and three call sites is small. The savings come from the next three products that never built a direct integration. The "small" case is not the cheap case; it is the only cheap case.

A pragmatic compromise: the first integration can be a thin gateway with just routing, audit, and one provider. The point is that products call the gateway from day one, not that the gateway has every surface from day one.

"The provider's SDK already retries / caches / etc."

It does some of this, narrowly. The provider SDK does not:

  • Fall back to a different provider when the primary is down
  • Apply your per-tenant quota across all your products
  • Attribute cost across tenants you have, which the provider does not know about
  • Track your own deprecation calendar and pin versions
  • Apply a unified audit that satisfies your compliance team
  • Run a semantic cache across products

The SDK provides per-call resilience. The gateway provides cross-product, cross-provider, cross-tenant policy. They solve different problems.

"The gateway adds latency"

A well-designed gateway adds 1–5 ms of overhead in the request path; provider calls themselves are 200 ms to several seconds. The added latency is negligible against the cost it saves on the failure path (where a non-gatewayed call simply errors after timeout instead of falling back in 100 ms). The latency argument is usually advanced before the gateway is built and not after.

The exception is when the gateway is implemented as a cross-region hop the request did not previously make. Then the latency matters. The fix is regional gateway deployment (chapter 10), not abandoning the gateway.

"We can't share rate limits across products; they have different priorities"

The gateway models this directly. Per-product, per-tenant, per-feature buckets are exactly what chapter 05 builds. The single shared API key the provider sees is not the same as a single shared quota inside your platform; the gateway is what lets the company allocate the provider's quota fairly across its own products.

"We don't want a single point of failure"

A gateway whose own failure brings down all model traffic is a real concern. The mitigations are well-understood: regional deployment, redundant instances, a fail-open mode for read-only routing decisions, and the ability to bypass the gateway in emergency through an audited break-glass path. These are platform-engineering disciplines, not unique to AI; the same considerations apply to API gateways, load balancers, and identity services that companies routinely deploy.

A gateway that has been engineered for resilience is not a single point of failure; it is a single point of policy. Those are different.


What the gateway does for the company that direct calls cannot

A single-sentence frame for each of the six failure modes:

Failure mode Gateway's role
Throttled provider takes the surface down Per-tenant fairness; buffering; fallback to alternate provider
Provider outage = product outage Fallback chain returns degraded but valid response
Cost invisible until invoice Per-call audit with cost attribution along tenant/feature/agent
Model retirement = team-by-team surprise Central deprecation calendar; version-pinned routes; traffic-shifted migration
Keys spread across products Gateway holds provider keys; products hold gateway keys
Drift surprises every product separately Central per-provider observability; one investigation, one signal

These six together justify the gateway. None of them, on its own, is dramatic enough to force the decision; the cumulative weight is what does it. Teams that delay the gateway pay each cost separately and never get the leverage of paying them once.


The other side — what a gateway is not a fix for

To stay calibrated, name what the gateway does not solve.

  • Prompt quality. The gateway does not write your prompts.
  • Eval and release gates. The gateway is the place model versions are pinned, but the decision to promote a model comes from the eval system (module 04_ai_product_evals).
  • Tool integrations. Module 19 is the boundary for tools; this module is the boundary for models.
  • Agent loop and tool composition. Module 01 architects the agent; this module supplies the model calls it makes.
  • Model behaviour itself. If a model is bad at your task, no gateway design saves you. The fix is routing (use a different model) or fine-tuning, not gateway plumbing.

How to recognise the gateway-missing pattern in the wild

Walk into a new engineering org. The following symptoms suggest the gateway is missing or partial.

  • Each product team owns its own provider API key
  • Cost reports come from the provider's billing export, not from internal audit
  • "What did Tenant X spend?" is hard to answer
  • Provider outages produce parallel incidents in multiple products
  • Model version strings (claude-3-5-sonnet-20241022) appear in product code
  • The model migration plan is a per-team email thread
  • There is no shared dashboard of model error rates across the company
  • Adding a new provider means each team integrates the new SDK
  • Cache hit rates are unknown or product-specific
  • Retry/fallback logic is duplicated across products with slight variations

Three or more of these is "no gateway." Most of these is "gateway is half-built."


Interview Q&A

Q1. A startup with one product and one model provider wants to skip the gateway "until it's needed." What is your advice? Build a thin gateway as the first integration. It can be a small service with routing, audit, and one provider; everything else (multi-provider fallback, semantic cache, per-tenant quotas) is added when the second product or the first incident demands it. The discipline you gain on day one is that callers never see the provider's SDK directly; that discipline is the load-bearing piece. Retrofitting it onto multiple products is multi-quarter work; starting with it is days of work. Wrong-answer notes: agreeing to "skip until needed" sets the team up for the chapter-opening incident, then a multi-quarter migration.

Q2. The provider's SDK already retries on 429 and 503. Why is a gateway-level fallback needed? SDK retries are within the same provider. They do not help when the provider is down hard or throttling persistently. The gateway's fallback chain switches providers (or models, or to a cached response, or to a refusal with explanation) when the primary cannot recover within the call's latency budget. The SDK and gateway operate at different scales: SDK is per-call resilience, gateway is cross-provider availability. Wrong-answer notes: "the SDK is enough" is the answer that produces hours-long outages when the provider has a multi-hour incident.

Q3. A platform engineer says "the gateway is one more service to operate, so it's a tax." How do you frame the cost calculation? The gateway is one service operated by one team. The alternative is the same service implemented partially, badly, and incompatibly in every product, by every team, with every team owning the operational burden. The cost of one shared, well-engineered platform service is reliably lower than the cost of N partial implementations whose combined surface area exceeds the gateway's. Add the cost of incidents avoided (chapter-opening scenario, model-retirement scrambles, cost surprises) and the math favours the gateway clearly. Wrong-answer notes: "platform tax" framing without the comparison is rhetorical, not analytical.

Q4. The compliance team needs to prove that customer data did not leave India for any agent call last quarter. With direct provider integrations, how do you answer? With a gateway? With direct integrations: gather audit logs from each product team, hope they captured the provider region, normalise the formats, write a one-off query. Confidence is partial; some calls likely went through fields not logged at the product level. With a gateway: query the gateway's audit log for region != IN; the answer comes back in minutes with full coverage. The privacy zone is enforced by the gateway, so the answer is a property of the routing policy, not a survey of products. Wrong-answer notes: "we'd build a dashboard" misses that the gateway is the dashboard's substrate.


What to do differently after reading this

  • If the company has more than one product calling models, scope a gateway. Start thin. Routing + audit + one provider is enough for day one.
  • If the company has one product, build the gateway as part of that product's first integration so the boundary exists from the start.
  • For every gateway-missing symptom in the list above, decide which gateway surface fixes it and add to the roadmap.
  • When a team argues "later," compute the retrofit cost honestly. It is always larger than the build-now cost.

Bridge. The case is made. The next chapter walks the actual anatomy of a gateway: the six surfaces as a deployable service, the request flow through them, and the operational properties that make it tier-zero. → 02-gateway-anatomy.md