02. Gateway anatomy¶

The case for the gateway is made. This chapter is the architecture: what the service actually is, how a request flows through its six surfaces, and what makes it a tier-zero dependency rather than a convenience library.

A staff engineer at a Hyderabad analytics platform is asked to draw the gateway on a whiteboard. The first sketch is six boxes in a row — routing, fallback, quota, credential, observability, transform — connected by arrows. The drawing is correct in spirit and misleading in shape. The six surfaces are not sequential stages; they are concerns that every call passes through in a specific order, with explicit interaction points where one surface's decisions feed another's. The chapter is what the redrawn picture looks like, and what each piece is doing per call.

The request flow, one call at a time¶

A typical call from a caller (an agent, a product service, a batch job) through the gateway and back:

Caller                                                        Gateway
  |
  |-- 1. Request: model_alias, prompt, parameters,
  |    gateway_credential, tenant_id, workload_class,
  |    latency_budget_ms, idempotency_key (optional)
  |---------------------------------------------------------->|
  |                                                           |
  |                            +- (a) Authenticate caller, resolve scope
  |                            |
  |                            +- (b) Apply quota plane:
  |                            |        per-tenant bucket, per-feature bucket,
  |                            |        global bucket against provider limits
  |                            |
  |                            +- (c) Resolve route:
  |                            |        model_alias + workload_class +
  |                            |        privacy_zone + cost_ceiling +
  |                            |        latency_budget => (provider, model_version)
  |                            |
  |                            +- (d) Check cache (exact + semantic if eligible)
  |                            |        Return cached result if hit.
  |                            |
  |                            +- (e) Transform: unified request -> provider-native
  |                            |
  |                            +- (f) Issue call with gateway-held key
  |                            |        - on transient error: retry within budget
  |                            |        - on hard failure: walk fallback chain
  |                            |
  |                            +- (g) Transform: provider response -> unified
  |                            |
  |                            +- (h) Cache write (if eligible)
  |                            |
  |                            +- (i) Emit audit record (cost, latency, route, outcome)
  |                            |
  |                            +- (j) Return to caller
  |<----------------------------------------------------------|
  | 2. Response: model_used, latency_ms, cost_usd,
  |    result, audit_id

Six surfaces are touched in steps (a)–(j). The order matters:

Authn / quota first, so refused calls never hit the provider.
Routing next, because routing decides which provider's quota the call counts against (often a route-aware quota check follows the route decision in real implementations).
Cache before transform, so cache hits skip the transform cost.
Transform around the provider call, so the cache key and the audit record can use the unified shape.
Audit last, after the outcome is known, with cost computed from the price book.

A reader new to gateways often imagines the gateway as "wrap the SDK call." It is more: every step (a)–(j) is a load-bearing concern, and most of them are independent of which provider the call eventually hits.

The six surfaces, as a service¶

The gateway is one service (often one binary, sometimes a small cluster). Internally it has these subsystems:

Routing plane¶

Inputs: model_alias, workload_class, latency_budget_ms, privacy_zone, cost_ceiling_usd.

Output: an ordered list of (provider, concrete_model_version, region) triples. The first is the primary; the rest is the fallback chain.

The routing plane is the policy surface — chapter 03 builds it in detail. It does not call providers; it picks targets.

Fallback executor¶

Walks the ordered list produced by the routing plane until one succeeds or the list is exhausted. On a failure that the contract layer cannot retry, advances to the next target. Within each target, the executor applies a small retry budget (e.g., one or two retries on transient errors).

The fallback executor is the resilience surface — chapter 04 builds the chain semantics.

Quota plane¶

Two layers of buckets:

Per-tenant / per-feature — fairness inside the platform; no caller can monopolise.
Per-provider — respects the provider's own rate limits, sized below them so the gateway never sees 429 from the provider.

Bucket state lives in a distributed store (Redis is common). Buckets refill via token-bucket or leaky-bucket policies.

Chapter 05 builds the math.

Credential plane¶

Holds per-provider keys. Issues callers their gateway credential — a separate identity scoped to the gateway only. Performs rotation. Logs every key use. Never exposes provider keys to callers.

Chapter 06 is the discipline.

Transform plane¶

The unified request shape on the inside; the provider-native shape on the outside. Translates between them:

Maps model_alias to concrete model_version via the route resolution
Translates parameters whose names or units differ between providers (max_tokens vs max_new_tokens, etc.)
Normalises the response shape (content, tool calls, stop reasons, usage)
Handles streaming differences when streaming is requested

Caching lives here too — exact and semantic cache keys are computed on the unified shape, so cache hits are provider-agnostic. Chapter 08 details this.

Observability plane¶

Per-call audit record (chapter 11 of module 19 templates the per-call structured audit; this module's chapter 11 specialises it for model calls). Per-provider metrics (latency, error rate, throttle rate). Per-tenant cost dashboards. Drift signals (chapter 09).

The unified request shape¶

The internal request shape is what the gateway's own code reads. It is decoupled from any provider's API. A reasonable starting shape:

request:
  model_alias: "fast-summariser"           # caller-facing name
  workload_class: "interactive"             # routing input
  latency_budget_ms: 2000                   # routing + retry input
  privacy_zone: "in-region-only"            # routing input
  cost_ceiling_usd: 0.05                    # routing input
  tenant_id: "acme-corp"                    # quota + audit input
  feature_id: "summary-card"                # quota + cost-attribution input
  caller_identity: "agent.support.v3"       # audit input

  prompt:
    messages:
      - role: "system"
        content: "You summarise customer transactions."
      - role: "user"
        content: "..."

  parameters:
    max_output_tokens: 500
    temperature: 0.2
    stop_sequences: []
    tools: []                               # if applicable
    stream: false

  cache_policy:
    eligible: true
    semantic: false
    ttl_seconds: 3600

The gateway accepts this shape. Internally it resolves the route, transforms to the provider's shape, makes the call, transforms the response back, and returns:

response:
  audit_id: "aud_01..."
  model_used:
    provider: "anthropic"
    model_version: "claude-sonnet-4-6"
    region: "ap-south-1"
  latency_ms: 712
  cost_usd: 0.0021
  cache_status: "miss"                      # "miss" | "exact_hit" | "semantic_hit"
  result:
    content: "..."
    stop_reason: "end_turn"
    usage:
      input_tokens: 1200
      output_tokens: 312

The unified shape is the single biggest design decision in the gateway. It must be flexible enough to absorb provider differences without becoming an undifferentiated bag of fields. A practical guide: include only fields where the gateway has a policy to apply (route, quota, cache, audit). Fields the gateway does not interpret should be passed through transparently — the gateway does not need to model every provider feature, only the ones it routes or caches against.

Authentication and authorisation¶

Callers authenticate to the gateway, not to providers. Two patterns are common.

Service-to-service tokens. Each calling service has a credential (mTLS cert, signed JWT, platform-issued token). The gateway validates it, looks up the service's permissions (which model aliases, which workload classes, which tenants), and decides whether to proceed.

Per-tenant or per-user tokens. For user-facing products where the gateway is called from a server that acts on behalf of an end-user, the token can carry the user identity. The gateway uses it for cost attribution and audit, and forwards a token-derived identity to the audit record.

Either way, callers never see provider keys. The gateway-issued credential is the only model-access credential outside the gateway's own vault.

What "tier zero" means here¶

The gateway is between every model call and the company. If it is down, every product that calls models is degraded — usually to "no model calls at all." That is the same operational position as the load balancer, the identity service, or the database front-end. The discipline:

Multi-AZ deployment at minimum; multi-region if the call profile justifies.
Health checks at multiple levels: process up, dependency reachable, sample provider call succeeds.
Graceful degradation of the gateway itself: if Redis-backed quota is down, the gateway can choose to fail open (let calls through without quota enforcement) or fail closed (refuse) per policy.
Break-glass bypass. A rarely-used, audited path that lets a caller talk to a provider directly during a gateway-wide outage. Documented in the runbook; usage is alarmed.
Its own SLO. Availability (e.g., 99.95%), p95 latency overhead (e.g., < 5 ms), error budget tracked.
Its own on-call. A team owns it; pages on its own incidents; runs postmortems on the gateway's own bugs.

Treating the gateway as a tier-zero service is the difference between a gateway that absorbs incidents (chapter 00's intent) and a gateway that causes them.

Build vs buy¶

There are credible open-source and commercial gateways: LiteLLM, Portkey, Cloudflare AI Gateway, Bedrock's invocation layer (for AWS-only stacks), Vertex AI's similar role on GCP. The decision factors:

Factor	Build favours	Buy favours
Number of providers and models	Few	Many (the vendor abstracts more)
Policy depth needed (privacy zones, complex routing)	Deep	Shallow-to-medium
Compliance and audit requirements	Tight (regulated industry)	Standard SaaS posture
Cost attribution needs	Custom dimensions	Standard ones
Team capacity to own a service	Available	Limited
Performance overhead tolerance	Tight	Loose
Vendor lock-in tolerance	Low	Higher

Most large platforms eventually end up with a thin internal gateway wrapping an open-source or vendor solution — the vendor handles the provider-side messy details (SDKs, retries, version drift) while the thin internal layer handles policy, audit, and cost attribution in the company's own dimensions.

A reasonable trajectory: start with an off-the-shelf gateway behind a thin internal wrapper; as policy needs deepen, the internal wrapper grows; if it grows past the vendor's value, fork or replace. The discipline of the wrapper is the load-bearing piece; it lets you swap the backing implementation without breaking callers.

How the gateway interacts with its neighbours¶

Module 12 (model vendor strategy) decides which providers and models you support. The gateway is where the strategy is enforced.
Module 13 (prompt lifecycle) versions prompts and model selections. The gateway is the production reader of those versions per route.
Module 19 (tool integration contracts) runs the boundary for tools; the gateway runs the boundary for models. Identical discipline, different upstream.
Module 04_ai_product_evals runs evals before promotion. The gateway is where promotion is implemented — flipping a route from one model version to another.
Module 05_agent_performance_economics owns the cost/latency story. The gateway is where prices, latencies, and caches are measured.

How to recognise a missing or partial gateway in the wild¶

Provider SDKs imported in product code
Per-product API keys
No central audit of model calls
Cost reports built from provider invoices, not internal records
Each team owns its own retry/fallback logic, with subtle differences
Model version strings hard-coded in product code
"How would a different provider work here?" requires a multi-week migration
The compliance team cannot answer routing/region questions

A platform with three or more of these is missing the gateway. A platform with all of these is the chapter-opening company.

Interview Q&A¶

Q1. Walk through what happens inside the gateway for one model call. Authenticate caller, apply quota (per-tenant, per-feature), resolve route based on alias + workload class + latency budget + privacy zone + cost ceiling, check cache (exact + semantic if eligible), transform to provider shape, call provider with gateway-held key (with small retry budget), on hard failure walk the fallback chain, transform response back, write cache if eligible, emit audit record with cost, return to caller. Six surfaces touched in order. Every step is independently testable. Wrong-answer notes: describing it as "call the SDK with retries" misses the policy surfaces (routing, quota, fallback, cache, audit) entirely.

Q2. What is the unified request shape and why does it exist? The internal request shape the gateway accepts, decoupled from any provider's native API. It exists so policy (routing, quota, cache, audit) can be applied without provider-specific code, so swapping providers is a transform-layer change not a policy change, and so the cache key is provider-agnostic (a cached response from one provider can be served when the route now points elsewhere, if the contract considers it equivalent). The hardest design decision is what to include: only fields the gateway interprets, plus a transparent passthrough for provider-specific fields. Wrong-answer notes: "to be vendor-agnostic" is the high-level reason; the specific value is policy/cache uniformity.

Q3. Why is the cache check before the transform but the cache write after? Because cache reads can short-circuit the provider call entirely — no transform cost, no provider cost, no latency. Cache hits skip everything downstream. Cache writes happen after the response is transformed back to the unified shape so the cached value is provider-agnostic; a future call that routes to a different provider can still serve from cache (when cache semantics permit). Wrong-answer notes: "for consistency" is vague; the specific value is short-circuiting cost.

Q4. The gateway is "one more service to operate." How is its operational discipline different from a regular microservice? Tier zero. Every model call in the company goes through it. Multi-AZ minimum; multi-region if traffic justifies; health checks at process, dependency, and synthetic-call layers; failure-mode policy (fail-open vs fail-closed on quota plane outage); break-glass bypass with audit; its own SLO and on-call rotation. The discipline is the same as the company's identity service or load balancer — both are tier-zero policy substrates. The gateway is not "another service"; it is closer to platform infrastructure. Wrong-answer notes: "we monitor it like everything else" is the answer that produces a gateway-induced outage.

What to do differently after reading this¶

Draw the request flow as ten boxed steps, in order. Reviewers can ask which boxes are missing in any draft design.
Pick the unified request shape early. Resist accumulating fields the gateway does not interpret.
Define the gateway's own SLOs separately from the providers'. Decouple "the gateway is up" from "the providers are up."
Decide build-vs-buy explicitly. If buying, build the thin internal wrapper from day one so policy is yours.

Bridge. The anatomy is in place. The next surface to design is the most visible: the routing plane. The route key is how the gateway decides which model serves which call, under which constraint. The next chapter builds the routing policies — workload class, latency budget, privacy zone, cost ceiling — and what the gateway does when they conflict. → 03-routing-policies.md