13. Honest admission¶

Twelve chapters of discipline. None of them solve the problem entirely. This chapter is the calibrated list of what gateway design cannot fix, where the community is young, and the limits a thoughtful lead should be transparent about.

The model gateway is a load-bearing piece of platform engineering. It absorbs provider variability, enforces policy once, and gives the company a defensible boundary against the most volatile dependency in modern software stacks. None of that makes it a complete answer.

1 — The gateway does not improve model quality¶

A well-routed call to a model that is bad at the task is still a bad answer. The gateway is the right place to swap models, but the decision of what model is good enough belongs to evals (04_ai_product_evals) and the prompt-and-tuning layer (module 13). A platform that hopes a gateway will fix a quality problem will be disappointed.

2 — Semantic caching is a sharp tool¶

Chapter 08 covered the safeguards (per-user partitioning, threshold validation, eval-gated rollouts). Even with the safeguards, semantic caching can produce subtly wrong answers in long-tail cases. Many production platforms run exact-match only and accept the lower hit rate rather than carry the risk. A lead should state the choice explicitly, not default to semantic because "it sounds better."

3 — Per-call cost estimation is upper-bound by design¶

Budget enforcement requires estimating cost before the response is known (chapter 07). The estimate uses max_output_tokens, which over-estimates real usage. Tenants near budget see refusals earlier than the "real" usage would dictate. The alternative — a learned estimator — adds complexity and risks under-estimation that slips calls past hard caps. Either is defensible; neither is perfect.

4 — Drift detection lags real provider changes¶

Chapter 09's drift signals (eval scores, token shifts, refusal rates) catch behaviour changes hours after they begin, not at the moment of change. The provider's own deploy notice — if it exists — is usually the only "real-time" signal. A platform that wants zero-lag detection of provider behaviour shifts is asking for something the provider does not provide.

5 — Single point of policy is single point of incident¶

The gateway is not a single point of failure when engineered for resilience (multi-AZ, regional deployment, fail-open or fail-closed modes). It is a single point of policy — a misconfiguration that takes effect platform-wide. The platform's change-management discipline (review, staging, canary rollout) has to be applied to gateway changes the way it is applied to production code. Gateways without that discipline produce platform-wide outages from configuration changes.

6 — Multi-provider abstraction has limits¶

The unified request shape (chapter 02) covers what the gateway interprets. Provider-specific features that the gateway does not interpret — fine-tuning APIs, vendor-specific tool-call semantics, certain streaming behaviours — leak through, either as passthrough fields or as alias-specific paths. A platform that wants every model to look identical in every detail is asking for an abstraction the providers themselves do not support.

7 — Cost attribution depends on caller honesty¶

Per-tenant, per-feature, per-agent attribution (chapter 07) is only as good as the dimensions the caller passes. A misconfigured caller can attribute calls to the wrong feature; a malicious caller could do so deliberately. Mitigation: enforce caller identity from the credential (not from the request body), and audit attribution dimensions for plausibility (a caller suddenly attributing to a tenant outside its scope is an alarm). Beyond that, the dimensions are trusted at the boundary.

8 — Privacy-zone enforcement depends on configuration correctness¶

The gateway can enforce only what it is configured to enforce. A tenant whose privacy zone is set to "any" when it should be "in-region-only" will route everywhere. The integrity check is the configuration audit — a quarterly review that confirms every tenant's zone matches their contract — and the standing zero-rows query that proves enforcement matches configuration. Neither catches the case where the configuration itself is wrong from day one.

9 — Eval-on-production-traffic is sampled¶

Chapter 11 noted that production eval scores are computed on a sample (1–5%) of traffic. A 1% sample misses most calls; small regressions can hide in the unsampled traffic. Larger samples increase eval cost. The trade-off is platform-specific. A lead should know the sample rate and the size of regressions it can detect.

10 — The gateway is platform code, with platform bugs¶

Every chapter described a subsystem. Each subsystem is software, with its own bug surface. A bug in the quota plane refuses calls that should be allowed; a bug in the routing scorer routes everything to a degraded candidate; a bug in the credential resolver issues over-broad credentials. The gateway is more important than the providers it wraps, because every product's reliability now depends on the gateway's. Treating the gateway as tier-zero — with its own tests, SLOs, postmortems, and code review discipline — is non-negotiable.

11 — Build-vs-buy is a moving target¶

Chapter 02 noted the build-vs-buy tradeoff. Open-source and vendor gateways (LiteLLM, Portkey, Cloudflare AI Gateway, others) have rapidly improved. A platform that built its own gateway 18 months ago may now look at the buy option differently. The discipline of the thin internal wrapper (so the backing implementation can swap) is what keeps the option open. Without it, the build choice locks in.

12 — Cost remains the loudest signal, but not the most important¶

The most visible failure mode of an AI platform is cost runaway. The cost dashboards in this module are designed to catch it. But the most consequential failure modes are silent quality degradations and small privacy breaches — both of which the gateway can detect but neither of which produces an obvious bill spike. A lead should resist the pull toward "we monitor cost so we are fine" and ensure quality and compliance get equal attention.

What this module does not teach¶

Listed once, explicitly:

How to design prompts and evaluate models (modules 13, 04_ai_product_evals)
How to architect agents that consume gateway calls (module 01)
How to operate tool contracts between agent and downstream systems (module 19)
How to run AI incidents (module 05)
How to red-team against prompt injection (module 03_ai_security_safety/01)
How to manage on-prem inference clusters specifically (specialised infra)
The economics of training and fine-tuning (foundation modules)

This module assumes those neighbours exist or are being built. Without them, the gateway is necessary but not sufficient.

How to use this module after reading it¶

A realistic path:

Audit the platform against the chapter-12 checklist. Identify the top three reds.
If item 1 is red, stand up a minimal gateway. Routing + audit + one provider + one tenant + one alias. Iterate.
If items 9 (credentials) and 10 (cost attribution) are red, fix them next. They are the highest-leverage early fixes.
Establish the per-provider dashboard (item 13). Without it, incidents take an hour to diagnose instead of three minutes.
Add the drift review cadence (item 18). Weekly.
Come back to this chapter every six months. Some gaps will have moved; new ones will appear.

Closing¶

The model gateway is the production boundary between your company and the most volatile dependency in modern software stacks. The discipline this module taught — six surfaces, deployed as a tier-zero service, with audit and observability as its substrate — gives you a defensible posture against provider failure, drift, cost runaway, and compliance breach.

It does not give you a complete platform. It gives you a boundary the rest of the platform can stand on.

That is what production-grade AI infrastructure looks like.

Bridge. This module's discipline is the production boundary between callers and model providers. The next module, 06_ai_runbooks_oncall, is the operational discipline of being the team that owns this boundary at 03:00 when the provider has an incident. → ../06_ai_runbooks_oncall/00-eli5.md