Skip to content

03. Routing policies

The gateway's first decision per call is which model handles it. Routing is the policy layer that turns a caller's named alias into a concrete (provider, model, region). This chapter builds the route key and the policies the gateway evaluates against it.


A platform engineer at a Bengaluru learning company audits the gateway after six months of live traffic. The model selection happens via a single config table: feature_id → model_version. The auditor finds that the homepage summary feature, the embedding pipeline for nightly indexing, the inline chat assistant, and the after-hours batch report all hit the same claude-sonnet-4-6. The interactive chat needs sub-2-second latency. The nightly batch is allowed 10 minutes per item. The inline chat is privacy-zone-restricted to India. The batch report can pay for the smartest model in the catalogue but the chat cannot. None of these differences is encoded anywhere; routing is a string lookup. The result, predictably, is that the batch jobs sometimes blow through the same provider's quota that the chat depends on; latency-sensitive chat occasionally pays for the smartest model when a cheaper one would do; and the privacy zone is enforced only because the team manually verified once.

This chapter is the discipline that prevents that audit's findings. Routing is more than a string lookup; it is a multi-dimensional policy evaluated per call.


The route key

The route key is the tuple of inputs the routing plane evaluates. Every gateway has the same five dimensions, even if some are constant for many platforms.

Dimension What it is Examples
model_alias Caller-facing name for a class of capability fast-summariser, smart-reasoner, code-assistant, embeddings-v3
workload_class The shape of the workload's SLOs interactive, batch, background, streaming
latency_budget_ms Maximum total latency tolerated, including fallback 2000, 10000, 60000
privacy_zone Where the data is allowed to be processed any, in-region-only, on-prem-only, eu-only
cost_ceiling_usd Maximum cost the caller is willing to pay per call 0.005, 0.05, 1.0

The dimensions are not orthogonal; some combinations select trivially, others produce a small candidate set the gateway must rank. The whole exercise is to turn the tuple into one chosen (provider, model, region) and a fallback list.

Why each dimension exists

model_alias decouples callers from concrete model versions. Callers ask for fast-summariser, not claude-haiku-4-5-20251001. The mapping from alias to concrete version lives in the gateway and changes as models evolve. This is the load-bearing abstraction for provider migration — chapter 09 builds the mechanics.

workload_class carries the SLO shape. Interactive workloads need sub-second responses and graceful fallback. Batch workloads tolerate retries and slower providers. Streaming workloads need providers that support streaming at all. Background workloads can be deferred during incidents. Without workload_class, every call looks the same and the gateway cannot apply different policies.

latency_budget_ms is the maximum total time. It bounds how long the fallback chain can run; if the budget is 2000 ms and the primary took 1900 ms before failing, there is no time to fall back. The budget is per call, not per provider attempt.

privacy_zone is the residency or processing constraint. in-region-only means the call must be served by a provider with an endpoint in the caller's region. on-prem-only means a local model. This is most often a tenant or feature property; the gateway looks it up from the caller's context rather than expecting the caller to pass it.

cost_ceiling_usd is the maximum the caller will pay for one call. The gateway uses the price book to estimate the call's cost on each candidate model; candidates above the ceiling are filtered out before the call is made. Without this, callers either over-pay silently or under-deliver because they were assigned a model too cheap for the workload.


The aliases — the load-bearing abstraction

Aliases are how callers stay stable while models churn underneath. A reasonable starting set for a general platform:

Alias Intent Typical mapping (Claude 4.x family)
fast-summariser Cheap, fast, simple text in / simple text out claude-haiku-4-5
smart-reasoner Default for complex reasoning claude-sonnet-4-6
top-reasoner Use when you need the strongest reasoning available claude-opus-4-7
tool-using-agent Default model behind an agent loop with tools claude-sonnet-4-6
code-assistant Code generation / refactoring claude-sonnet-4-6
embeddings-v3 Embedding generation provider-specific
vision-extractor Multimodal extraction tasks claude-sonnet-4-6

Three rules:

  • Aliases are intent-named, not model-named. fast-summariser survives a model swap; haiku-summariser does not.
  • The mapping is owned by the gateway team, in coordination with eval owners. A change is a release, with evals (module 04_ai_product_evals) gating the promotion.
  • Aliases are versioned only when their semantics change. Adding a model variant that does the same thing better is a mapping change, not a new alias. Adding a new capability (e.g., tool use to a previously non-tool-using alias) might be a new alias.

The hardest discipline: resist over-proliferation. A platform with thirty aliases is one where the alias-to-model mapping is doing the work of routing policy. The gateway should have ten to twenty stable aliases, with routing dimensions doing the per-call differentiation.


How the routing decision is made

The gateway evaluates the route key against a routing policy. The policy is data, not code — usually a YAML or JSON config the gateway loads from the registry at startup and on signal.

Sketch of a policy file:

aliases:
  fast-summariser:
    candidates:
      - id: "anthropic:claude-haiku-4-5:ap-south-1"
        weight: 80
        capabilities: { streaming: true, tools: false, max_input_tokens: 200000 }
      - id: "anthropic:claude-haiku-4-5:us-east-1"
        weight: 20
        capabilities: { streaming: true, tools: false, max_input_tokens: 200000 }
      - id: "openai:gpt-4o-mini:us"
        weight: 0   # standby for fallback only
        capabilities: { streaming: true, tools: true, max_input_tokens: 128000 }

  smart-reasoner:
    candidates:
      - id: "anthropic:claude-sonnet-4-6:ap-south-1"
        weight: 100
      - id: "openai:gpt-4o:eu-west-1"
        weight: 0  # standby

policies:
  workload_class:
    interactive:
      latency_budget_ceiling_ms: 5000
      max_retries: 1
    batch:
      latency_budget_ceiling_ms: 60000
      max_retries: 3
    background:
      latency_budget_ceiling_ms: 600000
      max_retries: 5

  privacy_zone:
    in-region-only:
      allowed_regions: [ap-south-1]
    eu-only:
      allowed_regions: [eu-west-1, eu-central-1]
    on-prem-only:
      allowed_providers: [local-ollama, internal-vllm]

  cost_ceiling:
    enforce: true
    price_book_path: /etc/gateway/price_book.yaml

Evaluation per call:

1. Look up alias -> candidate set.
2. Filter candidates by privacy_zone (drop any whose region/provider is disallowed).
3. Filter candidates by cost_ceiling (estimate cost from price book; drop over-ceiling).
4. Filter candidates by capability (must support streaming if requested, tools if requested).
5. Sort remaining candidates by weight (or by a smarter scorer; see below).
6. The top one is the primary; the rest is the fallback chain.
7. Apply workload_class constraints: cap latency_budget_ms at the class ceiling.

The output is (primary, [fallbacks]) and an effective latency_budget_ms.

A smarter scorer

Static weights are fine for stable workloads. At scale, the routing plane often grows a scorer that takes recent observability into account:

  • Live latency p95 per candidate — penalise candidates currently slow
  • Live error rate per candidate — penalise candidates currently failing
  • Live quota headroom — penalise candidates near their per-provider limit
  • Cost vs ceiling margin — prefer candidates with more headroom

The scorer is a function from (static_weight, live_signals) → effective_score. Routing picks the highest score; everything below is the fallback chain. This is sometimes called "smart routing" and it pays off most when one of several near-equivalent candidates is in trouble.

Caveats: the scorer adds complexity and a new failure mode (a stale signal can route everything away from a healthy candidate). Start with static weights; add the scorer when traffic justifies and observability is reliable.


Privacy zones

Privacy zones are the constraint where the gateway most often saves the compliance team weeks of work. The zone is usually derived from the caller, not passed by the caller — the caller's tenant has a configured residency policy, and the gateway looks it up.

tenants:
  acme-corp:
    privacy_zone: in-region-only
    region: ap-south-1
  globex-eu:
    privacy_zone: eu-only
    allowed_regions: [eu-west-1, eu-central-1]
  contoso-onprem:
    privacy_zone: on-prem-only
    allowed_providers: [local-vllm-cluster]

The routing plane filters candidates against the tenant's zone. A call from globex-eu cannot end up on a US region even if that region's model would be cheaper or faster; the candidates are removed before scoring. The compliance team's question — "did any call from globex-eu leave the EU?" — is answered by querying the audit log for tenant_id = globex-eu and region not in EU and expecting zero results.

The gateway enforces the zone; it does not recommend it. A misconfigured zone is a security event, not a performance hiccup.


Conflict resolution

The dimensions sometimes conflict. The gateway resolves them by a documented precedence.

A reasonable precedence:

  1. Privacy zone — never violated. If no candidate satisfies the zone, the call refuses.
  2. Capability — must support what the call needs (tools, streaming, vision).
  3. Cost ceiling — refuse if no candidate is under the ceiling.
  4. Latency budget — applies to provider selection and fallback budget.
  5. Workload class — provides defaults for latency and retries; can be overridden upward by latency_budget_ms within reason.
  6. Weight / scorer — picks among satisfying candidates.

Refusals produce a structured error with code: NO_ROUTE_AVAILABLE, human_hint (suggesting which constraint failed), and model_action: "broaden the constraint or escalate".


Canary and gradual rollouts

When the alias-to-model mapping changes (a new version, a new provider, a new region), the gateway supports a canary slice: a small fraction of traffic on the new candidate, the rest on the existing one.

aliases:
  fast-summariser:
    candidates:
      - id: "anthropic:claude-haiku-4-5:ap-south-1"
        weight: 90
      - id: "anthropic:claude-haiku-4-7:ap-south-1"   # new model
        weight: 10                                     # 10% canary

The rollout is operated by adjusting weights: 10%, 25%, 50%, 100%. Audit logs and per-candidate metrics validate that the new model meets evals before each step. A regression is rolled back by dropping the canary's weight to zero.

Promotion is gated by evals; the gateway is the executor, not the decider.


How routing interacts with the other surfaces

  • Fallback (chapter 04) — the routing plane produces the fallback chain; the fallback executor walks it.
  • Quota (chapter 05) — after routing picks a candidate, the quota plane checks whether the call fits the provider's bucket; if not, routing re-evaluates excluding that candidate.
  • Cost (chapter 07) — the cost ceiling is a routing filter; cost attribution after the call is observability.
  • Cache (chapter 08) — cache check happens before routing decides anything heavyweight; a cache hit short-circuits routing.
  • Drift (chapter 09) — alias mapping changes are versioned and rolled out through routing weights.
  • Privacy (chapter 10) — privacy zone is a routing filter; the routing plane is where the constraint is enforced.

How to recognise broken routing in the wild

  • The mapping from "what calls this feature makes" to "what model handles them" is implicit, scattered, or hard-coded
  • One model handles all workloads regardless of latency or cost shape
  • Privacy/residency claims are made but not enforced by the routing layer
  • The "fast model" and "smart model" choice is in product code, not in a policy file
  • Promoting a new model version requires editing each product
  • Canary rollouts of new model versions are done by deploying product code

Interview Q&A

Q1. Why use intent-named aliases (fast-summariser) instead of model-named ones (haiku-summariser)? Because the alias is the stable contract; the model is the implementation. Renaming the model under the alias should not require touching any caller. Intent-named aliases survive a vendor change, a model retirement, or a capability upgrade. Model-named aliases bake the implementation into every caller and force a coordinated migration on every model change — which is the chapter-opening discipline collapsed onto product code. Wrong-answer notes: "for clarity" is a weak answer; the specific value is decoupling.

Q2. The platform has a tenant whose privacy_zone = in-region-only and whose region is India. The only candidate for smart-reasoner is currently in the US (provider has not yet enabled India). What should the gateway do? Refuse the call with NO_ROUTE_AVAILABLE. Privacy zone is non-negotiable. The error surfaces to the product, which informs the tenant the feature is unavailable for them until in-region capacity exists. The platform team treats this as a backlog item for either the provider (expand to India) or the alias mapping (use a different provider that does serve India). What the gateway must not do is "make an exception this once" — exceptions become precedents and the audit log lies. Wrong-answer notes: "fall back to US" is the breach.

Q3. Walk through a routing decision for an interactive call to fast-summariser from a tenant in EU with a 1500 ms latency budget and a 0.001 USD cost ceiling. Look up fast-summariser candidates. Filter by privacy_zone=eu-only — keep only EU candidates. Filter by cost_ceiling=0.001 against the price book for each candidate's per-call estimated cost — keep only candidates whose worst-case input fits. Filter by capability — tools: false, streaming: true if requested. Sort by score. Top candidate is the primary; remaining is the fallback chain. Cap latency_budget_ms at the workload_class ceiling (interactive: 5000) — already under, no cap. If no candidates remain, refuse with NO_ROUTE_AVAILABLE. Wrong-answer notes: running steps out of order (cost before privacy) wastes work and risks breaching the zone before refusal.

Q4. The platform wants to introduce a new model claude-haiku-4-7 behind the fast-summariser alias. How does the rollout look? Add the new model as a candidate with weight: 0 so it does not receive live traffic; verify the gateway can call it via synthetic traffic. Run the eval suite (module 04_ai_product_evals) against it. If passing, set weight to 5 — a 5% canary. Watch evals, latency, error rate, cost per call for 24–48 hours. Promote in steps: 25, 50, 100. At 100% the previous model can be removed from the candidate list, but typically remains as a fallback (weight: 0, available for chain) for some weeks. The rollout is operated by adjusting weights; product code is untouched. Wrong-answer notes: "ship the new model and rollback if it breaks" loses the per-step eval and the canary's protection.


What to do differently after reading this

  • Define a small set of intent-named aliases. Document each alias's intent and current mapping.
  • Move privacy zones from "we promise X" to "the gateway refuses routes that violate X." Audit the policy and the enforcement.
  • Add workload classes as explicit inputs. Each call site declares one; the gateway applies the class's defaults.
  • Wire price-book-based cost ceilings into routing. Refuse calls that exceed.
  • Build the canary rollout flow before the next model migration, not during.

Bridge. Routing decides what to try first. Fallback decides what to try when first fails. Providers go down; some models throttle; transient errors compound. The next chapter builds the fallback chain — what counts as a fallback, what order to walk them, what the caller is told when the chain runs out. → 04-fallback-chains.md