05. Rate limits and quotas¶

Routing decides which provider serves the call. The quota plane decides whether the call gets to make the attempt at all. Providers have finite throughput; the gateway is what distributes it fairly across tenants, features, and agents.

An infrastructure engineer at a Pune analytics platform investigates why an indexing job that ran nightly without trouble for months has started failing every other run. The job processes invoices through an extraction model; it bursts hundreds of calls per minute. The gateway's audit shows that the job's bursts are now provoking the provider to throttle the shared API key the gateway uses, which knocks the interactive chat agent — using the same provider, same key — into the fallback chain. Two unrelated workloads are competing for one provider's quota with no internal fairness layer. The fix is not to ask the provider for more capacity; it is to give the gateway its own quota plane that gives the chat agent a guaranteed share while letting the indexing job consume whatever is left.

This chapter builds the quota plane. The math is simple; the discipline is in the layering.

Two layers of buckets¶

The gateway operates two buckets per call:

Per-tenant / per-feature / per-agent bucket — fairness inside the platform. No caller can monopolise.
Per-provider bucket — respects the provider's own limits. Sized below the provider's actual limit so the gateway never sees a 429 unless the provider's limit moved.

Both must pass for the call to proceed. The order matters: check the platform-internal bucket first (cheap), then the provider bucket (also cheap, but checking after platform-internal saves cycles on refused calls).

The platform-internal bucket prevents tenants from drowning each other; the provider bucket prevents the gateway from blowing past the provider's threshold and triggering throttles that affect all callers.

Token-bucket vs leaky-bucket¶

Two algorithms are common; both are simple; pick one and use it consistently.

Token bucket. A bucket holds up to N tokens. Tokens refill at rate R per second. Each call consumes one token (or more, weighted by request size). If the bucket has tokens, the call proceeds. If empty, the call waits or is refused.

Allows bursts up to bucket size, then steady at rate R.
Good for "mostly steady, occasional spike" workloads.

Leaky bucket. A queue accepts calls at any rate; calls drain from the queue at rate R per second. If the queue is full, the call is refused.

Smooths bursts entirely; the consumer never sees a spike.
Adds latency to bursty workloads.

For model gateways, token bucket is the more common choice because callers expect immediate responses or immediate refusals, not queue waits. Token bucket also matches the model providers' own algorithms more naturally.

What units to bucket by¶

Provider rate limits are usually expressed in a combination of:

Requests per minute (RPM) — call count, regardless of size
Tokens per minute (TPM) — input + output tokens
Concurrent requests — calls in-flight at the same time

A serious gateway bucket on at least the first two. Concurrent-request limits are usually enforced by counting in-flight calls per candidate.

Per-tenant buckets are usually configured per minute:

tenants:
  acme-corp:
    quotas:
      smart-reasoner:
        rpm: 600
        tpm: 800000
        burst: 60                 # bucket size for token-bucket
      fast-summariser:
        rpm: 6000
        tpm: 4000000
        burst: 300

  globex-eu:
    quotas:
      smart-reasoner:
        rpm: 100
        tpm: 150000
        burst: 20

A tenant's quota is the sum of all its callers' usage against each alias. Within a tenant, sub-buckets per feature or per agent provide further isolation:

tenants:
  acme-corp:
    quotas:
      smart-reasoner:
        rpm: 600
        sub_buckets:
          feature.chat: { rpm: 300 }
          feature.indexing: { rpm: 200 }
          feature.analytics: { rpm: 100 }

The sub-buckets sum to less than or equal to the parent. The parent is the hard cap; the sub-buckets are guaranteed minima with the unused portion shared.

Where the bucket state lives¶

Token buckets are simple in principle and quietly subtle in distributed systems. The bucket counter must be globally consistent across gateway instances (a tenant's calls hitting different gateway pods cannot each see "the bucket has tokens" simultaneously when the bucket is empty).

Two common implementations:

Centralised store (Redis). Each call atomically decrements the bucket via a Lua script. Refill happens lazily on read or by a tick. Fast (sub-millisecond), simple, single-shard failure mode (Redis goes down → quota plane is down → policy decision: fail open or fail closed).

Sharded local + sync. Each gateway pod holds a local bucket fraction. A background process re-balances. Lower latency but eventual-consistency holes; under sudden bursts a tenant can briefly exceed quota.

For most platforms, Redis-backed token buckets are the right starting choice. The Redis dependency is operated as tier-zero (chapter 02's tier-zero discipline applies here).

What happens when the bucket is empty¶

Three responses, picked by policy per alias and tenant:

Response	Meaning
Refuse	Return `RATE_LIMIT_EXCEEDED` with `retry_after_ms` from the bucket's refill curve
Queue	Hold the call until tokens are available, up to a maximum wait
Borrow	Use tokens from a shared pool above the tenant's bucket; flagged in audit

Most platforms refuse. Queueing is appropriate for batch workloads where latency tolerance is high. Borrowing is rare and dangerous — it makes per-tenant quotas advisory rather than enforced, which defeats the fairness goal.

The refusal carries retry_after_ms. The caller (or its retry wrapper) can wait and try again. The model platform's own retry layer can incorporate this.

Provider-side bucket¶

The per-provider bucket is what stops the gateway from causing throttles. Sized below the provider's published limit, with margin:

providers:
  anthropic:
    accounts:
      production:
        rpm_cap: 4000               # provider's limit is 5000; we cap at 4000
        tpm_cap: 8000000            # provider's limit is 10M
        concurrent_cap: 80          # provider's limit is 100
  openai:
    accounts:
      production:
        rpm_cap: 2400
        tpm_cap: 4800000
        concurrent_cap: 40

The margin matters. Providers sometimes throttle at lower-than-advertised limits, or in bursts shorter than your bucket measures. Capping at 80% of published limits is a reasonable default; tighten or loosen based on observed throttle rates.

The provider bucket is also the place where multiple accounts (if you have them) are pooled. A platform with two Anthropic accounts (e.g., one per region) has two provider buckets, each capped separately.

The math of fairness¶

The most useful fairness property in a model gateway: a busy tenant cannot starve a quiet one. Token buckets per tenant give this automatically, as long as each tenant's bucket is independent.

A second property worth designing for: a tenant's bursty workload should be able to consume slack from quiet workloads. This argues for a hierarchy — per-tenant buckets summing to a global cap, with the global cap pooled across tenants but each tenant getting its committed minimum.

A simple working policy:

Each tenant has a committed RPM/TPM that is always honoured.
A shared overflow pool is available to any tenant whose committed bucket is empty.
The overflow pool is bounded so a single tenant cannot consume all of it.
Audit records whether a call came from committed or overflow.

This is the same shape as cloud-provider quota models (compute reservations + on-demand). The gateway is implementing the same pattern for model capacity.

Cost vs throughput¶

Quota plane interacts with cost plane (chapter 07): a tenant can be RPM-rich and cost-poor or vice versa. The gateway enforces both — the call passes only if the tenant has both throughput quota and remaining budget.

A high-RPM tenant who has exhausted their monthly budget receives BUDGET_EXCEEDED rather than RATE_LIMIT_EXCEEDED, even though both bucket types refuse.

How quotas interact with the other surfaces¶

Routing (chapter 03) — the routing scorer can deprioritise candidates near their provider-bucket cap.
Fallback (chapter 04) — a candidate at quota gets skipped in the fallback walk; the gateway does not retry the same candidate to incur 429s.
Cost (chapter 07) — quota enforcement and budget enforcement are separate refusals.
Cache (chapter 08) — cache hits do not consume the per-tenant bucket (no provider call) but may consume a small "cache RPM" if the platform tracks it; provider bucket is not touched.
Observability (chapter 11) — bucket fill levels are first-class metrics on dashboards.

How to recognise broken quotas in the wild¶

One noisy tenant slows the whole platform
The provider returns 429 visibly to product callers
Tenants discover their quota by hitting it, not by being told
The team configures quotas in code, not in policy
Bucket state lives only in gateway-pod memory (lost on restart)
Per-feature quotas exist on paper but not in the bucket math

Interview Q&A¶

Q1. Why have a per-provider bucket if the provider already enforces its own rate limit? Because the provider enforces by returning 429 after the call, which costs latency and pollutes audit. By capping at, say, 80% of the provider's published limit on the gateway side, the gateway refuses internally before issuing the call, with a clean structured error and a known retry_after_ms. The gateway never sees a 429 in normal operation; 429s become a drift signal that the provider tightened or burst caps changed. Wrong-answer notes: "we trust the provider" misses the latency and audit hygiene.

Q2. A tenant exhausts their per-tenant bucket. The shared overflow pool has capacity. Should the gateway serve the call? Depends on the policy. If overflow is enabled and bounded such that no single tenant can take all of it, yes. If overflow is disabled, no — refuse with RATE_LIMIT_EXCEEDED and retry_after_ms based on the tenant's bucket refill curve. Either policy is defensible; the choice should be explicit per tenant tier (premium tenants get overflow; free tier does not). The audit log distinguishes committed vs overflow consumption. Wrong-answer notes: "always serve from overflow" makes per-tenant quotas advisory.

Q3. How does the bucket survive a gateway pod restart? The bucket state is in a shared store (typically Redis) accessed via an atomic decrement script. Pod restart loses no state; the next call reads the bucket from Redis. If the gateway is sharded with local buckets, restart requires the bucket fraction to be redistributed — which is a strong argument for centralised state. Wrong-answer notes: "we keep it in memory" silently breaks the fairness property on every deploy.

Q4. The provider returns 429 unexpectedly even though the gateway's provider bucket has headroom. What do you do? Investigate: the provider's effective limit may have dropped (e.g., shared with another account, or a regional limit you did not know about), or your bucket math undercounted (e.g., not accounting for input-token weight). Tighten the gateway's per-provider cap conservatively while investigating; raise the question with the provider's support; verify whether other accounts share the quota. The 429 itself is the alarm; the response is to never see one. Wrong-answer notes: "treat it as transient" misses that the 429 indicates the gateway's bucket model is out of sync.

What to do differently after reading this¶

Implement per-tenant, per-alias buckets in a centralised store.
Cap each provider bucket below the provider's published limit; tighten until 429s from the provider are essentially zero.
Surface bucket fill rates on dashboards. A tenant approaching their cap is a useful early signal.
Document the refuse-vs-queue-vs-borrow policy per alias and per tenant tier.
Audit log the bucket consumption (committed vs overflow) on every call.

Bridge. Quotas constrain how often callers can make calls. Credentials constrain who the gateway is to the providers when it does. A leaked provider key is a company-wide bill; a misused gateway credential is a per-product incident. The next chapter is the credential discipline. → 06-key-and-credential-management.md