08. Rate limits and quota — the ceiling that decides your peak hour¶
~14 min read. Every supplier puts a ceiling above your kitchen. You can choose to discover that ceiling at 3pm on a Friday when traffic spikes, or you can plan for it. Mature teams plan.
Builds on 07-dual-sourcing-fallback-chains.md. The second supplier keeps you alive during outages. Rate limits decide whether you survive your own peak hour.
1) Hook — the Friday afternoon spike¶
A B2B SaaS team has been on Anthropic's Tier 3 for six months. Their daily average is 80 requests per second across all customers. The peak hour, around Tuesday 11am US Pacific, runs at 180 RPS. The team has Tier 3 limits: 4000 requests/minute (~66 RPS sustained), 400K tokens/minute on Sonnet.
The math has been failing quietly all along. They are above sustained RPS most of the workday. They have been getting away with it because tokens are spread across requests and bursts smoothed by the client SDK's queuing.
Then, on the last Friday of the quarter, a customer launches a marketing campaign. Traffic jumps to 320 RPS for 18 minutes. The 429s start at minute three. By minute seven, the kitchen is serving stale 503s. The on-call engineer pages the head of engineering. Customer success starts answering complaints.
The fix at 4pm Friday is not to "add capacity" — Anthropic does not turn up your tier in 20 minutes. The fix is to fail open to the second supplier at 60% load and to throttle non-critical traffic. Both should have been wired up two months ago.
This whole chapter is the math and the playbook that would have prevented the incident.
2) The metaphor — a delivery dock¶
Imagine the supplier has a single loading dock at the back of their warehouse. Your kitchen sends order trucks to that dock. The dock can accept one truck every nine seconds. If you send two trucks at once, one waits. If you send fifteen at once, the dock guard waves all but six away with a 429 sign that says "try again in a minute".
Three numbers describe the dock:
- Trucks per minute — how many orders the dock can sign for in a sliding minute window. This is RPM (requests-per-minute).
- Tonnage per minute — total cargo across all trucks in that window. This is TPM (tokens-per-minute).
- Trucks in the door at once — how many trucks can be unloading simultaneously. This is concurrency (open streaming connections).
Most teams plan for RPM. Senior teams plan for all three, plus a fourth — burst tolerance, which is how forgiving the dock is when you arrive in clumps.
The matching habit here is sizing your traffic to the dock you actually have, not the dock you wish you had.
3) The anatomy — three (or four) numbers, never one¶
┌──────────────────────────────────────────────────────────┐
│ RATE LIMIT DIMENSIONS │
├──────────────────────────────────────────────────────────┤
│ 1. RPM — requests per minute (sliding window) │
│ 2. TPM — tokens per minute (input + output combined) │
│ 3. CONC — simultaneous open connections │
│ 4. BURST — short-window allowance above sustained rate │
└──────────────────────────────────────────────────────────┘
Hit any one and you get a 429. Hit two and the supplier may apply a longer cooldown. Hit all three and you might trip an automated abuse heuristic — at which point a human at the supplier needs to clear your account.
Different suppliers expose the dimensions differently. Anthropic publishes RPM and TPM per model and per tier; concurrency is implicit. OpenAI publishes RPM, TPM, and additional limits for image and audio. Gemini exposes RPM, TPM, and a per-project quota. Bedrock and Azure OpenAI expose region-specific limits that you provision separately from the underlying model's account-level limits.
Burst tolerance is the least documented dimension. Anthropic and OpenAI both tolerate a few seconds of 2-3x sustained rate before throwing 429s, but neither formally guarantees it.
4) Realistic 2026 numbers¶
ANTHROPIC (Sonnet 4.6)
─────────────────────
Tier 1 (free / first $5) : 50 RPM, 40K TPM
Tier 2 ($40 spent) : 1000 RPM, 200K TPM
Tier 3 ($200 spent) : 2000 RPM, 400K TPM
Tier 4 ($400 spent) : 4000 RPM, 800K TPM
Enterprise (negotiated) : custom, often 10K+ RPM, 5M+ TPM
OPENAI (GPT-4o)
───────────────
Tier 1 ($5 lifetime) : 500 RPM, 30K TPM
Tier 2 ($50 in 7d) : 5000 RPM, 450K TPM
Tier 3 ($100 in 7d) : 5000 RPM, 800K TPM
Tier 4 ($250 in 14d) : 10000 RPM, 2M TPM
Tier 5 ($1000 in 30d) : 10000 RPM, 30M TPM
Enterprise : custom
GEMINI (2.5 Flash, paid tier)
─────────────────────────────
Standard : 2000 RPM, 4M TPM
Provisioned : fixed throughput, no 429s
These numbers shift quarterly. Treat them as rough order-of-magnitude. The shape is what matters — there are tiers, you ascend by sustained spend, and enterprise tiers are negotiated.
5) Worked example — the headroom math¶
Your kitchen needs to handle a peak hour at 200 RPS sustained. That is 12000 RPM. Your average request size is 3000 input tokens and 600 output tokens, for 3600 tokens per request.
At OpenAI Tier 4 (10K RPM, 2M TPM), you fail on both dimensions — you need 4.3x more TPM and 20% more RPM. You will hit 429s constantly.
At OpenAI Tier 5 (10K RPM, 30M TPM), TPM is closer but still short by 13M. You either compress your prompts (cut input by half via caching, drop redundant few-shot examples) or you split across multiple OpenAI orgs, or you fail to the second supplier for 30% of traffic.
The headroom rule of thumb is 2x peak. If your peak is 12K RPM, you want a 24K RPM ceiling. The reason is that traffic is not flat across the minute — a 60-second sliding window over a real workload sees inner-minute bursts at 1.5-1.8x the minute average. Add a safety margin, and 2x is the planning target.
HEADROOM TABLE
──────────────
sustained avg : 1x ceiling → constant 429s at peak
peak hour avg : 1x ceiling → bursts cause 429s
peak hour × 1.5 : 1x ceiling → marginal, ok if rare
peak hour × 2 : 1x ceiling → safe planning target
peak hour × 3 : 1x ceiling → comfortable, wasted spend
The 2x rule applies to the dimension you are likely to hit. For chat workloads, that is usually TPM. For high-frequency low-token workloads (classification at scale), that is usually RPM.
Mid-content recall¶
- Which of the three dimensions (RPM, TPM, concurrency) is most often the binding constraint for a chat workload? Why?
- Why does the "2x peak" headroom rule exist?
- What dimension does a streaming endpoint stress that a batch endpoint does not?
6) Quota negotiation — what to actually ask for¶
The path from default-tier to enterprise-tier capacity at any major supplier involves a written request. The mechanics are similar across vendors. You file a quota increase form, you state your use case, you state your projected volume, and you wait.
What to include in the request:
Use case : "B2B SaaS support agent for 800 enterprise customers"
Current spend : "$32K/month, growing 18% MoM"
Projected ceiling : "5x current within 6 months"
Specific limits : "Sonnet TPM from 400K to 2M, RPM from 2K to 8K"
Architecture : "Streaming, p99 latency 4s, prompt caching enabled"
Compliance : "SOC2 Type II, no data retention beyond 30d"
Geography : "us-east-1 primary, eu-west-1 secondary"
Sales engineering at the supplier reads this. They look for two signals: that you understand your workload (specific numbers, not "as much as you can give"), and that you are a real business (compliance, growth, geography).
Realistic timelines: small bumps within tier (e.g. doubling TPM) come back in 24-72 hours. Tier promotions take a week. Enterprise contracts with custom rate floors take 30-90 days and involve a procurement cycle.
A practical tactic — many teams negotiate headroom contracts. You commit to a monthly minimum spend; the supplier commits to a TPM ceiling above what your spend would normally entitle you to. The supplier protects against under-utilization, you protect against being throttled mid-quarter.
7) 429 handling — backoff, jitter, and the retry-after header¶
┌─────────────────────┐
request ──────────▶│ supplier │
└──────────┬──────────┘
│ 429 (rate limited)
│ retry-after: 7
│
▼
┌───────────────────────────────┐
│ client │
│ wait 7s + jitter(0..2s) │
│ retry once │
│ if still 429 → backoff to 14s │
│ after 3 retries → failover │
└───────────────────────────────┘
Three rules that experienced teams enforce:
- Always read retry-after. Anthropic and OpenAI both return a
retry-afterheader on 429s. Honor it. Retrying earlier extends the cooldown, sometimes by minutes. - Add jitter. If a hundred clients all retry at exactly t+7s, you get a thundering herd that triggers another wave of 429s. Adding 0-2 seconds of random jitter spreads the herd.
- Cap retries at 2-3. Beyond that, the load is real and retrying is futile. Failover to the second supplier is the right move.
Token-aware client-side rate limiting is the most under-used technique. Use tiktoken (or the relevant provider tokenizer) to estimate the request's token cost before sending. Maintain a sliding-minute counter. If the next request would push you over 90% of TPM, queue it locally instead of sending and getting a 429. The supplier sees fewer 429s; your latency tail is calmer.
8) Failure modes — where teams trip¶
| Signal | Likely cause | Fix |
|---|---|---|
| 429s correlated with a single noisy customer | One tenant burning your shared quota | Per-tenant rate limit upstream |
| 429s only on the largest requests | TPM limit, not RPM | Compress prompts or split into smaller calls |
| 429s spike at exact minute boundaries | Sliding-window arithmetic vs fixed buckets | Add jitter; verify supplier's window type |
| 429s only in one region | Region-specific quota on Bedrock/Azure | Provision in the second region too |
| Latency tail grows but no 429s | Hitting concurrency ceiling, queuing implicit | Reduce streaming connection lifetimes |
| 429s on weekends but never weekdays | Shared quota; another team's batch job runs weekends | Isolate quotas per project/team |
| 429s after a model upgrade | New model on lower per-model TPM allotment | Request bump for the new model specifically |
| 429s on retry but not initial | Retrying without honoring retry-after | Read and respect the header |
Six of these eight are policy mistakes, not capacity mistakes. The teams that scale calmly are the teams that fix the policy.
9) Provisioned throughput — the fixed-cost alternative¶
Some suppliers sell capacity as a fixed-rate contract rather than per-token. AWS Bedrock has Provisioned Throughput Units (PTUs); Azure OpenAI has the same concept under "Provisioned" deployments; OpenAI has scale tier commitments for very large customers.
The math: a PTU buys you guaranteed RPS and TPM, charged hourly regardless of usage. The break-even with pay-as-you-go is typically around 40-60% utilization of the PTU's capacity. Below that, on-demand is cheaper. Above that, PTU saves money and removes the 429 risk entirely.
PROVISIONED VS ON-DEMAND
────────────────────────
PTU buys: 1000 TPS for $2.50/hour ($1800/month)
Equivalent on-demand at 100% utilization:
1000 TPS × 60 × 60 × 24 × 30 = 2.6B tokens/month
At Sonnet pricing ($3 in / $15 out, assume 80/20 split):
→ $5.4 in + $7.8 out = $13.2K/month
Break-even utilization: 1800 / 13200 = 14%
For chat workloads that have predictable peak hours, a PTU sized to peak with on-demand spillover for tails is often the lowest-risk choice. The trade-off is that PTUs are committed monthly or quarterly; you cannot turn them off Friday afternoon.
10) The noisy-neighbor problem on shared endpoints¶
Bedrock's on-demand endpoints, Azure OpenAI's Standard deployments, and Vertex AI's shared endpoints all pool your traffic with other customers in the same region. Your latency depends partly on what your neighbors are doing.
Symptoms of noisy-neighbor pain:
- p99 latency drifts upward through the day, peaks around 2-4pm Pacific
- Same model in another region runs faster
- No 429s, but timeouts climb
The fix is provisioned capacity in your noisy region or moving the workload to a quieter region. There is no client-side workaround for noisy neighbors on a shared endpoint.
Where this lives in the wild¶
- Anthropic API tiers — RPM/TPM per model, climb by sustained spend.
- OpenAI API tiers — five tiers, ascend by lifetime + recent spend windows.
- Google Gemini API — free/paid tiers; PTUs available via Vertex AI.
- AWS Bedrock — on-demand + Provisioned Throughput Units (PTUs).
- Azure OpenAI Service — Standard, Global Standard, Provisioned, Datazone deployments.
- Vertex AI — provisioned throughput via Model Garden.
- Mistral La Plateforme — RPS-only limits, generous defaults on paid plans.
- DeepSeek API — explicit RPS limit, no automatic tier climbing.
- Together AI — per-account RPS limits, request-bump form for enterprise.
- Fireworks AI — per-model concurrency limits; on-demand and dedicated.
- Anyscale — RPS and concurrency limits per deployment.
- Replicate — per-model cold-start and concurrency limits.
- Modal — concurrency limits per function, no provider RPS limit.
- OctoAI — RPS and concurrency configurable.
- Groq — RPS and TPM with dedicated capacity at enterprise.
- Cerebras — RPS limits, dedicated inference contracts.
- SambaNova — dedicated inference, capacity-based pricing.
- OpenRouter — aggregates upstream limits; rate-limits on free models.
- LiteLLM — client library; honors retry-after, supports custom rate-limit middleware.
- Vercel AI SDK — surfaces 429s as catchable errors with retry hints.
- tiktoken / Anthropic tokenizer — for client-side token counting.
- Helicone — observability for rate-limit events.
- Langfuse — traces include retry-after and 429 spans.
- LangSmith — token usage panels for quota planning.
- Datadog LLM Observability — alerts on 429 rate.
- Cloudflare Rate Limiting — upstream per-tenant rate limiting.
- AWS API Gateway — usage plans for per-customer throttling.
- Stripe API rate-limiting model — frequently studied as the cleanest reference design.
Pause and recall¶
- Why is the "2x peak hour" headroom rule the planning target?
- What three signals belong in a written quota-increase request?
- When should a team prefer Provisioned Throughput Units over on-demand?
- Why does retrying before retry-after extend the cooldown?
- What is the noisy-neighbor problem and where does it surface?
- Which dimension (RPM, TPM, concurrency) usually binds for a high-volume classification workload?
- How does per-tenant rate limiting protect a shared quota from one noisy customer?
Interview Q&A¶
Q1. Your service is getting 429s at peak. Walk me through the diagnosis. A. Five steps. (1) Identify which dimension — RPM, TPM, or concurrency — is being hit by reading the supplier's response headers and your client metrics. (2) Check whether the limit is account-level, project-level, or region-level. (3) Look for a noisy tenant or noisy time window. (4) Decide between short-term fixes (compress prompts, throttle upstream, failover to the second supplier) and long-term fixes (tier upgrade, PTU, multi-org split). (5) Add per-tenant rate limiting to prevent the same incident from a single customer. Trap: "I'd just add more retries." Retries on a saturated supplier extend cooldowns and worsen the outage.
Q2. How do you size capacity for an unknown future peak? A. Plan to 2x your historical peak hour for the dimension that binds. If you do not know which dimension binds, profile both RPM and TPM separately. For workloads with predictable peaks (B2B SaaS), use provisioned capacity sized to peak with on-demand spillover. For unpredictable workloads (consumer), keep on-demand with the second supplier wired up. Trap: Sizing to average. Real traffic is bursty; average misses peaks by 3-5x.
Q3. When does Provisioned Throughput beat pay-as-you-go? A. When your utilization of the provisioned capacity exceeds the break-even — typically 40-60%. PTUs trade flexibility for predictability. Use them when peaks are forecastable and 429s would be a business incident. Trap: "PTUs are always cheaper at scale." Not true. Below 40% utilization, on-demand is cheaper.
Q4. How do you negotiate a quota increase quickly? A. File the increase request with specifics — use case, current spend, projected volume, specific limits requested, compliance posture, geography. Vague requests sit in a queue. Specific requests get routed to a sales engineer. Maintain a relationship with the supplier's sales team for tier promotions and enterprise contracts. Trap: "Give me as much as you can." Suppliers will not throw capacity at vague requests.
Q5. How do you protect a shared supplier quota from one noisy tenant? A. Per-tenant rate limiting upstream of the supplier call — a token bucket per tenant, with limits set so no single tenant can consume more than a configured share of your total quota. Cloudflare or AWS API Gateway can do this in front; LiteLLM and custom middleware can do it in-app. Trap: Letting tenants share quota without limits. One customer's bug becomes your incident.
Q6. What is the right retry policy for 429s? A. Read retry-after, add 0-2 seconds of jitter, retry at most 2-3 times. Beyond that, fail over to a secondary supplier or return an honest error to the caller. Token-aware client-side rate limiting (estimate tokens before sending) prevents most 429s before they happen. Trap: Aggressive exponential backoff that ignores retry-after. The supplier knows when capacity will be free; trust the header.
Q7. Bedrock and Azure OpenAI split quota by region. How does that change planning? A. Provision and warm separate quota in each region where you serve traffic. Latency-tier routing should be region-aware. Failover plans should treat regions as independent suppliers — a regional outage is the most common Bedrock/Azure failure mode. Trap: Treating Bedrock as a single endpoint. It is not — it is per-region, with per-region capacity decisions.
Apply now (5 min)¶
Step 1 — profile your peak. Pull the last 30 days of your service's request logs. Compute requests-per-minute and tokens-per-minute at 1-minute granularity. Identify the peak minute. Compare it to your supplier's published limits for your tier.
Step 2 — check the dimensions. Are you closer to the RPM ceiling or the TPM ceiling? Project forward 6 months at your growth rate. Which dimension will bind first?
Step 3 — write the request. Draft the quota-increase form you would submit today for the projected ceiling. Be specific. Include use case, projected volume, compliance posture, and the specific RPM/TPM numbers you need.
If the form looks vague, your headroom plan is vague.
Bridge. Rate limits are the supplier's polite ceiling. Outages are when the ceiling collapses entirely. The next chapter is the harder question — what happens when the supplier goes away, and how you plan for the day the news headline is about them, not you. → 09-vendor-risk-and-outages.md