Skip to content

11. Rate Limiting and Throttling — the wait staff pace the dining room deliberately

~14 min read. Speed feels nice until the kitchen receives more plates than capacity.

Built on the ELI5 in 00-eli5.md. The wait staff — middleware — decide how many order slip requests reach the kitchen safely each minute.


1) Why limits exist before any algorithm discussion

A restaurant with unlimited seating sounds generous. Then thirty large groups enter together. The kitchen slows, orders pile up, and everyone gets angry.

APIs behave the same way. Without limits, one noisy client can consume shared capacity. Then polite clients suffer latency, timeouts, and retries.

So rate limiting answers one basic question. Who may send how much traffic, over what interval, using which identity key?

Common identity keys are simple.

  • per user account
  • per API token
  • per IP address
  • per organization or tenant
  • global service cap

Each key solves a different fairness problem. Per-user limits isolate one customer. Per-IP limits help with anonymous traffic. Global limits protect the whole service during spikes.

See the policy layers.

client ──→ **wait staff** ──→ service
           ├── per-user cap
           ├── per-IP cap
           └── global cap

Worked example. Suppose one search API serves 10,000 requests per minute safely. If a single scraper sends 6,000 requests, everyone else fights for leftovers. A per-user cap prevents that monopoly.

Throttling is related, but slightly different. Rate limiting often rejects extra requests. Throttling may delay, queue, or slow them deliberately. Both control pace. The wait staff choose the safe flow.

2) Fixed window is simple,

but boundary spikes are real

Fixed window means one counter per time block. Example: 100 requests per minute. At the next minute, count resets to zero.

That simplicity makes implementation cheap. It also makes explanations easy for customers. "You get 100 calls each minute." Very clear.

But boundaries create bursts. A client can send 100 requests at 12:00:59 and 100 more at 12:01:00. That feels like 200 requests in two seconds.

minute A:  ├────────────────────┤ 100 used at the end
minute B:                       ├────────────────────┤ 100 used at the start
result: short burst crosses the boundary

Worked example. Policy: 60 requests per minute. A client sends:

  • 60 requests between 10:00:50 and 10:00:59
  • 60 requests between 10:01:00 and 10:01:10

The policy technically allows all 120 requests. Still, downstream capacity may feel a sudden wave. That is the fixed-window weakness.

Use fixed window when:

  • limits are moderate,
  • fairness needs are relaxed,
  • storage must stay very cheap,
  • customer communication matters more than burst precision.

Do not use it blindly for hot public APIs where attackers exploit boundaries. Simple is good. Naive is different.

3) Sliding window smooths traffic more honestly

Sliding window checks the recent rolling interval. At any second, it looks back over the last interval. That removes the sharp reset effect.

Two common approaches exist.

  • store request timestamps exactly
  • store smaller sub-window counters and approximate

Both aim for better fairness than fixed windows. The service cares about recent behavior, not just clock boundaries.

now ▼
|---- last 60 seconds ----|
count only this moving slice

Worked example. Policy: 5 requests per 10 seconds. Client sends at: 0, 1, 2, 3, 9 seconds. All five pass. Now another request arrives at 10 seconds.

With fixed window, it may pass immediately if the clock rolled. With sliding window, the server still sees requests from 1, 2, 3, and 9 in the last 10 seconds. So the new request may still be blocked.

That is fairer. The wait staff remember recent crowding, not just the wall clock.

Trade-offs are practical. Sliding windows need more bookkeeping. Distributed counters need care. Clocks across nodes need consistency assumptions. Still, for shared public APIs, this extra effort often pays off.

4) Token bucket is practical because bursts are sometimes healthy

Token bucket imagines a bucket that fills steadily. Each request spends one token. If tokens remain, request passes. If not, request waits or gets rejected.

Two numbers define the policy.

  • bucket size = burst capacity
  • refill rate = long-run steady rate

That is why teams love it. You can allow short bursts without allowing endless flooding.

bucket size = 10 tokens
refill      = 2 tokens/second
request     = costs 1 token

Worked example. Start with 10 tokens. A client sends 10 requests instantly. All pass. Bucket becomes zero.

One second later, 2 tokens have refilled. Only 2 more requests can pass immediately. The rest must wait or fail with 429.

See the flow.

tokens full ──→ allow burst
bucket empty ──→ slow down to refill rate

This matches many real products. A dashboard opening may need a short burst. A human user clicking around is spiky. Token bucket tolerates that naturally.

But be careful with distributed state. If three nodes each think the bucket is full, you accidentally triple the burst. So shared storage, partitioned keys, or a gateway layer matters.

Per-user token buckets are common. Per-IP token buckets help anonymous endpoints. Global token buckets protect the whole fleet when a sudden event hits everyone together.

5) 429 responses,

headers, and API tiers complete the contract

A limit is not finished until clients know what happened next. Returning 429 Too Many Requests is the base signal. Useful headers make recovery smoother.

Common headers include:

  • Retry-After
  • X-RateLimit-Limit
  • X-RateLimit-Remaining
  • X-RateLimit-Reset

See the response idea.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714561200

Now the client can back off intelligently. Without these hints, clients often retry blindly. That creates the next overload wave.

API tiers add business shape. Free users may get 60 requests per minute. Pro users may get 600. Enterprise tenants may get negotiated ceilings. Same API, different promises.

Worked example. Suppose tier limits are:

  • free: 60 per minute
  • pro: 600 per minute
  • enterprise: 6,000 per minute

If one enterprise tenant also gets per-IP abuse, you still may layer a per-IP cap. Business tier and abuse protection are different concerns. Do both when needed.

The menu should document all of this. Say which identity key applies. Say whether excess traffic is delayed or rejected. Say which headers clients can trust. That turns limits into a usable contract, not a surprise punishment.


Where this lives in the wild

  • GitHub API platform engineer sets per-token quotas and secondary abuse limits so automation stays useful without crushing shared backend capacity.
  • Cloudflare edge engineer uses token-bucket style controls to stop abusive traffic before it reaches customer origins.
  • OpenAI platform engineer balances requests-per-minute and tokens-per-minute limits across pricing tiers and model capacity.
  • Zomato gateway engineer protects restaurant search and checkout endpoints from traffic spikes during lunch and dinner peaks.
  • AWS API Gateway product engineer exposes rate and burst controls so teams can shield their services without custom logic everywhere.

Pause and recall

  1. Why can fixed windows allow an unfair burst near boundaries?
  2. Which two numbers define a token bucket policy?
  3. When is per-IP limiting useful compared with per-user limiting?
  4. Why should a 429 response include retry guidance headers?

Interview Q&A

Q: Why choose token bucket for a human-facing public API? A: It allows short natural bursts while still enforcing a steady long-run pace. Common wrong answer to avoid: "Because token bucket is the newest algorithm" — the reason is burst tolerance with understandable control.

Q: Why is sliding window fairer than fixed window? A: It measures recent traffic continuously instead of resetting sharply at clock boundaries. Common wrong answer to avoid: "Because it always stores every timestamp" — many implementations approximate; fairness is the main point.

Q: Why keep per-user and global limits together? A: One protects against individual noisy clients, while the other protects the entire fleet during broad spikes. Common wrong answer to avoid: "Because more limits are always safer" — unnecessary layers confuse customers unless each solves a distinct problem.

Q: Why are headers important when returning 429? A: Headers tell clients when to retry and how much budget remains. Common wrong answer to avoid: "Because monitoring tools need them" — tools benefit too, but client recovery is the primary reason.


Apply now (5 min)

Exercise: Design rate limits for a login endpoint, a search endpoint, and a webhook ingest endpoint. Pick one identity key and one algorithm for each. Write the 429 behavior briefly.

Sketch from memory: Draw wait staff before the kitchen. Add per-user, per-IP, and global counters. Then label where headers and 429 leave the system.


Bridge. Humans are not the only API clients now. Next we design endpoints that AI agents can use safely. → 12-api-for-ai-agents.md