Skip to content

12. Rate Limiting and Backpressure — Close the lanes before the city jams

~13 min read. Fast traffic is good until the toll booth forgets the city has limits.

Built on the ELI5 in 00-eli5.md. The toll booth — the single entry point that inspects, routes, and limits traffic — decides when the city must say "enough for now."


1) Overload is also a design bug

See. Most outages are not only about failed machines. Sometimes every component is technically alive. Still the system collapses. Why? Because demand exceeds safe capacity.

One service can process only so many requests, connections, messages, or bytes per second. Past that point, latency climbs. Queues grow. Timeouts start. Retries multiply. Then overload spreads backward through the roads.

That is why the toll booth matters. It should not blindly wave every car inside. It must enforce limits before the city chokes.

clients ──→ ┌────────────┐ ──→ API tier ──→ queue ──→ workers ──→ warehouse
            │ toll booth │
            └─────┬──────┘
                  ├── allow
                  ├── delay
                  └── reject

Rate limiting protects shared systems from noisy neighbors. Backpressure tells upstream components to slow down. Load shedding drops work that the system cannot safely absorb.

These are not rude behaviors. They are survival behaviors. Simple, no?

2) Three common rate-limiting algorithms

Fixed window

You count requests in a fixed bucket of time. Example: 100 requests per minute. At 12:00:00 the counter resets. At 12:00:59 it may already be 100.

Good: easy to implement, cheap to store, easy to explain.

Bad: boundary spikes. A client can send 100 requests at 12:00:59 and 100 more at 12:01:00. Effective burst = 200 requests in 2 seconds.

Sliding window

Now you count over the last rolling interval. Example: at any moment, only the last 60 seconds matter. That smooths boundary tricks.

Good: fairer than fixed window. Better reflection of real recent traffic.

Bad: more bookkeeping. You store timestamps or sub-window counters. Cost is still fine for many APIs, but it is not free.

Token bucket

This one is very practical. Imagine the toll booth keeps a bucket of tokens. Each request spends one token. Tokens refill over time.

Example policy: bucket size = 10 refill rate = 5 tokens per second

Good: allows controlled bursts, smooths traffic, easy to reason about.

Bad: you still need shared state or partitioned counters, and you must choose good refill numbers.

See the comparison.

fixed window   ── simple, bursty at boundaries
sliding window ── smoother, more bookkeeping
token bucket   ── burst-friendly, steady refill

3) Worked example: token bucket plus backpressure

Worked example now. Policy: bucket size = 8 tokens refill rate = 2 tokens per second Each request costs 1 token

Start at time 0.0 seconds. Bucket = 8 tokens.

Requests arrive at: 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 seconds

Step 1: Requests 1 through 8 consume the 8 starting tokens. After request 8 at 0.7 seconds, bucket = 0.

Step 2: Between 0.7 and 0.8 seconds, refill added = 0.1 second × 2 tokens/second = 0.2 token Available token count = 0.2 Request 9 needs 1 full token. So request 9 is rejected or delayed.

Step 3: Between 0.8 and 0.9 seconds, another 0.2 token arrives. Now total = 0.4 token. Request 10 also cannot pass yet.

Step 4: When will the next full token exist? Need 1.0 token. Current = 0.4 token at 0.9 seconds. Missing = 0.6 token. At 2 tokens per second, time needed = 0.6 / 2 = 0.3 second

So next full token appears at: 0.9 + 0.3 = 1.2 seconds

That is the point. The toll booth allowed a short burst of 8. Then it forced the client back to the steady rate of 2 per second.

Now add backpressure inside the system. Suppose workers consume jobs at 500 jobs per second. Producers are pushing 800 jobs per second into a queue.

Net growth per second = 800 - 500 = 300 jobs

If queue size limit = 2,000 jobs, time to fill from empty = 2,000 / 300 ≈ 6.67 seconds

So in under 7 seconds, the queue is full. Then what? You have only four honest choices: slow producers, reject new jobs, drop low-priority jobs, or add more real capacity.

Pretending the queue is infinite is not design. It is denial.

4) Backpressure propagation and load shedding

Backpressure means the pain moves upstream on purpose. A full queue tells the API tier, "Do not accept more." A saturated DB connection pool tells workers, "Pause or fail fast." A busy model server tells the gateway, "Return 429 or use a cheaper fallback."

Without backpressure, every layer keeps accepting work. Buffers grow. Memory grows. Latency grows. Finally the system fails in a confusing way.

With backpressure, the refusal is early and visible. That is healthier.

workers full ▼
queue full   ▼
API full     ▼
toll booth returns 429

Load shedding is the next step. You choose what to drop first. Thumbnail generation may wait. Analytics events may be sampled. Search suggestions may disappear. Checkout creation may not be dropped. See the ordering. Protect the core journey.

A common HLD pattern is layered protection. The edge toll booth enforces tenant or user quotas. The service enforces concurrency caps. The queue enforces bounded buffering. The worker enforces time budgets. Together they stop overload from becoming chaos.

Now what is the interview line? "Rate limiting controls entry. Backpressure controls flow inside. Load shedding protects the core path when demand still exceeds capacity." That is crisp and correct.

One more practical point. 429 is not enough by itself. Clients should know when to retry, or whether not to retry at all. So return limits, reset hints, or Retry-After where appropriate. Otherwise client retry storms create the next overload.


Where this lives in the wild

  • GitHub REST and GraphQL APIs — per-user and per-token quotas stop one integration from monopolizing repository and search backends.
  • Cloudflare edge gateways — token-bucket style controls and bot rules block abusive traffic before it reaches origin servers.
  • OpenAI API — requests-per-minute and tokens-per-minute limits protect shared model capacity from noisy tenants.
  • LinkedIn Kafka pipelines — consumer lag, bounded queues, and paused partitions surface backpressure when downstream processors fall behind.
  • Stripe public APIs — layered gateway limits and careful client retry guidance reduce payment-side retry storms under stress.

Pause and recall

  1. Why is overload a system-design problem even when every box is technically alive?
  2. What burst problem does fixed window create at time boundaries?
  3. In the token bucket example, why were requests 9 and 10 blocked?
  4. If producers add 800 jobs per second and workers remove 500, how fast does the queue grow?

Interview Q&A

Q: Why use token bucket and not fixed window for a bursty public API? A: Token bucket allows short bursts while still enforcing a long-run rate. Fixed window is simpler, but it creates unfair boundary spikes. Common wrong answer to avoid: "Token bucket is always more accurate" — it is better for burst smoothing, but the real choice depends on fairness, cost, and implementation needs.

Q: Why push back with 429 instead of just queueing every request? A: Infinite queueing converts overload into latency, memory pressure, and timeout cascades. Early refusal keeps the rest of the system healthy. Common wrong answer to avoid: "Because rejecting is cheaper than processing" — the deeper reason is preserving stability and protecting critical capacity.

Q: Why is backpressure different from rate limiting? A: Rate limiting governs admission at the edge. Backpressure communicates saturation from inside the pipeline so upstream layers slow down or stop. Common wrong answer to avoid: "They are the same thing at different layers" — related, yes, but one is policy and the other is a feedback signal.

Q: Why do load shedding and graceful degradation belong together? A: Both decide what work disappears first when capacity is tight. One drops requests or tasks; the other keeps a smaller but still useful product experience alive. Common wrong answer to avoid: "Load shedding means turning the whole system off" — no, it means sacrificing low-priority work to protect core flows.


Apply now (5 min)

Exercise: Design rate limits for a login API, a search API, and a webhook ingest API. Pick one algorithm for each and state what happens when the limit is exceeded.

Sketch from memory: Draw a toll booth at the edge, a bounded queue, and a worker pool. Then mark where 429 happens, where backpressure signal travels, and which low-priority work gets dropped first.


Bridge. Individual protection mechanisms are clear now. Next we combine them into full interview-ready system patterns for feeds, chat, search, and more. → 13-common-hld-archetypes.md