Skip to content

02. Design Rate Limiter

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: On this stage, you are pitching traffic signals to city council. Draw the blueprint clearly, then use the choreography to justify fairness, burst control, and cost.

Step 1: Requirements & Constraints

See. First trap is solving the wrong question. Ask crisp questions, then freeze scope.

Functional requirements - Enforce limits per user, API key, IP, or route, because scope must stay explicit. - Support burst allowance for short traffic spikes, because scope must stay explicit. - Return clear throttle responses and remaining quota headers, because scope must stay explicit. - Allow different limits for free, paid, and internal clients, because scope must stay explicit. - Work across many app servers without double counting badly, because scope must stay explicit.

Non-functional requirements - Decision latency should stay within a few milliseconds, because that constraint changes architecture. - Limiter should fail safely and not become a single bottleneck, because that constraint changes architecture. - Memory use must stay bounded despite many active keys, because that constraint changes architecture. - Accuracy should be good enough for fairness under distribution, because that constraint changes architecture. - Operators need visibility into drops, abuse, and hot tenants, because that constraint changes architecture.

Constraints and assumptions - Assume 200,000 incoming API requests per second on average, so your estimate stays grounded. - Assume peak traffic can touch 1 million requests per second, so your estimate stays grounded. - Assume 10 million active principals per day across regions, so your estimate stays grounded. - Assume most limits are minute-level or second-level policies, so your estimate stays grounded.

What to explicitly de-scope - Bot classification by machine learning can come later, because interview time is limited. - Billing reconciliation is separate from online throttling, because interview time is limited. - Per-customer scripting for custom policies is out for now, because interview time is limited. - Global legal quotas per country are a later extension, because interview time is limited.

On the stage, say what is in and out. That makes the choreography visible and saves time.

Step 2: Scale Estimation

Now watch. Use round numbers, not thesis-level math. One minute of math can remove ten minutes of confusion.

Assumptions - 1 million peak requests per second means local decisions matter, so the back-of-envelope math stays honest. - If each request triggers one Redis write, cost rises fast, so the back-of-envelope math stays honest. - Suppose only 5 percent of keys are hot in one minute, so the back-of-envelope math stays honest. - Assume each active counter record is roughly 100 bytes, so the back-of-envelope math stays honest.

Quick math - Hot counters = 500,000 keys × 100 B ≈ 50 MB, which directly changes component choices. - A sliding log of every request would explode memory quickly, which directly changes component choices. - A token bucket needs only a few numbers per key, which directly changes component choices. - Cross-region round trips can exceed the whole latency budget, which directly changes component choices. - Therefore, local or regional enforcement is the default answer, which directly changes component choices.

Capacity implications - Keep the algorithm state compact and easy to update atomically, so the design stays proportional. - Prefer regional stores over one global throttle database, so the design stays proportional. - Use local warm caches for static quota configuration, so the design stays proportional. - Separate limit configuration reads from hit-counter writes, so the design stays proportional.

Latency budget - Gateway plus limiter should ideally finish within 5 ms, because user feel matters early. - Cross-AZ Redis access is acceptable, cross-region is not, because user feel matters early. - Background sync of usage is fine for dashboards, because user feel matters early. - The reject path should be faster than the allow path, because user feel matters early.

These numbers shape the first blueprint. Simple, no? Design follows load.

Step 3: High-Level Design

See. Keep the top-level flow boring and understandable. The interviewer rewards a clean blueprint before clever tricks.

┌──────────┐   ┌────────────┐   ┌──────────────┐
│ clients  │──→│ API gateway│──→│ limiter svc  │
└──────────┘   └────────────┘   └──────┬───────┘
                           ┌───────────┼───────────┐
                           │           │           │
                     ┌─────▼────┐ ┌────▼─────┐ ┌────▼──────┐
                     │ local    │ │ Redis    │ │ config DB │
                     │ cache    │ │ counters │ │ policies  │
                     └──────────┘ └──────────┘ └────┬──────┘
                                               ┌─────▼─────┐
                                               │ metrics   │
                                               │ pipeline  │
                                               └───────────┘

Main flow - Request reaches the gateway with principal identifiers attached, so the read and write path stays clear. - Gateway checks local config cache for the relevant policy, so the read and write path stays clear. - Limiter service updates bucket state in Redis atomically, so the read and write path stays clear. - Allowed requests continue to the backend immediately, so the read and write path stays clear. - Rejected requests return 429 with retry hints and headers, so the read and write path stays clear. - Metrics stream records allow and deny events asynchronously, so the read and write path stays clear.

Data model sketch - Counter key can look like tenant:user:route:window, so keys and queries stay obvious. - Token bucket needs last_refill_ts and current_tokens, so keys and queries stay obvious. - Policy records store rate, burst, scope, and priority, so keys and queries stay obvious. - Metrics events carry principal, route, region, and outcome, so keys and queries stay obvious.

What to say aloud - Start by stating the scope of limiting exactly, so the interviewer hears your structure. - Use reasoning aloud to explain why token bucket fits bursts well, so the interviewer hears your structure. - Mention that policy reads and counter writes have different access patterns, so the interviewer hears your structure. - Call out regional enforcement if global accuracy is not required, so the interviewer hears your structure.

Step 4: Deep Dive

So what to do? Pick two hotspots and go deeper. Do not deep dive everywhere.

Component 1: Algorithm choice: token bucket vs windows

Goal - Choose an algorithm matching burstiness, fairness, and memory limits, so the deep dive has a target. - Keep updates simple enough for atomic distributed execution, so the deep dive has a target.

Design notes - Fixed window is easiest, but boundary spikes are unfair, because details must still map to scale. - Sliding window log is accurate, but memory-heavy at scale, because details must still map to scale. - Sliding window counter smooths boundaries with moderate complexity, because details must still map to scale. - Token bucket gives intuitive burst handling with tiny state, because details must still map to scale.

Component 2: Distributed enforcement with Redis

Goal - Maintain near-consistent decisions across many stateless gateways, so the deep dive has a target. - Prevent Redis from becoming the bottleneck during global spikes, so the deep dive has a target.

Design notes - Use Lua scripts or server-side transactions for atomic updates, because details must still map to scale. - Shard counters by principal hash to spread load evenly, because details must still map to scale. - Use short local deny caches for obviously exhausted keys, because details must still map to scale. - Fallback to safer coarse limits if Redis is degraded, because details must still map to scale.

Use reasoning aloud to compare one easy option and one scalable option. Add an honest gap if exact thresholds are unknown.

Interviewer follow-ups to prepare - What happens when one customer floods a single route? - How do you handle shadow mode before enforcing new policies? - Would you rate limit by user, by IP, or by token? - How do you avoid cross-region over-throttling?

Why not the simpler alternative? - Putting counters in app memory is fast, but inconsistent across nodes, so tradeoffs stay visible. - A single central service is simple, but adds latency and risk, so tradeoffs stay visible. - Sliding logs sound precise, but burn memory on hot keys, so tradeoffs stay visible. - Global strict accuracy is expensive for most API products, so tradeoffs stay visible.

Step 5: Tradeoffs & Failure Modes

Now watch. Senior answers end with tradeoffs and breakage paths. That is where judgment shows up.

Tradeoffs - Token bucket is burst-friendly, but less intuitive than fixed windows, so the interviewer hears the cost clearly. - Regional limits are fast, but allow some global drift, so the interviewer hears the cost clearly. - Redis centralization is practical, but still operationally critical, so the interviewer hears the cost clearly. - Local deny caches reduce load, but may over-throttle briefly, so the interviewer hears the cost clearly. - Rich per-route policies help fairness, but complicate config management, so the interviewer hears the cost clearly.

Failure modes - Redis shard outage can make limit decisions unavailable, because real systems always break somewhere. - Clock skew can distort refill calculations across nodes, because real systems always break somewhere. - Bad policy rollout can throttle good traffic instantly, because real systems always break somewhere. - Hot-key tenants can overload one shard without smoothing, because real systems always break somewhere. - Metrics lag can hide abuse until after the event, because real systems always break somewhere.

Recovery levers - Fail closed for sensitive routes and fail open for low-risk ones, so failure discussion ends with action. - Rate-limit config rollouts behind feature flags, so failure discussion ends with action. - Use NTP and monotonic timers for refill math, so failure discussion ends with action. - Move abusive tenants to dedicated shards or stricter buckets, so failure discussion ends with action.

Close with an honest gap on one metric you would validate live. That sounds calm, not weak.

Interview Q&A

Q1. Why is token bucket a popular default? A: Because it preserves a simple average rate while allowing short bursts. That matches many real API patterns better than a hard fixed window. Common wrong answer to avoid: Token bucket is always best regardless of product behavior.

Q2. Why not keep counters only in gateway memory? A: Because clients hit many gateway instances. Memory-only counters create unfairness unless you accept very rough enforcement. Common wrong answer to avoid: Gateway-local memory is accurate enough in a distributed system.

Q3. Is Redis mandatory here? A: No. Any fast atomic key-value store can work. Redis is common because increments, expiries, and scripts are convenient. Common wrong answer to avoid: Redis magically removes all consistency problems.

Q4. When would sliding window counter beat token bucket? A: When you need smoother fairness near window boundaries and can accept a little more logic in the update path. Common wrong answer to avoid: You should always build the most accurate algorithm first.

Apply now (5 min) — practice exercise

Take five minutes. Do this without notes.

Practice checklist - Pick one limiter scope and justify it first, so your rehearsal stays focused. - Estimate whether memory or network becomes the first bottleneck, so your rehearsal stays focused. - Draw the online decision path in under thirty seconds, so your rehearsal stays focused. - Compare token bucket with one window-based option, so your rehearsal stays focused. - Say what you would do if Redis becomes unavailable, so your rehearsal stays focused.

Self-check - Did you separate policy storage from counter storage? - Did you mention a failure mode with hot keys? - Did you explain burst handling clearly? - Did you say whether you fail open or fail closed?

Say this opening - Open with the principal and route dimensions, so your first minute sounds controlled. - Then compare algorithms before choosing one, so your first minute sounds controlled. - Finish with degraded-mode behavior, so your first minute sounds controlled.

Run the choreography once in short form, then once with details. Stay aware of the stage and pause for questions.

Bridge. Rate limiting mastered. Now a system that reaches millions — notifications. → 03