Skip to content

08. Design Distributed Cache

⏱️ Estimated time: 20 min | Level: advanced

ELI5 callback: You are on the stage, pitching city council with a blueprint. Follow the choreography, show reasoning aloud, and admit every honest gap.

Step 1: Requirements & Constraints

Start wide. Then narrow. See. Scope first, technology later.

Functional requirements

  • Serve hot reads in single-digit milliseconds for repeated keys.
  • Support get, set, delete, invalidate, and TTL updates.
  • Handle cache-aside, write-through, and write-behind patterns.
  • Redistribute keys safely when nodes are added or removed.
  • Reduce cache stampede during sudden expiry of hot items.
  • Expose hit ratio, eviction, memory, and replication metrics.

Non-functional requirements

  • Target p99 read latency below 10 ms inside one region.
  • Stay available during single-node and single-rack failures.
  • Accept bounded staleness for many read-heavy workloads.
  • Use memory efficiently because RAM is the expensive part.
  • Scale horizontally without full rehashing of all keys.
  • Recover fast after failover without flooding the database.

Clarifying questions to ask

  • What is the read-to-write ratio for the hottest workload?
  • What average object size and maximum object size should we assume?
  • Can clients tolerate stale reads for a few seconds?
  • Do we need cross-region traffic, or is one region enough?
  • Who owns invalidation signals when source-of-truth data changes?
  • Which keys are business-critical and need stronger durability?

What to say on the whiteboard

  • State the user action, core data, and critical latency target.
  • Split must-have features from nice-to-have features immediately.
  • Name one honest gap before locking assumptions. Simple, no?
  • Ask what failure hurts most: money, freshness, or user trust.
  • Confirm whether single-region launch is acceptable for round one.
  • Summarise the scope before you move to numbers. Now watch.

Step 2: Scale Estimation

Do rough math. Clean math beats fancy math. So what to do? Pick clear assumptions and keep them verbal.

Assumptions

  • Assume 20 million daily active users.
  • Assume peak traffic is 1.2 million reads per second.
  • Assume writes are 120 thousand per second.
  • Assume average cached value size is 2 KB.
  • Assume average TTL is 30 minutes.
  • Assume top 5 percent of keys generate 60 percent of reads.

Back-of-envelope math

  • Working set per hour is 1.2M * 3600 * 2 KB for repeated reads.
  • After reuse, assume 600 million distinct hot objects per hour.
  • Raw memory for values is roughly 1.2 TB.
  • Add 30 percent for keys, metadata, and allocator overhead.
  • Provision about 1.6 TB effective RAM across shards.
  • With replication factor two, reserve about 3.2 TB total RAM.
  • If one shard handles 80 GB effective data, we need 20 primary shards.
  • Add 25 percent spare capacity for rebalancing and failover.

Interview cue

  • Say the biggest number first, then derive storage and bandwidth.
  • Round aggressively. Nobody wants calculator theatre on the board.
  • Mention peak-to-average ratio and why it changes capacity planning.
  • Keep one reserve factor for retries, bursts, and replays.
  • Remember the stage is interactive, so sanity-check assumptions aloud.
  • End with the two numbers that drive architecture choice.

Step 3: High-Level Design

Now place the big boxes. Your blueprint should fit in one glance.

+---------+    +--------------+    +------------------+
| Client  |--->| Cache Router |--->| Hash Ring Lookup |
+---------+    +--------------+    +------------------+
      |                 |                    |
      |                 v                    v
      |          +--------------+    +--------------+
      |          | Hot Key Lock |    | Shard Map     |
      |          +--------------+    +--------------+
      |                 |
      v                 v
+---------+    +----------------+    +----------------+
| DB/API  |<-->| Cache Shards   |<-->| Replica Shards |
+---------+    +----------------+    +----------------+

Main components

  • Cache router computes placement using consistent hashing.
  • Shard map service stores node membership and weights.
  • Primary cache shards hold values, TTL metadata, and eviction state.
  • Replica shards provide faster failover for hot partitions.
  • Hot key lock service coalesces misses and controls stampede.
  • Write pipeline decides cache-aside, write-through, or write-behind mode.
  • Metrics stream captures hit ratio, memory pressure, and tail latency.
  • Database remains the source of truth for durable state.

Request path

  • Client sends key lookup to the cache router.
  • Router checks current ring version and finds the target shard.
  • Shard returns value immediately on hit.
  • On miss, router acquires a short hot-key lock.
  • One request fetches from the database while peers wait briefly.
  • Loaded value is written with TTL and optional jitter.
  • If write-through is enabled, updates hit database and cache together.
  • If write-behind is enabled, updates queue for async flush.

Design narration

  • Start with ingress, then routing, then state, then async work.
  • Separate control plane decisions from data plane traffic early.
  • Show where metadata lives and where heavy payloads travel.
  • Mark caches, queues, and databases with their exact job.
  • Point out one synchronous dependency you may later relax.
  • Pause and let the interviewer choose the next zoom-in area.

Step 4: Deep Dive

Pick two parts that actually matter. Depth without structure becomes noise. See.

Component A — Consistent hashing and shard movement

  • Use virtual nodes so load spreads evenly across machines.
  • Store many small tokens per shard, not one giant token.
  • When adding a node, move only keys in the affected token ranges.
  • Track ring version so routers can switch atomically.
  • Keep shard weights configurable for mixed hardware sizes.
  • Throttle rebalance traffic to protect foreground latency.
  • Warm a new shard before sending full production traffic.
  • Measure key skew because real workloads are never perfectly uniform.
  • Pin ultra-hot keys if consistent hashing still leaves one node overloaded.
  • Keep a fast rollback path if a membership update goes bad.

Component B — Eviction, stampede control, and write policy

  • Choose LRU for simplicity or LFU for repeated hot objects.
  • Combine TTL expiry with active eviction under memory pressure.
  • Add random TTL jitter so many keys do not expire together.
  • Use request coalescing to stop a hundred misses becoming a thousand.
  • Serve slightly stale data during lock hold when freshness allows it.
  • Write-through keeps cache warm but adds write latency.
  • Write-behind lowers write latency but risks data loss on crash.
  • Cache-aside keeps application control but can create cold misses after updates.
  • Choose policy per dataset, not one global rule for everything.
  • Alert on eviction storms because they often precede database overload.

Deep-dive cue

  • Keep reasoning aloud clean while you zoom in.
  • Explain data model, hot path, and one ugly edge case.
  • Tie each deep dive back to a requirement you already named.
  • If numbers change the design, say that directly.
  • If one choice is uncertain, park it as research, not panic.
  • Return to the overall system before you get lost in detail.

Step 5: Tradeoffs & Failure Modes

Now show judgment. Interviewers hire the tradeoff thinker, not the diagram artist.

Key tradeoffs

  • LRU is easy to reason about, but LFU protects stable hot keys better.
  • Write-through improves freshness, but every write pays cache and database cost.
  • Write-behind cuts user latency, but crash recovery and replay become harder.
  • Replication improves availability, but doubles memory spending immediately.
  • Long TTL reduces origin load, but stale data risk increases.
  • Short TTL improves freshness, but miss rate and stampede risk increase.
  • Client-side hashing is faster, but central routing eases reconfiguration.
  • Per-key locks save the database, but lock services can themselves turn hot.

Failure modes to discuss

  • A primary shard crash can create a miss storm on replicas or the database.
  • A bad ring update can send traffic to empty nodes and spike error rates.
  • Mass expiry at the same second can trigger a classic stampede.
  • Write-behind queue lag can grow until cache and database diverge too far.
  • Memory fragmentation can reduce usable capacity long before the node is full.
  • Hot keys can overload one shard even when average utilisation looks fine.
  • Replica lag can surface stale reads after failover.
  • Over-aggressive invalidation can destroy hit ratio and waste RAM.

Close the answer strongly

  • Say what breaks first under sudden load and how you contain it.
  • Compare the chosen design against one simpler alternative.
  • Mention operational metrics, not only code-level correctness.
  • Admit where future scale may require redesign. Honest and sharp.
  • Offer a phased rollout plan if the company is early-stage.
  • Finish with latency, reliability, and cost in one sentence.

Interview Q&A

Q1. Why not use a single giant Redis cluster and stop there?

A. Single-cluster thinking ignores skew, rebalance pain, and regional blast radius. A. Start simple, yes, but still explain the path to sharding and failover. Common wrong answer to avoid: Redis is fast, so one cluster is always enough.

Q2. How would you prevent cache stampede for one celebrity key?

A. Use request coalescing, TTL jitter, and optionally stale-while-revalidate. A. If the key is extreme, isolate it or precompute it outside the normal path. Common wrong answer to avoid: Increase database capacity and let misses happen.

Q3. When would write-behind be acceptable?

A. Use it for low-risk derived data where brief loss is tolerable. A. Do not use it for balances, orders, or anything legally auditable. Common wrong answer to avoid: Use write-behind everywhere because it is faster.

Q4. What metric would tell you the cache is hurting the system?

A. Watch origin QPS, miss rate, tail latency, and eviction storms together. A. One metric alone hides the real story. See the coupling. Common wrong answer to avoid: Only cache hit ratio matters.

Apply now (5 min)

  • Run the full choreography with a two-minute timer per step.
  • Sketch this design again, but assume cross-region reads are mandatory.
  • Choose one dataset where cache-aside is better than write-through.
  • Choose one dataset where write-through is mandatory.
  • Explain how you would warm a new shard without user-visible latency.
  • State three alerts that should fire before the database melts.
  • Say one simplification you would take for an early-stage startup.

Bridge. Cache designed. Now something with real money — payments. → 09