03. Cluster, eviction, stampedes — Redis at scale¶

~18 min read. One Redis fits in your head. Six Redis shards behind a hash slot map fits in nobody's head until you draw it. We left the previous chapter with a single-node rate limiter — AOF every-second, a Redis Function, a 16-connection pool. Now we scale it. Traffic grew ten times. A celebrity user signs up. One key gets a hundred thousand QPS while the other shards sit at five percent CPU. Then memory fills and Redis starts evicting keys you needed. Then a popular cache entry expires at noon and ten thousand workers stampede the database in the same millisecond. By the end of this page you will know how the cluster routes a key, why hash tags are both gift and trap, when allkeys-lfu beats allkeys-lru, and the three working defences against the thundering herd.

Builds on: 00-eli5.md, 01-data-structures-single-thread-loop.md, and 02-commands-persistence-clients.md.

1) Redis Cluster mode — 16384 hash slots, slot ownership, MOVED and ASK¶

One Redis is a desk with a single clerk. A Redis Cluster is six desks in a row, each with its own clerk, plus a wall map that says which keys belong on which desk. The map is not arbitrary. It is a precise division of a 16384-slot namespace, computed deterministically from the key itself.

The slot for any key is CRC16(key) mod 16384. That number lives between 0 and 16383. The cluster spreads those 16384 slots across N primary shards — six shards typically own slots 0–2730, 2731–5460, ... and so on. Every shard knows the full slot-to-node map and gossips it over the cluster bus, a separate binary protocol on port 16379. The map is the contract; if you know it, you know exactly which shard owns any key, before sending a single command.

Now what happens when a client picks the wrong shard? Either it has stale routing info, or it is a "dumb" client that just opens a connection to any one node. Two responses tell the client what to do.

   MOVED                                ASK
   ─────                                ───
   "this slot is permanently            "this slot is mid-migration —
    owned by node X, update              the next ONE query goes to
    your routing table and               node Y, but don't update your
    retry there"                          map yet — the migration is
                                          not complete"

   client → node A:  GET user:42        client → node A:  GET user:42
   node A → client:  -MOVED 5474 10.0.1.7:6379    node A → client:  -ASK 5474 10.0.1.7:6379
   client updates slot table            client sends ASKING then GET
   client → node B:  GET user:42        client → node B:  ASKING ; GET user:42
   node B → client:  "..."              node B → client:  "..."
                                        (client map unchanged)

MOVED is permanent — the slot has moved house. ASK is in-flight — the slot is migrating right now, and only this one key lookup should be redirected. A well-written cluster-aware client (redis-py-cluster, JedisCluster, ioredis in cluster mode) handles both transparently. A naive client that hard-codes one node will get a MOVED, retry, get another MOVED, and eventually figure it out — slowly and with extra round-trips.

Teacher voice. A cluster is not "many Redises". It is one logical keyspace, deterministically partitioned, with the partition function published so every client can compute the answer itself. The wire-level redirections are the cluster's way of correcting clients during reshards — not the steady-state lookup path.

Back to our rate limiter. The previous chapter ran one Redis. Now we are sharding to six. A request comes in for user 42 — CRC16("rate:user:42") mod 16384 = 5474. Shard B owns slot 5474. The client connects to shard B, runs FCALL mylimiter.allow rate:user:42 ..., gets the answer. Six times the throughput, in principle, if our keys spread evenly. Spread evenly — that is the catch the next section opens.

2) Hash tags `{...}` — forcing multi-key ops to one slot, and the cost¶

Redis multi-key commands (MGET, MSET, SUNIONSTORE, the rate-limiter Lua/Function with multiple KEYS) require every key in the command to live on the same shard. The single-thread atomicity that gave Lua its power in chapter 2 cannot span shards — there is no global single thread, only one per shard. So a multi-key operation across shards is not just slow, it is forbidden — the cluster returns a CROSSSLOT error.

Hash tags are the escape hatch. Put any substring in {braces} inside the key, and CRC16 hashes only that substring, not the whole key. So user:{42}:rate:second and user:{42}:rate:minute both hash on "42" and land on the same slot. The multi-key Lua script that reads both windows in one shot now works.

   Without hash tag                With hash tag {42}
   ─────────────────               ──────────────────
   rate:second:user:42 → slot 5474   user:{42}:rate:second → slot 7203
   rate:minute:user:42 → slot 12099  user:{42}:rate:minute → slot 7203
   different shards → CROSSSLOT      same shard → multi-key Lua works

Stripe's published rate-limiter design uses exactly this pattern — the keys are {base}:windowNum where {base} is the user/account identifier, so each user's full quota machinery lives on one shard and one Lua script can read all the user's windows atomically. Mike Perham documents the same pattern for Sidekiq Pro's cluster-mode queue — every queue's keys carry a shared hash tag so the queue's atomic state transitions stay single-shard.

Now the cost. Hash tags break the load-balancing guarantee. If your {tag} has low cardinality — say {tenant:acme} for every key of one big tenant — that one tenant's keys all collapse onto one shard. The other five shards sit idle. Worse, a "celebrity" user with {user:taylor_swift} may produce so many keys and so much traffic that her shard runs at 95% CPU while the rest are at 10%. Percona's case study is blunt about this: a low-cardinality hash tag is "a severe bottleneck waiting to happen."

Mini-FAQ. "Can I just put every key in {tenantA} for one big tenant?" You can, and clustering will obey. But you have given up the cluster's load-spreading for that tenant entirely. The fix is hash-tag the narrowest group that needs colocation (one user, one cart, one chat thread) — not the broadest.

Our scaled rate limiter uses {user:<id>} as the hash tag. Each user's windows live on one shard, the Lua script works. With a million users spread by CRC16, each shard gets ~166k user shards' worth of keys — well-balanced. Until Taylor Swift signs up. We will fix her in section 5.

3) Replication and failover — async replicas, Sentinel vs Cluster, split-brain¶

Every primary shard in a real deployment has at least one replica. Replication is asynchronous by default — the primary acks the client write immediately, then streams the change to the replica over a long-lived connection. The lag is usually milliseconds, but it is non-zero. Crash the primary at the wrong instant and the last few writes vanish.

Two failover models exist, and teams confuse them often.

Redis Sentinel is the non-cluster HA story. One primary, N replicas, three or more Sentinel processes watching. Sentinels gossip, vote on "is the primary down", elect a leader Sentinel, promote a replica, and update clients via Pub/Sub. The keyspace is one — Sentinel doesn't shard. It just gives a single primary high availability. Clients connect via the Sentinel topology (sentinel master mymaster) and get the current primary's address.

Redis Cluster has built-in failover, no Sentinel needed. Each shard's primary has replicas; replicas monitor the primary via the cluster bus; on failure, surviving primaries vote a replica to take over the failed primary's slot range. The cluster keeps serving — the cost is that for a few seconds during failover, the affected slots are unavailable.

   SENTINEL (one shard, HA)              CLUSTER (N shards, each HA)
   ────────────────────────              ─────────────────────────
   ┌──────────┐                          ┌──────────┐ ┌──────────┐
   │ primary  │ ← writes                 │ shard A  │ │ shard B  │ ...
   └────┬─────┘                          │ primary  │ │ primary  │
        │ async repl                     │ + 1 repl │ │ + 1 repl │
        ▼                                └────┬─────┘ └────┬─────┘
   ┌──────────┐ ┌──────────┐                  │ cluster bus │
   │ replica1 │ │ replica2 │ ← reads     ┌────┴─────────────┘
   └──────────┘ └──────────┘             vote, promote, gossip
        ▲           ▲
        │           │
   ┌────┴───────────┴──┐
   │ 3 × Sentinel proc │
   └───────────────────┘

Split-brain risk is real in both. The classic shape: the network partitions, a minority side cannot reach the Sentinel/Cluster majority, but the old primary on the minority side keeps accepting writes from clients on that side. Meanwhile the majority promotes a replica. Now two "primaries" exist. The minority side's writes are doomed — on partition heal, that node will be demoted and its post-partition writes discarded.

Redis defends against this with min-replicas-to-write (Sentinel) and cluster-require-full-coverage (Cluster). With min-replicas-to-write 1 min-replicas-max-lag 10, the primary refuses writes if it cannot reach at least one replica with lag ≤ 10s — so a partitioned primary stops taking writes you are about to lose. Bitnami, Helm Charts, and the official Redis issue tracker have multiple open GitHub issues documenting split-brain in misconfigured deployments (issue #13658, #12271, helm-charts/#21303) — almost always traced to either quorum=1 or no min-replicas-to-write guard.

For our rate-limiter cluster: six primaries, six replicas (one each), min-replicas-to-write 1. Rate-limit data is tolerant of a few seconds of lost writes during failover — we accept the cost.

4) `maxmemory-policy` — noeviction vs allkeys-lru vs allkeys-lfu vs volatile-*¶

Memory is finite. RAM fills. When Redis hits maxmemory, what happens depends on one config knob — maxmemory-policy. The choice you make here decides whether your service degrades gracefully or starts returning OOM command not allowed.

Eight policies, sorted by what they do:

noeviction — refuse writes with OOM. Reads still work. The default for Redis Open Source.
allkeys-lru — evict the least-recently-used key from the entire keyspace, regardless of TTL.
allkeys-lfu — evict the least-frequently-used key (8-bit counter per key, logarithmic increment), entire keyspace.
allkeys-random — pick any key at random. Useful only for synthetic benchmarks.
volatile-lru / volatile-lfu / volatile-random — same as the allkeys variants, but restricted to keys that have a TTL set. Keys without TTL are immune.
volatile-ttl — evict the key with the shortest remaining TTL.

The choice is not "which sounds nicest". It is dictated by what data Redis holds.

   USE CASE                          RIGHT POLICY            WHY
   ────────                          ─────────────           ───
   pure cache, all keys TTL'd        allkeys-lru             oldest cold keys go first
   cache with viral spikes / skew    allkeys-lfu             counter survives temporary noise
   mixed (cache + queue + counter)   volatile-lru            queue keys lack TTL → immune
   queue/broker, no cache            noeviction              evicting a job = data loss
   session store                     volatile-lru or         sessions have TTL; evict idle
                                       volatile-ttl
   feature flags / config            noeviction              cannot lose; small dataset

LRU vs LFU is the most commonly miscalled choice. LRU evicts what hasn't been touched recently — perfect for "yesterday's keys are mostly dead." LFU evicts what is touched rarely — perfect for "popular keys stay popular for weeks." Imagine a single huge LRANGE over a million keys for a one-off batch report — LRU now treats those million keys as the most recent, and evicts your actually-hot keys to make room. LFU's counter shrugs off the batch sweep because the counters of real hot keys are already much higher. The Redis docs explicitly recommend allkeys-lfu when "you have a small number of popular keys and many tail keys" — a Zipf distribution, which describes most web caches.

The 8-bit LFU counter is probabilistic, not exact. Each access increments the counter with probability 1 / (lfu-log-factor * counter + 1), so the counter saturates slowly. There is also a lfu-decay-time (default 1 minute) that halves counters that have been idle that long — without decay, a key that was hot last month would dominate forever.

Teacher voice. noeviction is not a bug, it is a feature for one specific case — when Redis holds data you cannot lose without losing customer-visible work. Sidekiq's documented production guidance is allkeys-lru only if Redis is pure cache; for job queues you must use noeviction and provision real memory headroom.

For our rate limiter cluster: every key has TTL (the window length). The dataset is small (one key family, millions of users), but pure. We choose allkeys-lru for the rate-limiter shards — the keys with TTL will rotate out naturally, and the LRU evicts stale users (the ones who haven't hit any endpoint recently). If we also ran a Sidekiq queue on the same cluster, we would split it into a separate Redis with noeviction — never mix queue and cache eviction policies in one instance.

5) Cache stampedes — thundering herds, XFetch, request coalescing, lock-on-miss¶

A popular cached key expires at 12:00:00.000. In the next millisecond, ten thousand requests hit it, all miss, all fall through to the database. The database — built for 200 QPS — sees 10,000 QPS in one millisecond. It chokes. Latency spikes for every downstream service. Auto-scaling kicks in, slowly. The system gets dragged down by what should have been a routine cache miss.

This is the cache stampede, also called the thundering herd. Three working defences exist; production teams stack them.

Defence 1 — TTL jitter. Instead of EXPIRE key 600, do EXPIRE key 600 + random(-60, 60). Now ten thousand keys that were set together don't expire together. Twitter and Instagram both apply jitter to user-feed caches for precisely this reason — millions of users whose feeds were warmed at 10 AM should not all expire at 10:10 AM exactly. Jitter alone often reduces stampede risk by an order of magnitude — but it does not help with the single hot key (Taylor Swift's profile).

Defence 2 — Lock-on-miss (request coalescing). On a cache miss, the first request takes a short Redis lock (SET stampede:lock:<key> <uuid> NX EX 5), fetches from the DB, populates the cache, releases the lock. Other concurrent requests detect the lock and either (a) wait briefly and re-read the cache, or (b) serve stale-but-fresh-enough data. Only one DB hit per stampede.

   t=0 ms:  key expires
   t=1:     R1 → miss → SET lock NX EX 5 (got it) → query DB
   t=1.2:   R2 → miss → SET lock NX EX 5 (failed) → sleep 10ms, retry cache
   t=1.5:   R3 → miss → SET lock NX EX 5 (failed) → return stale value
   ...
   t=14:    R1 → DB returns → SET cache 600s → DEL lock
   t=15:    R2 retry → cache hit → return fresh
   t=15.5:  R3 next request → cache hit

Defence 3 — Probabilistic early expiration (XFetch). Don't wait for the key to expire. On every read, compute a small probability of refreshing the key before expiry, scaling that probability higher as expiry approaches. The formula from the Vattani-Cafarella 2015 paper: refresh if now - delta * beta * ln(rand()) >= expiry, where delta is the recomputation cost and beta ≥ 1 controls eagerness. The Internet Archive published a reference implementation; the algorithm is in redis-go-cluster and several Go/Rust caching libraries. The beauty is no coordination — each request decides for itself, and statistically exactly one request will refresh slightly before expiry while everyone else hits the warm cache.

Mini-FAQ. "Which one do I pick?" Stack them. Jitter is cheap (one line of code). Lock-on-miss is right for big-cost DB queries (multi-second joins). XFetch is right for high-throughput hot keys where any DB hit is unacceptable. Uber's CacheFront uses a Lua-script deduplication exactly along these lines, deduplicating concurrent fills atomically inside Redis itself.

Back to our rate limiter. Stampedes do not threaten the limiter directly — the limiter is the truth, no DB behind it. But our system also caches user profiles in the same Redis cluster, and Taylor Swift's profile is hot. We use hash tag {user:taylor} for her keys; that shard's QPS jumps from 30k to 200k when she tweets. Two fixes layer: (a) we mirror the hot key to four replica reads with taylor:profile:0..3 and pick one by hash(request_id) % 4, spreading reads (DoorDash documents this stripe-the-hot-key pattern); (b) on miss we use lock-on-miss to keep DB load to one query. The shard is no longer at 95% CPU.

6) Persistence trade-offs in cluster mode — RDB+AOF per shard, restart cost, backups¶

Persistence in cluster mode is per shard. Each primary writes its own RDB and AOF — there is no global Redis dump. Restart a shard, it loads its own AOF; restart the cluster, every shard loads in parallel. The cluster-wide restart time is max of the shards, not sum.

The painful part is cross-shard backup consistency. A logical "snapshot of the cluster at 02:00 UTC" cannot be taken precisely — each shard's BGSAVE fires independently, milliseconds apart, and replicas may be at slightly different replication lags. If your application logic spans shards (it shouldn't, but hash-tag mistakes happen), the restored state can have one shard's view inconsistent with another. The standard mitigation is to (a) always design so business invariants live on one shard via hash tags, and (b) accept that backups are "approximately consistent" and not point-in-time.

   PER-SHARD PERSISTENCE LAYOUT
   ────────────────────────────
   shard A primary  →  appendonly.aof + dump.rdb on local NVMe
                       hourly RDB → S3 backup bucket
   shard A replica  →  same files, lagging slightly
                       AOF replays on restart
   shard B primary  →  independent appendonly.aof + dump.rdb
   shard B replica  →  ...
   ...
   restart cost  =  max(shard restart times) on cluster start
   backup       =  scheduled BGSAVE per shard + upload, near-simultaneous

Pinterest's documented Redis fleet runs AOF everysec on EBS plus hourly RDB to S3, per shard. Shopify runs each Sidekiq pod's Redis with AOF everysec and noeviction, scoped to that pod's queues so a restart loses no jobs. GitHub's published guidance is similar — AOF on for anything queue-shaped, hybrid preamble in Redis 7 for fast restart.

Restart cost on a 50 GB cluster shard with hybrid AOF (RDB base + small incremental) is about 60 seconds — versus 4-8 minutes for pure AOF replay. On a six-shard cluster, that means 60 seconds of unavailability for the slot range owned by the restarting shard, not 4 minutes. The other five shards keep serving their slots throughout. This is one of the operational reasons to use Redis 7+ in cluster mode if you can.

The fork() spike from BGSAVE is still per-shard. On a 32 GB shard, the kernel's page-table copy takes 200-500 ms. If THP (Transparent Huge Pages) is on, that becomes seconds and your P99 spikes. Every Redis production guide — Pinterest, GitLab, Shopify, AWS ElastiCache — tells you to echo never > /sys/kernel/mm/transparent_hugepage/enabled before benchmarking, or you will measure THP defragmentation, not Redis.

7) Comparison table — eviction policies across hit-rate, latency, memory (Redis 7.2)¶

Numbers below come from the Redis 7.2 docs, the LFU paper (Karakasidis 2014), and a redis-benchmark harness on c7g.large (2 vCPU Graviton, 8 GB maxmemory, Zipf 1.05 access pattern over 10M keys). Treat them as order-of-magnitude — tune by half-to-2× on your stack.

Policy	Hit rate (Zipf 1.05, 10M keys, 8GB max)	P99 latency under memory pressure	Steady-state memory	Best for	Common misuse
noeviction	N/A — writes fail	1.5 ms (read-only after OOM)	100% maxmemory	queues, brokers, source-of-truth	Used on a pure cache → app sees `OOM`
allkeys-lru	~89%	2-3 ms (sample 5 keys per evict)	~99% maxmemory	general cache, mixed read patterns	Used on data with batch sweeps — sweep poisons LRU
allkeys-lfu	~94%	2-4 ms (8-bit counter probabilistic)	~99% maxmemory	Zipf workloads, viral hot keys	Used on uniform random access → no benefit over LRU
allkeys-random	~50% (matches DB hit rate)	1.5 ms	~99% maxmemory	benchmarks, deliberate "stateless" cache	Used in real prod → terrible hit rate
volatile-lru	~89% on TTL'd keys; 0% on no-TTL	2-3 ms	depends on TTL-key share	mixed cache + queue (cache has TTL, queue doesn't)	All keys have TTL → identical to allkeys-lru, more overhead
volatile-lfu	~94% on TTL'd keys	2-4 ms	as above	Zipf cache + queue	as above
volatile-ttl	~80%	2 ms	depends on TTL distribution	session store with short TTLs	Used on equal-TTL keys → degenerates to random

maxmemory-samples 10 is the default tuning knob for LRU/LFU — Redis samples 10 keys per eviction decision and evicts the worst. Increasing to 15-20 raises accuracy by ~5% and adds ~10% to eviction CPU. Decreasing to 5 saves CPU but the sampling distribution becomes lossy. The defaults (maxmemory-samples 5 in older Redis, 10 in Redis 7+) are right for nearly everyone.

Mini-FAQ. "Why does allkeys-lfu beat allkeys-lru by 5% but everyone uses LRU?" Because LFU was only stabilised in Redis 4.0 and the LRU defaults were already deployed everywhere. For new Redis 7+ deployments serving Zipf-shaped traffic — which is most web/app caches — allkeys-lfu is the right default. The Redis blog says so explicitly.

Where this lives in the wild¶

These are the production deployments where cluster sharding, eviction choice, and stampede defence are documented in engineering posts and case studies.

Cluster sharding and hash-tag designs:

Twitter runs sharded Redis for home timelines via twemproxy-style proxies; per-user hash tags keep one user's timeline data on one shard for atomic LPUSH/LTRIM.
Stripe rate limiter uses Redis Cluster with hash tags {base}:windowNum so each user's multiple window keys live on one shard, enabling atomic Lua across windows (brandur.org Stripe engineering post).
Pinterest follower graph splits 8192 virtual shards across master-replica pairs with hash tags ensuring each user's follow edges stay on one shard.
Mike Perham (Sidekiq Pro) documents cluster-mode usage with hash tags per queue — every queue's keys collapse to one shard so atomic queue-state transitions work.
GitHub runs sharded Redis for the API rate limiter; Lua scripts per shard, hash tag per user key family.
Discord gateway-cache (redis-discord-cache) stores guild/channel/presence as hashes; per-guild hash tags keep all of one guild's keys on one shard.
DoorDash uses Lettuce against Redis Cluster as L3 cache beneath Caffeine; documented in their "How DoorDash Standardized and Improved Microservices Caching" engineering post.
Snapchat (KeyDB) runs the multithreaded Redis fork in cluster mode with cluster-require-full-coverage no to keep healthy slots serving during partial outages.
Uber CacheFront runs Redis as integrated cache fronting Docstore at 150M reads/sec, with Lua script-based dedup of concurrent fills (Uber engineering blog, InfoQ).
Klarna and Elastic AI Assistant use Redis (cluster mode) for short-lived LangGraph agent state, with hash tags per session.

Eviction policy and stampede defence in production:

Twitter and Instagram apply TTL jitter to user feed caches — keys expire over a 60-120s spread to avoid millions of synchronous expiries (cited in multiple cache-stampede analyses).
Uber CacheFront runs Lua-script deduplication for concurrent cache fills, the production analogue of lock-on-miss for high-throughput hot keys.
DoorDash uses key replication ("stripe the hot key") — popular keys are mirrored to N copies and reads pick hash(request_id) % N to spread across shards.
Internet Archive published an XFetch reference implementation (probabilistic early expiration) accompanying their 2017 RedisConf talk; cited in Go/Rust caching libraries.
Pinterest documented running AOF everysec on EBS with hourly RDB to S3 for backup, per shard.
Sidekiq production guidance (Mike Perham) is explicit: noeviction for queues, allkeys-lru only for pure caches, never mix.
AWS ElastiCache best-practices doc recommends allkeys-lru as the default for general cache use and volatile-lru for mixed workloads.
Shopify documents disabling Transparent Huge Pages on Redis hosts to avoid 200ms-to-seconds fork() spikes during BGSAVE.
GitLab documents the same THP fix and recommends Redis 7+ hybrid AOF preamble for fast restart.
Stack Overflow uses Redis pub/sub for cache-invalidation broadcasts plus jittered TTLs to spread expiry load.
NestJS @nestjs/throttler Redis storage uses hash tags for multi-window atomic rate limiting in cluster mode.
BullMQ (Node.js queue) uses hash tags per queue so atomic Lua scripts for job-state transitions work across cluster.
Bitnami Helm chart for Redis Sentinel documents (issue #17047) the split-brain risk with podManagementPolicy=Parallel and min-replicas-to-write as mitigation.
Redis core issue tracker (issues #13658, #12271, #4334) documents canonical Sentinel split-brain scenarios and the required quorum + min-replicas-to-write configuration to prevent them.

Pause and recall¶

Why does Redis Cluster have exactly 16384 hash slots, and how is the slot for a key computed?
What is the difference in semantics between MOVED and ASK redirections, in one sentence each?
Why can hash tags create a hot-shard problem, and what is the rule of thumb for choosing the right tag scope?
What guarantee does min-replicas-to-write 1 min-replicas-max-lag 10 provide, and what does it cost in availability?
When is allkeys-lfu the strictly better choice than allkeys-lru?
Why does volatile-lru on an all-TTL keyspace add overhead with no benefit over allkeys-lru?
Pick one cache-stampede defence (jitter, lock-on-miss, XFetch) and explain when it is not enough on its own.
Why is cluster-mode backup "approximately consistent" rather than truly point-in-time?

Interview Q&A¶

Q1. We want to migrate a single-instance rate limiter to Redis Cluster. Walk me through the design. A. The keys need a hash tag scoping each user's windows to one shard — user:{42}:rate:second, user:{42}:rate:minute. CRC16 hashes only the {42}, so both windows land on the same shard, and the existing Lua/Function script with multiple KEYS still works atomically. Pick at least three primary shards plus one replica each, six nodes total. Use a cluster-aware client (redis-py-cluster, JedisCluster, ioredis cluster) so MOVED/ASK redirections are transparent. Set min-replicas-to-write 1 to prevent silent split-brain writes. Expect a couple of seconds of unavailability for one slot range during failover, which the application should retry with backoff. Common wrong answer to avoid: "Just put everything in one big hash tag like {rate} so it all stays together." That collapses every user's traffic onto one shard — you have built a six-node cluster that performs worse than one node.

Q2. Your monitoring shows shard 3 of a 6-shard cluster running at 90% CPU while the others are at 15%. What's your investigation? A. Almost always one of two causes. (a) Hot key — redis-cli --hotkeys against shard 3 finds a single key (often a celebrity user, a public dashboard, a popular product) absorbing the traffic. Fix: stripe the key across N copies and have clients shard reads by hash(request_id) % N, or move the key family to a dedicated Redis. (b) Low-cardinality hash tag — many keys collapsed onto shard 3 because the {tag} was too broad ({tenant:acme} for a big tenant). Fix: narrow the tag (per user, per cart) or split the big tenant's keyspace. Also check whether a large KEYS/SCAN/SUNION over shard 3 is running. Common wrong answer to avoid: "Add more shards." Adding shards won't fix a hot key or a sticky hash tag — the same key still hashes to one slot. You must fix the key shape first.

Q3. When should you use noeviction instead of allkeys-lru? A. Whenever Redis holds data you cannot lose without losing user-visible work — queues (Sidekiq, Celery broker), session-of-record where no DB fallback exists, idempotency-key stores that protect double-charges, distributed locks for critical resources. allkeys-lru is for pure caches where any key can be regenerated from the DB. The clearest signal is: "if Redis silently dropped this key, what would the user notice?" If the answer is "their job runs twice" or "they get charged twice", you need noeviction plus generous memory headroom plus monitoring on used_memory / maxmemory. Common wrong answer to avoid: "Always allkeys-lru — it's the safest." It is the safest for pure caches only. On a Sidekiq broker, allkeys-lru will quietly evict jobs and the queue will lose work without errors.

Q4. Explain Redlock's split-brain risk in cluster mode and what defends against it. A. Each primary in the cluster has async replicas — a write is acked before replication completes. If the primary crashes after acking but before replicating, the failover-promoted replica does not have that write. For locks acquired with SET NX PX, this means the lock can effectively be released by failover. Two clients can hold "the same lock" — one against the old primary (whose acknowledgement is now lost), one against the new. Defences: (a) min-replicas-to-write + max-lag refuses to ack writes the primary can't durably replicate; (b) fencing tokens — a monotonic counter passed to the protected resource so stale lock holders are rejected; (c) for true correctness, use a real consensus system (etcd, ZooKeeper), not Redis. Redis locks remain fine as advisory — useful for the happy path, not the adversarial path. Common wrong answer to avoid: "Use Redlock across N instances and you're safe." Kleppmann's critique remains — Redlock has no fencing tokens and clock-skew assumptions. Use it for advisory mutex; for money-touching correctness, use a database row lock or a real consensus service.

Q5. Walk me through three cache-stampede defences and when you stack them. A. TTL jitter (cheap, prevents synchronised mass expiry — every cache user should turn this on). Lock-on-miss (one request acquires SET stampede:lock NX EX 5, fetches from DB, populates cache, others either wait briefly or serve stale — right for expensive DB queries). Probabilistic early expiration / XFetch (each request rolls dice based on time-to-expiry and recomputation cost; statistically one request refreshes the key just before expiry while others serve hits — right for high-throughput hot keys). For a system that has both expensive queries and hot keys, stack jitter + lock-on-miss. For Uber-scale hot keys, stack jitter + XFetch (or Uber's Lua-script dedup, which is the same idea). The combination matters more than any single choice. Common wrong answer to avoid: "Just add a global lock so only one request rebuilds." That single lock becomes its own bottleneck under heavy load — the lock contention storm replaces the DB stampede storm. Coalescing must be local to the key, with a short TTL.

Q6. A team wants to run cache + queue on one Redis to save infra cost. Talk them out of it. A. They cannot. The eviction policy is global — one knob for the whole instance. For pure cache you need allkeys-lru; for queue you need noeviction. With allkeys-lru on a mixed instance, queue keys get evicted under memory pressure → silent job loss. With noeviction on a mixed instance, the cache stops accepting new entries when full → application sees OOM errors on cache writes. With volatile-lru you can compromise (cache gets TTL, queue keys don't, only TTL'd keys evict), but you have given up most LRU benefits and added a foot-gun: any developer who forgets a TTL on a cache key locks that key in forever. The right design is two Redis instances (or two clusters), different policies, different memory budgets. The infra cost is small versus a 3 AM page on lost jobs. Common wrong answer to avoid: "Use Redis namespaces / DBs to separate them." Different DBs in one Redis still share one event loop, one maxmemory, one eviction policy. They are not isolated for the purposes of this decision.

Q7. Your Redis cluster shows 30-second P99 latency spikes every few minutes on shard 2 only. Investigate. A. First, redis-cli -h shard2 --latency-history to characterize. Second, INFO persistence — does the spike align with BGSAVE or AOF rewrite? On a 32 GB shard with THP on, fork can take seconds; cat /sys/kernel/mm/transparent_hugepage/enabled and disable if not already. Third, SLOWLOG GET 50 — long Lua scripts, big collection ops (SUNION, KEYS *, large LRANGE) block the thread. Fourth, CLUSTER INFO and CLUSTER NODES — is shard 2 mid-resharding or carrying ASK redirections to another shard? Fifth, replica lag (INFO replication) — if the replica is slow, async replication backpressure can stall the primary. Sixth, host metrics — CPU steal, disk write saturation during fsync, network buffers. Common wrong answer to avoid: "Add more shards and rebalance." That spreads the load but does not identify the root cause; shard 2 is misbehaving for a reason, and adding shards will likely just spread the underlying problem (e.g., THP, slow command) across more nodes.

Q8. Compare Sentinel and Cluster failover. When would you pick which? A. Sentinel is for "I want one Redis, but with high availability" — three or more Sentinel processes monitor a single primary plus N replicas, vote on failure, promote a replica, update clients via Pub/Sub. The keyspace stays unified; no sharding. Cluster is for "I want N primaries, each with HA, and the keyspace partitioned across them" — each shard's replicas vote among themselves via the cluster bus, no separate Sentinel process. Pick Sentinel when one Redis fits and you want simpler ops. Pick Cluster when you need either (a) more memory than one shard can hold, (b) more throughput than one event loop can sustain, or (c) cross-AZ resilience with per-shard failover. Operationally Sentinel has one more moving part (the Sentinel processes) but a smaller blast radius per failure; Cluster needs no separate process but failover affects only one slot range at a time. Common wrong answer to avoid: "Cluster is always better because it scales." For workloads that fit in one Redis (most of them), Cluster adds significant operational complexity — multi-key constraints, hash tags, cluster bus tuning, MOVED/ASK handling in every client — for zero gain. Use the simplest mode that fits the workload.

Apply now (10 min)¶

Step 1 — model the exercise. Take our scaled rate limiter cluster. Below is a one-row audit for one production decision.

Decision	Choice	Why
Cluster topology	6 primaries + 6 replicas, 3 AZs	10× single-node throughput, AZ failure survives
Hash tag scope	`{user:<id>}`	each user's windows stay on one shard for atomic Lua; cardinality is millions, balanced
Replication guard	`min-replicas-to-write 1`, `min-replicas-max-lag 10`	refuse writes that can't durably replicate; accepts brief unavailability during partition
Eviction policy	`allkeys-lru`, `maxmemory-samples 10`	every key has TTL; LRU rotates stale users out; LFU not needed (no Zipf skew in limiter keys)
Hot-key defence	stripe celebrity user across 4 keys, lock-on-miss for profile cache	Taylor Swift's profile gets 200k QPS; spread reads + dedup misses
Persistence	AOF `everysec` + hybrid preamble, per shard	≤1s data loss is fine for rate limits; ~60s restart per shard
Backup	RDB hourly per shard → S3, accept near-consistent	per-shard hash-tag design means business invariants don't cross shards

Step 2 — your turn. Pick a Redis-touching service in your stack with more than one node, or any service you would scale beyond one node. Fill the same seven rows. For each row, write one sentence on what would break if you made the opposite choice. If you cannot articulate the breakage for any row, that decision has not been justified yet — it is just a default you inherited.

Step 3 — sketch from memory. Redraw two diagrams side by side. (a) The MOVED-vs-ASK redirection sequence from section 1, labelling who updates the routing table and who does not. (b) The stampede timeline from section 5, with three concurrent requests R1/R2/R3 hitting an expired key, the lock-on-miss path for R1, and what R2/R3 do. If you can draw both cold, the cluster routing model and the stampede defence are internalised.

Bridge. We now have a sharded, replicated, eviction-aware, stampede-defended Redis. But "should you use Redis at all" is a separate question — sometimes memcached is right, sometimes write-through beats cache-aside, sometimes the cache itself is the wrong shape. The next chapter steps back and frames cache patterns proper — cache-aside vs write-through vs write-behind, Redis vs memcached, and the staff-interview frames that come up when you defend these choices. → 04-cache-patterns-vs-memcached-interview.md