02. Commands, persistence, clients — Redis day to day¶

~17 min read. We left the previous chapter with a working rate limiter — sorted set, four commands, one Lua script, atomic by virtue of the single thread. Now we put that limiter into production. We pick the right verbs for the everyday work. We argue about persistence — RDB or AOF or both. We learn when pipelining beats a transaction, when a Lua script beats a pipeline, and how a real client library — redis-py, Lettuce, ioredis — turns your call into framed RESP bytes the event loop can dispatch without choking. By the end of the page you will have the wiring to deploy that limiter, survive a restart, and not lose money.

Builds on: 00-eli5.md and 01-data-structures-single-thread-loop.md.

1) SET, GET, INCR, EXPIRE — the core verbs and their atomicity¶

These four commands carry most of production. The shape is small, the guarantees are big, and almost every other pattern in Redis is built by composing them.

SET key value writes a string in O(1). GET key reads it in O(1). INCR key atomically advances an integer-valued string by one and returns the new value. EXPIRE key seconds (or PEXPIRE for milliseconds) attaches a TTL — the key will be evicted automatically after the deadline passes. Each of these commands runs to completion before the next one starts, because of the single-threaded loop you saw in chapter 1. That is the atomicity guarantee — per command, not per business operation.

The distinction matters. Consider the obvious rate-limiter shape — INCR rate:user:42, then EXPIRE rate:user:42 60. Two commands, two round-trips, one race. If a request arrives between the INCR and the EXPIRE, you have a key with no TTL. It will live forever. Memory leaks one user at a time. The fix is SET ... EX seconds NX — set the value with an expiry in a single atomic command — or SET key value EX 60 XX if you only want to update when the key exists. Modern Redis exposes the right flags so the two-step trap disappears.

   SHAPE                          ATOMIC?       TYPICAL USE
   ─────                          ───────       ───────────
   INCR + EXPIRE (two commands)   NO            buggy fixed-window limiter
   SET key val EX 60 NX           YES           idempotent claim with TTL
   INCR with EXPIRE preset        YES (after    counter that lives at most 60s
     on first SET)                  first set)
   GETEX (Redis 6.2+)             YES           read-and-refresh-TTL in one shot

Teacher voice. Redis gives you atomicity per command, free. Production correctness needs atomicity per intent. The whole game of pipelining, MULTI/EXEC, Lua, and Functions is just different ways to widen "per command" to "per intent" without sacrificing speed.

Back to our threaded example. The rate-limiter Lua script from chapter 1 used ZREMRANGEBYSCORE, ZCARD, ZADD, PEXPIRE — four commands wrapped in EVAL. The single-thread loop made the wrap atomic. Now we ask: what if the process dies before AOF flushes? The next section answers that.

2) SETNX, SET ... NX, distributed locks, the Redlock controversy¶

SETNX key value sets a key only if it does not exist, returning 1 on success and 0 on failure. The modern form is SET key value NX EX 30 — the same idea with an expiry baked in so a crashed lock holder does not freeze the system forever. This three-flag command is the canonical recipe for a single-instance distributed lock.

   acquire:  SET lock:invoice:881 holder_uuid NX EX 30
             → "OK"     → I hold the lock
             → (nil)    → someone else holds it

   release:  EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then
                     return redis.call('DEL', KEYS[1])
                   else return 0 end" 1 lock:invoice:881 holder_uuid

The Lua release is non-negotiable. If you do a naive DEL lock:invoice:881, your lock might already have expired and a new owner taken it — and you just deleted their lock. The check-and-delete must be atomic, which means Lua, which means the single thread.

This pattern is widespread in production. Slack uses SET NX PX to claim each incoming event ID so retried webhooks are silently dropped. Stripe and many payment APIs use Redis-backed idempotency keys with SET NX so a retried POST /charges produces the same charge ID. Bot platforms use per-thread leases so only one worker handles a Slack thread at a time.

Now the controversy. Redlock is the multi-node extension — acquire a majority of locks across N independent Redis instances with timing checks. The argument is that single-instance locks are unsafe if Redis fails over, because the new primary may not have replicated the lock. Antirez (Redis's creator) published the Redlock algorithm. Martin Kleppmann published a famous critique arguing Redlock is unsafe for correctness-critical uses because (a) clock jumps on any node break the timing model, and (b) Redlock generates no fencing tokens — no monotonically-increasing number you can pass to the protected resource so it can reject stale lock-holders. Antirez replied that the post-acquire timing check defends against the clock jump, and that fencing tokens are a separable concern that any lock service needs.

The practical synthesis most teams land on: Redis locks are fine as advisory — best-effort coordination, mutual exclusion in the happy path, not safety in adversarial conditions. If you genuinely need a lock for correctness on money, use a database transaction with SELECT FOR UPDATE or a real fencing token from ZooKeeper/etcd. Use Redis to make the common case fast.

Mini-FAQ. "Is SET NX PX enough for idempotent payment APIs?" By itself, no. The idempotency record and the business operation must commit together, usually via the payment DB's transaction with the Redis key as a coordination hint. Stripe's published guidance treats the Redis key as a fast claim, with the source of truth in the ledger.

For our rate limiter, this is not directly relevant — the rate-limit key is the truth, and a stale rate-limit decision is at worst a missed reject, not a double charge. But the moment you want one worker per user processing a request, you reach for SET NX PX.

3) RDB vs AOF — snapshot vs log, fsync policies, AOF rewrite¶

Memory is fast but volatile. The day Redis crashes (and it will — OOM kills happen, kernels panic, Spot instances terminate), what survives is what you wrote to disk. Redis ships two persistence models, both useful, and Redis 7 makes them play nicely together.

RDB — point-in-time snapshot. Periodically, a child process is fork()ed and dumps the entire keyspace to a .rdb file. Compact format. Fast restart — Redis just mmaps the file. The cost is a window of data loss: if your last snapshot was 5 minutes ago and Redis dies now, those 5 minutes are gone. The other cost is that fork() on a multi-gigabyte instance copies the page table; Linux's copy-on-write keeps actual RAM use low, but the syscall itself can spike latency for a few hundred milliseconds on big instances.

AOF — append-only log. Every write command (well, the post-resolution form of it) is appended to a log file. On restart, Redis replays the log to reconstruct state. The fsync policy decides durability versus throughput.

   appendfsync   GUARANTEE                       THROUGHPUT IMPACT
   ───────────   ─────────                       ─────────────────
   always        write fsynced after EVERY cmd   ~10-30% of no-fsync
   everysec      fsync once per second           ~5% slowdown, ≤1s loss
   no           OS decides (~30s buffers)        no overhead, large loss

appendfsync everysec is the default and the right answer for nearly everyone. It is also what Sidekiq's documented production guidance recommends, and what Pinterest runs on its sharded Redis fleet (every-second AOF on EBS, hourly RDB to S3 for backup).

The AOF gets large. Each INCR counter is one line in the log. After a million increments, the log is a million lines, even though the final state is one integer. AOF rewrite solves this — Redis periodically rewrites the AOF as a compacted version that produces the same final state. Trigger is auto-aof-rewrite-percentage 100 (rewrite when AOF doubles since last rewrite) plus auto-aof-rewrite-min-size 64mb.

Redis 7 multi-part AOF. Pre-7, the rewrite child built a new AOF in memory while the parent appended to a buffer; at the end the parent merged the buffer into the new file. That merge was a brief freeze — milliseconds usually, sometimes seconds for big instances. Redis 7.0 split the AOF into a base file (RDB-formatted snapshot at last rewrite time) plus incremental files (commands since then), tracked by a manifest. The rewrite child writes the new base; the parent keeps appending to a fresh incremental. When the child is done, Redis swaps the manifest atomically. No merge step, no buffer doubling, no freeze.

Hybrid persistence. aof-use-rdb-preamble yes (default in Redis 7+) makes the base file an RDB snapshot followed by AOF commands for changes since. On restart, Redis loads the RDB base fast, then replays the small incremental tail. Best of both — RDB's restart speed, AOF's durability bound.

For our rate-limiter deployment: AOF everysec is enough. A second of lost rate-limit data is fine; we are not refunding money based on it. If Redis is also the broker for Sidekiq jobs in the same process, AOF becomes mandatory — losing jobs is losing customer-visible work.

4) Pipelining vs MULTI/EXEC — round-trips versus atomicity¶

A Redis command takes roughly 50 microseconds of server work. A TCP round-trip between AZs in the same region is 0.5-2 milliseconds. The math is brutal — for small commands, you spend 95% of wall time waiting for the network. Pipelining fixes this.

Pipelining is a client-side concept. The client sends N commands back-to-back without waiting for each reply, then reads N replies in order. The server processes them one at a time as normal (still single-threaded), but the network cost amortizes across the batch. Typical speedup: 5-10× for 100-command batches, 20-30× for 1000-command batches.

   WITHOUT PIPELINING                WITH PIPELINING (batch of 5)
   ──────────────────                ─────────────────────────────
   client       server                client          server
     │  GET k1     │                    │ GET k1 ──┐    │
     │ ─────────►  │                    │ GET k2   │    │
     │  reply k1   │                    │ GET k3   ├──► │  (process
     │ ◄─────────  │                    │ GET k4   │    │   k1..k5
     │  GET k2     │                    │ GET k5 ──┘    │   in order)
     │ ─────────►  │                    │               │
     │  ...        │                    │ ◄── reply k1  │
                                        │ ◄── reply k2  │
   5 RTTs                                │ ◄── reply k5  │
                                              1 RTT

MULTI/EXEC is a server-side concept. The client sends MULTI, then queues commands (each is replied with QUEUED), then EXEC. The server holds the queue and runs all the commands atomically at EXEC time. No other client's command interleaves. The trade is that you cannot branch on intermediate values — MULTI/EXEC is "execute these commands as a batch", not "read this then conditionally write that."

So:

Pipelining = "I have a hundred independent reads/writes; eliminate the round-trips." No atomicity across the batch. Partial failure on disconnect is possible.
MULTI/EXEC = "These five writes must commit together, no client must see the half-state." Atomic. But no read-then-write logic.
Lua / Functions = "Read, branch, write — atomically." Atomic and conditional.

Twitter's timeline service heavily uses pipelining through twemproxy, which automatically batches commands across client connections to shared upstream Redis instances. The proxy lets dozens of web servers feel like they have one big Redis — under the hood it is fan-out plus pipeline-on-the-wire.

For our rate limiter, pipelining is not the right tool by itself — the four commands depend on each other (read ZCARD, then conditionally ZADD). We need the read-and-branch, which means Lua. But if our service does many independent rate-limit checks (say, a single request consumes quota from three different buckets), we can pipeline three EVAL calls in one round-trip.

5) Lua scripts and Redis Functions — when EVAL beats pipelining¶

Lua scripts run on the server, atomically on the command thread, with access to all of Redis. EVAL "..." numkeys key1 ... arg1 ..." ships the script every time. EVALSHA sha1 calls a previously-SCRIPT LOAD-ed script by hash — same body, no network repetition. Either way, the script runs in one indivisible block. From the first redis.call to the last, no other client's command interleaves.

This is exactly what our rate limiter needs. The four-command shape from chapter 1 — eviction, count, decide, add, expire — is one logical operation. Pipelining would send all four but does not give us the branch. MULTI/EXEC would batch them but cannot conditionally ZADD based on ZCARD. Lua is the only fit.

   PATTERN              ATOMIC?  CONDITIONAL?  ROUND-TRIPS    NOTES
   ───────              ───────  ────────────  ───────────    ─────
   Naive 4-command          NO       YES           4         race risk
   Pipelined 4-command      NO       NO            1         can't branch
   MULTI/EXEC 4-command    YES       NO            1         can't branch
   Lua script              YES       YES           1         the answer

Redis Functions (7.0+) are the modern replacement for ephemeral EVAL. Functions are named, versioned, persisted in RDB and AOF, replicated to replicas automatically. EVAL scripts vanish on restart unless your client re-SCRIPT LOADs them; Functions are first-class data. The library you upload contains many functions; they call each other; they can be hot-swapped with FUNCTION LOAD REPLACE. For new Redis 7+ deployments, Functions are the recommended path.

For our rate limiter, we deploy the four-command logic as a Redis Function mylimiter.allow(user_id, now, window, limit). The client just calls FCALL mylimiter.allow .... The function survives restart and rolling-replace, and the client never has to know the script SHA.

Teacher voice. EVAL is convenient. Functions are infrastructure. The difference matters the day you do a Redis upgrade and find half your services broke because their cached script SHA is gone from the new instance.

The dark side: a Lua script that runs eight milliseconds blocks every other client for eight milliseconds. The single thread is unforgiving. Treat Lua as "small, fast, deterministic" — read a few keys, decide, write a few keys. Never iterate a million-entry collection. Never call out to anything non-Redis. The script is in the hot path; act like it.

6) Client patterns — pools, retries, jittered backoff, pub/sub¶

The client library bridges your application to RESP frames on the wire. Three clients dominate production: redis-py (Python), Lettuce (Java, async/Netty-based, the default for Spring Data Redis), ioredis (Node.js, the de-facto choice in the JS ecosystem). The conventions differ, but the patterns converge.

Connection pools. Redis serves all commands on one thread, so one persistent connection per process is, in principle, enough. In practice you want a small pool — typically 10-50 connections — for three reasons. (a) Blocking commands like BRPOP park their connection until a message arrives; without a pool, the next caller waits. (b) Multi-step transactions (MULTI/EXEC or WATCH-based optimistic locking) need a dedicated connection for their lifetime. (c) HTTP-thread frameworks let many threads issue commands concurrently; one connection serializes them at the network layer.

Lettuce is unusual here — its StatefulRedisConnection is thread-safe by design, multiplexing commands from many threads onto one channel. The official guidance is not to pool unless you specifically use blocking commands or transactions. redis-py and ioredis use pools by default.

   redis-py (6.0+):
     pool = redis.ConnectionPool(
         host='r.local', port=6379, max_connections=50,
         socket_timeout=2, socket_connect_timeout=2,
         retry=Retry(ExponentialBackoff(cap=10, base=0.1), retries=3),
         retry_on_error=[ConnectionError, TimeoutError],
     )
     r = redis.Redis(connection_pool=pool)

   Lettuce (Spring Boot YAML):
     spring.redis.lettuce.pool.max-active: 16
     spring.redis.lettuce.pool.max-idle: 8
     spring.redis.lettuce.pool.min-idle: 2
     spring.redis.timeout: 2s

   ioredis (Node):
     new Redis({
       host: 'r.local', port: 6379,
       maxRetriesPerRequest: 3,
       retryStrategy: (n) => Math.min(n * 200 + Math.random()*200, 2000),
       enableOfflineQueue: true,
     })

Retries with jittered backoff. redis-py 6.0+ retries failed commands three times by default, with exponential-plus-jitter delays. The jitter — random noise on each delay — exists to break the thundering-herd correlation when, say, 200 web pods all reconnect to the same Redis after a 1-second blip. Without jitter, they all retry at exactly T+100ms, T+200ms, T+400ms; the spikes pile on. With jitter, they spread out, and Redis catches up. AWS ElastiCache's published client guidance is explicit: always use exponential backoff with full jitter, cap somewhere around 10 seconds.

Pub/Sub vs keyspace notifications. Both are fire-and-forget — no acknowledgement, no replay. Pub/Sub is "I publish, all current subscribers get it; future subscribers miss it." Keyspace notifications are a special Pub/Sub channel where Redis publishes events about its own data ("key X was SET", "key Y expired"). Useful for cache invalidation (Stack Overflow's L1/L2 design uses this), feature-flag broadcasts, and reactive workflows. Dangerous for anything requiring delivery — if a subscriber is slow or disconnects, the event is gone. For durable event delivery, use Streams with consumer groups instead.

For our rate-limiter service, the client setup is small but specific. One connection pool with 16 connections, 2-second socket timeout, 3 retries with jittered exponential backoff, no offline queue (we want a fast fail back to the caller, not silent buffering). The four-command Function is loaded once at deploy time. Each request is one FCALL round-trip — about 0.4 ms across an AZ-local Redis on a c7g.large.

7) Comparison table — RDB vs AOF vs no persistence (Redis 7.2)¶

Numbers from official Redis docs, published case studies, and redis-benchmark on Redis 7.2 / c7g.large (2 vCPU Graviton, 50 GB keyspace). They are the right order of magnitude; tune by half to two on your stack.

Mode	Max data loss	Restart time (50 GB)	Write amplification	fsync syscalls/sec	When to use
No persistence	Everything since boot	~100 ms (empty)	0	0	pure cache where cold restart is fine
RDB only (every 5 min)	Up to 5 min	~60-90 s (load RDB)	~1× per snapshot, fork() spike	0	analytics, cache with snapshot-for-warm-restart
AOF `everysec` only	≤ 1 s	4-8 min (replay full log)	1× per write, until rewrite	1/sec	Sidekiq queue, session store, most teams
AOF `always` only	0 (one command)	4-8 min	1× per write	one per write (~10-30k/sec ceiling)	money-touching state, hardest durability
AOF `everysec` + hybrid preamble (RDB base)	≤ 1 s	~60 s base + small tail replay	as above	1/sec	Redis 7 default; recommended for most
RDB hourly + AOF `everysec` (no preamble)	≤ 1 s	4-8 min (full AOF replay)	as above	1/sec	older Redis or legacy ops

A few callouts the table flattens. The fork-spike during RDB snapshot or AOF rewrite is the silent latency killer on big instances — 32 GB Redis on Linux 5.x typically pauses for 200-500 ms while the kernel copies page tables. Disable transparent huge pages (echo never > /sys/kernel/mm/transparent_hugepage/enabled) before benchmarking, or you will measure THP defragmentation instead of Redis. Pinterest, GitLab and Shopify all document this in their Redis operational guides.

Mini-FAQ. "Why does AOF replay take so long on restart?" Each command in the log is parsed and executed against an initially-empty state. Hybrid preamble cuts this dramatically — the RDB base loads as a binary blob, only the incremental tail is replayed.

Where this lives in the wild¶

These patterns show up everywhere production Redis runs. Two natural categories: shops that built custom infrastructure on Redis primitives, and shops that wire off-the-shelf clients into standard frameworks.

Companies running custom Redis infrastructure with these primitives:

Twitter uses twemproxy to pipeline and shard commands across thousands of Redis instances; home timelines fan out with LPUSH/LTRIM and the proxy batches commands on the wire — 30B daily Redis updates per the Tanzu case study.
GitHub runs a sharded, replicated Redis rate limiter using Lua scripts for atomicity; the engineering blog documents how Lua removed the read-then-write race in their early naive INCR+EXPIRE limiter.
Stripe uses Redis-backed idempotency keys with SET NX plus capacity-reservation rate limiters that count in-flight requests per type, with Lua coordinating the atomic decisions.
Pinterest runs sharded Redis with AOF everysec on EBS and hourly RDB to S3 for billions of follow edges; the dual-persistence shape is documented in their engineering blog.
Snapchat (KeyDB fork) runs multithreaded Redis for User Service caching in GKE, cutting P99 from 49-133 ms to 1.5-2.1 ms per the Google Cloud case study.
Sidekiq deployments at Shopify, GitHub, Stripe run Redis with AOF everysec for job queue durability — the Sidekiq production guide is explicit that Redis must be persistent for job reliability.
Uber CacheFront runs Redis as an integrated cache fronting Docstore, sustaining 150M reads/sec with cache hit rates above 99.9% after extending TTLs up to 24 hours.
DoorDash runs Redis as the L3 cache beneath request-local maps and Caffeine, using Lettuce as the client and runtime knobs to flip layers per service.
Stack Overflow uses Redis Pub/Sub for L1 cache invalidation across web servers, with StackExchange.Redis as the client; pub/sub events tell each server's in-process L1 to drop stale entries.
Slack bot platforms use SET NX PX to claim each incoming event ID for idempotent webhook handling, and per-thread leases to coordinate worker exclusion.
Klarna's support stack and Elastic AI Assistant keep short-lived LangGraph agent state (pending tool calls, memory snippets) in Redis hashes with TTL, using ioredis and redis-py clients respectively.
Discord caches guild/channel/presence objects as hashes via gateway-cache layers (RainCache, redis-discord-cache) — the engineering case study calls out 10-50× inbound query reduction for popular channels.

Standard-framework wirings using redis-py / Lettuce / ioredis:

Django apps with django-redis use redis-py connection pools with exponential-backoff retries to cache sessions, view fragments, and rate-limit counters.
Spring Boot apps with Spring Data Redis use Lettuce by default, with a single shared multiplexed connection and a small pool for blocking ops; the AWS Lettuce-on-ElastiCache best-practices doc codifies the configuration.
Celery (Python ecosystem) uses Redis lists as broker via redis-py, with BRPOP-blocking workers and a second Redis DB as the result backend.
Bull / BullMQ (Node.js queue libraries) sit on ioredis and use Lua scripts heavily for atomic move-job-between-state operations; the BullMQ source has dozens of Lua scripts.
NestJS rate-limiter (@nestjs/throttler with Redis storage) uses ioredis and a Lua-based fixed/sliding-window limiter very similar to the chapter-1 design.
Express rate-limit-redis uses ioredis with a Lua sliding-window-log script — the most-used Node.js rate-limit middleware in production.
Sidekiq Pro/Enterprise uses redis-rb against a dedicated persistent Redis with AOF everysec and hybrid preamble; the production guide is explicit.
Spring Session with Redis uses Lettuce's multiplexed connection plus Redis hashes for session storage, with keyspace-notification-based invalidation when sessions expire.
Symfony / Laravel queue workers use phpredis or Predis with BRPOP against persistent Redis, mirroring the Sidekiq pattern in PHP.
redis-py 6.0+ default retry policy ships exponential backoff with full jitter and 3 retries — every Python app that uses the library inherits the AWS-style backoff curve without configuration.
ioredis default retryStrategy uses Math.min(times * 50, 2000) ms backoff; production deployments commonly override with full-jitter for thundering-herd avoidance.
AWS ElastiCache published client guidance for Redis OSS / Valkey explicitly recommends exponential backoff with full jitter for cluster discovery; the doc is the canonical reference.

Pause and recall¶

Why is SET key value EX 60 NX strictly safer than INCR followed by EXPIRE?
What is the difference between pipelining and MULTI/EXEC in one sentence each?
Which one of pipelining, MULTI/EXEC, or Lua lets you read a value and conditionally write based on it?
What does Redis 7's multi-part AOF eliminate that pre-7 AOF rewrite had to do?
In the Redlock controversy, what are Kleppmann's two main objections and what does Antirez argue back?
Why does jittered backoff matter for a fleet of 200 web pods reconnecting after a network blip?
What is the trade-off when you switch appendfsync from everysec to always?
Why does Lettuce default to one multiplexed connection while redis-py and ioredis default to pools?

Interview Q&A¶

Q1. Walk me through implementing an idempotent payment-charge endpoint using Redis. A. The client sends a unique Idempotency-Key header. Server runs SET idem:<key> "processing:<request_id>" NX EX 60. If the result is OK, this is the first request — start the DB transaction, do the charge, store the response under idem:<key> with a longer TTL (24 h typically), commit. If the result is nil, fetch the existing value — if it is "processing", return 409 or wait briefly; if it is a stored response, return that response verbatim. The Redis key is a fast claim; the source of truth lives in the payment DB transaction that records the charge keyed by idempotency key as well. Common wrong answer to avoid: "Just use SETNX and return success." That misses the stored-response replay path — a retry must receive the same response, not a fresh attempt or a 409.

Q2. Your team wants to use Redlock for a job-scheduling lock so only one worker runs a daily cron. Should they? A. For a daily cron where worst case is "the cron runs twice or zero times once a quarter", Redlock or even single-instance SET NX PX is fine — advisory locking is the appropriate stance. For a lock where double-execution corrupts state (charging a customer twice, sending a duplicate refund), Redis locks alone are not safe. You need fencing tokens — a monotonic counter the protected resource can use to reject stale lock holders — and Redis does not provide them. Use a proper consensus system (etcd, ZooKeeper) or a database row lock with SELECT FOR UPDATE. Common wrong answer to avoid: "Redlock is unsafe, never use Redis for locks." Single-instance SET NX PX is widely deployed for advisory mutual exclusion and is the right tool when the cost of a rare race is low.

Q3. When do you reach for a Lua script instead of pipelining or MULTI/EXEC? A. When the operation needs to read a value and conditionally write based on it, all atomically. Pipelining batches commands but gives no atomicity. MULTI/EXEC is atomic but cannot branch on intermediate values — you queue commands blind. Lua runs server-side, atomically on the command thread, with full read-branch-write power. Rate limiters with sliding-window-log, atomic "increment and return new value if above threshold else decrement back" patterns, queue-state transitions in BullMQ — all of these need Lua. Common wrong answer to avoid: "Always use Lua, it is atomic and fast." Lua blocks every other client for its entire run; long-running scripts cause global latency spikes. Use Lua only for short, deterministic logic.

Q4. Your Redis is configured with AOF everysec and you lose 30 minutes of writes after a crash. What investigation do you do? A. First, check whether AOF was actually enabled on the running instance — CONFIG GET appendonly should be yes. If not, someone disabled it. Second, check the appendfsync setting — if it was no, the OS buffered up to ~30 s before flushing and a power loss could cost that. Third, inspect the disk — if the underlying volume's write cache is volatile and not battery-backed (or nobarrier was set on the filesystem), fsync may return success without the bytes hitting durable storage. Fourth, check whether AOF was being rewritten at crash time — a bug or a misconfigured aof-load-truncated no could discard a corrupted tail. Fifth, did the host actually crash, or was it a graceful kill where the Redis process buffered writes in its own queue? Common wrong answer to avoid: "AOF everysec means you lose at most one second, so the persistence layer is innocent." That guarantee assumes fsync is honest, the kernel flushed the buffer, and the disk reported truth — any of those can fail.

Q5. Explain pipelining versus MULTI/EXEC versus Lua, with a concrete example of when each wins. A. Pipelining wins when you have N independent commands and want to eliminate N-1 round-trips — fetching 100 cached user records by ID is a perfect fit. MULTI/EXEC wins when you have a small batch that must commit atomically without read-then-decide logic — for example, debit from one user balance and credit to another (both are unconditional given that you have already decided to do them). Lua wins when you need read-branch-write — the sliding-window rate limiter is the canonical example, where you must read the count and conditionally add. Performance-wise, all three are one round-trip; the differences are atomicity scope and conditional logic. Common wrong answer to avoid: "Pipelining is faster than MULTI/EXEC because it has no atomicity overhead." Both are one round-trip; the server-side per-command cost is identical; the difference is what guarantees the batch provides, not throughput.

Q6. Why does Lettuce default to a single connection while redis-py and ioredis use connection pools? A. Lettuce is built on Netty with non-blocking I/O — one channel can multiplex unlimited concurrent commands from many threads, because it pipelines them on the wire automatically and demultiplexes responses by sequence. redis-py and ioredis historically were synchronous — one command per connection at a time, so concurrent callers needed concurrent connections. Even modern async versions keep the pool model because blocking commands (BRPOP, transaction MULTI/WATCH) need dedicated connections. The right pool size in those clients is "your concurrent-blocking-op count plus a small headroom" — usually 10-30. Common wrong answer to avoid: "More connections always mean more throughput." Beyond the blocking-op count, extra connections add no throughput (Redis still serves one command at a time) but cost a file descriptor and a few KB each.

Q7. Redis 7 introduces multi-part AOF. What problem did it solve from earlier versions? A. Pre-Redis 7, AOF rewrite worked by forking a child that wrote the new compact AOF in memory, while the parent continued appending to a rewrite buffer. At the end, the parent merged the buffer into the new file and renamed atomically — a brief but real freeze window, plus the buffer doubled memory usage for the rewrite duration. Redis 7 splits AOF into a base file (RDB-formatted snapshot at last rewrite) and incremental files (commands since), tracked by a manifest. The child writes the new base; the parent appends to a fresh incremental. The swap is just a manifest update, no merge, no rewrite buffer. With aof-use-rdb-preamble yes (default), restart loads the binary RDB base fast then replays only the small tail. Common wrong answer to avoid: "It made AOF faster overall." Steady-state AOF write throughput is unchanged; the savings are at rewrite time and on restart.

Q8. A production Redis is showing P99 latency spikes every few minutes. Walk me through the investigation. A. First, redis-cli --latency and --latency-history to characterize the spikes. Second, check whether the spike interval correlates with RDB snapshots (save config) or AOF rewrites — INFO persistence shows last fork time and rewrite times. Fork on large instances takes 200-500 ms of page-table copying; if THP is on, it can be seconds. Disable transparent huge pages and re-measure. Third, SLOWLOG GET 50 for individual slow commands — long Lua scripts, KEYS *, big LRANGE, SUNION over huge sets. Fourth, LATENCY HISTORY event for built-in latency-monitor events. Fifth, check if a noisy neighbor on a shared host is causing CPU steal. Sixth, network — is a single TCP connection saturated, or is the client doing too-large pipelined batches that fill the output buffer and cause backpressure? Common wrong answer to avoid: "Add more replicas or upgrade the instance type." Both might be needed eventually, but you must identify the actual cause first — most P99 spikes on Redis are persistence-fork events, slow commands, or THP, none of which are solved by bigger hardware.

Apply now (10 min)¶

Step 1 — model the exercise. Take our rate-limiter service. Below is a one-row deployment audit for a single production decision.

Decision	Choice	Why
Persistence mode	AOF `everysec` + hybrid preamble	≤ 1 s loss is fine for rate-limit state; fast restart matters
Atomicity primitive	Redis Function (`FCALL`)	read-then-conditional-write needs Lua; Function survives restart
Client library	`redis-py` with `ConnectionPool(max=16)`	matches our Python service; pool size = concurrent in-flight + slack
Retry policy	`ExponentialBackoff(cap=2, base=0.1)` × 3, with jitter	break thundering herd; fail fast back to caller
Distributed lock for worker exclusion?	No — rate limiter is the truth; no shared mutation needed
Pub/Sub for invalidation?	No — TTL drops keys; no cross-server cache invalidation here

Step 2 — your turn. Pick a Redis-touching service in your own stack — a cache, a queue, a session store, a counter. Fill the same six rows. For each row, write one sentence on what would break if you made the opposite choice. If you cannot articulate the breakage, you have not justified the choice yet.

Step 3 — sketch from memory. Redraw the pipelining-vs-no-pipelining diagram from section 4, then add a third lane for Lua showing one round-trip with server-side branch logic. Label each lane with whether it provides atomicity and whether it allows conditional logic. If you can do this cold, the three primitives are internalised.

Bridge. One node, one event loop, one fsync policy — we have day-to-day Redis under control. But one node is the failure boundary. The next chapter opens the cluster, the eviction policies that decide which keys die when memory fills, and the cache-stampede patterns that bring production down at the worst possible moment. → 03-cluster-eviction-cache-stampedes.md