07. Idempotency and Retries — The order slip that should not cook twice¶

~15 min read. Networks lie often, so your API must stay calm.

Built on the ELI5 in 00-eli5.md. The order slip — the written instruction sent again after confusion — now teaches us how to retry safely without duplicating the meal.

First understand why retries create duplicate danger¶

Distributed systems fail in annoying half-visible ways. A client may send a request successfully, then lose the response. The server may finish work, but the network may drop the receipt. From the client view, success and timeout can look identical. So the client retries. That retry is often correct. But retries become dangerous when operations have side effects. Creating a payment, ticket, or shipment twice is expensive. The core problem is uncertainty after partial success. See the basic failure path: ┌──────────┐ create payment ┌──────────────┐ │ Client │ ─────────────────→ │ API service │ └──────────┘ └──────┬───────┘ │ charges card ▼ payment stored │ └──X response lost The client times out. Then the client sends the same call again. If the backend treats it like a fresh request, duplicate work happens. That is why idempotency matters. An idempotent operation produces the same final state after repeats. Running it once or five times ends similarly. Now be careful. “Same final state” is the real test. “Same response bytes” is not the full test. You can return updated metadata while preserving safe side effects.

Natural idempotency and artificial idempotency are different tools¶

Some operations are naturally idempotent. Setting a user’s preferred language to en repeatedly changes nothing extra. Deleting a known resource can also be idempotent conceptually. Once deleted, more deletes should not create more deletion. These are natural patterns. The resource state itself prevents duplicate impact. A simple table helps: ┌──────────────────────────┬─────────────────────────────┐ │ Usually naturally safe │ Usually duplicate-prone │ ├──────────────────────────┼─────────────────────────────┤ │ PUT set profile picture │ POST create payment │ │ PUT mark email verified │ POST place order │ │ DELETE remove session │ POST issue coupon │ └──────────────────────────┴─────────────────────────────┘ Now come to artificial idempotency. Sometimes the business action is inherently create-like. A payment capture or new order should happen once. Repeating the same order slip should not create extra dishes. So the client sends an idempotency key. The backend stores that key with the original result. If the same key appears again, the backend returns the stored answer. That is artificial idempotency. We are manufacturing safety around a duplicate-prone action. Worked example. Customer presses “Pay now” twice after a spinner freezes. Both clicks carry idempotency key pay_9f1c. The backend processes the first request and stores the result. The second request finds the existing record. No second charge happens. That is exactly what you want.

Idempotency keys need careful storage rules¶

An idempotency key is not a magic header alone. It needs a server-side record with disciplined semantics. Usually the server stores at least three things. First, the key itself. Second, a fingerprint of the request payload. Third, the resulting status and response body. Why fingerprint the request? Because the same key must not hide different intent. If key abc123 first meant amount=500, then amount=900 should fail. Otherwise clients could reuse keys accidentally and corrupt meaning. A practical flow looks like this: ┌──────────┐ key=abc123 ┌─────────────────────────┐ │ Client │ ───────────────→ │ API service │ └──────────┘ └──────────┬──────────────┘ │ lookup key ┌─────────────┴─────────────┐ │ found same fingerprint? │ └───────┬─────────┬─────────┘ │yes │no ▼ ▼ return stored reject mismatch result safely as misuse Now think about lifecycle. Keys should not live forever. Most systems keep them for a bounded TTL. Maybe 24 hours for payments. Maybe shorter for low-value operations. The window should match realistic retry behavior. Now think about concurrency too. Two identical requests can arrive almost together. The idempotency store must handle races safely. One request should win the right to process. The other should wait, poll, or reuse the finished result. Worked example. Suppose two retries land within 15 milliseconds. Both read “key missing” from a weak store. Both charge the card. That is a broken design. Use atomic insert, unique constraints, or transactional locking. Exactly here, boring database guarantees save real money.

Retries need exponential backoff and jitter¶

Retries are useful only when they reduce pain. Blind rapid retries usually amplify pain instead. If a service is already overloaded, instant retries worsen the storm. So use exponential backoff. After each failure, wait longer before retrying. A simple schedule might be 100 milliseconds, 200, 400, 800. That spreads load over time. Now add jitter. Jitter means randomizing the wait slightly. Without jitter, thousands of clients retry together like synchronized drums. That creates retry spikes. With jitter, those retries spread out more naturally. See the pattern: attempt 1 → immediate call attempt 2 → wait about 100 ms ± random jitter attempt 3 → wait about 200 ms ± random jitter attempt 4 → wait about 400 ms ± random jitter This matters a lot during regional incidents. A thousand clients failing together should not hammer recovery together. Now pair retries with status-code judgment. Retrying 500, 502, 503, and some timeouts can make sense. Retrying 400 or 401 usually does not. Retrying 409 depends on the domain meaning. A smart client respects semantic signals. Worked latency example. Suppose the first timeout happens at 2 seconds. If three retries happen instantly, the backend sees four near-simultaneous requests. If backoff spaces them over 0.1, 0.2, and 0.4 seconds, pressure reduces meaningfully. The user still waits, but the system recovers more gracefully.

Exactly-once at API level means disciplined illusion, not magic delivery¶

People often ask for exactly-once delivery. Strictly speaking, networks rarely give that end-to-end guarantee cheaply. What APIs usually offer is exactly-once effect at the boundary. The same business action should happen once. Duplicates may arrive, but they should not duplicate side effects. That is the useful promise. It combines retries, idempotency keys, atomic writes, and clear responses. The wait staff analogy fits well here. If the waiter resubmits the same order slip, the kitchen should notice. One meal should arrive. Not two meals and one apology. A practical design sequence looks like this: 1. Client creates a stable idempotency key before the first attempt. 2. Server reserves that key atomically before side effects start. 3. Business action commits once. 4. Server stores the final outcome against the key. 5. Retries reuse the stored outcome until the key expires. Now be honest about boundaries. Exactly-once at your API layer does not mean every downstream system is perfect. If your API calls a non-idempotent legacy partner, risk returns. Then you need compensating logic, dedupe ledgers, or reconciliation jobs. So always ask, “Exactly once between which boundaries?” That is the interview-quality question. A final worked example. Suppose a payment API receives 10,000 create-charge calls daily. One percent experience network ambiguity. That means 100 calls may be retried. Without idempotency, even a small duplicate rate hurts quickly. If 12 of those retries double-charge, support pain is immediate. With strong key handling, those 100 ambiguous calls can still produce 100 correct charges. That is operational peace.

Where this lives in the wild¶

Stripe payments API engineer — stores idempotency keys for create-charge style requests so clients can retry safely after timeouts.
Razorpay backend engineer — protects payment capture APIs from duplicate effects caused by mobile retries and flaky networks.
Swiggy checkout engineer — ensures order creation remains single-effect even when customers tap the payment button repeatedly.
Shopify commerce platform engineer — uses retry-safe API contracts so merchants do not create duplicate orders during transient failures.
AWS SDK engineer — designs retry behavior with exponential backoff and jitter so shared outages do not become retry storms.

Pause and recall¶

Why can a lost response create duplicate business actions later?
Which operations are naturally idempotent, and which need artificial protection?
Why must an idempotency key be tied to a request fingerprint?
What extra problem does jitter solve beyond exponential backoff alone?

Interview Q&A¶

Q: What is the difference between natural and artificial idempotency? A: Natural idempotency comes from the operation’s state semantics, like setting a field repeatedly to the same value. Artificial idempotency adds a stored key so duplicate-prone creates behave safely under retries. Common wrong answer to avoid: “PUT is always idempotent and POST never is” — method names hint intent, but actual state behavior still matters. Q: Why must idempotency keys be stored with request fingerprints? A: Because the same key should represent the same intent only. If the payload changes under one key, the server must reject it instead of silently replaying the wrong result. Common wrong answer to avoid: “The key alone is enough” — without payload validation, accidental key reuse can hide serious business mistakes. Q: Why use exponential backoff with jitter for retries? A: Backoff reduces repeated pressure on struggling services, and jitter prevents many clients from retrying in lockstep. Together they improve recovery behavior under shared failure. Common wrong answer to avoid: “Jitter is optional decoration” — synchronized retries can create major spikes even when backoff exists. Q: What does exactly-once delivery usually mean at an API boundary? A: It means exactly-once effect for the business action exposed by that API, not magical packet-level uniqueness across every network hop. Duplicates may arrive, but side effects should not duplicate. Common wrong answer to avoid: “Exactly once means duplicates never happen anywhere” — in practice, duplicates can appear, and the system must absorb them safely.

Apply now (5 min)¶

Imagine a payment request times out after the bank was already charged. Write the duplicate risk in one sentence. Then design a minimal idempotency record with four fields. Next, decide a TTL for that key and justify it. Now sketch a retry schedule for four attempts using backoff and jitter. Mark which HTTP failures you would retry automatically. Mark one client error you would never retry. Finally, write one sentence explaining “exactly-once effect” in plain words. Sketch from memory: draw a client, an API, an idempotency store, and two repeated order slip submissions that still produce one final effect. Do not peek back.

Bridge. Duplicate-safe requests are good. Large result sets still need clean, navigable shape. → 08-pagination-filtering-sorting.md