05. Retries, DLQ, and Idempotency — Fail safely without duplicate side effects¶

~17 min read. Asynchronous systems stay calm only when failure handling is designed, not hoped for.

Built on the ELI5 in 00-eli5.md. The board rules — delivery promises and retry discipline — now becomes the difference between recovery and chaos.

Retries need classification before enthusiasm¶

Not every failure deserves another immediate attempt. Transient failures often improve with time or spacing. Permanent failures need correction, not repetition. Validation errors, bad payloads, and missing fields are usually permanent. Network blips and short dependency overloads are often transient. Good systems classify before retrying blindly. These are board rules for safe asynchronous recovery. Otherwise retries become a self-inflicted outage.

Diagram:

Failure ▼
├→ transient? ─→ retry later
├→ permanent? ─→ DLQ or drop with reason
└→ unknown?   ─→ inspect and cap attempts

Retry only when another attempt has a realistic chance.
Attach failure reason and attempt count to each message.
Cap retries so poison messages do not loop forever.

Worked example: 1. Payment webhook returns HTTP 503 from a dependency. 2. That is likely transient, so retry later. 3. Another webhook returns schema validation failure. 4. That is likely permanent, so stop retrying. 5. Classification saves both money and sleep.

Retries are medicine, not snacks; dosage matters.

Exponential backoff plus jitter reduces stampedes¶

Fixed delay retries make failing systems receive synchronized waves. Exponential backoff spaces attempts farther apart over time. Jitter adds randomness so workers do not stampede together. This protects the dependency and your own queues. Without jitter, thousands of clients may retry simultaneously. With jitter, retry traffic spreads over a safer window. Choose sane caps so user experience stays bounded. Log retry schedules for later debugging.

Diagram:

Attempt 1 → 1s
Attempt 2 → 2s ± jitter
Attempt 3 → 4s ± jitter
Attempt 4 → 8s ± jitter
Attempt 5 → cap at 30s

A common pattern is base * 2^n with random spread.
Add maximum delay and maximum attempts explicitly.
Keep business SLAs in mind while tuning backoff.

Worked example: 1. Inventory service is overloaded for two minutes. 2. Five thousand workers start retrying a stock update. 3. Fixed five-second delay causes repeated spikes. 4. Exponential backoff with jitter smooths the load. 5. Recovery happens without another collapse.

Backoff protects the system; jitter protects it from itself.

DLQ is a parking area, not a dustbin¶

A dead-letter queue stores messages that exceeded retry policy. It isolates poison messages from healthy traffic. That keeps workers productive during bad data incidents. But a DLQ is useful only with triage and replay practice. You need reason codes, timestamps, and original payload context. Someone must own dashboards and replay runbooks. Blindly draining DLQ back into main flow can repeat failure. Review the root cause before replaying.

Diagram:

Main Queue ─→ Worker
    │ success
    ├→ done
    └→ fail x N ─→ DLQ
DLQ ─→ inspect ─→ fix ─→ replay

DLQ volume is a product signal, not just an ops signal.
Track top error categories, not only raw counts.
Replays should be controlled, observable, and idempotent.

Worked example: 1. A mobile app sends malformed address payloads all morning. 2. Delivery creation fails after three attempts. 3. Messages move to DLQ with validation reason attached. 4. Team fixes serializer bug and patches bad records. 5. Then they replay safely in batches.

DLQ is a learning lane for bad messages, not a graveyard.

Idempotent consumers make duplicates boring¶

At-least-once delivery means duplicate processing can happen. Crashes between side effect and acknowledgment cause many duplicates. Idempotency means repeated processing reaches the same final state. Use business keys or dedup IDs to detect repeats. Store processed message IDs where the write is durable. Prefer upsert, compare-and-set, or unique constraints when possible. External APIs also need idempotency keys if they support them. This is how retries stop being scary.

Diagram:

Message id=abc123 ─→ Consumer
                    ├→ seen before? yes ─→ skip side effect
                    └→ seen before? no  ─→ apply and record id

Dedup keys should match real business uniqueness.
Inbox tables and unique indexes are practical tools.
Idempotency must cover writes and outward side effects.

Worked example: 1. Email worker sends WelcomeEmail for user 55. 2. Crash happens before acknowledgment returns. 3. Broker redelivers the same message later. 4. Consumer sees message_id already recorded. 5. It skips sending a second email.

If duplicates are expected, make them harmless by design.

Where this lives in the wild¶

These patterns appear anywhere message delivery meets real failures and real side effects.

Razorpay backend engineer uses idempotency keys for payment callbacks. Retries are allowed, but duplicate ledger writes are not.
Amazon fulfillment engineer tunes retry counts and DLQs for worker queues. Poison messages must not freeze healthy order traffic.
Swiggy delivery platform engineer classifies transient partner API failures carefully. Backoff and jitter reduce retry storms during peak dinner hours.
Google Cloud platform engineer inspects Pub/Sub DLQs with reason labels. Replay workflows exist only after root cause is fixed.
Netflix service engineer designs consumers to tolerate duplicate event delivery. Operational safety beats pretending the network is perfect.

Pause and recall¶

Before you leave, check whether failure handling now feels mechanical instead of magical.

Why should permanent failures usually bypass repeated retries?
What specific problem does jitter solve during retry storms?
Why is a DLQ useless without triage ownership and replay process?
How does idempotency convert duplicate delivery into safe behavior?

Say the answer aloud before reading ahead tomorrow.

Interview Q&A¶

Strong answers sound operational, practical, and a little suspicious of happy paths.

Q: What is exponential backoff with jitter? A: It increases retry delay after each failure and randomizes timing to avoid synchronized retry spikes. Common wrong answer to avoid: "It just means wait longer every time; randomness does not matter."

Q: When should a message go to DLQ? A: After bounded retries fail or a non-retryable error is identified. Common wrong answer to avoid: "Keep retrying forever because eventual success is always possible."

Q: What is an idempotent consumer? A: A consumer that can process the same message repeatedly without changing the correct final outcome. Common wrong answer to avoid: "A consumer that never receives duplicates."

Q: Why are dedup keys important? A: They let the consumer recognize repeated delivery and suppress duplicate side effects safely. Common wrong answer to avoid: "Broker ordering alone removes duplicate risk."

Keep answers crisp, then add trade-offs only when asked.

Apply now (5 min)¶

Take one consumer from your notes, such as email or webhook delivery. List three failure types and mark each retryable or non-retryable. Write a backoff schedule with jitter and a retry cap. Define the condition that moves the message to DLQ. Then state the dedup key making the consumer idempotent. Sketch from memory: - one retry timeline with increasing delays, - one main queue flowing into DLQ, - one consumer check for seen message_id before side effects.

Bridge. Retries keep notices alive. But what if the board itself IS the source of truth? → 06