03. Retries, monitoring, prod gotchas — the operational surface¶

~12 min read. Retries that storm an upstream. Queues that grow until the broker dies. Dead letters that nobody reads. Worker memory leaks. The Celery production catalogue — what breaks, why, and how to size.

Builds on: 02-tasks-queues-day-to-day.md.

The previous chapters showed how to write Celery. This one is what production teaches.

1) Retry storms — the most common production incident¶

A downstream API has a hiccup. Tasks fail. Auto-retry with no backoff fires immediately. Within seconds, the downstream is hit harder than it was before — not because of new traffic, but because of accumulating retries.

T+0:   100 tasks/sec → upstream 5xx → 100 retries/sec
T+1:   100 new + 100 retries = 200 tasks/sec → 200 retries
T+2:   100 new + 200 retries = 300 tasks/sec → 300 retries
... amplification compounds

The fix is exponential backoff plus jitter:

@shared_task(
    autoretry_for=(ConnectionError, Timeout),
    retry_backoff=True,           # exponential: 1s, 2s, 4s, 8s, 16s...
    retry_backoff_max=600,        # cap at 10 minutes
    retry_jitter=True,            # randomize so retries don't synchronise
    max_retries=10,
)
def call_external_api(payload):
    ...

With backoff + jitter, the retry rate drops over time, giving the downstream room to recover. Without backoff, the retry storm prevents recovery — every recovery attempt is met with the same flood that took it down.

For critical downstream defence, also add circuit breaker logic at the application level (pybreaker, circuitbreaker libraries) — when downstream error rate is high, fail fast for some window before retrying.

2) Dead-letter queues — what to do with permanent failures¶

After max_retries, a task gives up. By default in Celery, it just disappears. For tasks that matter, route them to a dead-letter queue (DLQ):

@shared_task(
    bind=True,
    autoretry_for=(Exception,),
    retry_backoff=True,
    max_retries=5,
)
def critical_task(self, *args, **kwargs):
    try:
        do_work(*args, **kwargs)
    except Exception as exc:
        if self.request.retries >= self.max_retries:
            # Final failure — route to DLQ
            failed_task.apply_async(
                args=[self.request.task, args, kwargs, str(exc)],
                queue='dead_letter',
            )
        raise

The pattern: a separate worker pool consumes the dead_letter queue. Each item is logged, alerted, and held for human triage. Some teams write a small UI that lists DLQ items with retry/discard/replay actions.

RabbitMQ has native DLQ support via dead-letter exchanges; SQS has DLQ as a first-class feature. With Redis brokers, DLQ is a convention (a separate queue) rather than a feature.

The cost of no DLQ: failed tasks vanish silently. Users don't get emails; payments don't reconcile; reports don't generate. The DLQ makes failure visible.

3) Queue depth and worker scaling¶

Queue depth is the leading indicator. Rising depth means workers can't keep up. Patterns:

Constant depth, near zero. Healthy. Workers keep pace.

Oscillating depth. Bursty workload. Acceptable if oscillation amplitude is bounded.

Steadily climbing depth. Workers under-provisioned. Add workers or speed up tasks.

Sudden spike, slow drain. A backlog from upstream incident. Workers will catch up over time; alert if drain takes longer than SLO.

Sudden spike, no drain. Workers stopped processing. Investigate workers (crashed, deadlocked, network partition).

For autoscaling:

Kubernetes HPA on queue depth (via Prometheus + KEDA).
AWS ASG on CloudWatch metrics if using SQS.
Manual scaling in cron based on time-of-day patterns.

The trade-off: aggressive autoscaling adds workers fast; lag in scale-down can be expensive. Set sane minimums and maximums.

4) Worker memory leaks and recycling¶

Long-running Celery workers can accumulate memory over time — DB query caches, library leaks, fragmentation. Symptom: a worker that started at 200 MB is at 2 GB after a day.

Standard defence: recycle workers periodically.

app.conf.worker_max_tasks_per_child = 1000      # restart after 1000 tasks
app.conf.worker_max_memory_per_child = 200_000  # restart at 200 MB (KB units)

worker_max_tasks_per_child is the more common setting. Every 1000 tasks, the child process exits and a fresh one starts. Cost: occasional 1-2 second startup; benefit: bounded memory.

worker_max_memory_per_child triggers a restart when memory crosses the threshold. Useful when leaks are bursty (some tasks leak a lot, others none).

5) The big-task problem¶

Some tasks legitimately take 30+ minutes — a database export, a big report. Two problems:

Visibility timeout / broker re-delivery. Some brokers re-deliver if the task doesn't ack within a window. SQS defaults to 30 seconds visibility timeout; configure visibility_timeout to be longer than the longest task. RabbitMQ's consumer_timeout plays a similar role.

Lost progress on crash. A 30-minute task that crashes at minute 28 loses all work. Make long tasks checkpoint:

@shared_task(bind=True, acks_late=True)
def export_data(self, job_id):
    job = ExportJob.objects.get(pk=job_id)
    for batch_start in range(job.last_completed_batch + 1, total_batches):
        process_batch(batch_start)
        job.last_completed_batch = batch_start
        job.save()       # checkpoint after each batch

On retry, the task resumes from the last checkpoint. Combined with idempotency, this is the standard pattern for long-running tasks.

For very long tasks (hours), Celery is the wrong tool. Use a workflow engine (Airflow, Temporal, Prefect) that has first-class checkpointing, retry, and observability.

6) Monitoring — what to watch¶

Per-queue: depth, ingress rate, egress rate, average wait time.

Per-task: count, success rate, retry rate, p50/p95/p99 latency.

Per-worker: active task count, prefetch utilisation, memory, CPU, restart count.

System: broker connection count, broker memory, broker disk (for persistent queues).

Tools:

Flower. Web UI for Celery; live view of workers, tasks, queues. Useful for development and ad-hoc debugging; not production-grade for long-term metrics.
celery-prometheus-exporter. Exposes Celery metrics for Prometheus scraping.
OpenTelemetry instrumentation. Spans for each task; integrate with traces.
Sentry's Celery integration. Captures exceptions with task context.
Custom dashboards. Grafana over Prometheus is the standard production setup.

The metric that catches the most production issues: per-task failure rate. A task whose failure rate climbs from 0.1% to 5% is a regression waiting to escalate.

7) Broker failure — what happens when Redis dies¶

With Redis broker. Workers can't fetch tasks; the web can't enqueue. Tasks already prefetched by workers may complete; new ones queue up in the application or fail (depending on the producer's error handling).

When Redis comes back: depends on persistence. If Redis was configured for AOF or RDB persistence, queued tasks survive. If not, they're lost. Default Redis with no persistence loses everything on restart.

Defence:

Configure Redis persistence: appendonly yes (AOF), or RDB snapshots, or both.
Use Redis Sentinel or Redis Cluster for HA — automatic failover to a replica.
Set producer-side retry on connection error so a transient broker hiccup doesn't lose the enqueue.

app.conf.broker_transport_options = {
    'visibility_timeout': 43200,
    'retry_on_timeout': True,
    'socket_keepalive': True,
}

For workloads where loss is unacceptable, RabbitMQ or SQS provide stronger durability out of the box.

8) The "Celery is slow" diagnosis tree¶

Common complaint: "Celery is slow." Walk through layers:

Per-task latency. Profile the task. Is it slow because of database, API call, CPU work?
Queue wait time. Tasks waiting in queue before pickup. Add workers or split queues.
Worker prefetch. High prefetch + slow tasks = head-of-line blocking. Set prefetch=1.
Broker latency. Redis usually <1ms; SQS usually 50-200ms. Network distance to broker matters.
Result backend write. If using results, each completion writes to the backend. Disable if unused.
Acks_late + transaction holding. If task starts a Postgres transaction and Celery's prefetch holds the task, transaction lifetime grows.
Serialization overhead. Pickle is slower than JSON; JSON is slower than msgpack. Profile if it's hot.

Each layer has a measurement. Don't guess; instrument.

9) Alternatives — when to look beyond Celery¶

Celery is the default but not the only choice:

RQ. Redis-based, simpler than Celery. Good for small projects. No chord, fewer features.
Dramatiq. Cleaner API than Celery, Redis or RabbitMQ broker. Good middle ground.
Arq. Async-native, for asyncio workloads. Less mature ecosystem.
Huey. Even simpler than RQ. For very small projects.

When to migrate from Celery:

The team's workload has shifted entirely to asyncio — Arq fits.
Celery's complexity is the constraint and the workload is simple — RQ or Dramatiq.
The workflow needs first-class DAG modeling — Airflow, Temporal, Prefect.

For most Python web teams, Celery remains the right default. Migration costs are real; benefits should be specific.

10) The threaded example — a production-grade payment task¶

A fintech runs a charge_payment task. It must:

Charge through Stripe.
Update the database to mark paid.
Send a receipt email.
Tolerate retries without double-charging.

@shared_task(
    bind=True,
    name='payments.charge_payment',
    autoretry_for=(stripe.error.RateLimitError, stripe.error.APIConnectionError),
    retry_backoff=True,
    retry_backoff_max=600,
    retry_jitter=True,
    max_retries=10,
    acks_late=True,
    soft_time_limit=30,
    time_limit=60,
)
def charge_payment(self, payment_id):
    payment = Payment.objects.select_for_update().get(pk=payment_id)
    if payment.status == 'charged':
        return  # idempotent: already done

    try:
        intent = stripe.PaymentIntent.create(
            amount=int(payment.amount * 100),
            currency='inr',
            idempotency_key=str(payment.id),
        )
        payment.stripe_intent_id = intent.id
        payment.status = 'charged'
        payment.charged_at = timezone.now()
        payment.save()
    except stripe.error.CardError as e:
        # client error — don't retry
        payment.status = 'failed'
        payment.failure_reason = str(e)
        payment.save()
        return

    # Schedule receipt after the database commits
    transaction.on_commit(lambda: send_receipt.delay(payment.id))

What this task does right:

Idempotency at three layers: database check (payment.status == 'charged'), Stripe idempotency_key, on-commit hook for the receipt.
Distinguishes retryable (rate limit, connection) from non-retryable (card declined).
Exponential backoff with jitter and a 10-minute cap.
acks_late=True so worker crash mid-task results in re-delivery, not loss.
select_for_update to prevent two workers from racing on the same payment.
Soft + hard time limits so a stuck Stripe call gets cleaned up.

This is what a production payments task looks like. Each detail is there because the team has bled.

Operational signals¶

Healthy. Per-task failure rate < 1%; queue depth oscillating; worker memory bounded by recycling; DLQ near-empty.

First degrading metric. DLQ accumulating. Failures are happening; max retries exhausted; nobody is reviewing.

Misleading metric. Total task throughput — masks per-task failure spikes.

Expert graph. Per-task failure rate × retry count × DLQ ingress — three together reveal the problem before users do.

Where this appears in production¶

Instagram — heavy use of recycling (worker_max_tasks_per_child) to bound memory.
Stripe (internal Python services) — explicit idempotency_key on every retry.
Pinterest — chord patterns with checkpointing for image pipelines.
A Mumbai fintech — on_commit hook for every side effect that follows a database write.
A Bengaluru SaaS — DLQ with custom UI for triage; >90% of DLQ items are replayed after a transient downstream fix.
A Goa-based travel SaaS — moved from Redis to RabbitMQ when DLQ requirement became regulatory.
A Pune AI platform — circuit breaker on every external API call from Celery.
A Delhi e-commerce — Flower for ad-hoc debugging; Grafana + Prometheus for production observability.

Recall / checkpoint¶

What is a retry storm and how does exponential backoff + jitter fix it?
What is a dead-letter queue and what belongs in it?
What signals a misbehaving worker pool (depth pattern, memory pattern, throughput)?
Why is worker_max_tasks_per_child a standard production setting?
How does checkpointing change the design of a long-running task?
What is transaction.on_commit and when is it the right pattern?
When does Celery stop being the right tool, and what comes next?

Interview Q&A¶

Q1. The team's downstream API was degraded for 15 minutes. After it recovered, the API kept failing. Walk through what happened and the fix. Retry storm. During the 15-minute degradation, tasks accumulated retries. When the API recovered, the backlog hit it at amplified rate; the API failed again. This loop continues until backoff catches up. Fix: exponential backoff with jitter; circuit breaker that fails fast when error rate is high; max-retries cap that routes to DLQ rather than retrying forever. After the fix, recovery is graceful — backoff lets the API recover; circuit breaker prevents amplification. Common wrong answer to avoid: "the API needs to be faster" — the structural fix is at the retry policy, not the downstream.

Q2. A team's DLQ has 50K items and nobody is looking at it. Walk through the response. DLQ is a symptom dashboard for tasks that failed permanently. 50K items means thousands of users haven't received what they should. The first move: triage by task name and error class. Many DLQ items have a common cause (config bug, downstream API change); fix the cause and replay. Going forward: DLQ size as a paging condition; per-task DLQ ingress rate as a Grafana alert; small UI for engineers to triage. Treat the DLQ as a backlog with the same rigour as a product backlog. Common wrong answer to avoid: "discard the DLQ" — discarding without investigation drops real failures.

Q3. Worker memory grows to 2 GB after running for a day. What is the fix? Set worker_max_tasks_per_child = 1000 (or similar). Every 1000 tasks the child exits and a fresh one starts; memory bounded by recycle interval. This handles leaks in libraries, cached querysets, fragmentation. Also set worker_max_memory_per_child as a backstop. Cost: occasional task latency spike during child restart (~1-2 seconds); benefit: bounded memory permanently. Common wrong answer to avoid: "find the leak" — necessary if leak is from your code; for library leaks, recycling is the structural fix.

Q4. A long-running task (45 minutes) keeps failing at minute 40. The whole thing restarts. Walk through the fix. Checkpoint. Refactor the task to process work in batches and save progress between batches; the database has a last_completed_batch field. On retry, the task resumes from the last batch. Combined with idempotency (the work in each batch is safe to redo), retries become recovery, not redo. If the task is regularly 45+ minutes, consider whether Celery is still the right tool — for workflows this long, Airflow or Temporal have better observability and first-class checkpointing. Common wrong answer to avoid: "make the task complete in 30 minutes" — sometimes the work just is long.

Q5. The team uses Redis broker with no persistence. After a Redis restart, queued tasks are gone. Walk through the response. Two layers of fix. Short-term: enable Redis AOF persistence (appendonly yes); configure Redis Sentinel for failover. Medium-term: consider whether the workload tolerates the loss profile of Redis. For workloads where loss is unacceptable (payments, regulated workflows), migrate to RabbitMQ (durable queues, publisher_confirms) or SQS (managed durability). The operational cost rises; the loss surface drops. Common wrong answer to avoid: "Redis is always sufficient" — depends on the loss tolerance of the workload.

Q6. The team's Celery dashboard shows healthy aggregates, but users report missing emails. Walk through diagnosis. Aggregate metrics hide per-task and per-tenant problems. Slice the metrics. Per-task failure rate may be elevated for send_email. Per-tenant: a specific tenant's emails may be in the DLQ. Per-error: a specific exception class may dominate. The complaint surface (users) and the metric surface (aggregate) are different views; reconcile by adding the slices that connect them. Common wrong answer to avoid: "our dashboard is fine" — the dashboard is missing the slice that would show the bug.

Operational memory¶

This chapter explained Celery's production surface: retry storms and backoff defences, dead-letter queues, queue-depth diagnostics, worker recycling, long-task patterns, broker failure modes, and the alternatives when Celery is wrong. The important idea is that Celery is convenient but not safe-by-default; production safety is a set of conscious choices (backoff, DLQ, idempotency, recycling, observability).

You learned to defend against retry storms, route permanent failures to DLQ, scale workers on queue depth, recycle worker memory, checkpoint long tasks, and choose alternatives when the workload outgrows Celery. That solves the production surface.

Carry this diagnostic forward: when Celery is suspected in a production incident, ask which operational surface is at fault — retry policy, DLQ, scaling, recycling, broker durability, or alternative-tool fit. Each has a structural fix.

Remember:

Exponential backoff + jitter prevents retry storms.
DLQ makes permanent failure visible.
Worker recycling bounds memory.
Idempotency + checkpointing makes long tasks safe.
Broker durability is a workload choice.
Per-task slice metrics catch what aggregates hide.

Bridge. Celery covers the in-process background job path. The next module — 07_sqs — covers the cloud-native message queue: AWS SQS as broker, standalone queue, or alternative to Celery's broker entirely. → ../07_sqs/00-eli5.md