03. DLQ, throughput, cost — the production surface¶

~10 min read. DLQ that fills up. Cost surprises from empty receives. Throughput that doesn't scale linearly. The SQS production catalogue.

Builds on: 02-sdk-and-poll-loops-day-to-day.md.

The previous chapters covered the model and the SDK. This is what production teaches.

1) Dead-letter queues — design and triage¶

A DLQ is itself an SQS queue. The main queue's RedrivePolicy points to it:

{
  "maxReceiveCount": "5",
  "deadLetterTargetArn": "arn:aws:sqs:ap-south-1:123:my-orders-dlq"
}

After 5 receives without delete, the message moves to the DLQ. The DLQ has its own retention (typically longer — 14 days; pull from DLQ for triage).

Sizing maxReceiveCount. Too low: transient errors push real work to DLQ. Too high: stuck poison messages cycle for hours before redriving. Common values: 3-5 for fast-handler tasks; 5-10 for tasks with retryable downstream calls.

DLQ triage. Without a process, the DLQ grows silently. Standard pattern:

Monitor ApproximateNumberOfMessages on the DLQ; alert on any increase.
Engineer pulls messages from DLQ, inspects, decides per message: replay (back to main queue), fix and replay, or discard.
For high-volume DLQs, build a small UI listing messages with replay/discard actions.

AWS has DLQ redrive (since 2021) — a one-click replay from console or API. Useful for replaying after a downstream fix.

2) Throughput — what to expect¶

Standard SQS. Effectively unbounded. The published limit is "thousands of messages per second per queue"; in practice you can push tens of thousands per second per queue with batching. There is no hard ceiling published; AWS auto-scales.

FIFO SQS. Hard caps:

300 TPS per queue (sends + receives + deletes counted).
3000 TPS with batch operations (each batch counts as 1).
High-throughput mode: 3000 TPS per message group, up to 30000 TPS per queue.

For Standard, you rarely hit a queue limit. For FIFO, plan throughput carefully — you may need to partition across multiple FIFO queues if the workload exceeds the cap.

Receive throughput per consumer: ~10 messages per ReceiveMessage call. With long polling + batch + parallel processing, a single consumer can handle hundreds of messages per second. Beyond that, scale horizontally.

3) Cost — the surprises¶

SQS pricing has two components:

Per-request. $0.40 per million for Standard, $0.50 per million for FIFO.
Data transfer. Out-of-region transfer is billed per GB.

Cost surprises:

Empty receives. Without long polling, consumers poll constantly even when the queue is empty. Each is a billable receive call. A consumer polling every 100ms generates 36000 calls per hour = $0.01/hour per consumer. Across 100 consumers, that's $1/hour for nothing.

Always use long polling. WaitTimeSeconds=20 reduces empty receives to one every 20 seconds per consumer.

Batch operations. A single send_message_batch with 10 messages costs 1 request, not 10. Same for receive and delete batches. For high-volume producers, batching cuts the request count by 10×.

Cross-region transfer. Sending from EC2 in us-east-1 to SQS in ap-south-1 incurs data transfer costs. Co-locate SQS with consumers.

KMS. SSE-KMS adds per-message KMS API calls. At million-message rates, KMS cost can exceed SQS cost itself. Use SSE-SQS (no per-call KMS cost) unless KMS is regulatory.

4) Message attributes vs. message body¶

Two ways to attach data:

Message body — up to 256 KB; the payload.
Message attributes — up to 10 K/V pairs; not counted toward body size limit; visible separately.

Use attributes for:

Filtering at the consumer level (e.g., a producer sets priority=high; consumer can route or log).
SNS filtering policies (when SQS is fanned out from SNS).
Metadata that doesn't belong in the JSON body.

Caveat: attributes are not encrypted at rest separately; they share the queue's encryption. Don't put secrets in attributes (or body).

5) The visibility-timeout vs. retry-policy tension¶

Visibility timeout is the SQS-side re-delivery. The consumer's retry policy is application-side.

The trade-off:

Long visibility + few application retries. The application processes once; if it fails, SQS re-delivers after the timeout. Simple, predictable.
Short visibility + many application retries. The consumer retries internally before letting the message lapse. Faster recovery on transient errors, but consumer holds the message longer; risks duplicate processing if visibility runs out mid-retry.

The canonical pattern: visibility timeout sized for the work; application retries for fast transient errors; let SQS re-deliver for slower/persistent failures. Don't retry forever inside the consumer — let maxReceiveCount move the message to DLQ.

6) Scaling consumers — what triggers what¶

For autoscaling SQS consumers:

ApproximateNumberOfMessages. Total messages in queue. Useful for fan-out workloads.

ApproximateNumberOfMessagesNotVisible. Currently in-flight (being processed). High value with low queue depth means consumers are saturated with current work.

ApproximateAgeOfOldestMessage. How long the oldest queued message has been waiting. The latency-oriented metric. Pages on this when SLA is at risk.

In Kubernetes with KEDA, scale based on backlog-per-pod or queue depth. In AWS Lambda with SQS triggers, Lambda auto-scales based on inflight messages.

For Lambda specifically: each Lambda invocation processes up to 10 messages from a single receive. Lambda scales the number of concurrent invocations up to your reserved concurrency limit. Watch for partial batch failures — by default, Lambda discards the entire batch on any failure; use ReportBatchItemFailures to delete only the successful ones.

7) Encryption at rest and in transit¶

In transit. Always HTTPS (the AWS SDK defaults to this). Don't override.

At rest.

No encryption. Default. Messages stored unencrypted.
SSE-SQS. AWS-managed key. No additional cost. Enable in queue config.
SSE-KMS. Customer-managed KMS key. Per-message KMS cost. Choose for regulated workloads.

For most workloads, enable SSE-SQS as a baseline. The default of no-encryption is rarely justified.

8) FIFO-specific gotchas¶

Deduplication window is 5 minutes. A duplicate MessageDeduplicationId submitted within 5 minutes is dropped (consumer never sees it). After 5 minutes, the same ID is no longer deduped. For workloads where duplicates can occur > 5 minutes apart, the FIFO dedup is not enough — application-side dedup is still needed.

Per-group throughput is the constraint. A single MessageGroupId is processed serially. If one group's messages dominate, throughput collapses to that group's processing speed. Choose group IDs to distribute work evenly.

Visibility timeout per group. A FIFO message in flight blocks the next message in the same group. Choose visibility timeout carefully — too long, and a stuck message blocks the group's queue for that duration.

High-throughput mode. Enabled at queue creation (FifoThroughputLimit=perMessageGroupId, DeduplicationScope=messageGroup). Without this, even thousands of distinct groups can't exceed 300 TPS aggregate. With this, each group gets its own throughput allocation.

A common pattern: one event needs to fan out to multiple consumers. SNS topic → multiple SQS queue subscriptions:

event producer → SNS topic → subscribed: SQS queue A, SQS queue B, SQS queue C
                                          ↓               ↓               ↓
                                       consumer A      consumer B      consumer C

Each subscribed SQS queue gets its own copy of the message. Consumers act independently.

Subscription filter policies let each queue receive only matching messages:

{"event_type": ["order.created", "order.cancelled"]}

The fan-out pattern is the standard AWS event-driven architecture. Producers don't need to know consumers; consumers subscribe via IaC.

10) Operational checklist¶

The mature SQS deployment has:

Long polling enabled (queue-level default ReceiveMessageWaitTimeSeconds=20).
DLQ configured with maxReceiveCount=5-10; DLQ depth alerted.
Visibility timeout sized to 3× p99 task duration.
Encryption at rest (SSE-SQS minimum).
IAM roles per producer/consumer with least-privilege.
Monitoring: depth, age of oldest, send/receive/delete rates, DLQ ingress.
Producer-side retry on send failures with backoff.
Consumer-side heartbeat for long tasks.
For must-deliver semantics: transactional outbox.

This checklist is the difference between "we use SQS" and "SQS is operating safely."

Operational signals¶

Healthy. ApproximateAgeOfOldestMessage < SLA; DLQ depth steady or zero; send rate matches receive rate; per-consumer throughput stable.

First degrading metric. ApproximateAgeOfOldestMessage climbing — backlog growing, SLA at risk.

Misleading metric. Raw send count — high count without context can be normal load or amplification.

Expert graph. Per-queue: depth × age-of-oldest × DLQ ingress; consumer count × per-consumer throughput. The combination shows where pressure is and where it will surface.

Where this appears in production¶

Netflix — extensive SNS-SQS fan-out for internal event distribution.
Stripe (parts of infrastructure) — outbox pattern for must-deliver events to SQS.
Airbnb — SQS for many background pipelines; tuned maxReceiveCount per workload.
A Bengaluru fintech — DLQ with custom UI for triage; DLQ replay is a routine engineering activity.
A Mumbai SaaS — FIFO SQS with high-throughput mode; per-tenant message groups for ordering.
A Pune logistics platform — SNS topic for shipment.events; 12 SQS queues subscribed for different downstream consumers.
A Delhi e-commerce — visibility timeout tuned per workload; heartbeat for video transcoding.
A Goa-based AI startup — ReportBatchItemFailures on Lambda triggers; partial batch retry instead of all-or-nothing.

Recall / checkpoint¶

What does maxReceiveCount do?
What is the throughput limit of FIFO without high-throughput mode?
Why is long polling cost-relevant?
What is the SNS-to-SQS fan-out pattern?
What is the 5-minute deduplication window and when does it fail?
What is ReportBatchItemFailures and what does it solve?
When is the transactional outbox necessary?

Interview Q&A¶

Q1. The team's DLQ has 100K messages. Walk through the response. The DLQ is the failure surface; 100K means thousands of users have an unfinished interaction. Triage by task or error_type: many messages usually share a common cause (a bad deploy, a downstream change). Identify the cause; fix; replay the DLQ via SQS redrive (one-click since 2021). Going forward: DLQ size as a paging condition; ingress rate as an alert; a process for daily triage. The DLQ is not a graveyard; it is a backlog. Common wrong answer to avoid: "purge the DLQ" — discarding without diagnosis drops real failures.

Q2. SQS bills are 10× expected for a workload with low message volume. Walk through diagnosis. The most common cause is empty receives. Without long polling, consumers poll constantly; each poll is billed. 100 consumers polling 10 times/second = 1000 receives/second = 86M calls/day = ~$35/day for empty receives alone. Fix: set WaitTimeSeconds=20. Receive call rate drops by 100-200×. The bill drops accordingly. Secondary causes: KMS calls (if SSE-KMS enabled); cross-region transfer. Common wrong answer to avoid: "negotiate AWS pricing" — almost always config, not pricing.

Q3. The team's FIFO queue throughput is capped at 300 TPS; they need 5000 TPS. Walk through the response. Three options. (1) Enable high-throughput mode — FifoThroughputLimit=perMessageGroupId, DeduplicationScope=messageGroup. Each group gets its own throughput allocation; up to 30000 TPS per queue. (2) Partition manually across multiple FIFO queues by hashing the group ID — producer sends to queue N based on hash(group_id) % N. (3) If strict global ordering is not required, switch to Standard SQS — order is per-customer (in code) rather than per-queue (in SQS). Choice depends on the workload's ordering needs. Common wrong answer to avoid: "use Standard" — only if global ordering is unnecessary.

Q4. A team uses SQS as the Celery broker. A Celery task times out at the visibility timeout (30s default for Celery+SQS) and re-runs. Walk through the fix. Two layers. SQS side: raise the queue's VisibilityTimeout to a value larger than the longest task. Celery side: visibility_timeout in broker_transport_options matches the SQS setting. Mismatch causes Celery to think it has more time than it does, leading to double-processing. Make tasks idempotent regardless (at-least-once is the contract). Common wrong answer to avoid: "reduce the task duration" — sometimes tasks just are long; size the timeout for reality.

Q5. A Lambda function processes SQS messages but on any error the entire batch is retried. Walk through the fix. Default behaviour: if the Lambda function raises, the whole batch is considered failed; SQS doesn't delete any; all 10 messages reappear. This is wasteful if 9 succeeded. Fix: enable ReportBatchItemFailures on the trigger; the Lambda response includes a list of batchItemFailures (only the failed message IDs); SQS deletes the rest. Cleaner: per-message idempotency in the handler so retries don't double-process. Common wrong answer to avoid: "process messages one at a time" — gives up batch efficiency unnecessarily.

Q6. The team needs to add a new consumer for an event without modifying the producer. Walk through the pattern. SNS-to-SQS fan-out. Producer publishes to an SNS topic (instead of directly to one SQS queue). Multiple SQS queues subscribe to the topic. Each new consumer adds a new SQS queue subscription. Producers never change. Subscriptions can include filter policies — each queue receives only matching events. The pattern is the foundation of decoupled event-driven architectures on AWS. Common wrong answer to avoid: "modify the producer" — couples producers to consumers; doesn't scale.

Operational memory¶

This chapter explained the production surface of SQS: DLQ design and triage, throughput limits and scaling, cost surprises (empty receives, KMS, cross-region), encryption, FIFO gotchas, SNS-SQS fan-out, and the operational checklist. The important idea is that SQS is simple to start with and has surprising depth at the production layer; the patterns that make it safe are the patterns this catalogue documents.

You learned to size DLQ, scale consumers per workload, control cost, choose encryption, plan FIFO throughput, and use SNS fan-out for decoupling. That completes the SQS production surface.

Carry this diagnostic forward: when SQS is suspected, ask which production surface is involved — DLQ, throughput, cost, encryption, FIFO ordering, or fan-out. Each has a structural fix.

Remember:

DLQ is a backlog, not a graveyard.
Long polling is the cost defence.
FIFO high-throughput mode unlocks per-group parallelism.
SSE-SQS by default; SSE-KMS only when regulated.
SNS-SQS fan-out decouples producers from consumers.

Bridge. SQS is the AWS-managed message queue. The next module — 08_kafka — covers the streaming log: Kafka as a distributed commit log with partitioned, retained, replayable streams. → ../08_kafka/00-eli5.md