01. Visibility timeout, FIFO, deduplication — how SQS actually delivers¶
~11 min read. Visibility timeout is the parameter you will tune again and again. FIFO ordering and exactly-once processing come with caveats. This chapter opens the SQS delivery model.
Builds on: 00-eli5.md.
The mailbox picture is enough to start. To debug "why did this message process three times?" you need the model.
1) Visibility timeout — the actual mechanics¶
When a consumer calls ReceiveMessage, SQS marks the message as "in flight" and sets a deadline. Until that deadline, no other consumer will receive the message. The consumer has two choices:
- Delete the message before the deadline → message is gone forever.
- Do nothing → after the deadline, the message reappears in the queue.
T=0 Consumer A receives msg with visibility_timeout = 30s.
Message is "in flight"; invisible to other consumers.
T=5 Consumer A is still processing.
Other consumer B asks ReceiveMessage — gets nothing for this message.
T=30 Deadline. Message becomes visible again.
T=31 Consumer C calls ReceiveMessage — gets the same message.
T=35 Consumer A finishes (slow path) and calls DeleteMessage.
Delete succeeds (returns 200) but C is already processing.
Outcome: the message was processed twice.
This is the at-least-once semantics in action. The fix is to make the work idempotent or extend the visibility timeout dynamically:
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=60, # add another minute
)
For long tasks, set the initial timeout long, or extend it periodically as a heartbeat from the consumer.
The maximum visibility timeout is 12 hours per message. The maximum per receive is 12 hours total (you can extend, but the running total caps at 12 hours).
2) Setting the right visibility timeout¶
Rule of thumb: 3 × p99 task duration. If your p99 task is 5 seconds, set visibility to 15 seconds. The 3× gives margin for slow tasks; aggressive consumers will delete well before the timeout.
task duration distribution visibility timeout setting
───────────────────────────── ──────────────────────────
p50 = 200ms, p99 = 3s 10-15s (3-5× p99)
p50 = 5s, p99 = 30s 90-120s
p50 = 30s, p99 = 3min 10-15 min
p50 = 5min, p99 = 30min 90-120 min (or use heartbeat)
Too short: re-delivery during in-flight work; duplicate processing.
Too long: stuck consumer holds the message for a long time before redrive; latency for retries.
For variable-duration tasks, use heartbeat extension — every 30 seconds while the task runs, extend the timeout by another 60 seconds. The task's clean exit deletes the message.
3) FIFO queues — when order matters¶
A FIFO queue (MyQueue.fifo) provides:
- Strict ordering within a message group ID. Messages with the same
MessageGroupIdare delivered in the order they were sent. - Exactly-once processing. SQS deduplicates messages with the same
MessageDeduplicationIdwithin a 5-minute window.
sqs.send_message(
QueueUrl=fifo_queue_url,
MessageBody=json.dumps(payload),
MessageGroupId=f'customer-{customer_id}',
MessageDeduplicationId=str(event_id),
)
Three quirks:
Throughput. Standard SQS handles unbounded throughput. FIFO is capped at 300 TPS per queue (3000 with batch operations). For higher FIFO throughput, use high-throughput mode (per-message-group-id parallelism); or partition manually across multiple FIFO queues.
Order is per-group, not global. Different MessageGroupId values are delivered in parallel without ordering between them. Choose group ID such that the things that must be ordered share a group, and unrelated things don't.
Deduplication is content-based or explicit. Either set MessageDeduplicationId explicitly (the safer pattern; you control the dedup token), or enable ContentBasedDeduplication (SQS hashes the message body and dedupes within 5 minutes). The 5-minute window is the trap — a duplicate sent 6 minutes apart is not deduped.
For most use cases, set MessageDeduplicationId to a stable token (event ID, payment ID) so dedup works regardless of timing.
4) Long polling — the under-used setting¶
By default, ReceiveMessage is short polling — checks one server, returns immediately even if empty. This produces:
- Empty receives (cost: $0.40 per million; adds up).
- Slow apparent throughput (consumer round-trips with nothing).
- "Why did my consumer miss the message that was clearly in the queue?" SQS is distributed; one short poll may hit a server that doesn't have your message yet.
Long polling fixes all three:
response = sqs.receive_message(
QueueUrl=queue_url,
WaitTimeSeconds=20, # long poll
MaxNumberOfMessages=10, # batch up to 10
)
The receive waits up to 20 seconds for a message; returns immediately if a message is available. With MaxNumberOfMessages=10, you can batch.
The trade-off: long polling holds a connection. If you have hundreds of consumer threads each long-polling, you have hundreds of open connections. For most workloads, this is fine. For very high consumer concurrency, tune WaitTimeSeconds down (1-5 seconds).
MaxNumberOfMessages=10 is the maximum; receiving 10 at a time amortises the API cost across more work.
5) Message size and large payloads¶
SQS message body cap: 256 KB. Larger payloads need a workaround.
Standard pattern: put the large payload in S3; put the S3 URL in the SQS message.
s3.put_object(Bucket='myapp-payloads', Key=f'task-{uuid}', Body=payload)
sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps({'s3_key': f'task-{uuid}'}),
)
The consumer fetches the body from S3 after receiving the SQS message.
For Java/JVM consumers, AWS provides the Amazon SQS Extended Client Library that handles this transparently. For Python and others, it's a few lines of glue.
6) Encryption — SSE-SQS and KMS¶
SQS supports two server-side encryption modes:
- SSE-SQS. AWS-managed key. No additional cost. Simplest.
- SSE-KMS. Customer-managed KMS key. Adds per-request KMS API costs.
For most workloads, SSE-SQS is sufficient. For workloads with compliance requirements (HIPAA, PCI), SSE-KMS gives the customer-managed key control auditors want.
KMS adds latency (KMS API call per message) and cost (per KMS API call). For high-throughput queues, KMS cost can dominate. Audit before enabling.
7) The consumer poll loop pattern¶
def poll_loop(queue_url):
while True:
try:
response = sqs.receive_message(
QueueUrl=queue_url,
WaitTimeSeconds=20,
MaxNumberOfMessages=10,
AttributeNames=['ApproximateReceiveCount'],
MessageAttributeNames=['All'],
)
except Exception as e:
log.exception("SQS receive failed: %s", e)
time.sleep(5)
continue
messages = response.get('Messages', [])
if not messages:
continue # long poll timed out; loop
for message in messages:
try:
process(message)
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message['ReceiptHandle'],
)
except Exception as e:
log.exception("Processing failed for message %s: %s",
message['MessageId'], e)
# Don't delete; visibility timeout will redrive
What this poll loop does right:
- Long polling (20s wait).
- Batch receive (up to 10).
- Per-message try/except; one failure doesn't stop the batch.
- Explicit delete only on success.
- Retry on the receive itself if SQS API has a transient failure.
What's missing for production: heartbeat extension for long tasks, parallel processing within the batch (use threads or processes), graceful shutdown on SIGTERM, metrics.
8) The "ApproximateReceiveCount" trick¶
SQS adds an attribute to received messages: how many times this message has been received. Useful for poison-message detection:
receive_count = int(message['Attributes']['ApproximateReceiveCount'])
if receive_count > 5:
log.warning("Poison message %s — receive_count=%d", message['MessageId'], receive_count)
# Route to DLQ explicitly or skip processing and let redrive policy handle it
The redrive policy (chapter 03) usually handles this automatically by routing to DLQ after maxReceiveCount receives. But for tasks that need explicit awareness (an audit log, a custom alert), the count is in the message.
9) Delay queues and per-message delay¶
Two delay mechanisms:
Per-queue delay (DelaySeconds). Every message sent to the queue is delayed by this amount before becoming visible. Set on the queue itself; applies to all messages.
Per-message delay.
sqs.send_message(
QueueUrl=queue_url,
MessageBody=body,
DelaySeconds=300, # this message is invisible for 5 minutes
)
Useful for retry-with-delay patterns: producer schedules a follow-up message with a delay. Max per-message delay is 15 minutes; for longer, persist and re-enqueue at the right time.
10) The threaded example — an order processor¶
A team consumes an orders SQS queue. Each message is "process this order"; tasks take 2-15 seconds. The team's poll loop:
def process_orders():
while not shutdown_event.is_set():
messages = sqs.receive_message(
QueueUrl=orders_queue_url,
WaitTimeSeconds=20,
MaxNumberOfMessages=10,
).get('Messages', [])
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [pool.submit(handle_message, m) for m in messages]
for future in as_completed(futures):
future.result() # surface exceptions
def handle_message(message):
body = json.loads(message['Body'])
receipt = message['ReceiptHandle']
# Heartbeat thread to extend visibility every 30s
stop_heartbeat = threading.Event()
def heartbeat():
while not stop_heartbeat.is_set():
if not stop_heartbeat.wait(30):
sqs.change_message_visibility(
QueueUrl=orders_queue_url,
ReceiptHandle=receipt,
VisibilityTimeout=60,
)
threading.Thread(target=heartbeat, daemon=True).start()
try:
process_order(body)
sqs.delete_message(QueueUrl=orders_queue_url, ReceiptHandle=receipt)
finally:
stop_heartbeat.set()
Initial visibility timeout: 60s. Heartbeat extends every 30s while the task runs. On success: delete. On failure: don't delete; the message reappears for retry.
The whole pattern is ~30 lines and handles long-running tasks, parallelism, retries, and clean shutdown.
Operational signals¶
Healthy. ApproximateNumberOfMessages (CloudWatch metric) oscillates with load; ApproximateAgeOfOldestMessage stays low; DLQ depth near zero.
First degrading metric. ApproximateAgeOfOldestMessage climbing. Consumers can't keep up; backlog growing.
Misleading metric. Total receive count — high count can include both real work and re-deliveries.
Expert graph. Send rate, receive rate, delete rate, DLQ ingress rate — together they reveal where messages flow and where they stall.
Where this appears in production¶
- Amazon Prime Video — SQS for fan-out tasks; FIFO for per-customer event ordering.
- Airbnb — SQS for many internal pipelines; well-documented patterns on heartbeat.
- Slack — SQS for some background pipelines.
- A Bengaluru fintech — FIFO SQS with
MessageGroupId = account_idfor per-account event ordering. - A Mumbai retail SaaS — Standard SQS for order processing; per-tenant queue with redrive to DLQ.
- A Pune analytics platform — SQS + S3 pattern for large data payloads.
- A Goa-based logistics SaaS — heartbeat extension for tasks that vary from seconds to 10 minutes.
- A Delhi food-delivery platform — long polling everywhere; cost on empty receives dropped 80%.
Recall / checkpoint¶
- What is visibility timeout and what happens at the deadline?
- How do you size visibility timeout for a workload?
- What guarantees does FIFO SQS add over Standard?
- What is the 5-minute deduplication window and where does it surprise teams?
- What is long polling and why is it almost always the right setting?
- How do you handle messages larger than 256 KB?
- What is the heartbeat pattern and when do you need it?
Interview Q&A¶
Q1. A task is being processed three times for the same message. Walk through diagnosis.
Either the visibility timeout is shorter than the task takes (most common), or the consumer is crashing before delete (less common). Diagnosis: check the task's actual duration distribution vs. the queue's visibility timeout. If the p99 task is 30s and the timeout is 30s, every long task gets re-delivered. Fix: raise the timeout, or add heartbeat extension. Verify by tracking ApproximateReceiveCount per message; healthy queues have most messages at receive_count=1. Common wrong answer to avoid: "SQS is broken" — at-least-once is the contract; tune around it.
Q2. A team needs ordered processing per customer. Walk through the FIFO choice.
FIFO queue with MessageGroupId = customer_id. Each customer's messages are processed in order. Different customers process in parallel — no global ordering, just per-customer. Throughput: 300 TPS per queue base; enable high-throughput mode for higher rates (per-group-id parallelism). Deduplication: set MessageDeduplicationId to a stable event ID for exactly-once. Watch the 5-minute dedup window if events can recur with different content. Common wrong answer to avoid: "use FIFO for everything" — FIFO has throughput cost; only use when order matters.
Q3. The team's consumer is processing one message at a time with short polling. Walk through optimisations.
Three quick wins. (1) Switch to long polling (WaitTimeSeconds=20) — fewer empty receives, lower cost, faster pickup. (2) Batch receive (MaxNumberOfMessages=10) — amortise API cost across messages. (3) Process the batch in parallel (thread pool) — wall time drops. After all three, the consumer's throughput typically goes from ~5 msgs/sec to 50+ msgs/sec on the same infrastructure. Common wrong answer to avoid: "add more consumer instances" — first optimise the consumer; then scale.
Q4. A 1 MB payload needs to flow through SQS. Walk through the pattern. SQS max message size is 256 KB. For larger payloads, use the S3-pointer pattern: producer puts the payload in S3 with a unique key; sends an SQS message with the key. Consumer reads the SQS message, fetches from S3, processes, deletes. AWS has the Extended Client Library for Java that does this transparently; for Python it's a few lines. Cleanup: delete the S3 object after successful processing, or use S3 lifecycle policies to expire payloads after N days. Common wrong answer to avoid: "split the message" — adds complexity and exactly-once becomes harder.
Q5. A consumer holds messages for 10 minutes; the visibility timeout is 60 seconds. What happens, and what is the fix?
After 60s, the message becomes visible to other consumers. While the first consumer is still working, a second consumer picks up and processes. Result: double processing, with potential side-effect duplication. Fix: implement heartbeat — every 30s, the consumer calls ChangeMessageVisibility to extend the timeout by another 60s. The message stays in-flight as long as the consumer is alive. On consumer crash, no heartbeat is sent; the message reappears for retry. Common wrong answer to avoid: "raise the timeout to 10 minutes" — works but punishes failure cases with 10-minute redrive latency.
Q6. The team enabled SSE-KMS and SQS costs tripled. Walk through the diagnosis.
KMS charges per API call. SSE-KMS encrypts each message, which involves a KMS call. At high message rates, KMS API costs dominate. Options: (1) use SSE-SQS instead — AWS-managed keys, no per-API cost; sufficient for most compliance regimes. (2) batch where possible — SQS batches send/receive into a single API call; KMS is per encryption operation but the SDK can cache data keys (CMK with data key caching). (3) audit whether KMS is actually required by compliance — sometimes SSE-SQS is enough. Common wrong answer to avoid: "disable encryption" — almost never the right answer.
Operational memory¶
This chapter explained SQS delivery: visibility timeout mechanics, FIFO ordering and deduplication, long polling, message size limits, encryption, and the consumer poll loop. The important idea is that visibility timeout is the parameter that controls duplicate processing; sizing it correctly is the difference between safe and chaotic.
You learned to size visibility timeout, choose Standard vs. FIFO, use long polling and batching, handle large payloads via S3, and implement heartbeat for long tasks. That solves the delivery layer; day-to-day code patterns come next.
Carry this diagnostic forward: when SQS misbehaves, ask which delivery property is at fault — timeout, ordering, deduplication, polling mode, or message size. Each has a known fix.
Remember:
- Visibility timeout = 3× p99 task duration; extend with heartbeat for long tasks.
- FIFO for per-group ordering; Standard for everything else.
- Long polling + batch receive + parallel process is the consumer pattern.
- S3-pointer pattern for payloads > 256 KB.
- ApproximateReceiveCount > 1 means the message was re-delivered; investigate.
Bridge. The delivery model is set. Day-to-day, you write producers and consumers with the AWS SDK. The next chapter is that surface. → 02-sdk-and-poll-loops-day-to-day.md