00. SQS — ELI5¶

SQS is AWS's mailbox-as-a-service. You drop a letter; another process picks it up; if the picker doesn't confirm delivery quickly enough, the letter pops back into the mailbox for the next picker. Managed, durable, simple — and deeply unlike most queues you've used.

Picture a row of mailboxes at a post office. Producers walk up and drop letters in. Consumers walk up and ask "any letters?" — they get one if there is one. They take it away to process. If they come back within 30 seconds with proof they processed it (the delete receipt), the letter is removed from the mailbox forever. If not, the mailbox treats the letter as un-delivered and gives it to the next consumer who asks.

That is SQS. A queue. Durable (AWS replicates across availability zones). Managed (no broker to operate). At-least-once delivery (re-delivers if not confirmed). The post-office bit you don't see — the persistence, the replication, the load balancing across pickers — is what you pay AWS to handle.

Two flavours:

Standard SQS. Best-effort ordering, at-least-once delivery, very high throughput (effectively unbounded). The default.
FIFO SQS. Strict ordering within a group, exactly-once processing (with deduplication), bounded throughput (300 TPS per queue, 3000 TPS with batching).

Most workloads want Standard. FIFO is for cases where order matters within a customer or session (e.g., a customer's events must be processed in order).

The recurring vocabulary¶

Name	What it is
queue	a named SQS endpoint; you produce and consume against it
message	the payload — up to 256 KB; SNS-style fan-out can carry larger via S3 pointers
visibility timeout	seconds during which a received message is invisible to other consumers
receipt handle	the one-time token used to delete or extend the message
dead-letter queue (DLQ)	the queue that catches messages that fail too many times
redrive policy	the config tying a main queue to a DLQ
long polling	wait up to 20 seconds for a message rather than poll instantly
batch operations	send/receive/delete up to 10 messages per API call
delay queue	per-queue delay before messages become visible after send

The picture¶

        producers (1+)
              │
              ▼ SendMessage
       ┌────────────┐
       │ SQS queue  │   AWS-managed, multi-AZ, durable
       └─────┬──────┘
             │ ReceiveMessage (long-poll)
             ▼
        consumers (1+)
             │ work the message
             │
             ├── on success → DeleteMessage (gone forever)
             │
             └── on failure → don't delete
                              ↓ (after visibility timeout)
                              ↓ message reappears
                              ↓
                       another consumer picks up
                              ↓
                       eventually: redrive to DLQ
                       (after maxReceiveCount tries)

The lifecycle is: send, receive (becomes invisible), succeed-and-delete (gone) or fail (reappears after timeout). The DLQ is where messages go when they've been received maxReceiveCount times without being deleted.

Two facts that surprise new SQS users¶

There is no peek and no "see all messages." You can't browse the queue. You receive a message, work on it, delete it. The next consumer doesn't see what you received. There's no LIST API.

The visibility timeout is the single most-tuned parameter. Too short: your task is slower than the timeout, the message gets re-delivered while you're still working on it, you do the work twice. Too long: a stuck consumer holds the message for an hour before another consumer can retry. The right value is a few times your p99 task duration.

What this module covers¶

01-visibility-timeout-and-fifo-internals.md — How visibility timeouts actually work, FIFO ordering and deduplication, the failure modes that surface only in production.
02-sdk-and-poll-loops-day-to-day.md — Sending and receiving with boto3, batch operations, long polling, the patterns developers write daily.
03-deadletter-throughput-prod-gotchas.md — DLQ configuration, throughput scaling, message-size limits, KMS, cost surprises, the production catalogue.

Bridge. Before writing SQS code, we see why visibility timeout is the parameter that makes or breaks the system. → 01-visibility-timeout-and-fifo-internals.md