7.2 Advanced SQS Patterns — DLQ, Backpressure, and FIFO
Key Takeaways
- A Dead-Letter Queue (DLQ) captures messages that fail after maxReceiveCount delivery attempts, isolating poison-pill messages so one bad record cannot block the whole queue.
- Standard queues give at-least-once delivery and best-effort ordering at nearly unlimited throughput; FIFO queues give exactly-once processing and strict ordering at 300 (or 3,000 batched) messages per second.
- FIFO ordering is scoped to the Message Group ID: messages sharing a group are strictly serial, while different groups run in parallel, so the group ID is your parallelism key.
- Long polling (WaitTimeSeconds up to 20 seconds) eliminates empty ReceiveMessage responses, cutting API request charges versus short polling.
- Backpressure uses queue depth (ApproximateNumberOfMessagesVisible) to drive Auto Scaling of consumers, letting SQS absorb bursts so downstream systems are never overwhelmed.
Standard vs. FIFO: The First Decision
Every Amazon SQS question starts here. Standard queues deliver at-least-once (occasional duplicates) with best-effort ordering and effectively unlimited throughput. FIFO queues guarantee exactly-once processing and strict first-in-first-out ordering, but are capped at 300 messages/second per API action (send, receive, delete), or 3,000 messages/second with batching of 10 messages per call. High-throughput mode lifts this further per Region.
| Feature | Standard | FIFO |
|---|---|---|
| Ordering | Best-effort | Strict, per Message Group ID |
| Delivery | At-least-once (duplicates possible) | Exactly-once |
| Throughput | Nearly unlimited | 300/s, or 3,000/s batched |
| Name suffix | none | must end in .fifo |
Exam trap: "financial transactions, no duplicates, in order per account" is FIFO. "Decouple a high-volume image pipeline, occasional reprocessing is fine" is Standard. Do not choose FIFO just because ordering sounds nice; its throughput cap can disqualify it.
Dead-Letter Queues and Redrive
A Dead-Letter Queue is an ordinary SQS queue you attach to a source queue via a redrive policy. The flow:
- A message is received; the consumer fails to process it.
- After the visibility timeout expires, the message becomes visible again.
- Once the receive count hits maxReceiveCount (for example, 3), SQS moves the message to the DLQ instead of redelivering forever.
- You alarm on DLQ depth, fix the bug, then use redrive to move messages back to the source queue.
| Setting | Guidance |
|---|---|
| maxReceiveCount | Typically 3–5; too low loses transient-failure retries |
| DLQ retention | Set longer than the source (up to 14 days) so you have time to investigate |
| Queue-type match | Standard source needs a Standard DLQ; FIFO source needs a FIFO DLQ |
| Alarm | CloudWatch alarm on ApproximateNumberOfMessagesVisible of the DLQ |
Worked example: A consumer crashes on a malformed record. Without a DLQ that one message is redelivered endlessly, blocking healthy traffic behind it (a "poison pill"). With maxReceiveCount of 3 and a DLQ, the bad record is sidelined after three tries and processing continues.
FIFO Internals: Group ID, Deduplication, Visibility
Message Group ID is the parallelism control. All messages sharing a group are processed strictly in order; messages in different groups process concurrently. Using a customer ID or device ID as the group ID gives per-entity ordering with cross-entity parallelism.
| Goal | Group ID strategy | Result |
|---|---|---|
| Total global order | One shared group ID | Fully serial (slow) |
| Order per customer | Customer ID | Serial per customer, parallel across customers |
| Order per device | Device ID | Serial per device, parallel across devices |
Deduplication prevents duplicate sends within a 5-minute window. Use content-based dedup (SQS hashes the body) when identical bodies mean duplicates, or supply an explicit MessageDeduplicationId when you control idempotency keys.
Visibility timeout is the silent culprit behind duplicate processing on Standard queues: if processing takes longer than the timeout, the message reappears and a second consumer grabs it. Tune the timeout above your worst-case processing time, or call ChangeMessageVisibility to extend it mid-flight.
Long Polling and Backpressure
| Polling type | Behavior | Cost |
|---|---|---|
| Short polling | Returns immediately, often empty | Higher API charges |
| Long polling (WaitTimeSeconds 1–20) | Waits up to 20 s for a message | Lower, fewer empty calls |
Backpressure decouples producer rate from consumer rate: producers push freely, SQS buffers the surge, and an Auto Scaling policy on the ApproximateNumberOfMessagesVisible metric adds consumers as the backlog grows and removes them as it drains, so a downstream database is never flooded.
Delivery Delays, Retention, and Choosing SNS, SQS, or EventBridge
Three timing knobs frequently appear in answer choices, and confusing them is a classic distractor.
| Setting | Range | What it does |
|---|---|---|
| Visibility timeout | 0 s to 12 hours (default 30 s) | Hides a received message from other consumers while it is processed |
| Message retention | 1 minute to 14 days (default 4 days) | How long an unprocessed message survives in the queue |
| Delivery delay | 0 to 15 minutes | Postpones first delivery of a new message |
| WaitTimeSeconds | 0 to 20 seconds | Long-poll wait per receive call |
A delay queue (delivery delay) is the right answer for "hold every message 5 minutes before processing," whereas per-message timers use the message timer attribute. Do not confuse delay with visibility timeout: delay applies before first delivery; visibility timeout applies after a message has been received.
Service-selection trap: SQS, SNS, and EventBridge all "decouple," so read the verb. "Buffer and process at the consumer's pace, possibly with many retries" is SQS. "Notify multiple subscribers instantly" is SNS. "Route different event types to different targets based on content rules, including SaaS sources" is EventBridge. For a message larger than the 256 KB SQS maximum, use the SQS Extended Client, which stores the payload in S3 and passes a pointer, rather than splitting the message.
Worked example: An order pipeline must guarantee no order is lost even if the processor is down for an hour. Set the queue retention to 14 days and let Auto Scaling restart consumers; SQS durably retains the backlog and processing resumes with zero data loss once capacity returns.
A payments system must process transactions exactly once and in submission order per account, while still allowing different accounts to be processed in parallel. How should SQS be configured?
Messages that hit a code bug are being redelivered indefinitely, blocking healthy messages behind them. What is the correct remedy?
A team sees their SQS bill dominated by ReceiveMessage requests, most of which return no messages. Which change reduces cost without losing messages?
Producers send bursts far faster than a downstream database can absorb, occasionally overwhelming it. Which SQS-based design protects the database?