Retry policies and dead-letter queues - ApsaraMQ for MQTT

Message delivery failures in Message Integration can cause data loss or processing delays. Retry policies automatically retry failed deliveries. Dead-letter queues (DLQs) capture undeliverable messages for inspection and reprocessing.

How it works

When a message delivery fails, Message Integration applies the configured retry policy. If all retries are exhausted, the fault tolerance policy determines what happens next:

Message delivery fails
        |
        v
  Retry policy applies
  (backoff or exponential decay)
        |
        v
  All retries exhausted?
    |           |
    No          Yes
    |           |
    v           v
  Retry     Fault tolerance policy?
  again       |              |
          Allowed        Prohibited
              |              |
              v              v
        DLQ configured?   Task status -> Ready
          |         |     (processing stops)
         Yes        No
          |         |
          v         v
      Send to     Discard
       DLQ        message

Note

If the system cannot attempt retries at all -- for example, due to an invalid resource configuration -- the task status changes to Start Failed and the normal retry flow does not apply.

Retry policies

A retry policy determines how Message Integration retries failed deliveries. Two policies are available:

Policy	Behavior	Max retries	Retry intervals	Total retry window
Backoff retry (default)	Fixed random interval	3	Random, 10-20 seconds each	~60 seconds
Exponential decay retry	Increasing interval	176	Starts at 1 second, doubles up to 512 seconds	1 day

Backoff retry

Backoff retry is the default policy. The system retries a failed message up to 3 times, with a random interval of 10 to 20 seconds between each attempt.

Use backoff retry when you need fast failure detection and want failed messages to reach the DLQ or trigger fault tolerance quickly.

Exponential decay retry

Exponential decay retry increases the wait time between attempts. The interval starts at 1 second and doubles with each retry, up to a maximum of 512 seconds. The full interval sequence is:

1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s

After reaching 512 seconds, the system continues retrying at 512-second intervals for the remaining 167 attempts, for a total of 176 retries over one day.

Use exponential decay retry when the downstream system may recover on its own. This policy maximizes the chance of successful delivery before routing to the DLQ.

Choose a retry policy

Scenario	Recommended policy	Reason
Downstream failures are typically permanent (e.g., invalid message format)	Backoff retry	Fail fast and route to DLQ for manual inspection
Downstream failures are typically transient (e.g., temporary service unavailability)	Exponential decay retry	Give the downstream system time to recover
Message loss is unacceptable and downstream recovery is expected	Exponential decay retry + Fault tolerance prohibited	Maximum retry coverage with processing halt as final safeguard
High message throughput and DLQ-based reprocessing is in place	Backoff retry + Fault tolerance allowed	Minimize retry overhead and handle failures through the DLQ

Fault tolerance policies

A fault tolerance policy determines what happens after all retries are exhausted. Two policies are available:

Fault tolerance allowed

Event processing continues even if a message fails after all retries. The failed message is either:

Delivered to the dead-letter queue, if one is configured.
Discarded, if no dead-letter queue is configured.

Use this mode when occasional message loss is acceptable, or when you have a DLQ-based reprocessing workflow in place.

Fault tolerance prohibited

Event processing stops if a message fails after all retries. The task status changes to Ready and no further messages are processed until you resolve the issue.

Use this mode when every message must be delivered and message loss is unacceptable.

Dead-letter queues

A dead-letter queue (DLQ) captures messages that fail after all retries are exhausted. The system sends the raw message data to the DLQ, where you can inspect or reprocess it. The DLQ feature is disabled by default. Enable it at the task level to start capturing failed messages.

Supported DLQ targets

Service	Target type
ApsaraMQ for RocketMQ	Queue
Simple Message Queue (formerly MNS)	Queue
ApsaraMQ for Kafka	Queue
EventBridge	Event bus

Handle dead-lettered messages

After a message reaches the DLQ, take the following steps:

Identify the failure cause. Check the task logs and the raw message data in the DLQ. Determine whether the failure was caused by a downstream service error, invalid message format, or permission issue.
Resolve the root cause. Fix the underlying issue, such as restoring the downstream service, correcting the message format, or updating permissions.
Reprocess the message. Consume the message from the DLQ and resend it to the original target, or process it through an alternative path.

Note

Set up monitoring on the DLQ to detect failed messages promptly. For example, configure alerts based on the message count in the DLQ target service to avoid unnoticed message accumulation.