Message delivery failures in Message Integration can cause data loss or processing delays. Retry policies automatically retry failed deliveries. Dead-letter queues (DLQs) capture undeliverable messages for inspection and reprocessing.
How it works
When a message delivery fails, Message Integration applies the configured retry policy. If all retries are exhausted, the fault tolerance policy determines what happens next:
Message delivery fails
|
v
Retry policy applies
(backoff or exponential decay)
|
v
All retries exhausted?
| |
No Yes
| |
v v
Retry Fault tolerance policy?
again | |
Allowed Prohibited
| |
v v
DLQ configured? Task status -> Ready
| | (processing stops)
Yes No
| |
v v
Send to Discard
DLQ messageIf the system cannot attempt retries at all -- for example, due to an invalid resource configuration -- the task status changes to Start Failed and the normal retry flow does not apply.
Retry policies
A retry policy determines how Message Integration retries failed deliveries. Two policies are available:
| Policy | Behavior | Max retries | Retry intervals | Total retry window |
|---|---|---|---|---|
| Backoff retry (default) | Fixed random interval | 3 | Random, 10-20 seconds each | ~60 seconds |
| Exponential decay retry | Increasing interval | 176 | Starts at 1 second, doubles up to 512 seconds | 1 day |
Backoff retry
Backoff retry is the default policy. The system retries a failed message up to 3 times, with a random interval of 10 to 20 seconds between each attempt.
Use backoff retry when you need fast failure detection and want failed messages to reach the DLQ or trigger fault tolerance quickly.
Exponential decay retry
Exponential decay retry increases the wait time between attempts. The interval starts at 1 second and doubles with each retry, up to a maximum of 512 seconds. The full interval sequence is:
1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 512s
After reaching 512 seconds, the system continues retrying at 512-second intervals for the remaining 167 attempts, for a total of 176 retries over one day.
Use exponential decay retry when the downstream system may recover on its own. This policy maximizes the chance of successful delivery before routing to the DLQ.
Choose a retry policy
| Scenario | Recommended policy | Reason |
|---|---|---|
| Downstream failures are typically permanent (e.g., invalid message format) | Backoff retry | Fail fast and route to DLQ for manual inspection |
| Downstream failures are typically transient (e.g., temporary service unavailability) | Exponential decay retry | Give the downstream system time to recover |
| Message loss is unacceptable and downstream recovery is expected | Exponential decay retry + Fault tolerance prohibited | Maximum retry coverage with processing halt as final safeguard |
| High message throughput and DLQ-based reprocessing is in place | Backoff retry + Fault tolerance allowed | Minimize retry overhead and handle failures through the DLQ |
Fault tolerance policies
A fault tolerance policy determines what happens after all retries are exhausted. Two policies are available:
Fault tolerance allowed
Event processing continues even if a message fails after all retries. The failed message is either:
Delivered to the dead-letter queue, if one is configured.
Discarded, if no dead-letter queue is configured.
Use this mode when occasional message loss is acceptable, or when you have a DLQ-based reprocessing workflow in place.
Fault tolerance prohibited
Event processing stops if a message fails after all retries. The task status changes to Ready and no further messages are processed until you resolve the issue.
Use this mode when every message must be delivered and message loss is unacceptable.
Dead-letter queues
A dead-letter queue (DLQ) captures messages that fail after all retries are exhausted. The system sends the raw message data to the DLQ, where you can inspect or reprocess it. The DLQ feature is disabled by default. Enable it at the task level to start capturing failed messages.
Supported DLQ targets
| Service | Target type |
|---|---|
| ApsaraMQ for RocketMQ | Queue |
| Simple Message Queue (formerly MNS) | Queue |
| ApsaraMQ for Kafka | Queue |
| EventBridge | Event bus |
Handle dead-lettered messages
After a message reaches the DLQ, take the following steps:
Identify the failure cause. Check the task logs and the raw message data in the DLQ. Determine whether the failure was caused by a downstream service error, invalid message format, or permission issue.
Resolve the root cause. Fix the underlying issue, such as restoring the downstream service, correcting the message format, or updating permissions.
Reprocess the message. Consume the message from the DLQ and resend it to the original target, or process it through an alternative path.
Set up monitoring on the DLQ to detect failed messages promptly. For example, configure alerts based on the message count in the DLQ target service to avoid unnoticed message accumulation.