Applications may encounter temporary faults associated with network and running environments, such as transient network jitter, temporary unavailability of services, and timeouts caused by busy services. You can configure automatic retry mechanisms to prevent such faults and ensure successful operations.
Causes for temporary faults
Cause | Description |
High availability mechanism triggered by a fault | Tair can monitor the health status of nodes. If a master node in an instance fails, Tair automatically triggers a master-replica switchover. For example, the roles of the master and replica nodes are switched over to ensure the high availability of the instance. As a result, the client may encounter the following temporary faults:
Note For more information, see Master-replica switchovers. |
Request blockage caused by slow queries | Request blockage and slow queries occur when operations with a time complexity of O(N) are performed. In this case, other requests initiated by the client may temporarily fail. |
Complex network environments | Complex network environments between the client and the Tair server may cause problems such as occasional network jitter and data retransmission. In this case, requests initiated by the client may temporarily fail. |
Recommended retry rules
Retry rule | Description |
Retry only idempotent operations. | Timeouts can occur in any of the following phases:
A retried operation may be repeatedly executed on Tair. Therefore, not all operations are suitable for a retry mechanism. We recommend that you retry only idempotent operations, such as running the SET command. If you run the SET a b command multiple times, the value of a can only be b. Otherwise, the execution fails. If you run the LPUSH mylist a command, which is not idempotent, mylist may contain multiple a elements. |
Configure proper retry times and intervals. | Configure the retry times and intervals based on business requirements in actual scenarios. Otherwise, the following issues may occur:
Common retry interval policies include immediate retry, fixed-interval retry, exponential backoff retry, and random backoff retry. |
Avoid retry nesting. | Retry nesting may cause repeated or even unlimited retries. |
Record retry exceptions and generate failure reports. | During the retry process, we recommend that you configure the system to generate retry logs at the WARN level and only when the retry fails. |
Jedis
We recommend that you use Jedis 4.0.0 or later, preferably the latest Jedis version. In the following example, Jedis 5.0.0 is used.
Add the following dependency to your pom.xml file to include Jedis:
<dependency> <groupId>redis.clients</groupId> <artifactId>jedis</artifactId> <version>5.0.0</version> </dependency>
Use Jedis to retry a Tair operation.
If the instance is a standard instance or a cluster instance in proxy mode, you must use the JedisPool mode.
The following sample code automatically retries the SET command up to 5 times within a total retry duration of 10 seconds, with exponentially increasing wait times between each retry. If all retries fail, an exception is thrown.
PooledConnectionProvider provider = new PooledConnectionProvider(HostAndPort.from("127.0.0.1:6379")); int maxAttempts = 5; // The maximum number of retries. Duration maxTotalRetriesDuration = Duration.ofSeconds(10); // The maximum total retry duration. UnifiedJedis jedis = new UnifiedJedis(provider, maxAttempts, maxTotalRetriesDuration); try { System.out.println("set key: " + jedis.set("key", "value")); } catch (Exception e) { // If the exception is caught in this block, it implies that the operation failed even after the maximum number of attempts (maxAttempts) or after the maximum total retry duration (maxTotalRetriesDuration) is reached. e.printStackTrace(); }
If the instance is a cluster instance in direct connection mode, you must use the JedisCluster mode.
You can configure the maxAttempts parameter to define the number of retry attempts in case of failure, with a default value of 5. If the operation is still unsuccessful after the maximum number of attempts, an exception is thrown.
HostAndPort hostAndPort = HostAndPort.from("127.0.0.1:30001"); int connectionTimeout = 5000; int soTimeout = 2000; int maxAttempts = 5; ConnectionPoolConfig config = new ConnectionPoolConfig(); JedisCluster jedisCluster = new JedisCluster(hostAndPort, connectionTimeout, soTimeout, maxAttempts, config); try { System.out.println("set key: " + jedisCluster.set("key", "value")); } catch (Exception e) { // If the exception is caught in this block, it implies that the operation failed even after the maximum number of attempts (maxAttempts). e.printStackTrace(); }
Redisson
The Redisson client provides two parameters to control the retry logic:
retryAttempts: the number of retries. Default value: 3.
retryInterval: the retry interval. Default value: 1500. Unit: milliseconds.
Example of retry settings on the Jedis client:
Config config = new Config();
config.useSingleServer()
.setTimeout(1000)
.setRetryAttempts(3)
.setRetryInterval(1500) //ms
.setAddress("redis://127.0.0.1:6379");
RedissonClient connect = Redisson.create(config);
StackExchange.Redis
The StackExchang.Redis client supports only connection retries. Example of retry settings on the StackExchange.Redis client:
var conn = ConnectionMultiplexer.Connect("redis0:6380,redis1:6380,connectRetry=3");
For more information about the API-level retry mechanism, see Polly.
Lettuce
Although the Lettuce client does not provide parameters for retries after a command times out, you can use the following parameters to implement a retry mechanism:
at-most-once execution: The command can be executed once at most. If the client is disconnected and then reconnected, the command may be lost.
at-least-once execution (default): A minimum of one successful command execution is ensured. This indicates that multiple attempts may be made to ensure a successful execution. If this method is used and a master/replica switchover occurs in the Tair instance while the client is making multiple retry attempts, a large number of retry commands may be accumulated on the client. After the master/replica switchover is complete, the CPU utilization of the Tair instance may surge.
For more information, see Client-Options and Command execution reliability.
Example of retry settings on the Lettuce client:
clientOptions.isAutoReconnect() ? Reliability.AT_LEAST_ONCE : Reliability.AT_MOST_ONCE;