Causes of Tair (Redis OSS-compatible) client faults and retry mechanisms - Tair (Redis® OSS-Compatible)

Applications using Tair (Redis OSS-compatible) may encounter temporary faults associated with network and running environments, such as transient network jitter, temporary unavailability of services, and timeouts caused by busy services. You can configure automatic retry mechanisms to avoid temporary failures and ensure successful operations.

Causes of temporary faults

Cause	Description
High availability mechanism triggered by a fault	Tair (Redis OSS-compatible) can monitor the health status of nodes. If the master node in an instance fails, a master-replica switchover is automatically triggered. For example, the roles of the master and replica nodes are switched over to ensure the high availability of the instance. As a result, the client may encounter the following temporary faults: Transient connections within seconds Read-only state within 30 seconds (to prevent potential risks of data loss and dual writes caused by master/replica switchovers) Note For more information, see High availability.
Request blockage caused by slow queries	Request blockage and slow queries occur when operations with a time complexity of O(N) are performed. In this case, other requests initiated by the client may temporarily fail.
Complex network environments	Complex network environments between the client and the server may cause problems such as occasional network jitter and data retransmission. In this case, requests initiated by the client may temporarily fail.

Recommended retry rules

Retry rule	Description
Retry only idempotent operations	Timeouts can occur in any of the following phases: A command is sent by the client but has not reached the server. The command has reached the server, but the execution times out. The command has been executed on the server, but a timeout occurs when the result is returned to the client. A retried operation may be repeatedly executed on the server. Therefore, not all operations are suitable for a retry mechanism. We recommend that you retry only idempotent operations, such as running the SET command. If you run the SET a b command multiple times, the value of a can only be b. Otherwise, the execution fails. If you run the LPUSH mylist a command, which is not idempotent, mylist may contain multiple a elements.
Configure proper retry times and interval	Configure the retry times and interval based on business requirements in actual scenarios. Otherwise, the following issues may occur: If the number of retries is insufficient or if the interval between retries is longer than expected, the application may fail to complete operations. If an excessive number of retries are attempted or if the interval between retries is shorter than expected, the application may consume excessive system resources and the server may become overwhelmed with a high volume of repeated requests. Common retry interval policies include immediate retry, fixed-interval retry, exponential backoff retry, and random backoff retry.
Avoid retry nesting	Retry nesting may cause repeated or even unlimited retries.
Record retry exceptions and generate failure reports	During the retry process, we recommend that you configure the system to generate retry logs at the WARN level and only when the retry fails.

Jedis

We recommend that you use Jedis 4.0.0 or later, preferably the latest Jedis version. In the following example, Jedis 5.0.0 is used.

Add the following dependency to your pom.xml file to include Jedis:

    <dependency>
        <groupId>redis.clients</groupId>
        <artifactId>jedis</artifactId>
        <version>5.0.0</version>
    </dependency>

Use Jedis to retry an operation.

If the instance is a standard instance or a cluster instance in proxy mode, you must use the JedisPool mode.

The following sample code automatically retries the SET command up to 5 times within a total retry duration of 10 seconds, with exponentially increasing wait times between each retry. If all retries fail, an exception is thrown.

PooledConnectionProvider provider = new PooledConnectionProvider(HostAndPort.from("127.0.0.1:6379"));
int maxAttempts = 5; // The maximum number of retries.
Duration maxTotalRetriesDuration = Duration.ofSeconds(10); // The maximum total retry duration.
UnifiedJedis jedis = new UnifiedJedis(provider, maxAttempts, maxTotalRetriesDuration);
try {
    System.out.println("set key: " + jedis.set("key", "value"));
} catch (Exception e) {
    // If the exception is caught in this block, it implies that the operation failed even after the maximum number of attempts (maxAttempts) or after the maximum total retry duration (maxTotalRetriesDuration) is reached. 
    e.printStackTrace();
}

If the instance is a cluster instance in direct connection mode, you must use the JedisCluster mode.

You can configure the maxAttempts parameter to define the number of retry attempts in case of failure, with a default value of 5. If the operation is still unsuccessful after the maximum number of attempts, an exception is thrown.

HostAndPort hostAndPort = HostAndPort.from("127.0.0.1:30001");
int connectionTimeout = 5000;
int soTimeout = 2000;
int maxAttempts = 5;
ConnectionPoolConfig config = new ConnectionPoolConfig();
JedisCluster jedisCluster = new JedisCluster(hostAndPort, connectionTimeout, soTimeout, maxAttempts, config);
try {
    System.out.println("set key: " + jedisCluster.set("key", "value"));
} catch (Exception e) {
    // If the exception is caught in this block, it implies that the operation failed even after the maximum number of attempts (maxAttempts). 
    e.printStackTrace();
}

Redisson

The Redisson client provides two parameters to control the retry logic:

retryAttempts: the number of retries. Default value: 3.
retryInterval: the retry interval. Default value: 1500. Unit: milliseconds.

Example of retry settings on the Jedis client:

Config config = new Config();
config.useSingleServer()
    .setTimeout(1000)
    .setRetryAttempts(3)
    .setRetryInterval(1500) //ms
    .setAddress("redis://127.0.0.1:6379");
RedissonClient connect = Redisson.create(config);

StackExchange.Redis

The StackExchang.Redis client supports only connection retries. Example of retry settings on the StackExchange.Redis client:

var conn = ConnectionMultiplexer.Connect("redis0:6380,redis1:6380,connectRetry=3");

Note

For more information about the API-level retry mechanism, see Polly.

Lettuce

Although the Lettuce client does not provide parameters for retries after a command times out, you can use the following parameters to implement a retry mechanism:

at-most-once execution: The command can be executed once at most. If the client is disconnected and then reconnected, the command may be lost.
at-least-once execution (default): A minimum of one successful command execution is ensured. This indicates that multiple attempts may be made to ensure a successful execution. If this method is used and a master/replica switchover occurs in the instance while the client is making multiple retry attempts, a large number of retry commands may be accumulated on the client. After the master/replica switchover is complete, the CPU utilization of the instance may surge.

Note

For more information, see Client-Options and Command execution reliability.

Example of retry settings on the Lettuce client:

clientOptions.isAutoReconnect() ?  Reliability.AT_LEAST_ONCE : Reliability.AT_MOST_ONCE;