Impact of a data shard failure on the request success rate - Tair (Redis® OSS-Compatible)

0.0.201

Assume that the number of data shards in a cluster instance is N. When one of the data shards fails, a master-replica failover is initiated on the faulty data shard. The entire failover process takes anywhere from a few seconds to tens of seconds. During this process, all requests sent to the data shard fail. If the requests are evenly distributed across all data shards, the instance theoretically has a request failure rate of 1/N.

However, the actual number of failed requests may be higher than the theoretical value for the following reasons. In the example, a cluster instance that contains two data shards and runs in proxy mode is used.

Note

In the case of a cluster instance in direct connection mode, use of multi-key commands is the only reason for experiencing a higher actual failure rate.

Commands that involve multiple keys are used
- Proxy mode: The proxy server splits multi-key commands into multiple subcommands and routes them to the corresponding data shards based on the route table.
- Direct connection mode: The client sends each subcommand to the corresponding data shard.
When a multi-key command involves a data shard that experiences a failure, the entire request tends to fail, as shown in the following figure.
Optimization solution: Minimize the number of keys in a single command to reduce the probability of multi-key command failures when a data shard fails.
The client uses a single connection
Some clients such as Lettuce send requests asynchronously over a single connection. For example, in the scenario where two GET requests are sent sequentially to a proxy server over one connection, Redis Serialization Protocol (RESP) requires that responses be returned in the same order as the requests were sent. If the GET key2 request fails due to a node failure, the client cannot receive the response to the request that follows GET key2 even if the request is successfully executed.
Optimization solution: Use clients such as Jedis that support the connection pool model.
Connection pool resources on the client are exhausted
For clients that use the connection pool model, the maximum number of connections allowed can be configured. When the number of connections reaches the maximum limit and no idle connections are available, new requests fail or are blocked.
In the following figure, the Jedis client is used. If maxTotal is set to 3, timeout is set to 2000, and three GET key2 requests are initiated within 2 seconds, the connection pool reaches its maximum capacity with all connections in use after 2 seconds. If one of the nodes fails to respond, new requests either fail or be blocked based on the blockWhenExhausted configuration, leading to the client-side requests either failing or timing out.
Optimization solution: Configure the JedisPool resource pool size appropriately and set the timeout parameter to a lower value, such as 200 to 300 milliseconds (instead of the default 2,000 milliseconds). This adjustment helps achieve faster failure responses and prevents a large number of connections from being blocked when a failure occurs.

Feedback

Previous: Query the data node and slot to which a key belongsNext: Why am I unable to switch between the proxy mode and the direct connection mode when I use cloud-native cluster instances?

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)