Handle data skew in a Tair cluster instance - - Alibaba Cloud Documentation Center

When the metrics of a data shard in a Tair cluster instance vastly exceed those of other data shards, this generally indicates data skew in the Tair cluster instance. Data skew can be indicated by memory usage, CPU utilization, bandwidth usage, and latency metrics. In this case, exceptions such as data eviction, out-of-memory (OOM) errors, and high latency may occur even when the overall memory usage of the instance is low.

Why does data skew occur?

Tair cluster instances adopt a distributed architecture. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles the data in specific slots. For example, in a 3-shard cluster instance, slots are distributed among the three shards in the following way: the first shard handles slots in the [0,5460] range, the second shard handles slots in the [5461,10922] range, and the third shard handles slots in the [10923,16383] range. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula: Slot=CRC16(key)%16384. Then, the client writes the key to the slot. In theory, this mechanism evenly distributes keys among data shards, and is sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.

However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.

Note

Typically, data skew occurs when the resources of specific data shards are in higher demand compared with those of other data shards.

You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard are consistently 20% higher than those of other data shards, data skew may be present. The severity of the data skew is proportional to the difference between the metrics of the abnormal shard and normal shards.

The following figure shows two typical data skew scenarios. Although keys are evenly distributed across the cluster (two keys per data shard), data skew still occurs.

热key

The number of queries per second (QPS) for key1 on Replica 1 is much higher than that of other keys. This is a case of data access skew that can lead to high CPU utilization and bandwidth usage on the replica. This affects the performance of all keys on the replica.
The size of key5 on Replica 2 is 1 MB, which is much larger than that of other keys. This is considered a case of data volume skew, which can lead to high memory and bandwidth usage on the replica. This affects the performance of all keys on the replica.

This topic describes how to determine whether data skew occurs, identify the cause, and handle the issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.

Check for data skew

Use the diagnostic report feature to check whether data skew is present on the current instance.
On the Data Node tab of the Performance Monitor page, view the metrics of data shards. For more information, see View performance monitoring data.

Provisional solutions to data skew

If data skew is present, you can use provisional solutions as temporary measures. The following table describes these provisional solutions, which temporarily handle data skew, but do not resolve the root cause.

To temporarily address data skew, you can also reduce requests for large keys and hotkeys. To resolve issues related to large keys and hotkeys, you must make adjustments at the application level. We recommend that you promptly identify the cause of data skew in your instance and handle the issue at the application level to optimize instance performance. For more information, see the Causes and solutions section of this topic.

Issue	Possible cause	Provisional solution
Memory usage skew	Large keys and hash tags	Upgrade your instance specifications. For more information, see Change the configurations of an instance. Important Tair initiates a precheck for data skew during instance specification change. If the instance type that you select cannot handle the data skew issue, Tair reports an error. Select an instance type that has higher specifications and try again. After you upgrade your instance specifications, memory usage skew may be alleviated. However, usage skew may also occur on bandwidth and CPU resources.
Bandwidth usage skew	Large keys, hotkeys, and resource-intensive commands	Increase the bandwidth of one or more specific data shards. For more information, see Manually increase the bandwidth of an instance. Note The bandwidth of a data shard can be increased by up to six times of the default bandwidth of the instance, but the increase in bandwidth cannot exceed 192 Mbit/s. If this measure still does not resolve the data skew issue, we recommend that you make adjustments at the application level.
CPU utilization skew	Large keys, hotkeys, and resource-intensive commands	No provisional solutions are available for this issue. Check your instance, identify the cause, and then make adjustments at the application level.

Causes and solutions

To resolve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.

Cause	Description	Solution
Large keys	A large key is identified based on the size of the key and the number of members in the key. Typically, large keys are common in key-value data structures such as Hash, List, Set, and Zset. Large keys occur when these structures store a large number of fields or fields that are excessively large. Large keys are one of the main culprits in data skew. For more information about large keys, see Identify and handle large keys and hotkeys.	Prevent the occurrence of large keys. Split a hash key that contains tens of thousands of members into multiple hash keys that have a manageable number of members.
Hotkeys	Hotkeys are keys that have a much higher QPS value than other keys. Hotkeys commonly appear during stress testing on a single key, or on the keys of popular products during flash sales. For more information about hotkeys, see Identify and handle large keys and hotkeys.	Prevent the occurrence of hotkeys. Use the proxy query cache feature of Tair to cache hot data. For more information, see Optimize large keys and hotkeys.
Resource-intensive commands	Each command has a metric called time complexity that measures its resource and time consumption. In most cases, the higher the time complexity of a command, the more resources the command consumes. For example, the time complexity of the `HGETALL` command is O(n). This indicates that the command consumes resources in proportion to the number of fields specified for the command. Similarly, if a `SET` or `GET` command involves a large data volume, the command consumes a large amount of data shard resources.	Query the slow logs of the data shard. For more information, see Query slow logs. Do not use resource-intensive commands. To disable specific commands, configure the `#no_loose_disabled-commands` parameter on the Parameter Settings page.
Hash tags	Tair distributes a key to a specific data shard based on the content contained in the `{}` part of the key name. For example, the `{item}id1`, `{item}id2`, and `{item}id3` keys are stored in the same data shard because they share the same `{}` content. As a result, the memory usage and resource consumption of this data shard increase significantly.	Do not use `{}` in the key name. Note If you want to use `{}` in the key name, make sure that different keys have different content in `{}`. This way, you can store the keys across multiple data shards.