When the metrics of a data shard in a Tair cluster instance vastly exceed those of other data shards, this generally indicates data skew in the Tair cluster instance. Data skew can be indicated by memory usage, CPU utilization, bandwidth usage, and latency metrics. In this case, exceptions such as data eviction, out-of-memory (OOM) errors, and high latency may occur even when the overall memory usage of the instance is low.
Why does data skew occur?
Tair cluster instances adopt a distributed architecture. The storage space of a cluster instance is split into 16,384 slots. Each data shard stores and handles the data in specific slots. For example, in a 3-shard cluster instance, slots are distributed among the three shards in the following way: the first shard handles slots in the [0,5460] range, the second shard handles slots in the [5461,10922] range, and the third shard handles slots in the [10923,16383] range. When you write a key to a cluster instance or update a key of a cluster instance, the client determines the slot to which the key belongs by using the following formula: Slot=CRC16(key)%16384
. Then, the client writes the key to the slot. In theory, this mechanism evenly distributes keys among data shards, and is sufficient to keep the values of metrics such as memory usage and CPU utilization at almost the same level across data shards.
However, in actual practice, data skew may occur due to a lack of advanced planning, unusual data writes, or data access spikes.
Typically, data skew occurs when the resources of specific data shards are in higher demand compared with those of other data shards.
You can view the metrics of data shards on the Data Node tab of the Performance Monitor page in the console. If the metrics of a data shard are consistently 20% higher than those of other data shards, data skew may be present. The severity of the data skew is proportional to the difference between the metrics of the abnormal shard and normal shards.
The following figure shows two typical data skew scenarios. Although keys are evenly distributed across the cluster (two keys per data shard), data skew still occurs.
The number of queries per second (QPS) for
key1
onReplica 1
is much higher than that of other keys. This is a case of data access skew that can lead to high CPU utilization and bandwidth usage on the replica. This affects the performance of all keys on the replica.The size of
key5
onReplica 2
is 1 MB, which is much larger than that of other keys. This is considered a case of data volume skew, which can lead to high memory and bandwidth usage on the replica. This affects the performance of all keys on the replica.
This topic describes how to determine whether data skew occurs, identify the cause, and handle the issue. You can also refer to this topic to troubleshoot high memory usage, CPU utilization, bandwidth usage, and latency for Tair standard instances.
Check for data skew
Use the diagnostic report feature to check whether data skew is present on the current instance.
On the Data Node tab of the Performance Monitor page, view the metrics of data shards. For more information, see View performance monitoring data.
Provisional solutions to data skew
If data skew is present, you can use provisional solutions as temporary measures. The following table describes these provisional solutions, which temporarily handle data skew, but do not resolve the root cause.
To temporarily address data skew, you can also reduce requests for large keys and hotkeys. To resolve issues related to large keys and hotkeys, you must make adjustments at the application level. We recommend that you promptly identify the cause of data skew in your instance and handle the issue at the application level to optimize instance performance. For more information, see the Causes and solutions section of this topic.
Issue | Possible cause | Provisional solution |
Memory usage skew | Large keys and hash tags | Upgrade your instance specifications. For more information, see Change the configurations of an instance. Important
|
Bandwidth usage skew | Large keys, hotkeys, and resource-intensive commands | Increase the bandwidth of one or more specific data shards. For more information, see Manually increase the bandwidth of an instance. Note The bandwidth of a data shard can be increased by up to six times of the default bandwidth of the instance, but the increase in bandwidth cannot exceed 192 Mbit/s. If this measure still does not resolve the data skew issue, we recommend that you make adjustments at the application level. |
CPU utilization skew | Large keys, hotkeys, and resource-intensive commands | No provisional solutions are available for this issue. Check your instance, identify the cause, and then make adjustments at the application level. |
Causes and solutions
To resolve the root cause of data skew, we recommend that you evaluate your business growth and make the necessary preparations for future growth. You can take measures to split large keys and write data in a manner that conforms to the expected usage.
Cause | Description | Solution |
Large keys | A large key is identified based on the size of the key and the number of members in the key. Typically, large keys are common in key-value data structures such as Hash, List, Set, and Zset. Large keys occur when these structures store a large number of fields or fields that are excessively large. Large keys are one of the main culprits in data skew. For more information about large keys, see Identify and handle large keys and hotkeys. |
|
Hotkeys | Hotkeys are keys that have a much higher QPS value than other keys. Hotkeys commonly appear during stress testing on a single key, or on the keys of popular products during flash sales. For more information about hotkeys, see Identify and handle large keys and hotkeys. |
|
Resource-intensive commands | Each command has a metric called time complexity that measures its resource and time consumption. In most cases, the higher the time complexity of a command, the more resources the command consumes. For example, the time complexity of the |
|
Hash tags | Tair distributes a key to a specific data shard based on the content contained in the |
|