how to identify and optimize large keys and hotkeys - Tair (Redis® OSS-Compatible)

Large keys and hotkeys can lead to degraded service performance, poor user experience, or even system failures. This topic explains how to quickly identify and optimize large keys and hotkeys, analyze their causes and potential issues, and provide preventive measures to mitigate their impact on business operations.

Quickly identify large keys and hotkeys

Alibaba Cloud self-developed tools

Tair and Redis offer Top Key statistics and offline full key analysis features in the console to assist in quickly identifying large keys and hotkeys.

Method

Limits

Description

Top Key statistics (recommended)

Only Redis open source edition 5.0 and later versions and Tair (enterprise edition)memory type, persistent memory type support this feature.

Displays the top three large keys and hotkeys of each data type in each shard in real time.
Allows you to view the historical information of large keys and hotkeys within the last four days.
Has almost no impact on online services.

Offline full key analysis

disk type instances do not support this feature.

Allows you to analyze Redis Database (RDB) backup files in a customized manner. You can view the statistics of keys in an instance, such as the memory usage, distribution, and time-to-live (TTL) of keys.
Has no impact on online services.
Does not allow for rapid analysis, and takes longer to analyze large RDB files.
Unable to parse hotkey information.

If your instance cannot use the above features, consider the following methods.

Other methods to identify large keys and hotkeys

Method	Advantages and disadvantages	Description
Use the bigkeys, memkeys, and hotkeys parameters of redis-cli to find large keys and hotkeys	Advantages: Convenient, fast, and secure. Disadvantages: The analysis results cannot be customized, and the accuracy and timeliness are poor. It requires traversing all existing keys in the instance, which may affect the performance of the instance.	The bigkeys, memkeys, and hotkeys parameters of redis-cli can obtain the overall statistics of keys and the top 1 large key or hotkey for each data type. Differences are as follows: bigkeys: Collects statistics of large keys. For collection or list types, it returns the number of elements. memkeys: Collects statistics of large keys and returns the memory size occupied by all data types. hotkeys: Collects statistics of hotkeys. Supported data types: STRING, LIST, HASH, SET, ZSET, and STREAM. For example, the command for bigkeys is `redis-cli -h r-***************.redis.rds.aliyuncs.com -a <password> --bigkeys`.
Analyze a specific key using built-in commands	Advantages: Has little impact on online services. Disadvantages: The returned serialized length of a key is not equal to the actual length of the key in memory. This method provides limited precision and is for reference only.	Analyze keys of various data types using the following low-risk commands to determine whether a key is a large key. STRING type: STRLEN command, which returns the number of bytes of the value of the key. LIST type: LLEN command, which returns the length of the list of the key. HASH type: HLEN command, which returns the number of members of the key. SET type: SCARD command, which returns the number of members of the key. ZSET type: ZCARD command, which returns the number of members of the key. STREAM type: XLEN command, which returns the number of members of the key. Note DEBUG OBJECT and MEMORY USAGE commands consume a large amount of resources and have a time complexity of O(N) when executed. They may block the instance and are not recommended for use.
Identify hotkeys at the business layer	Advantages: Can identify hotkeys in a timely and accurate manner. Disadvantages: Increases the complexity of business code and may degrade performance.	Add code to the business layer to record requests sent to instances and asynchronously analyze the collected statistics.
Identify large keys in a customized manner using the redis-rdb-tools project	Advantages: Supports customized analysis without affecting online services. Disadvantages: Does not allow for rapid analysis, and takes longer to analyze large RDB files.	Redis-rdb-tools is an open-source tool written in Python that supports customized analysis of RDB snapshot files. After you download RDB files, you can analyze the memory usage of all keys in an instance based on your business requirements and perform flexible queries.
Find hotkeys using the MONITOR command	Advantages: Convenient and secure. Disadvantages: Consumes CPU, memory, and network resources and has poor timeliness and limited precision.	The MONITOR command can accurately print all requests in an instance, including time information, client information, commands, and key information. In an emergency, you can temporarily run the MONITOR command and output the returned information to a file. After disabling the MONITOR command, you can classify and analyze the requests in the file to identify the hotkeys during this period. Note Because the MONITOR command significantly consumes instance performance, it is not recommended to use the MONITOR command unless in special circumstances.

Optimize large keys and hotkeys

Category	Handling method	Description
Large key	Compress large keys	It is recommended to reduce the storage space of large keys by using serialization or compression algorithms before data is written to the cache. If the key is still too large after compression, you can further split the key.
	Split large keys	For example, you can split a HASH key that contains tens of thousands of members into multiple HASH keys that each have an appropriate number of members. Splitting large keys can effectively prevent data skew.
	Delete large keys	You can store unsuitable data in other storage engines and delete the data from the instance. Note Redis open source edition 4.0 and later versions: You can safely delete large keys or even extra-large keys by using the UNLINK command. This command asynchronously clears keys to avoid blocking the main thread. Redis open source edition earlier than 4.0: It is recommended to first read part of the data by using the SCAN command and then delete the data to avoid blocking the main thread by deleting many keys at a time.
	Delete expired data on a regular basis	The accumulation of a large amount of expired data may lead to the generation of large keys. For example, in the HASH data type, a large amount of data may be continuously written in incremental form because the data timeliness is ignored. You can use scheduled tasks to delete invalid data. Note When you clear HASH data, it is recommended to use the HSCAN command together with the HDEL command to delete invalid data to avoid blocking the instance by deleting a large amount of data.
Hot key	Replicate hotkeys for cluster instances	Because a hotkey is stored as a whole in a single shard, requests cannot be distributed by migrating part of the data. As a result, the pressure on a single data shard cannot be reduced. In this case, you can replicate the corresponding hotkey and migrate it to other data shards. For example, you can replicate the hotkey foo into three identical keys named foo2, foo3, and foo4, and migrate these keys to other data shards to alleviate the pressure on a single data shard caused by the hotkey. Note The disadvantage of this solution is that you need to modify the code to maintain multiple replicas, and it is difficult to ensure data consistency among multiple replicas. For example, update operations need to be synchronized to all replicas. It is recommended to use this solution as a temporary solution to alleviate urgent issues.
	Enable read/write splitting	If a hotkey is caused by read requests, you can enable read/write splitting to reduce the read request load on each data shard. If the read request load is still high after read/write splitting is enabled, you can further alleviate the read request load by increasing the number of read-only nodes. Note Read/write splitting also has disadvantages. In scenarios with extremely high request volumes, read/write splitting may cause unavoidable latency, which may result in dirty data being read. Therefore, read/write splitting is not recommended for scenarios with high read and write pressure and high requirements for data consistency.
	Enable the proxy query cache feature	After this feature is enabled, Tair and Redis identify hotkeys (usually hotkeys with QPS greater than 5,000) based on algorithms. The proxy node caches the requests and query results of hotkeys (only the query results of point keys are cached, and the entire key does not need to be cached). If a proxy node receives a duplicate request within the validity period of the cached data, the proxy server directly returns the response of the request to the client without the need to interact with backend data shards. For more information, see optimize hotkey issues by using the proxy query cache.

Causes of large keys and hotkeys

Tair and Redis have a minimum data distribution granularity of keys. Each key is stored in a specific data shard and cannot be split. Insufficient workload planning, accumulation of invalid data, and unexpected traffic spikes may cause large keys and hotkeys in an instance, such as:

Large key
- Inappropriate use of Tair and Redis may result in excessively large keyvalues. For instance, using a string key to store large binary files.
- Lack of workload planning prior to releasing a feature can lead to some keys having more members than necessary.
- Accumulation of invalid data: Not regularly deleting invalid data can cause the number of members for a HASH key to continually increase.
- Code failures in consumer applications using LIST keys can result in an ever-increasing number of members.
Hot key
- Unexpected traffic spikes can occur for various reasons, such as viral marketing, a surge of "likes" from a livestream audience, or a large-scale event in a game.

Potential issues caused by large keys and hotkeys

Category	Description
Large key	The amount of time it takes for a client to run a command is longer. When the memory of an instance reaches the maxmemory limit, operations may be blocked, important keys may be evicted, or memory overflow (OOM) may occur. The memory usage of a data shard in a Tair cluster instance far exceeds that of other data shards, which results in imbalanced memory usage across data shards in the instance. When a read request is made for a large key, the response time may increase and other services may be affected. This is because the bandwidth of the instance to which the key belongs is exhausted. The primary database may be blocked for an extended period of time when a large key is being deleted. This may lead to a synchronization failure or a master-replica switchover.
Hot key	Consumes a large amount of CPU resources and may increase network bandwidth usage, which affects other requests and reduces overall performance. Request skews may take place for Tair cluster instances. Request skews occur when one data shard in an instance receives many requests while other data shards in the instance remain idle. In this situation, the maximum number of connections to a data shard may be reached and new connections to the shard may be rejected. During flash sales, overselling may occur if the key corresponding to a commodity receives more requests than can be handled by the instance. If the request pressure on a hotkey exceeds the capacity of an instance, cache penetration may occur. In this case, many requests are directly sent to the backend storage layer, which causes a surge in storage access and even breakdowns, affecting other businesses.

How to prevent large keys and hotkeys from affecting business

Method	Description
Configure an alert rule	You can specify appropriate alert thresholds in the monitoring system for metrics, such as CPU utilization, memory usage, and connection usage of an instance. For example, you can specify 70% as the alert threshold for the memory usage of an instance and 20% as the alert threshold for the memory usage increase of the instance over an 1-hour period. When an alert is triggered, you can identify and optimize large keys and hotkeys as mentioned earlier to address them before they affect business. For more information, see alert settings.
Use Tair (enterprise edition) to avoid clearing invalid data	For large key scenarios of the hash type, Tair (enterprise edition) provides an enhanced data structure TairHash. It supports setting expiration time and version for each field, breaking the limitation of Redis Hash that only allows setting expiration time for the entire key. Meanwhile, TairHash uses an efficient active expire algorithm to complete the expiration judgment and deletion of fields with almost no impact on response time. By using TairHash properly, you can significantly reduce the maintenance burden, simplify the complexity of business code, and effectively address the issues caused by large keys and hotkeys.