All Products
Search
Document Center

Tair (Redis® OSS-Compatible):Troubleshoot high CPU utilization on a Tair instance

Last Updated:Oct 30, 2024

In most cases, you can attribute an increase in the CPU utilization of a Tair (including Redis Open-Source Edition) instance to the following causes: high-concurrency and high-throughput applications, unexpectedly high workloads, and improper resource usage. High-concurrency and high-throughput applications consume more CPU resources. If the CPU utilization is within the capacity of the system, the increased CPU activity is considered normal. If the Tair instance does not have sufficient CPU resources to meet the business requirements, you can increase the number of shards or replicas or upgrade the instance to a Tair (Enterprise Edition) instance to alleviate resource bottlenecks. In addition, the improper use of resources such as CPU-intensive commands, hotkeys, and large keys can result in abnormally high CPU utilization. If the average CPU utilization is higher than 70% and the average peak CPU utilization within a 5-minute period is higher than 90%, the stability of the application may be affected. You must pay close attention to and troubleshoot this issue.

What factors can cause an abnormal increase in CPU utilization?

  • CPU-intensive commands: commands that have a time complexity of O(N), where N is a large value. Examples: KEYS, HGETALL, MGET, MSET, HMSET, and HMGET. In most cases, a command that has a higher time complexity consumes more CPU resources. This increases CPU utilization.

    If Tair runs CPU-intensive commands, pending requests accumulate in the queue due to single-threading. This slows down the response of applications. In specific cases, a Tair instance may be overwhelmed by pending requests. An application may be disconnected due to these requests timing out. In addition, requests may be directly forwarded to backend databases and cause a cache avalanche.

    Note

    For information about the time complexity of each command, visit the Redis official website.

  • Hotkeys: If a single key or a subset of keys is accessed more frequently than other keys, a possible reason is that one or more hotkeys were generated. Hotkeys can consume a substantial amount of CPU resources of Tair. This affects the access latency of other keys. If hotkeys in a cluster instance are concentrated on specific data shards, CPU utilization skew may occur. In this case, the CPU utilization of specific shards is much higher than the CPU utilization of other shards.

  • Large keys: Large keys consume more memory. Access to large keys significantly increases the CPU load and traffic of Tair. Large keys are more prone to becoming hotspots, which can result in high CPU utilization. If large keys are concentrated on specific data shards, the CPU utilization, bandwidth usage, and memory usage may be skewed.

  • Short-lived connections: Frequent establishment of connections can lead to a substantial consumption of resources on a Tair instance due to the overhead associated with connection handling.

  • AOFs: By default, append-only file (AOF) persistence is enabled for Tair instances. If an instance is under heavy load, writing data to AOFs leads to increased CPU utilization and higher overall response latency of the instance.

Scenarios of high CPU utilization

The following scenarios may result in high CPU utilization:

Take appropriate measures to reduce the CPU utilization based on the scenario.

Sudden increase in CPU utilization

If the overall CPU utilization of an instance increases, perform the following steps to troubleshoot the issue:

Troubleshoot and disable CPU-intensive commands

Troubleshooting procedure

  1. Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.

  2. Use the following methods to identify CPU-intensive commands:

    • The latency insights feature records the latency of all commands and custom special events on a Tair instance. You can identify CPU-intensive commands that have longer response times based on the latency information collected over a specific period of time across different nodes. For more information, see Latency insights.

      Figure 1. Usage example of latency insights image.png

    • Slow query logs record commands that run longer than the specified threshold. You can identify CPU-intensive commands based on the slow queries and their execution durations that are collected over a specific period of time across different nodes. For more information, see Query slow logs.

      Figure 2. Slow log query example image.png

Solutions

  • Evaluate and disable high-risk commands and CPU-intensive commands, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.

  • Optional: Use one of the following methods to change the instance architecture based on your business requirements:

    • Change the architecture of the instance to read/write splitting to evenly distribute CPU-intensive commands or application requests. For more information about the read/write splitting architecture, see Read/write splitting instances.

    • Convert the instance into a Tair DRAM-based instance and use the multi-threading feature of DRAM-based instances to reduce the CPU utilization of the instance.

    Note

    For information about how to change the architecture and series type of an instance, see Change the configurations of an instance.

Troubleshoot and optimize short-lived connections

Troubleshooting procedure

  1. Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.

  2. On the performance monitoring dashboard, check whether CPU utilization is high and a large number of connections that do not have the expected queries per second (QPS) exist. If the preceding scenario is true, short-lived connections may exist. In this case, use one of the following solutions:

Solutions

Disable AOF persistence

By default, AOF persistence is enabled for Tair instances. If an instance is under heavy load, frequent AOF operations may increase CPU utilization.

Disable AOF persistence if this does not adversely affect your business. In addition, you can back up the data of the Tair instance during off-peak hours or during the maintenance window to minimize the impact.

Warning

If your instance is a Tair DRAM-based instance, you cannot use AOFs to restore data of the Tair instance after you disable AOF persistence for the instance. In this case, the data flashback feature is unavailable. You can use only backup sets to restore data. For more information, see Restore data from a backup set to a new instance. Proceed with caution when you disable AOF persistence.

Evaluate the service performance

The preceding methods are used to optimize the performance of a Tair instance. If the average CPU utilization still exceeds 70% during business operations, the instance may have a performance bottleneck.

To resolve this issue, check whether commands and requests from application hosts that may degrade the instance performance exist. If such commands or requests exist, you must optimize your business system. If no such commands or requests exist but the CPU utilization is still high, we recommend that you upgrade the instance specifications to ensure business stability. You can also upgrade the instance to a cluster instance or read/write splitting instance. For more information about how to upgrade an instance, see Change the configurations of an instance.

Note

To ensure business stability, we recommend that you purchase a pay-as-you-go instance before you upgrade the instance. You can release the instance after you complete the stress and compatibility tests.

Inconsistent CPU utilization across data nodes

If specific data shards in a Tair instance that uses the cluster architecture or read/write splitting architecture have high CPU utilization whereas other data shards have low CPU utilization, perform the following steps to troubleshoot the issue:

Troubleshoot and optimize hotkeys

Troubleshooting procedure

  1. Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.

  2. On the History tab of the Real-time Key Statistics page, select the data nodes that have high CPU utilization, select time filter conditions based on Step 1, and then click Search. You can view the hotkeys that show high CPU utilization within the time range. For more information, see Use the real-time key statistics feature. image.png

Solutions

  • Enable the proxy query cache feature. After you enable this feature, proxy nodes cache the request and response data of hotkeys. If a proxy node receives a duplicate request during the validity period of the cached data, the proxy node directly returns a response to the client without the need to interact with backend data shards. This helps prevent skewed requests caused by hotkeys that receive a large number of read requests. For more information, see Use proxy query cache to address issues caused by hotkeys

    Note

    The proxy query cache feature is available only for DRAM-based and persistent memory-optimized instances.

  • If hotkeys are generated from read requests, you can change the instance into a read/write splitting instance to reduce the read pressure imposed on each data shard of the instance.

    Note

    If a large number of requests are sent to a read/write splitting instance, a specific amount of latency is unavoidable, and dirty data may be read from the instance. Therefore, the read/write splitting architecture is not the optimal solution for scenarios that have high requirements for read and write capabilities and data consistency.

Troubleshoot and disable CPU-intensive commands

Troubleshooting procedure

  1. Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.

  2. Use the following methods to identify CPU-intensive commands:

    • The latency insights feature records the latency of all commands and custom special events on a Tair instance. You can identify CPU-intensive commands that have longer response times based on the latency information collected over a specific period of time across different nodes. For more information, see Latency insights.

      Figure 1. Usage example of latency insights image.png

    • Slow query logs record commands that run longer than the specified threshold. You can identify CPU-intensive commands based on the slow queries and their execution durations that are collected over a specific period of time across different nodes. For more information, see Query slow logs.

      Figure 2. Slow log query example image.png

Solution

Evaluate and disable high-risk commands and CPU-intensive commands, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.

Troubleshoot and optimize large keys

Troubleshooting procedure

  1. Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.

  2. On the Cache Analysis page, click Analyze. Select a data node that has high CPU utilization and click OK. You can view the large keys that show high CPU utilization within the time range. For more information, see Use the offline key analysis feature. image.png

Solution

Split large keys into small keys based on actual business requirements to evenly distribute the request load.

Inconsistent CPU utilization across proxy nodes

If specific proxy nodes in a Tair instance that uses the cluster architecture or read/write splitting architecture have high CPU utilization whereas other proxy nodes have low CPU utilization, perform the following steps to troubleshoot the issue:

Troubleshooting procedure

On the Proxy Node tab of the Performance Trends page, check whether connection usage is evenly distributed. For more information, see View performance trends.

Solutions

Perform one of the following operations based on whether connection usage is evenly distributed:

  • If connection usage is evenly distributed, restart the client or proxy node on which business applications are deployed to redistribute connections. For more information about how to restart a proxy node, see Restart or rebuild proxy nodes.

  • If connection usage is unevenly distributed, the uneven distribution is usually caused by a large scale of pipeline or batch operations. You can reduce the scale of pipeline or batch operations, for example, by splitting the operations into multiple smaller operations.