Troubleshoot high CPU utilization on an instance

In most cases, you can attribute an increase in the CPU utilization of a Tair and Redis Open-Source Edition instance to the following causes: high-concurrency and high-throughput applications, unexpectedly high workloads, and improper resource usage. High-concurrency and high-throughput applications consume more CPU resources. If the CPU utilization is within the capacity of the system, the increased CPU activity is considered normal. If the Redis Open-Source Edition instance does not have sufficient CPU resources to meet the business requirements, you can increase the number of shards or replicas or upgrade the instance to a Tair (Enterprise Edition) instance to alleviate resource bottlenecks. In addition, the improper use of resources such as CPU-intensive commands, hotkeys, and large keys can result in abnormally high CPU utilization. If the average CPU utilization is higher than 70% and the average peak CPU utilization within a 5-minute period is higher than 90%, the stability of the application may be affected. You must pay close attention to and troubleshoot this issue.

What factors can cause an abnormal increase in CPU utilization?

CPU-intensive commands: commands that have a time complexity of O(N), where N is a large value. Examples: KEYS, HGETALL, MGET, MSET, HMSET, and HMGET. In most cases, a command that has a higher time complexity consumes more CPU resources. This increases CPU utilization.
If an instance runs CPU-intensive commands, pending requests accumulate in the queue due to single-threading. This slows down the response of applications. In specific cases, an instance may be overwhelmed by pending requests. An application may be disconnected due to these requests timing out. In addition, requests may be directly forwarded to backend databases and cause a cache avalanche.
Note
For information about the time complexity of each command, visit the Redis official website.
Hotkeys: If a single key or a subset of keys is accessed more frequently than other keys, a possible reason is that one or more hotkeys were generated. Hotkeys can consume a substantial amount of CPU resources. This affects the access latency of other keys. If hotkeys in a cluster instance are concentrated on specific data shards, CPU utilization skew may occur. In this case, the CPU utilization of specific shards is much higher than the CPU utilization of other shards.
Large keys: Large keys consume more memory. Access to large keys significantly increases the CPU load and traffic. Large keys are more prone to becoming hotspots, which can result in high CPU utilization. If large keys are concentrated on specific data shards, the CPU utilization, bandwidth usage, and memory usage may be skewed.
Short-lived connections: Frequent establishment of connections can lead to a substantial consumption of resources on an instance due to the overhead associated with connection handling.
AOFs: By default, append-only file (AOF) persistence is enabled for instances. If an instance is under heavy load, writing data to AOFs leads to increased CPU utilization and higher overall response latency of the instance.

Scenarios of high CPU utilization

The following scenarios may result in high CPU utilization:

During a specific period of time, the CPU utilization suddenly spikes to a high level, even reaching 100%. For information about the causes and solutions, see Sudden increase in CPU utilization.
A data node in an instance has higher CPU utilization compared with other data nodes. For information about the causes and solutions, see Inconsistent CPU utilization across data nodes.
A proxy node in an instance has higher CPU utilization compared with other proxy nodes. For information about the causes and solutions, see Inconsistent CPU utilization across proxy nodes.

Take appropriate measures to reduce the CPU utilization based on the scenario.

Sudden increase in CPU utilization

If the overall CPU utilization of an instance increases, perform the following steps to troubleshoot the issue:

Troubleshoot and disable CPU-intensive commands

Troubleshooting procedure

Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.
Use the following methods to identify CPU-intensive commands:
- The latency insights feature records the latency of all commands and custom special events on an instance. You can identify CPU-intensive commands that have longer response times based on the latency information collected over a specific period of time across different nodes. For more information, see Latency insights.
  Figure 1. Usage example of latency insights
- Slow query logs record commands that run longer than the specified threshold. You can identify CPU-intensive commands based on the slow queries and their execution durations that are collected over a specific period of time across different nodes. For more information, see Query slow logs.
  Figure 2. Slow log query example

Solutions

Evaluate and disable high-risk commands and CPU-intensive commands, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.
Optional: Use one of the following methods to change the instance architecture based on your business requirements:
- Change the architecture of the instance to read/write splitting to evenly distribute CPU-intensive commands or application requests. For more information about the read/write splitting architecture, see Read/write splitting instances.
- Convert the instance into a DRAM-based instance and use the multi-threading feature of DRAM-based instances to reduce the CPU utilization of the instance.
Note
For information about how to change the architecture and series type of an instance, see Change the configurations of an instance.

Troubleshoot and optimize short-lived connections

Troubleshooting procedure

Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.
On the performance monitoring dashboard, check whether CPU utilization is high and a large number of connections that do not have the expected queries per second (QPS) exist. If the preceding scenario is true, short-lived connections may exist. In this case, use one of the following solutions:

Solutions

Change short-lived connections to persistent connections. For example, create a JedisPool connection pool. For more information, see Use a client to connect to an instance.
Convert the instance into a DRAM-based instance to optimize the processing of short-lived connections.

Disable AOF persistence

By default, AOF persistence is enabled for instances. If an instance is under heavy load, frequent AOF operations may increase CPU utilization.

Disable AOF persistence if this does not adversely affect your business. In addition, you can back up the data of an instance during off-peak hours or during the maintenance window to minimize the impact.

Warning

If you use a DRAM-based instance, you cannot use AOFs to restore data of the instance after you disable AOF persistence for the instance. In this case, the data flashback feature is unavailable. You can use only backup sets to restore data. For more information, see Restore data from a backup set to a new instance. Proceed with caution when you disable AOF persistence.

Evaluate the service performance

The preceding methods are used to optimize the performance of an instance. If the average CPU utilization still exceeds 70% during business operations, the instance may have a performance bottleneck.

To resolve this issue, check whether commands and requests from application hosts that may degrade the instance performance exist. If such commands or requests exist, you must optimize your business system. If no such commands or requests exist but the CPU utilization is still high, we recommend that you upgrade the instance specifications to ensure business stability. You can also upgrade the instance to a cluster instance or read/write splitting instance. For more information about how to upgrade an instance, see Change the configurations of an instance.

Note

To ensure business stability, we recommend that you purchase a pay-as-you-go instance before you upgrade the instance. You can release the instance after you complete the stress and compatibility tests.

Inconsistent CPU utilization across data nodes

If specific data shards in an instance that uses the cluster architecture or read/write splitting architecture have high CPU utilization whereas other data shards have low CPU utilization, perform the following steps to troubleshoot the issue:

Troubleshoot and optimize hotkeys

Troubleshooting procedure

Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.
On the History tab of the Real-time Key Statistics page, select the data nodes that have high CPU utilization, select time filter conditions based on Step 1, and then click Search. You can view the hotkeys that show high CPU utilization within the time range. For more information, see Use the real-time key statistics feature.

Solutions

Enable the proxy query cache feature. After you enable this feature, proxy nodes cache the request and response data of hotkeys. If a proxy node receives a duplicate request during the validity period of the cached data, the proxy node directly returns a response to the client without the need to interact with backend data shards. This helps prevent skewed requests caused by hotkeys that receive a large number of read requests. For more information, see Use proxy query cache to address issues caused by hotkeys
Note
The proxy query cache feature is available only for DRAM-based and persistent memory-optimized instances.
If hotkeys are generated from read requests, you can change the instance into a read/write splitting instance to reduce the read pressure imposed on each data shard of the instance.
Note
If a large number of requests are sent to a read/write splitting instance, a specific amount of latency is unavoidable, and dirty data may be read from the instance. Therefore, the read/write splitting architecture is not the optimal solution for scenarios that have high requirements for read and write capabilities and data consistency.

Troubleshoot and disable CPU-intensive commands

Troubleshooting procedure

Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.
Use the following methods to identify CPU-intensive commands:
- The latency insights feature records the latency of all commands and custom special events on an instance. You can identify CPU-intensive commands that have longer response times based on the latency information collected over a specific period of time across different nodes. For more information, see Latency insights.
  Figure 1. Usage example of latency insights
- Slow query logs record commands that run longer than the specified threshold. You can identify CPU-intensive commands based on the slow queries and their execution durations that are collected over a specific period of time across different nodes. For more information, see Query slow logs.
  Figure 2. Slow log query example

Solution

Evaluate and disable high-risk commands and CPU-intensive commands, such as FLUSHALL, KEYS, and HGETALL. For more information, see Disable high-risk commands.

Troubleshoot and optimize large keys

Troubleshooting procedure

Use the performance monitoring feature to identify the period of time during which CPU utilization is high. For more information, see View performance monitoring data.
On the Cache Analysis page, click Analyze. Select a data node that has high CPU utilization and click OK. You can view the large keys that show high CPU utilization within the time range. For more information, see Use the offline key analysis feature.

Solution

Split large keys into small keys based on actual business requirements to evenly distribute the request load.

Inconsistent CPU utilization across proxy nodes

If specific proxy nodes in an instance that uses the cluster architecture or read/write splitting architecture have high CPU utilization whereas other proxy nodes have low CPU utilization, perform the following steps to troubleshoot the issue:

Troubleshooting procedure

On the Proxy Node tab of the Performance Trends page, check whether connection usage is evenly distributed. For more information, see View performance trends.

Solutions

Perform one of the following operations based on whether connection usage is evenly distributed:

If connection usage is evenly distributed, restart the client or proxy node on which business applications are deployed to redistribute connections. For more information about how to restart a proxy node, see Restart or rebuild proxy nodes.
If connection usage is unevenly distributed, the uneven distribution is usually caused by a large scale of pipeline or batch operations. You can reduce the scale of pipeline or batch operations, for example, by splitting the operations into multiple smaller operations.

What factors can cause an abnormal increase in CPU utilization?

Scenarios of high CPU utilization

Sudden increase in CPU utilization

Troubleshoot and disable CPU-intensive commands

Troubleshoot and optimize short-lived connections

Disable AOF persistence

Evaluate the service performance

Inconsistent CPU utilization across data nodes

Troubleshoot and optimize hotkeys

Troubleshoot and disable CPU-intensive commands

Troubleshoot and optimize large keys

Inconsistent CPU utilization across proxy nodes

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)