Troubleshoot and resolve high CPU utilization or high CPU load issues on Linux ECS instances - Elastic Compute Service

This topic provides common cases of high CPU utilization or high CPU load issues that occur on Linux Elastic Compute Service (ECS) instances, and describes how to troubleshoot and resolve the issues.

Troubleshooting procedure

When you use a Linux ECS instance, the CPU utilization or CPU load of the instance may stay high. To troubleshoot and resolve the issue, perform the following steps:

Find the process that causes high CPU utilization or high CPU load.
Check whether the process that causes high CPU utilization or high CPU load is running as expected, and handle the issue based on the status of the process.
- If the process is running as expected, you can optimize the corresponding program or upgrade the instance type. For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance.
- If the process is not running as expected, you can manually check and terminate the process. You can also use a third-party security tool to check and terminate the process.

Query and analyze CPU loads

In a Linux system, you can run the following commands to view processes. This topic describes only the vmstat (stands for VirtualMeomoryStatistics) and top commands.

vmstat
top
ps -aux
ps -ef

Run the vmstat command to query and analyze CPU loads

Run the vmstat command to view the overall statistics about virtual memory, processes, and CPUs in the operating system.

Command syntax

The following code shows the syntax of the vmstat command:

vmstat [-n] [delay [count]]

Note

[-n]: specifies that the header is displayed only once.
[delay]: specifies the refresh interval. If you do not specify this parameter, only one result is displayed.
[count]: specifies the number of refreshes. If you do not specify the number of refreshes but specify the refresh interval, the number of refreshes is unlimited.

Example

Run the following vmstat command to collect statistics about the CPU utilization of each process every 1 second for four times:

vmstat -n 1 4

Sample output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 2684984 310452 2364304    0    0     5    17   19   35  4  2 94  0  0
 0  0      0 2687504 310452 2362268    0    0     0   252 1942 4326  5  2 93  0  0
 0  0      0 2687356 310460 2362252    0    0     0    68 1891 4449  3  2 95  0  0
 0  0      0 2687252 310460 2362256    0    0     0     0 1906 4616  4  1 95  0  0

Fields in the command output:

r: indicates the number of threads waiting to be processed by the CPU. A CPU can process only one thread at a time. A larger value indicates that the system runs slower.
us: indicates the percentage of CPU time consumed in user mode. A larger value indicates that more CPU time is consumed by a user process. If the value exceeds 50% for an extended period of time, you must optimize the program algorithm or code.
sy: indicates the percentage of CPU time consumed in kernel mode.
wa: indicates the percentage of CPU time spent waiting for I/O operations to complete. If the value is high, the system experiences signiﬁcant I/O waits. This may be caused by extensive random access to disks or a bottleneck in disk performance.
id: indicates the percentage of idle CPU time. If the value of this parameter remains 0 and the value of the sy parameter is twice the value of the us parameter, the CPU resources in the system are insufficient.

Run the top command to query and analyze CPU loads

The top command is a common performance analytics tool in Linux. The command can display the resource usage of each process in the system in real time.

Command syntax

top [-n] [-d]

Note

[-n]: specifies the number of refreshes. If you do not specify the number of refreshes but specify the refresh interval, the number of refreshes is unlimited.

[-d]: specifies the refresh interval.

Example

Connect to a Linux ECS instance.
For more information, see Connection method overview.
Run the following command to view the resource usage of each process in the system.
The command collects statistics about the resource usage of each process every 2 seconds for five times.
```
top -n 5 -d 2
```
Sample output:
```
top - 17:27:13 up 27 days,  3:13,  1 user,  load average: 0.02, 0.03, 0.05
Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.1 st
KiB Mem:   1016656 total,   946628 used,    70028 free,   169536 buffers
KiB Swap:        0 total,        0 used,        0 free.   448644 cached Mem
PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1 root        20   0   41412   3824   2308 S  0.0  0.4   0:19.01 systemd
2 root        20   0       0      0      0 S  0.0  0.0   0:00.04 kthreadd
```
Fields in the command output:
For CPU utilization and CPU load issues, you need to only focus on the first and third lines of the command output.
- On the first line, 17:27:13 up 27 days, 3:13, 1 user, load average: 0.02, 0.03, 0.05 is displayed, which indicates the current system time, the system uptime, the number of logon users, and the system load.
- On the third line, the overall usage of current CPU resources is displayed. The resource usage of each process is displayed below the line.
When the top command runs, you can press the following keys to adjust the command output:
- Press the P key to sort processes in descending order of CPU utilization. This way, you can quickly find the processes that consume a large amount of CPU resources in the system.
- Press the M key to sort processes by memory usage. If a CPU has multiple cores, press the 1 key to display the load status of each CPU core.
Run the ll /proc/PID/exe command to view the program file that corresponds to each process ID (PID).
Terminate the processes that consume a large amount of CPU resources.
1. Enter k.
2. Enter the PID of the process that you want to terminate and press the Enter key.
  The default PID is the first PID in the command output. If you want to terminate the process whose PID is 23, enter 23 and press the Enter key, as shown in the following figure.
3. After the operation is complete, a message similar to Send pid 23 signal [15/sigterm] appears. Press the Enter key to confirm.

Common cases of high CPU resource usage

Case 1: The CPU utilization is low but the CPU load is high

Problem description

No business programs are running on a Linux ECS instance. The top command shows that the CPU utilization is low, but the CPU load (load average) is high, as shown in the following figure.

Cause

A large number of zombie processes exist in the system.

The load average is used to evaluate the CPU load. A higher value indicates a longer task queue and more tasks waiting to be executed.

Solution

Use the ps -axjf command to check whether the system has processes in the D+ state, which is an uninterruptible sleep state.

Processes in this state cannot be terminated or automatically exit. To resolve the issue that the system has processes in the D+ state, you can restore the dependency resources of the processes or restart the system.

Case 2: The kswapd0 process consumes a large amount of CPU resources

Problem description

A Linux ECS instance is not operating as fluidly as expected. The top command shows that the kswapd0 process consumes 99% of the CPU resources.

Cause

Constant page swapping by the system causes the consumption of a large amount of CPU resources.

The kswapd0 process is a virtual memory management process that is responsible for page swapping. When the physical memory of the ECS instance is insufficient, the kswapd0 process performs a page-swapping operation. The operation consumes a large amount of CPU resources.

Solution

Modify the vm.swappiness kernel parameter to control the size of the swap space.

Log on to the Linux ECS instance.
For more information, see Connection method overview.
View the swappiness parameter.
```
cat /proc/sys/vm/swappiness
```
A command output similar to the command output in the following figure is returned. In this example, the returned value is 40, which indicates that the swap space is used when less than 60% of physical memory is available.
Note
A smaller swappiness value indicates that the kernel uses less swap space and more physical memory. A larger swappiness value indicates that the kernel uses more swap space and less physical memory.
Modify the swappiness parameter based on your business requirements.
1. Open the kernel parameter configuration file named sysctl.conf.
```
vi /etc/sysctl.conf
```
2. Modify the swappiness parameter based on your business requirements.
  For example, configure vm.swappiness = 10 in the sysctl.conf configuration file.
3. Press the Esc key and enter :wq to save the changes and exit the file.
4. Reload the sysctl.conf configuration file for the new configurations to take effect.
```
sysctl -p
```
  If the issue persists, we recommend that you upgrade the instance type of the ECS instance. For more information, see Upgrade the instance types of subscription instances or Change the instance type of a pay-as-you-go instance.

Case 3: The CPU utilization reached 100%

Problem description

The CPU utilization of a Linux ECS instance reached 100%. You cannot run commands such as top and htop to query processes that consume CPU resources.

Cause

The issue may be caused by a virus.

Solution

View monitoring data in the CloudMonitor console.
1. Log on to the CloudMonitor console.
2. In the left-side navigation pane, click Host Monitoring.
3. Find the ECS instance on which the issue occurred and click Monitoring Charts in the Actions column.
4. On the OS Monitoring tab, view and record the CPU utilization of the ECS instance in each point in time.
View the command modification records of the Linux ECS instance.
1. Log on to the Linux ECS instance.
  For more information, see Connection method overview.
2. Run the following command to check whether the commands of the Linux system were recently modified:
```
stat /usr/bin/top
```
  If a command output similar to the command output in the following figure is returned, the system has commands that were modified. Check whether the points in time at which the commands were modified are consistent with the points in time at which the CPU utilization was 100%.
Run the following commands to check whether the ps and top commands were modified:
```
rpm -Vf /bin/ps
rpm -Vf /usr/bin/top
```
- If the instance runs as expected, no modification information is returned.
- If an exception occurs on the instance, a command output similar to the command output in the following figure may be returned. The command output indicates that the ps and top commands were modified.
Run the following command to check whether the instance is connected to an invalid domain name:
```
iftop -i [$Device] -n -P
```
Note
Replace the [$Device] variable with the network interface controller (NIC) used by the current system, such as eth0.
A command output similar to the command output in the following figure is returned. If you did not connect to crypto-pool.fr, crypto-pool.fr is an invalid domain name.
Determine whether the ECS instance is infected with a virus based on the results of the preceding steps. If the top and ps commands were modified and the ECS instance is connected to an invalid domain name, the instance is infected with a virus. To resolve this issue, perform the following steps:
1. Back up data on the ECS instance. For more information, see Create a snapshot for a disk.
2. Re-initialize the system disk of the ECS instance and then use Security Center to harden the security of the ECS instance. For more information, see Re-initialize a system disk and What is Security Center?