Troubleshoot and resolve high load issues on Linux instances - Elastic Compute Service

This topic describes how to troubleshoot and resolve high load issues that occur on Linux Elastic Compute Service (ECS) instances. High load issues may cause various exceptions, such as slow instance performance, automatic instance shutdown, automatic instance restart, and failed logons.

Problem description

When you use an ECS instance, the following high load issues may occur on the instance:

High CPU utilization or high CPU load: This issue occurs when CPU utilization reaches or exceeds 80%. If CPU utilization remains high, issues such as slow instance performance, automatic instance shutdown, automatic instance restart, and failed logons may occur.
High bandwidth utilization: This issue occurs when bandwidth utilization reaches or exceeds 80%. If bandwidth utilization remains high, the network connectivity or network throughput of the instance is affected. For example, the instance cannot be connected or has a low network speed.
High memory usage: This issue occurs when memory usage reaches or exceeds 80%. If memory usage remains high, issues such as system stutter and delayed responses may occur.
I/O utilization: This issue occurs when the I/O utilization of a disk reaches or exceeds 80%. When I/O utilization is high, issues such as slow file read/write operations, decreased application performance, or application errors may occur.

Troubleshooting

You can use System Activity Reporter (SAR) or htop to identify high load issues and check the resource usage on your Linux instance.

Use SAR to check the resource usage on the instance

SAR collects statistics about system activities and displays the operating status of the system by calculating the statistics. SAR can continuously collect large amounts of system activity data. SAR analyzes the collected data and stores the data and analysis results in files, which cause a low load.

SAR is a utility program that is used to analyze Linux system performance. SAR is an all-in-one monitoring tool that monitors and reports various system activities, such as file read/write operations, system calls, serial port activities, CPU activities, memory usage, process activity, and inter-process communication.

Install SAR

If SAR is not installed on your Linux instance, perform the following steps to install SAR:

Connect to the Linux instance by using Virtual Network Computing (VNC).
For more information, see Connect to a Linux instance by using VNC.
Run the following command to install SAR:
```
yum install sysstat
```
Run the following command to start the sysstat service:
```
systemctl start sysstat
```
Run the following command to check the status of the sysstat service:
```
systemctl status sysstat
```
If the sysstat service is started, the command output contains Active: active (exited).

Check CPU utilization

Run the following command to check the CPU load:

sar -u 1 5   #Refresh once every second for five times.

A command output similar to the following one is displayed:

Linux 3.10.0-123.9.3.el7.x8664 (iZ23pddtofdZ)     07/04/2016     _x86_64    (1 CPU)
10:16:35 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
10:16:36 AM     all     14.14      0.00      1.01      0.00      0.00     84.85
10:16:37 AM     all     14.14      0.00      0.00      1.01      0.00     84.85
10:16:38 AM     all      0.00      0.00      1.01      0.00      0.00     98.99
10:16:39 AM     all      0.00      0.00      0.00      0.00      0.00    100.00
10:16:40 AM     all      1.00      0.00      0.00      0.00      0.00     99.00
Average:        all      5.86      0.00      0.40      0.20      0.00     93.54

Fields in the command output:

%user: the percentage of the CPU time consumed in user mode.
%nice: the percentage of the CPU time consumed by processes in user mode that have had their scheduling priority changed by using the nice value.
%system: the percentage of the CPU time consumed in system mode.
%iowait: the percentage of the time that the CPU is idle and waiting for disk I/O completion.
%steal: the percentage of time spent in waiting for other vCPUs to finish computing by using operating system virtualization technologies, such as Xen.
%idle: the percentage of CPU idle time.

Check queue lengths and CPU load averages

Run the following command to view queue lengths and CPU load averages:

sar -q 1 10 #Refresh once every second for 10 times.

A command output similar to the following one is displayed:

Linux 3.10.0-123.9.3.el7.x8664 (iZ23pddtofdZ)     07/04/2016     _x86_64    (1 CPU)
10:23:13 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
10:23:14 AM         0       142      0.00      0.01      0.05         0
10:23:15 AM         0       142      0.00      0.01      0.05         0
10:23:16 AM         0       142      0.00      0.01      0.05         0
10:23:17 AM         0       142      0.00      0.01      0.05         0
10:23:18 AM         0       142      0.00      0.01      0.05         0
10:23:19 AM         0       142      0.00      0.01      0.05         0
Average:            0       142      0.00      0.01      0.05         0

Fields in the command output:

runq-sz: the run queue length, which indicates the number of processes that are running.
plist-sz: the number of processes and threads in the process list.
ldavg-1: the system load average for the previous minute.
ldavg-5: the system load average for the previous 5 minutes.
ldavg-15: the system load average for the previous 15 minutes.

Check disk usage

Run the following command to check the disk read/write load:

sar -d 1 3 # Refresh once every second for three times.

A command output similar to the following one is displayed:

Linux 5.10.134-13.al8.x86_64 (iZ2zegjvrdtgifd77gadyqZ)  03/09/2023      _x86_64_        (8 CPU)

02:41:04 PM       DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util
02:41:05 PM  dev253-0      1.00      0.00      4.00      4.00      0.00      0.00      1.00      0.10
02:41:06 PM  dev253-0      1.00      0.00      4.00      4.00      0.00      1.00      1.00      0.10
02:41:07 PM  dev253-0      1.00      0.00      4.00      4.00      0.00      0.00      2.00      0.20
02:41:08 PM  dev253-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
02:41:09 PM  dev253-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev253-0      0.60      0.00      2.40      4.00      0.00      0.33      1.33      0.08

Fields in the command output:

tps: the number of transfers per second.
rd_sec/s: the number of sectors read per second.
wr_sec/s: the number of sectors written per second.
avgrq-sz: the average size (in sectors) of data per I/O operation.
avgqu-sz: the average queue length of the requests that were issued to the disk.
await: the average time (in milliseconds) for I/O requests to be served. This includes the time spent by the requests in queue and the time spent servicing the requests.
svctm: the average service time (in milliseconds) per I/O request.
%util: the percentage of CPU time during which I/O requests were issued. A higher value indicates a lower I/O rate.

Check the memory load

Run the following command to check the memory load:

sar -r 1 3 # Refresh once every second for three times

A command output similar to the following one is displayed:

Linux 3.10.0-123.9.3.el7.x8664 (iZ23pddtofdZ)     07/04/2016     _x86_64    (1 CPU)

10:27:34 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive  kbinact  kbdirty
10:27:35 AM    275992    740664     72.85    181552    315340    362052     35.61    471216   115828       60
10:27:36 AM    276024    740632     72.85    181552    315340    362052     35.61    471220   115828       64
10:27:37 AM    276024    740632     72.85    181552    315340    362052     35.61    471220   115828       64
Average:       276013    740643     72.85    181552    315340    362052     35.61    471219   115828       63

Fields in the command output:

kbmemfree: the amount of free memory available in kilobytes, excluding buffer space and cache space. kbmemfree is the same as the free column in the output of the free command.
kbmemused: the amount of used memory in kilobytes, excluding buffer space and cache space. kbmemused is the same as the used column in the output of the free command.
%memused: the percentage of used memory in relation to the total amount of memory. The total amount of memory does not include the amount of swap memory.
kbbuffers and kbcached: correspond to the buffer and cache columns in the output of the free command.
kbcommit: the amount of memory in kilobytes that is required for the current workload. This is an approximation of the total amount of RAM and swap memory that is required to prevent out-of-memory issues.
%commit: the percentage of memory required for current workload in relation to the total amount of RAM and swap memory.

Check the I/O load

Run the following command to check the I/O load:

sar -b 1 10 # Refresh every second for 10 times

A command output similar to the following one is displayed:

Linux 5.10.134-13.al8.x86_64 (iZ2zegjvrdtgifd77gadyqZ) 03/09/2023 _x86_64_ (8 CPU)

02:34:00 PM tps rtps wtps bread/s bwrtn/s
02:34:01 PM 6.00 0.00 6.00 0.00 80.00
02:34:02 PM 55.00 0.00 55.00 0.00 632.00
02:34:03 PM 1.00 0.00 1.00 0.00 8.00
02:34:04 PM 0.00 0.00 0.00 0.00 0.00
02:34:05 PM 0.00 0.00 0.00 0.00 0.00
02:34:06 PM 2.00 0.00 2.00 0.00 136.00
02:34:07 PM 82.00 0.00 82.00 0.00 888.00
02:34:08 PM 0.00 0.00 0.00 0.00 0.00
02:34:09 PM 0.00 0.00 0.00 0.00 0.00
02:34:10 PM 0.00 0.00 0.00 0.00 0.00
Average: 14.60 0.00 14.60 0.00 174.40

Fields in the command output:

tps: the total number of transfers that were issued to physical devices per second.
rtps: the total number of read requests that were issued to physical devices per second.
wtps: the total number of write requests that were issued to physical devices per second.
bread/s: the total amount of data read from physical devices in blocks per second.
bwrtn/s: the total amount of data written to physical devices in blocks per second.

Check swapping activity

Run the following command to check swapping activity:

sar -W 1 3

A command output similar to the following one is displayed:

Linux 3.10.0-123.9.3.el7.x8664 (iZ23pddtofdZ) 07/04/2016 _x86_64 (1 CPU)
10:28:59 AM pswpin/s pswpout/s
10:29:00 AM 0.00 0.00
10:29:01 AM 0.00 0.00
10:29:02 AM 0.00 0.00
Average: 0.00 0.00

Fields in the command output:

pswpin/s: the number of swap pages that the system brought in per second.
pswpout/s: the number of swap pages that the system brought out per second.

Common sar command parameters

The following parameters that are case-sensitive are used in sar commands:

-A: reports all collected statistics.
-a: reports the use of files.
-B: reports additional buffer activities.
-b: reports buffer activities.
-c: reports system calls.
-d: reports disk activities.
-g: reports page-out and memory freeing activities.
-h: reports summary statistics about buffer activities.
-m: reports inter-process communication activities, including the number of message operations (send and receive) per second and the number of semaphore operations per second.
-n: reports the usage of the named cache.
-p: reports page-in activities.
-g: reports the average lengths of run queues and swap queues.
-R: reports process activities.
-r: reports the number of memory pages and disk blocks that are currently unused.
-u: reports CPU utilization statistics.
-v: reports the status of the process table, inode table, file table, and lock table.
-w: reports swapping and switching activities.
-y: reports teletypewriter (TTY) device activities.

Use htop to view the load of processes in the operating system

htop is an interactive tool that provides you with a visual representation of the usage and load averages of CPU, memory, and swap in Linux.

Connect to a Linux instance.
For more information, see Connection methods.
Run the following command to install htop:
```
yum install htop
```
Run the following command to start htop:
```
htop
```
View the system loads in htop.
The following figure shows the htop interface. The interface consists of the following sections:
- ①: The CPU utilization, memory usage, and swap usage are displayed on the left side, and the total number, load average, and uptime of processes are displayed on the right side.
- ②: The usage of all processes is displayed. You can click CPU% or MEM% to sort the processes by CPU utilization or memory usage and identify the processes that cause a high CPU utilization or high memory usage.
- ③: The function keys F1 to F10 are displayed.

Elastic Compute Service:Troubleshoot and resolve high load issues on Linux instances

Problem description

Troubleshooting

Use SAR to check the resource usage on the instance

Install SAR

Check CPU utilization

Check queue lengths and CPU load averages

Check disk usage

Check the memory load

Check the I/O load

Check swapping activity

Common sar command parameters

Use htop to view the load of processes in the operating system

References