Symptoms
-
Slow response: Secure Shell Protocol (SSH) commands are delayed. Website or API access is slow or times out.
-
High metrics: CPU, memory, and disk I/O metrics consistently exceed 80%.
-
Service interruption: The system terminates critical processes due to an out-of-memory (OOM) error, and the instance automatically restarts.
-
Logon failure: SSH connections are refused.
Causes
-
Application issues: The application code has performance bottlenecks or memory leaks.
-
Traffic spikes: Concurrent access exceeds the processing capacity of the instance.
-
I/O bottleneck: Disk read and write operations are saturated, which causes high CPU
iowait.
Solutions
Step 1: Use htop to quickly identify abnormal processes
-
Log on to an ECS instance using a VNC connection.
-
Go to ECS console - Instances. In the top navigation bar, select the target region and resource group.
-
Go to the details page of the target instance. Click Connect and select VNC. Enter the username and password to log on to the ECS instance.
-
-
Install and run htop.
sudo yum install -y htop htop -
Analyze the output in the htop interface.
-
To find processes with high CPU consumption, press the
F6key and sort byPERCENT_CPUin descending order. -
To find processes with high memory consumption, press the
F6key and sort byPERCENT_MEMin descending order.
-
Step 2: Use sar to diagnose resource bottlenecks
After you use htop to identify a symptom, use sar to obtain quantitative data and confirm whether the bottleneck is CPU, memory, or I/O.
-
Install and enable sysstat.
sudo yum install -y sysstat systemctl start sysstat && systemctl enable sysstat -
Run a targeted analysis.
-
Analyze CPU usage (
sar -u) to confirm where CPU time is spent.# Collect data once per second for a total of 5 times sar -u 1 5-
High
%user: Indicates an application issue. -
High
%system: Indicates frequent kernel or I/O calls. -
%iowaitis consistently greater than 20%: Indicates a disk I/O bottleneck.
-
-
Analyze the system load (
sar -q) to measure how busy the system is.# Collect data once every 2 seconds for a total of 5 times sar -q 2 5-
ldavg-1is greater than the number of CPU cores: The system is overloaded. -
High
runq-sz: Many processes are in the queue waiting for the CPU.
-
-
Analyze memory and swap activity (
sar -randsar -W) to determine whether memory is exhausted.# Analyze memory usage sar -r 1 3 # Analyze swap activity (Swap) sar -W 1 3-
pswpin/sorpswpout/sis consistently greater than 0: Physical memory is insufficient, and the system is swapping to disk. This degrades performance.
-
-
Analyze disk I/O (
sar -d) to identify disk performance bottlenecks.# Collect data once per second for a total of 3 times to analyze a specific disk sar -d 1 3-
%utilis close to 100%: Disk I/O is saturated. -
awaitis greater than 20 ms: I/O request processing time is too long.
-
-
Step 3: Apply targeted solutions and optimizations
-
For application processes with high CPU consumption:
-
Code optimization: Use tools such as
perf(C/C++) andjstack(Java) to identify and optimize hot spot code. -
Logic optimization: Check for and fix inefficient operations such as infinite loops and SQL queries that perform full table scans.
-
-
For insufficient memory or frequent swapping:
-
Investigate leaks: Use tools such as
valgrind(C/C++) andjmap(Java) to analyze memory leaks. -
Adjust configurations: Configure application memory parameters, such as the
-Xmsand-Xmxparameters for a Java Virtual Machine (JVM). -
Upgrade resources: Increase physical memory by changing the instance type. For more information, see Overview of instance type changes.
-
-
High disk I/O: For more information, see Troubleshoot high disk I/O load on Linux systems.
Next steps
-
Configure monitoring and alerts: Set alert thresholds for key metrics such as CPU, memory, load, and disk to receive early warnings.
-
Plan for Auto Scaling: For workloads with fluctuations, such as web applications, configure Auto Scaling policies to automatically add or remove instances in response to traffic changes.