Trigger restarts of ECS instances with high CPU utilization based on alerts - CloudOps Orchestration Service

This topic describes how to use the alerting feature of CloudOps Orchestration Service (OOS) to automatically restart the Elastic Compute Service (ECS) instances with high CPU utilization.

Background information

If the CPU utilization of an ECS instance is high due to known or unknown reasons, the status of applications on the instance is often affected, and the operation of applications is slow or stuck. In this case, you can restart the instance to quickly restore the CPU utilization to a lower level and prevent applications from being affected. In this scenario, you can use the alerting feature of OOS to automatically restart instances with high CPU utilization to achieve unattended recovery.

Procedure

Log on to the OOS console.
In the left-side navigation pane, choose Automated Task > Alert and Event O&M and click Create.
Go to the Trigger Rule section. Select ECS from the Service drop-down list and configure the Rule Description parameter. In this example, CPU Utilization/cpu_total is selected and the threshold is set to 80%. This way, if the CPU utilization of an ECS instance is greater than 80%, the ECS instance is restarted. The default value of the Mute Period parameter is 5Minutes, which specifies that an ECS instance is restarted only once every 5 minutes even if the metric value exceeds the alert threshold several times.
Select one or more instances whose CPU utilization you want to monitor for the Resources Alerted parameter.
In the Select Template section, select Public Template from the drop-down list next to the search box, and select the ACS-ECS-BulkyRebootInstances template.
In the Configure Template Parameters section, configure the parameters as required. Select Extract Value from Message Body for Template Parameters.
Use the default settings for the RegionId, TargetInstance and RateControl parameters.
Create a Resource Access Management (RAM) role for OOS as the source of the permissions, and select the RAM role from the Permissions drop-down list. For more information, see Use RAM to grant permissions to OOS. The following code shows the policy that is required to execute the template.
```
{
  "Version": "1",
  "Statement": [
    {
      "Action": [
        "ecs:RebootInstance",
        "ecs:DescribeInstances"
      ],
      "Resource": "*",
      "Effect": "Allow"
    }
  ]
}
```
After the configuration is complete, click Create.

Check the result

In this example, you can use the stress testing tool stress-ng to simulate high CPU utilization.

Connect to the monitored ECS instance.

Install stress-ng.

# AliyunLinux/CentOS/RHEL
yum install stress-ng -y

# Ubuntu/Debian
apt-get install stress-ng -y

Run the stress-ng command to simulate CPU utilization.

# When you run the stress-ng command, you can configure the parameters based on your stress testing requirements. 
# In this example, stress-ng is used to perform a stress test on two CPU cores. The CPU load is set to 85% and stops after 5 minutes. 
stress-ng --cpu 2 --cpu-load 85 --timeout 5m

After about 1 minute of stress testing, an alert is triggered, and the ECS instance that runs the command is restarted. After the restart, the CPU utilization of the ECS instance decreases.