Trigger automatic restarts for ECS instances in response to high CPU utilization alerts - CloudOps Orchestration Service

An abnormal increase in CPU utilization of an Elastic Compute Service (ECS) instance can affect the performance of applications running on the instance. As a result, the applications run slowly or stop responding. To quickly resolve this issue, you can restart the ECS instance to reduce CPU utilization and mitigate the impact on the applications. CloudOps Orchestration Service (OOS) provides the alerting feature to automatically restart an instance when the CPU utilization is detected to be excessively high. This enables automatic processing without manual intervention. This topic describes how to configure a CPU utilization alert to automatically restart ECS instances when the CPU utilization exceeds the threshold to quickly restore the service performance.

Preparations

You must create a RAM role that has the permissions to restart ECS instances for CloudOps Orchestration Service.

Create a custom policy that contains the ecs:RebootInstance and ecs:DescribeInstances permissions. For more information, see Create custom policies.

Permissions required for automatic restart

{
  "Version": "1",
  "Statement": [
    {
      "Action": [
        "ecs:RebootInstance",
        "ecs:DescribeInstances"
      ],
      "Resource": "*",
      "Effect": "Allow"
    }
  ]
}

Create a regular service role and configure CloudOps Orchestration Service as the trusted service. For more information about how to select a trusted service, see Create a RAM role for a trusted Alibaba Cloud service.
Attach the custom policy to the RAM role that you create. This way, the RAM role has the required permissions. For more information, see Grant permissions to a RAM role.

Procedure

Log on to the CloudOps Orchestration Service console. In the left-side navigation pane, choose Automated Task > Alert and Event O&M.
On the Alert and Event O&M page, create Create. On the Create Alert and Event O&M page, select Threshold Alert.
In the Trigger Rule section, configure rule-related parameters and select desired instances.
In the Select Template section, select Public Template from the drop-down list next to the search box and select the ACS-ECS-BulkyRebootInstances template.
Retain the default configurations for the RegionId, TargetInstance, and RateConsole parameters and select a RAM role that has the permissions to restart ECS instances from the Permissions drop-down list.
Click Create. In the dialog box that appears, click OK.

Verify the result

In this example, the open source stress testing tool stress-ng is used to simulate high CPU utilization.

Connect to the monitored ECS instance. For more information, see Methods for connecting to an ECS instance.
Install the stress-ng tool.
Alibaba Cloud Linux, CentOS, and RHEL
```
yum install stress-ng -y
```
Ubuntu and Debian
```
apt-get install stress-ng -y
```
# In this example, the stress-ng tool is used to perform stress testing on two CPU cores, the CPU load is set to 85%, and the stress testing lasts for 5 minutes.
```
stress-ng --cpu 2 --cpu-load 85 --timeout 5m
```
Observe the CPU utilization. After the instance is restarted, the CPU utilization decreases.

Preparations

Procedure

Verify the result

Alibaba Cloud Linux, CentOS, and RHEL

Ubuntu and Debian