Use Managed Service for Prometheus to monitor ECS instances - Managed Service for Prometheus

This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor the metrics about Elastic Compute Service (ECS) instances deployed in a virtual private cloud (VPC), discard metrics, and configure alerting. To monitor the ECS instances, you need to enable a component named Host Monitor.

Workflow

Monitoring ECS instances requires the following steps:

Enable Host Monitor: Enable the Host Monitor component and select a VPC. Then, the open source exporters are automatically installed. This way, the managed Prometheus agent automatically collects data.
(Optional) Modify Host Monitor: Modify the configurations of the Host Monitor component, such as the service port, to fix errors or meet new business requirements.
(Optional) Discard metrics: Discard metrics that you do not need to refine metric collection and saves costs.
(Optional) Configure alerting: Configure alerting for specific metrics to detect changes and troubleshoot problems in a timely manner.

Prerequisites

A VPC is created. One or more ECS instances are created in the VPC. For more information, see Create and manage an ECS instance in the console (express version).
Alibaba Cloud Resource Center is activated. For more information, see Activate Resource Center.
Note
Before you monitor ECS instances, you must activate Resource Center. This is because the service discovery capabilities of Managed Service for Prometheus rely on the VPC data and ECS data of the current account provided by Resource Center.

1. Enable Host Monitor

After you enable Host Monitor and select the VPC, Node Exporter and Process Exporter are installed for ECS instances in the VPC by default. Then, the managed Prometheus agent automatically collects data. You can store and visualize the data, and configure alerting for the data in a unified manner. About 1,000 metric entries are collected for each ECS instance at a time.

1.1 Enable the component

In the left-side navigation pane of the ARMS console, click Integration Center. In the Infrastructure section, click Host Monitor.
In the panel that appears, select the VPC and configure the parameters as required. For more information, see Monitor ECS instances.
Click OK. Wait for 1 or 2 minutes.

1.2 View the dashboards

In the left-side navigation pane of the ARMS console, click Integration Management. On the Integrated Environments tab, click ECS Instance. Click the VPC ID and go to the environment details page.
On the Component Management tab, click Dashboards in the Addon Type section to view the built-in Grafana dashboards.
Note
If the dashboards have no data, check the security group settings. For more information, see Why do the dashboards have no data?

2. (Optional) Modify Host Monitor

You can modify the configurations of Host Monitor, such as the service discovery, service port, or the interval at which data is collected.

2.1 Procedure

In the left-side navigation pane of the ARMS console, click Integration Management. On the Integrated Environments tab, click ECS Instance. Click the VPC ID and go to the environment details page.
Find the exporter that you want to modify and click Settings.
Modify the configurations based on your needs and click OK. For more information, see Monitor ECS instances.

2.2 Verification

Refresh the page and click Settings again to check whether the modification takes effect.
View the dashboards and check whether the data meets your expectations. For more information, see the 1.2 View the dashboards section.

3. (Optional) Discard metrics

You can discard metrics that you do not need to simplify data analysis and management.

3.1 Procedure

In the left-side navigation pane of the ARMS console, click Integration Management. On the Integrated Environments tab, click ECS Instance. Click the VPC ID and go to the environment details page.
In the Discard Metrics section of the Metric Scraping tab, select the metrics that you want to discard and click Update. For information about metrics, see Metrics.
Note
You cannot discard the basic metrics about ACK clusters.

3.2 Verification

Click Update and refresh the page to check whether the modification takes effect.
View the dashboards and check whether the data meets your expectations. For more information, see the 1.2 View the dashboards section.

4. (Optional) Configure alerting

You can configure alert rules to monitor specific metrics. When an alert is triggered due to metric changes, you are notified in a timely manner. This facilitates routine maintenance and troubleshooting.

Managed Service for Prometheus provides built-in alert rules and custom alert rules. You can configure the built-in alert rules or existing custom alert rules, or add custom alert rules based on your business requirements.

4.1 Configure built-in alert rules

By default, the built-in alert rules generate alert events. You can manually configure alert notifications.

In the left-side navigation pane of the ARMS console, click Integration Management. On the Integrated Environments tab, click ECS Instance. Click the VPC ID and go to the environment details page.
In the Addon Type section of the Component Management tab, click Alert Rule. Click View Alert Event or Edit to view or modify an alert rule.
Modify the alert rule based on your needs and click OK. For more information, see Prometheus alert rules.

4.2 Configure custom alert rules

If the built-in alert rules cannot meet your business requirements, you can configure custom alert rules for the ECS instances.

In the left-side navigation pane of the ARMS console, click Integration Management. On the Integrated Environments tab, click ECS Instance. Click the VPC ID and go to the environment details page.
On the Component Management tab, click the VPC next to Default Metric Storage in the Basic Information section.
On the Alert rules page, you can create, modify, or view custom alert rules. For more information, see Prometheus alert rules.

4.3 Verification

Refresh the page to check whether the modification takes effect.
Configure an alert rule that is easy to be triggered and risk-free, and try to trigger the alert rule to check whether the alert rule meets your expectations.
Note
How alert notifications are sent depends on the alert rule.

FAQ

Why do the dashboards have no data?

If the dashboards have no data, the security groups of the ECS instances may not allow access as required. Requirements:

The security group of each ECS instance allows access from the 100.64.0.0/10 and 192.168.0.0/18 CIDR blocks to the ports of Node Exporter and Process Exporter. The default port of Node Exporter is 9100, and the default port of Process Exporter is 9256. If you have modified the ports, use the modified ports. For information about how to view the security group rules of an ECS instance, see Search for security groups.

Why did Node Exporter fail to be automatically installed in the ECS instance?

Perform the following operations:

Check whether the ECS instance is running as expected.
Check whether open source Node Exporter has been installed in the ECS instance and uses port 9100. If so, find the Node Exporter provided by Alibaba Cloud on the Component Management tab and click Settings to change the port.

How do I check whether Node Exporter has been installed?

Visit http://<ECS-IP>:<PORT>/metrics to check whether metric data is generated as expected. If metric data is available, Node Exporter has been installed.

How do I manually configure security group rules?

Log on to the ECS console. Manually add an inbound rule in the security group settings of each ECS instance to allow access from the CIDR block of the VPC to the ports of Node Exporter and Process Exporter. The default port of Node Exporter is 9100, and the default port of Process Exporter is 9256. If you have modified the ports, use the modified ports.

What do I do if network connection issues occur when I integrate ECS instances into Managed Service for Prometheus?

Make sure that each ECS instance and the Prometheus agent can access each other through the VPC. First, check the route table configurations of the VPC and make sure that the direction of the traffic is correct. Then, check whether the firewall or security group rules allow the traffic.

Why is the metric data inaccurate or missing?

If Node Exporter and Process Exporter are enabled, check whether they are running as expected. To do so, use a command line tool, such as cURL, to query metrics exposed by the exporters to check whether data is returned as expected. If exceptions occur on the exporters, check the logs.

What do I do if I want to enable the collection of process status data?

Process status data is collected by Process Exporter. By default, Process Exporter uses port 9256. Make sure that port 9256 is allowed in the security group of each ECS instance. Note that collecting process status data consumes a few system resources, which does not affect the system performance generally. However, if the system resources are insufficient, proceed with caution.