Monitor Hologres instance metrics and configure alert rules by using CloudMonitor - Hologres

CloudMonitor provides all-in-one monitoring solutions for enterprises in the cloud. The cloud service monitoring feature of CloudMonitor supports Hologres. You can use CloudMonitor to gain a comprehensive understanding of the resource utilization, business operations, and health status of Hologres instances. CloudMonitor can also send alert notifications to help you handle exceptions at the earliest opportunity and ensure that applications run as expected. This topic describes how to monitor Hologres instance metrics and configure alert rules by using CloudMonitor.

Prerequisites

A Hologres instance is purchased. For more information, see Purchase a Hologres instance.

Usage notes

CloudMonitor provides dedicated metrics for different instance types and displays the metrics on different tabs by instance type, such as Hologres follower instance, Hologres acceleration instance, Hologres standard instance, and Hologres warehouse instance. This facilitates business monitoring and troubleshooting. We recommend that you view metrics on the tab of a specific instance type instead of the Hologres tab.

Metrics

For more information about Hologres instance metrics that are supported by CloudMonitor, see Hologres metrics.

View metrics

You can log on to the CloudMonitor console to view metrics.

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Cloud Service Monitoring > Cloud Service Monitoring.
In the Big Data section, click the desired instance type. The instance type can be Hologres follower instance, Hologres acceleration instance, Hologres standard instance, or Hologres warehouse instance.
Click the icon next to the region and select the region where your instance resides.
Click the ID of your instance or click Monitoring Charts in the Actions column of your instance.
Note
You can specify a time period to view the metrics of the instance. You can query only the metrics in the previous 30 days.

Configure alert rules

Enable initiative alert

You can enable the initiative alert feature for Hologres in the CloudMonitor console. The initiative alert feature allows you to configure default alert rules based on different metrics, such as metrics related to CPU utilization, disk usage, memory usage, and number of connections, for all Hologres instances of your Alibaba Cloud account. This helps you identify issues at the earliest opportunity. The following default alert rules are provided:

If the average connection usage is greater than or equal to 95% in three consecutive cycles, an info-level alert notification is sent to the contacts in the alert contact group.
If the average storage usage is greater than 90% in three consecutive cycles, a warn-level alert notification is sent to the contacts in the alert contact group.
If the average memory usage is greater than or equal to 90% in three consecutive cycles, a warn-level alert notification is sent to the contacts in the alert contact group.
If the average CPU utilization is greater than or equal to 99% in three consecutive cycles, an info-level alert notification is sent to the contacts in the alert contact group.

Note

By default, each cycle lasts for 5 minutes. You can also specify a custom cycle duration.

Create alert rules

In addition to the initiative alert feature, you can perform the following steps to configure custom alert rules for metrics based on your business requirements:

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Alerts > Alert Rules.
On the Alert Rules page, click Create Alert Rule. In the Create Alert Rule pane, configure parameters based on your business requirements. For more information, see Create an alert rule.

Best practices for configuring alert rules

This section describes the recommended alert rules for different metrics.

Instance CPU Usage(%)

This metric indicates whether resource bottlenecks exist or whether resources are fully utilized on your Hologres instance. Recommended configurations:

Alert rules:
- Critical: If the value of this metric is greater than or equal to 99% in 60 consecutive cycles, a critical-level alert is reported. Each cycle lasts for 1 minute. Based on this alert, you can effectively monitor the resource usage of an instance and determine whether to perform a scale-out operation.
- Warn: If the value of this metric is greater than or equal to 99% in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. If this alert is reported, you can check whether the high CPU utilization is caused by business changes.
We recommend that you do not configure an alert that is triggered once the value of this metric reaches 100%. A 100% CPU utilization in a short period of time does not indicate system overloads or exceptions. This scenario indicates high resource utilization.

We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. When no tasks are run, some components may be running and consume resources.

Worker CPU Usage(%)

This metric indicates whether resource bottlenecks exist or whether resources are fully utilized on each worker node of your Hologres instance. Recommended configurations:

Alert rules:
- Critical: If the value of this metric is greater than or equal to 99% in 60 consecutive cycles, a critical-level alert is reported. Each cycle lasts for 1 minute. Based on this alert, you can effectively monitor the resource usage of each worker node and determine whether to perform a scale-out operation.
- Warn: If the value of this metric is greater than or equal to 99% in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. If this alert is reported, you can check whether the high CPU utilization is caused by business changes.
We recommend that you do not configure an alert that is triggered once the value of this metric reaches 100%. A 100% CPU utilization in a short period of time does not indicate system overloads or exceptions. This scenario indicates high resource utilization.
We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. When no tasks are run, some components may be running and consume resources.

Instance Memory Usage(%)

This metric indicates the memory usage of an instance. Recommended configurations:

Alert rules:
- Critical: If the value of this metric is greater than or equal to 99% in 60 consecutive cycles, a critical-level alert is reported. Each cycle lasts for 1 minute. Based on this alert, you can effectively monitor the memory usage of an instance and determine whether to perform a scale-out operation.
- If the value of this metric is greater than or equal to 99% in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. If this alert is reported, you can check whether the high memory usage is caused by business changes.
We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. In addition to queries, metadata and cached data consume memory resources. Memory resources are consumed even if no tasks are run on the instance.

Worker Memory Usage(%)

This metric indicates the memory usage of a worker node. Recommended configurations:

Alert rules:
- Critical: If the value of this metric is greater than or equal to 99% in 60 consecutive cycles, a critical-level alert is reported. Each cycle lasts for 1 minute. Based on this alert, you can effectively monitor the memory usage of each worker node on an instance and determine whether to perform a scale-out operation.
- Warn: If the value of this metric is greater than or equal to 99% in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. If this alert is reported, you can check whether the high memory usage is caused by business changes.
We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. In addition to queries, metadata and cached data occupy memory resources. Memory resources are consumed even if no tasks are run on the instance.

Max Connection Usage(%)

This metric indicates the maximum connection usage among FE nodes on an instance. Recommended configurations:

Warn: If the value of this metric is greater than or equal to 95% for five consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. Based on this alert, you can effectively monitor the connection usage of an instance and close idle connections at the earliest opportunity.

Binlog WAL Sender Usage(%)

This metric indicates the maximum walsender usage among FE nodes. Recommended configurations:

Longest Active Query Time(milliseconds)

Based on this metric, you can check whether a long-running query exists on an instance at the specified point in time. Recommended configurations:

Warn: If the value of this metric is greater than or equal to 3,600,000 milliseconds in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute.

Serverless Computing Longest Active Query Time(milliseconds)

You can use this metric to effectively monitor the running status of tasks that use serverless computing resources. If the running duration of a task is excessively long, you can cancel the task at the earliest opportunity. Recommended configurations:

Warn: If the value of this metric is greater than or equal to 3,600,000 milliseconds in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute.

Failed Query QPS(countS)

This metric indicates the total number of failed queries per second on an instance. You can configure alert rules based on this metric. Recommended configurations:

Warn: If the value of this metric is greater than or equal to 10 in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. If a large number of queries fail on an instance, we recommend that you check the failure details in slow query logs and perform governance.

FE Replay Running Time(milliseconds)

This metric indicates the replay duration of each FE node. If the value of this metric is excessively large, queries may be stuck at an FE node. In this case, perform troubleshooting at the earliest opportunity. Recommended configurations:

Alert rules:
Warn: If the value of this metric is greater than or equal to 300,000 milliseconds in 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute. In this case, check and cancel the queries that require a long period of time to complete in the HoloWeb console.
We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. If metadata of an instance is modified, FE replay occurs. In most cases, if the value of this metric is in seconds, the value is considered normal.

Instance Sync Lag(milliseconds)

This metric is displayed only for secondary instances and indicates the latency of data synchronization from the primary instance to a secondary instance. Recommended configurations:

Warn: If the value of this metric is greater than or equal to 600,000 milliseconds for 10 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute.

Stats Miss Table Num by DB(countS)

This metric indicates the performance of the auto-analyze feature. If statistical information of specific tables is not collected for a long period of time, manually execute the ANALYZE statement on the tables. For more information, see ANALYZE and auto-analyze. Recommended configurations:

Alert rules:
Warn: If the value of this metric is greater than or equal to 10 in 60 consecutive cycles, a warn-level alert is reported. Each cycle lasts for 1 minute.
We recommend that you do not set the threshold of this metric for triggering an alert to an excessively small value. This is because the execution speed of the auto-analyze feature decreases if an instance contains a large number of tables.

Troubleshoot metric-related issues

If a metric unexpectedly fluctuates or an alert is reported, you can troubleshoot the issue by following the instructions in Metric FAQ.

View metrics by calling API operations

In addition to the CloudMonitor console, you can view metrics from a custom dashboard or by calling API operations.

For more information about how to view Hologres metrics by calling API operations, see Cloud products.
For more information about how to view Hologres metrics from a custom dashboard, see Manage a custom dashboard.
For more information about how to view Hologres metrics by using Application Real-Time Monitoring Service (ARMS), see Integrate services or components.

Grant required permissions on CloudMonitor to a RAM user

By default, RAM users do not have permissions on CloudMonitor. You must grant the required permissions on CloudMonitor to a RAM user based on your business requirements.

You can log on to the Resource Access Management (RAM) console by using your Alibaba Cloud account and grant permissions by following the instructions in Grant permissions to a RAM user. The following table describes the permissions.

Note

You can grant the required permissions based on your business requirements.

Policy	Description
AliyunCloudMonitorFullAccess	Permissions to manage CloudMonitor.
AliyunCloudMonitorReadOnlyAccess	Read-only permissions on CloudMonitor.
AliyunCloudMonitorMetricDataReadOnlyAccess	Permissions to access time series metrics in CloudMonitor.