Cluster performance and resource group monitoring metrics - AnalyticDB

AnalyticDB for MySQL allows you to view the cluster performance metrics (such as CPU utilization and disk I/O usage) and resource group metrics of a cluster within a time range in the last month in the AnalyticDB for MySQL console. This helps you identify and resolve issues based on the performance and running status of a cluster.

Usage notes

You can view the monitoring information within two days in the last month.

Procedure

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Clusters page, click an edition tab. Find the cluster that you want to manage and click the cluster ID.
Go to the Monitoring and Alerts page.
- For a Data Warehouse Edition cluster: In the left-side navigation pane, click Monitoring and Alerts.
- For a Data Lakehouse Edition cluster: In the left-side navigation pane, choose Cluster Management > Monitoring and Alerts.
On the Monitoring tab, click the Standard Views tab or the Custom Views tab to view the corresponding monitoring information.
- The Standard Views tab displays common metrics by default. To view all metrics, click More Metrics.
- To view only specific metrics, click More Metrics on the Standard Views tab, select the metrics that you want to view and clear the metrics that you do not want to view, and then click Save As to add the selected metrics to the Custom Views tab.

Data Lakehouse Edition and Data Warehouse Edition metrics

Health status metrics

Important

You can view the health status information only for clusters of V3.1.6 and later.
If the value of a health status metric is Risky or Unavailable, contact technical support.

Metric	Description
Cluster Access Node Status	The access layer of AnalyticDB for MySQL is composed of multiple cluster access nodes and provides features such as protocol layer access, SQL parsing and optimization, real-time sharding of written data, data scheduling, and query scheduling. Valid values: Healthy: the number of available cluster access nodes. Unavailable: the number of unavailable cluster access nodes.
Health Status of Compute Node Groups	The compute engine of AnalyticDB for MySQL is composed of compute node groups and supports the integrated execution of distributed massively parallel processing (MPP) and directed acyclic graph (DAG) architectures. The compute engine can work with intelligent optimizers to support high concurrency and hybrid loads of complex SQL statements. Additionally, the cloud native infrastructure allows compute nodes to be elastically scaled out within seconds based on business requirements. This way, resources can be used in an efficient manner. Valid values: Healthy: the number of available compute nodes. Unavailable: the number of unavailable compute nodes.
Health Status of Storage Node Groups	The storage engine of AnalyticDB for MySQL is composed of storage node groups and supports real-time data writes with strong consistency and high availability in compliance with the Raft consensus protocol. The storage engine uses data sharding and Multi-Raft to support parallel processing, tiered storage to separate hot and cold data at lower costs, and hybrid row-column storage and intelligent indexing to provide ultra-high performance. Valid values: Healthy: the number of available storage nodes. Risky: the number of at-risk storage nodes. Unavailable: the number of unavailable storage nodes.

Cluster Resource Monitoring metrics

Metric		Unit	Sub-metric or description
Node Monitoring	CPU Utilization	%	Maximum CPU Utilization of Compute Nodes P95 CPU Utilization of Compute Nodes Average CPU Utilization of Compute Nodes Maximum CPU Utilization of Storage Nodes P95 CPU Utilization of Storage Nodes Average CPU Utilization of Storage Nodes Note After you change a C32 Data Warehouse Edition cluster in reserved mode to elastic mode, the average CPU utilization increases. For more information, see the "FAQ" section of this topic.
	BUILD Jobs	N/A	Average BUILD Jobs: the average number of BUILD jobs that run across storage nodes. Maximum BUILD Jobs: the maximum number of BUILD jobs that run across storage nodes.
	Compute Memory Usage	%	Maximum Compute Memory Usage P95 Compute Memory Usage Average Compute Memory Usage
	Unavailable Nodes	N/A	Unavailable Storage Nodes Unavailable Compute Nodes
	Amount of Read Table Data	MB	Maximum Amount of Read Table Data Average Amount of Read Table Data
	CPU Utilization of Access Nodes	%	Maximum CPU Utilization of Access Nodes P95 CPU Utilization of Access Nodes Average CPU Utilization of Access Nodes
	Disk I/O Throughput	MB	Maximum Disk Read Throughput of Storage Nodes P95 Disk Read Throughput of Storage Nodes Average Disk Read Throughput of Storage Nodes Maximum Disk Write Throughput of Storage Nodes P95 Disk Write Throughput of Storage Nodes Average Disk Write Throughput of Storage Nodes
	Disk IOPS	N/A	Maximum Disk Reads of Storage Nodes P95 Disk Reads of Storage Nodes Average Reads of Storage Nodes Maximum Disk Writes of Storage Nodes P95 Disk Writes of Storage Nodes Average Writes of Storage Nodes
	Disk I/O Usage	%	Maximum Disk I/O Usage of Storage Nodes P95 Disk I/O Usage of Storage Nodes Average Disk I/O Usage of Storage Nodes
	Disk I/O Wait Time	ms	Maximum Disk I/O Wait Time of Storage Nodes P95 Disk I/O Wait Time of Storage Nodes Average Disk I/O Wait Time of Storage Nodes
Data Size Monitoring	Disk Usage	%	Average Disk Usage Maximum Disk Usage
Data Size Monitoring	Disk Space Used	GB	Cold Data Size Note This metric is not available for Data Warehouse Edition in reserved mode because the edition does not support tiered storage of hot and cold data. Hot Data Size Maximum Hot Data Size of Storage Nodes Average Hot Data Size of Storage Nodes
Workload Monitoring	Cluster Connections	N/A	The number of successful connections.
	Query Failure Rate	%	The failure rate of queries. If you select a time range within 24 hours, the query failure rate per minute is displayed, which is calculated by using the following formula: `Query failure rate = (Number of failed SQL queries in 1 minute/Total number of SQL queries in 1 minute) × 100%`. If you select a time range that exceeds 24 hours, the query failure rate for every 5 minutes is displayed, which is calculated by using the following formula: `Query failure rate = (Number of failed SQL queries within 5 minutes/Total number of SQL queries within 5 minutes) × 100%`.
	QPS	N/A	QPS ETL QPS
	Query Response Time	ms	Average Query Response Time Maximum Query Response Time
	Query Wait Time	ms	Average Query Wait Time Maximum Query Wait Time
	Write TPS	N/A	The write transactions per second (TPS) of a cluster.
	Write Response Time	ms	Average Write Response Time Maximum Write Response Time
	Write Throughput	MB	The average write throughput of a cluster.
	Update TPS	N/A	The update TPS of a cluster.
	Update Response Time	ms	Average Update Response Time Maximum Update Response Time
	Delete TPS	N/A	The delete TPS of a cluster.
	Delete Response Time	ms	Average Delete Response Time Maximum Delete Response Time
	Load TPS	N/A	The load TPS of a cluster.

Resource Group Monitoring metrics

Data Lakehouse Edition

Metric	Unit	Description
CPU Utilization	%	The CPU utilization of a resource group.
QPS	N/A	The queries per second processed by a resource group.
Query Response Time	ms	The average response time of queries processed by a resource group.
Query Wait Time	ms	The average wait time of queries processed by a resource group.
(XIHE) Running Queries	N/A	The number of running queries in a resource group.
Queued Queries	N/A	The number of queued queries in a resource group.

Data Warehouse Edition

Important

You can view Resource Group Monitoring metrics only for Data Warehouse Edition clusters that meet the following requirements:

The cluster is in elastic mode.
The cluster has 32 cores or more.

Metric	Unit	Description
CPU Utilization	%	The CPU utilization of a resource group.
Query Response Time	ms	The average response time of queries processed by a resource group.
QPS	N/A	The queries per second processed by a resource group.
Query Wait Time	ms	The average wait time of queries processed by a resource group.
Scheduled Nodes Actually Scaled Out	N/A	The number of nodes added to a resource group in a scheduled scaling plan.
Scheduled Nodes to Be Scaled Out	N/A	The number of nodes that need to be added to a resource group in a scheduled scaling plan. For information about how to create a scaling plan for a resource group, see Create a resource scaling plan.
Total Nodes	N/A	The total number of nodes in a resource group. The total number of nodes in a resource group is calculated by using the following formula: Total number of nodes = Number of basic nodes + Number of effective nodes in scheduled scaling plans.
Basic Nodes	N/A	The number of basic nodes in a resource group.

FAQ

Q: Why does the average CPU utilization increase after I change a cluster from reserved mode to elastic mode?
A: After you change a C32 cluster from reserved mode to elastic mode, the specifications of a single node decrease to 8 cores. By default, BUILD jobs occupy 3 cores. In this case, the average CPU utilization increases. If the increased average CPU utilization does not affect your business, ignore this change. If your business is affected, upgrade your cluster or Submit a ticket. For information about BUILD jobs, see BUILD.
Q: Why are the values of Regular Index and Primary Key Index metrics large?
A: The preceding metrics may have large values due to the following reasons:
- Indexes and primary key indexes are created for a large number of columns.
- The length of a value in index columns is large, or the total length of all values in an index column is large. For example, the value of an index column is a long string.
- The number of distinct values in index columns is large. This results in a low index compression ratio. For example, Index Column A has four distinct values: A1, A2, A3, and A4. Data is difficult to be compressed, which results in a low index compression ratio.
- The length of a value in the primary key is large or multiple columns comprise a composite primary key.
Q: A large response time is displayed on the Monitoring and Alerts page, but no corresponding time-consuming SQL statements are found on the Diagnostics and Optimization page. Why?
A: A large amount of query result data requires an extended period of time to cache the result set. However, the total duration that is displayed on the Diagnostics and Optimization page consists of the queuing time, execution plan duration, and execution duration, excluding the cache duration of the result set. We recommend that you view the corresponding time-consuming SQL statements on the SQL Audit page.

References and related operations

References

Cluster performance optimization

Related operations

Operation	Description
DescribeDBClusterHealthStatus	Queries the health status of an AnalyticDB for MySQL Data Lakehouse Edition cluster.
DescribeDBClusterPerformance	Queries the performance data of an AnalyticDB for MySQL Data Lakehouse Edition cluster.
DescribeComputeResourceUsage	Queries the monitoring information about resource groups within an AnalyticDB for MySQL Data Lakehouse Edition cluster.
DescribeDBClusterPerformance	Queries the performance data of an AnalyticDB for MySQL Data Warehouse Edition cluster.
DescribeDBClusterResourcePoolPerformance	Queries the monitoring information about resource groups within an AnalyticDB for MySQL Data Warehouse Edition cluster.