Use monitoring information to optimize cluster performance - AnalyticDB

AnalyticDB for MySQL provides various metrics on the Monitoring and Alerts page to allow you to view the performance and running status of AnalyticDB for MySQL clusters. This topic describes how to identify the causes of abnormal metrics.

For information about how to view the metrics of an AnalyticDB for MySQL cluster, see View monitoring information of AnalyticDB for MySQL.

Cluster resource metrics

CPU utilization metrics

AnalyticDB for MySQL uses CPU utilization metrics to display the maximum CPU utilization and the average CPU utilization across nodes. The following table describes the metrics supported by different AnalyticDB for MySQL editions.

Edition	Description
Data Lakehouse Edition	Maximum CPU utilization and average CPU utilization across storage nodes and compute nodes
Data Warehouse Edition in elastic mode
Data Warehouse Edition in reserved mode	Maximum CPU utilization and average CPU utilization across storage nodes

Increased average CPU utilization

The average CPU utilization metric indicates the average value of CPU utilization across nodes at a specific point in time. An increase in the average CPU utilization may affect cluster stability and result in slower query and write speeds. If the average CPU utilization continues to increase, the cluster is at risk and must be optimized at the earliest opportunity.

An increase in the average CPU utilization may be caused by the following reasons:

Queries
An increase in the average CPU utilization may be caused by queries, such as bad SQL queries. For example, an SQL query contains complex computing logic, must process large amounts of data, or results in a Cartesian product error due to missing JOIN conditions. In this case, you can use the diagnostics feature on the Monitoring and Alerts page to identify the problematic queries.
- For bad SQL detection results, an increase in the average CPU utilization may be caused by SQL queries that require an extended period of time to complete, process large amounts of data, involve multiple stages, and consume large amounts of CPU resources. You must analyze the causes based on the self-diagnostics results or execution plans.
- For abnormal pattern detection results, you can identify the causes from various factors such as large amounts of data, time, and CPU resources.
- If the average CPU utilization across compute nodes or storage nodes increases, you can use the abnormal operator detection results in compute layer detection and storage layer detection to analyze the causes. The operator details and operator summary help you identify abnormal operators based on the consumed CPU resources.
Writes
Write operations, such as the INSERT, UPDATE, DELETE, REPLACE, INSERT OVERWRITE, and INSERT INTO SELECT operations, may consume large amounts of CPU resources and cause the average CPU utilization across storage nodes to increase. In this case, you can check for a surge in metrics, such as the delete transactions per second (TPS), write TPS, update TPS, and load TPS metrics.
An increase in the average CPU utilization due to writes may be caused by the following reasons:
- Long primary keys
  A long primary key may result in a large primary key index, which consumes large amounts of CPU resources.
- DELETE statements
  If a single DELETE WHERE statement matches a large number of rows of data, the compute engine must calculate the primary keys of all rows and send the primary keys to storage nodes for deletion. A DELETE statement may be amplified many times and cause the average CPU utilization to increase.
- UPDATE statements
  If a single UPDATE WHERE statement matches a large number of rows, the compute engine must calculate the primary keys of all these rows, update the corresponding field values, and then send the primary keys and new values to storage nodes to label old rows and append new rows. An UPDATE statement may be amplified many times and cause the average CPU utilization to increase.
- INSERT OVERWRITE SELECT statements
  A batch load requires data parsing, data sorting based on clustered index fields (if existent), and the creation of primary key indexes and regular indexes. The preceding operations are CPU-bound and require one thread for each shard. A limit is imposed on the number of concurrent batch load operations. For example, you can execute up to two batch load statements at the same time. However, the CPU utilization may still increase because each shard requires one thread to perform batch load operations.
- INSERT INTO SELECT statements
  If a large amount of data is written in a short period of time, accumulated BUILD jobs in the background may cause real-time data to increase. In this case, if real-time data is involved in queries, AnalyticDB for MySQL must scan large amounts of real-time data that is not indexed. As a result, the CPU utilization increases.
BUILD jobs
A BUILD job contains operations such as index creation, partition creation, and partition deletion. The preceding operations may cause the CPU utilization to increase across storage nodes. You can compare the CPU utilization and the number of BUILD jobs in the AnalyticDB for MySQL console to view the correlation between the two metrics.
Note
- For more information about BUILD jobs, see BUILD.
- For information about how to identify and analyze the causes for increased resource usage related to BUILD jobs, see the "Increased number of BUILD jobs" section of this topic.

Skewed CPU utilization

The maximum CPU utilization metric indicates the maximum value of CPU utilization across nodes at a specific point in time. If a large difference exists between the maximum CPU utilization and the average CPU utilization for an extended period of time, a large amount of data is processed on specific nodes and a small amount of data is processed on other nodes. This may result in CPU utilization skew. If the CPU utilization skew is serious, cluster stability is significantly affected and resources are wasted. For example, the maximum CPU utilization is more than twice the average CPU utilization. The performance of distributed query tasks is limited by the maximum CPU utilization and cannot be improved. To resolve this issue, you must upgrade the node configurations. However, the CPU utilization of other nodes is not high.

A skewed CPU utilization may be caused by the following reasons:

Source table skew
In most cases, if you select an inappropriate distribution key when you create a table, data may be unevenly distributed across shards.
The following figure shows that data of a large table is unevenly distributed. Shard_0 and Shard_1 on Storage Node 0 contain a large amount of data, and Shard_2 and Shard_3 on Storage Node 1 contain a small amount of data. When you query the large table, Storage Node 0 must process more data than Storage Node 1. In this case, the CPU utilization of Storage Node 0 is persistently higher than that of Storage Node 1, which results in CPU utilization skew.
For information about how to diagnose source table skew, see Storage diagnostics. You can also use the diagnostics feature on the Monitoring and Alerts page to check for skewed tables that occupy a large disk storage space and analyze resource skew.
Intermediate data skew
Intermediate data skew is different from source table skew. In intermediate data skew scenarios, data of the source table may be evenly distributed across shards, but data of a specific field may be unevenly distributed across shards.
If you use an unevenly distributed field in a GROUP BY clause, aggregate function, or JOIN clause, AnalyticDB for MySQL redistributes data across nodes based on the field. Data of the same field value is distributed to the same node. As a result, uneven data distribution related to fields occurs.
The following figure shows that a table is distributed based on Field a. Data of the table is evenly distributed across storage nodes because data of Field a is evenly distributed. If you use Field b in the GROUP BY clause, Storage Node 1 distributes the rows whose value of Field b is b1 to Compute Node 1. To ensure that Compute Node 1 has all rows that contain the b1 value, Storage Node 2 distributes the rows whose value of Field b is b1 to Compute Node 1 and the rows whose value of Field b is b2 to Compute Node 2. As a result, Compute Node 1 has more rows than Compute Node 2, and data skew occurs. Compute Node 1 requires more cluster resources to process data than Compute Node 2.
For information about how to identify the causes for CPU utilization skew related to intermediate data skew, see Stage-level diagnostic results.

Disk read/write metrics

Increased disk I/O throughput

The disk I/O throughput metric indicates the throughput of the underlying storage media. Unit: MB/s. For information about the maximum value of the disk I/O throughput metric, see ESSDs. The maximum value is an ideal-state value and does not completely match the actual workload value. In most cases, the actual workload value can reach approximately 80% of the nominal value.

An increase in the disk I/O throughput may be caused by the following reasons:

A large amount of data is written. You can check whether the TPS-related metrics increase when the disk I/O throughput increases.
A large amount of source table data is read to perform queries. You can perform diagnostics on the Monitoring and Alerts page and identify the queries that read large amounts of data from the bad SQL detection results. You can also identify the problematic queries on the Diagnostics and Optimization page. The actual methods vary based on the cluster edition.
- Data Lakehouse Edition: In the left-side navigation pane, choose Diagnostics and Optimization > SQL Diagnostics and Optimization. On the SQL Queries tab, sort the values of the Scanned Data parameter in descending order when the disk I/O throughput increases.
- Data Warehouse Edition: In the left-side navigation pane, click Diagnostics and Optimization. On the SQL Queries tab, sort the values of the Average Data Scanned and Maximum Data Scanned parameters in descending order when the disk I/O throughput increases.
A large number of BUILD jobs run in the background at the same time. You can view the correlation between the I/O throughput and the number of BUILD jobs on the Monitoring and Alerts page.
The AnalyticDB for MySQL cluster is performing backup or scaling operations.

Increased disk IOPS

The disk IOPS metric indicates the number of I/O operations per second of the underlying storage media. For information about the maximum value of the disk IOPS metric, see ESSDs. The maximum value is an ideal-state value and does not completely match the actual workload value. In most cases, the actual workload value can reach approximately 80% of the nominal value.

An increase in the disk IOPS may be caused by the following reasons:

A large amount of data is written. You can check whether the TPS-related metrics increase when the disk IOPS increases.
A large number of point queries (for example, WHERE a=3 is specified) are concurrently performed and the queried data is scattered. In this case, you must perform multiple disk reads. As a result, the disk IOPS increases.
A large number of BUILD jobs run in the background at the same time. You can view the correlation between the IOPS and the number of BUILD jobs on the Monitoring and Alerts page.
The AnalyticDB for MySQL cluster is performing backup or scaling operations.

Memory metrics

Increased compute memory usage

AnalyticDB for MySQL consumes a large amount of memory resources to process large amounts of data. In most cases, memory-intensive SQL queries may include the Aggregation, TopN, Window, and Join operators.

Aggregation operators
Aggregation operators consume a large amount of memory resources because AnalyticDB for MySQL temporarily stores grouping information in the memory. If the GROUP BY field has a large number of distinct values, AnalyticDB for MySQL consumes a large amount of memory resources in the final stage of distributed aggregation. The partial aggregation stage consumes a small amount of memory resources because this stage does not require global aggregation. Each node can send data to downstream nodes after completing partial aggregation of partial data.
TopN operators
TopN operators are used to perform TopN calculations. For example, if you execute an SQL statement that contains the ORDER BY id LIMIT m,n clause in AnalyticDB for MySQL and the m field is assigned a large value, TopN operators in AnalyticDB for MySQL cache a large amount of data in the memory to perform global sorting. This process consumes large amounts of memory resources.
Window operators
Window operators are used to calculate window functions. Similar to Aggregation operators, Window operators cache a large amount of data in the memory to achieve the final semantic results.
Join operators
AnalyticDB for MySQL supports standard join operations. Hash and index algorithms are used to implement join operations. For more information, see the "Join" section of the Operators topic. The hash algorithm caches small tables (build tables) in the memory and builds a hash table for the build tables in the memory to accelerate the join process. The hash table may use a large amount of memory resources due to the following reasons:
- A large build table
  AnalyticDB for MySQL estimates the sizes of the two tables that you want to join based on statistics and uses the smaller table as the build table. However, the build table may contain large amounts of data.
- Expired or inaccurate statistics
  If the two tables that you want to join are not source tables, but tables that have undergone aggregation, filtering, and join operations, the sizes of the two tables in the source table-based statistics may be inaccurate. If the statistics are expired, AnalyticDB for MySQL may incorrectly select the larger table as the build table and build a hash table for the larger table. For more information, see Statistics.
- A large right table for a left join
  For semantic purposes, the right table of a left join must be used to build a hash table. If the right table of the left join is large, the join operation uses large amounts of memory resources.

For more information about operators, see Operators.

If a large number of SQL queries that include these operators are performed at the same time or a single operator uses large amounts of memory resources, the compute memory usage metric increases. As a result, cluster stability is affected and an error message is returned. The following section describes the common error messages:

Query exceeded reserved memory limit: A query uses a large amount of memory resources on a single node.
Query exceeded system memory pool limit: The length of a single field exceeds the limit or a large number of columns are involved.
Out of Memory Pool size pre cal. available: The physical memory pool is exhausted.
The cluster is out of memory, and your query was killed: The largest query that is running is terminated if the cluster does not have sufficient memory.

To decrease the compute memory usage, you must optimize the SQL queries that include these operators. For more information, see Operator-level diagnosis results.

Other resource metrics

Increased number of BUILD jobs

A BUILD job is used to create indexes for the written data, clear expired data, and asynchronously execute DDL statements. This helps improve read performance. In specific scenarios, BUILD jobs consume a large amount of the CPU and disk I/O resources of storage nodes, which may affect other operations and result in cluster stability issues. The following table describes the BUILD-related metrics.

Metric	Description
Maximum BUILD Jobs	The maximum number of BUILD jobs that are running across storage nodes at a specific point in time.
Average BUILD Jobs	The average number of BUILD jobs that are running across storage nodes at a specific point in time.

An increase in the number of BUILD jobs in an AnalyticDB for MySQL cluster may affect the CPU utilization of storage nodes. You can identify and analyze the causes from the following aspects:

A partition of a partitioned table contains large amounts of data. The probability that a large partition involves write, update, or delete operations is high, which easily triggers BUILD jobs. You can use storage diagnostics to identify these types of tables and perform schema optimization.
A non-partitioned table contains large amounts of data. The probability that a large non-partitioned table involves write, update, or delete operations is high, which easily triggers BUILD jobs.
A large number of read and write requests cause the CPU utilization of storage nodes to remain high for an extended period of time, which results in an extended execution duration of BUILD jobs.

Increased number of unavailable nodes

The nodes of an AnalyticDB for MySQL cluster may become unavailable. The increased number of unavailable nodes may affect cluster stability, slow down queries and writes, and cause errors. In this case, you can check whether the CPU utilization continuously increases and the I/O metrics reach the upper limits.

P95 metrics

AnalyticDB for MySQL provides P95 metrics that indicate the 95th percentile values (P95 values) of performance metrics such as CPU utilization, CPU utilization of access nodes, compute memory usage, disk I/O throughput, disk IOPS, disk I/O usage, and disk I/O wait time. A P95 value is the value that is greater than 95% of the monitored values. For example, if you use the P95 CPU Utilization of Compute Nodes metric for an AnalyticDB for MySQL cluster that has 100 compute nodes, the P95 value of the metric at a specific point in time is the CPU utilization of the 95th compute node in the list of CPU utilizations for all compute nodes sorted in ascending order.

The maximum value, average value, and P95 value have the following differences:

The maximum value indicates the upper limit of data. If a dataset contains outliers or extreme values, the maximum value of a metric may be significantly influenced by these values and may not accurately represent the general or typical level of the data.
The average value is designed to show the central tendency of data. However, if the dataset contains outliers or is skewed, the average value cannot accurately reflect the general level of the data.
The P95 value focuses on the performance of the highly-ranked data points and ignores the most extreme data points. It is suitable for assessing the data performance or level in most cases.

Business metrics

Query-related metrics

Extended query response time

The query response time metric indicates the period of time when a query is submitted up to when the query is queued and executed. For information about the execution duration of queries in AnalyticDB for MySQL, see the "A large response time is displayed on the Monitoring and Alerts page, but no corresponding time-consuming SQL queries are found on the Diagnostics and Optimization page. Why?" section of the Monitoring topic.

The query response time may be extended due to the following reasons:

Bad SQL statements
Bad SQL statements may consume large amounts of cluster resources and affect the execution of normal SQL statements.
Abnormal patterns
In specific scenarios, SQL statements that share the same pattern consume a small amount of resources but are frequently executed, or SQL statements that share the same pattern consume a large amount of resources. As a result, other queries are affected, and the overall query response time is extended.
Increased amount of written data
If the amount of written data increases, AnalyticDB for MySQL consumes large amounts of CPU and disk I/O resources and triggers more BUILD jobs. As a result, the overall query response time is extended.

Note

For more information about the factors that affect the query response time, see Factors that affect query performance.

You can perform diagnostics on the Monitoring and Alerts page and select a period of time during which the query response time is extended to analyze the causes based on the diagnostic results.

Extended query wait time

After a query is submitted to the access layer node of an AnalyticDB for MySQL cluster, the cluster checks whether the query must be queued based on the queue size to prevent excess SQL queries from being concurrently executed and affecting cluster stability. For more information, see Priority queue and concurrency of interactive resource groups.

In most cases, an extended query wait time is caused by performance degradation of the cluster. For example, bad SQL statements or abnormal patterns consume large amounts of cluster resources. You can perform diagnostics on the Monitoring and Alerts page and check the bad SQL and abnormal pattern detection results. An extended query wait time may also be caused by a large amount of written data. If a large amount of data is written, the CPU and disk I/O resources of storage nodes are highly used.

Increased query failure rate

The query failure rate metric measures the proportion of failed queries, but the metric does not provide the failure causes. Query failures may be caused by multiple reasons. The following section describes the common reasons:

SQL errors
- Syntax errors
  If an SQL statement does not conform to the SQL syntax of AnalyticDB for MySQL, an error message is returned in the syntax parsing stage. For example, an SQL statement is incomplete, the format is invalid, or keywords or punctuation marks are missing.
- Semantic errors
  If an SQL statement conforms to the SQL syntax of AnalyticDB for MySQL but a database object is incorrect, an error message is returned in the semantic parsing stage. For example, a table name is incorrect, a column does not exist, the GROUP BY field is missing, or the parameter type of a function is incorrect.
Cluster errors
- Query timeout
  AnalyticDB for MySQL provides a default query timeout period. You can configure a custom query timeout period based on your business requirements. If the query execution duration exceeds the specified timeout period, the query fails.
  Note
  - For information about the default query timeout period, see the "Timeout limits" section of the Limits topic.
  - For information about how to modify the query timeout period, see Config and hint configuration parameters.
- High cluster workloads
  If an AnalyticDB for MySQL cluster encounters high workloads, a node communication timeout or a process failure may result in query failures.
Read-only state
If a Raft log error occurs, AnalyticDB for MySQL sets the current process to the read-only state. In this case, an error message is returned for write operations.
Timeout
If AnalyticDB for MySQL cannot instantly consume Raft log queues, backpressure occurs. For example, long primary keys slow down write operations. As a result, a timeout error message is returned.

Amount of read table data

In AnalyticDB for MySQL, data is stored on different storage nodes. The metrics of the amounts of read table data indicate the amounts of data retrieved from the storage layer to the computing layer by using SQL queries at a specific point in time.

The following figure shows an example of data reading in queries. At the point in time Time_1, data is read from the user, report, customer, test, region, and partition tables by using the following six SQL queries: Query 1, Query 2, Query 3, Query 4, Query 5, and Query 6. At this point in time, the total amount of read table data across all storage nodes is 20.1 GB, which is calculated by using the following formula: 1.6 + 2 + 3 + 0.7 + 4.8 + 8 = 20.1 GB. The average amount of read table data across all storage nodes is 6.7 GB, which is the result of the total amount of read table data across all storage nodes divided by the number of storage nodes and calculated by using the following formula: (1.6 + 2 + 3 + 0.7 + 4.8 + 8)/3 = 6.7 GB. The maximum amount of read table data across all storage nodes is 12.8 GB, which is calculated by using the following formula: 4.8 + 8 = 12.8 GB.

The total amount of read table data, the maximum amount of read table data, and the average amount of read table data can reflect the pressure of SQL queries on an AnalyticDB for MySQL cluster.

A sudden increase in the average amount of read table data indicates that large amounts of data is transmitted from the storage layer to the compute layer for processing. The transmission process consumes more CPU resources and memory resources. At the same time, the amount of data read from the storage layer increases, which consumes more disk I/O resources.
A significant difference between the maximum and average amounts of read table data indicates that discrepancies exist in the amounts of data read from different storage nodes. Specific nodes reach the bottleneck of the amount of data that can be processed excessively earlier than expected, which affects the overall performance of the cluster. In most cases, the preceding issue occurs due to unreasonable table design. For example, if you select fields whose values are unevenly distributed as distribution fields, data is unevenly distributed across multiple storage nodes.

Write-related metrics

Extended response time for writes, deletes, and updates

The write response time metric indicates the amount of time required by an INSERT INTO VALUES statement to process each row of data. The delete response time metric indicates the amount of time required by a DELETE statement to process each row of data. The update response time metric indicates the amount of time required by an UPDATE statement to process each row of data. In most cases, the response time is affected by the following factors:

Increased CPU utilization of storage nodes
The CPU utilization of storage nodes may increase together with other metrics. For example, bad SQL statements cause specific metrics to increase or the write, delete, or update TPS metric increases.
Increased write-related I/O metrics of storage nodes
The write-related I/O metrics of storage nodes include the disk I/O throughput and disk IOPS. The metrics may increase due to the following reasons:
- The number of BUILD jobs increases.
- Backup or scaling operations are performed.
The response time may be affected because the preceding operations require disk writes.