During the daily O&M of ApsaraDB RDS for MySQL or the handling of instance failures, you must view related performance metrics. The standard monitoring feature of ApsaraDB RDS for MySQL provides various performance metrics and powerful diagnostic capabilities. This helps detect database anomalies at the earliest opportunity and provides required troubleshooting methods. The feature also provides diagnostic views for common issues on RDS instances to help you quickly locate anomalies.
Usage notes
The standard monitoring feature of ApsaraDB RDS for MySQL is upgraded and integrated with the performance trend feature of Database Autonomy Service (DAS) to provide more capabilities. For more information about the performance trend feature, see Performance trends.
DAS is a stable, secure, and efficient cloud service that is designed based on machine learning and expert experience to implement automated perception, recovery, optimization, O&M, and security assurance for databases. DAS helps prevent service failures that are caused by manual operations. For more information, see Overview of DAS.
Custom views: The standard monitoring feature provides more performance monitoring metrics and supports custom views. You can select metrics to create custom views based on your business requirements.
NoteFor more information about the performance parameters that correspond to the metrics, see Performance parameters.
Diagnostic views for common issues: Memory OOM diagnosis, Read-only instance delay diagnosis, Diagnosis of Space Full Problem, CPU jitter diagnosis, and Large transaction recognition diagnosis. You can select a diagnostic view based on your business requirements and use the diagnostic view to locate issues in a quick manner.
Automatic diagnostics: The standard monitoring feature provides powerful diagnostic capabilities to detect events on your RDS instance at the earliest opportunity and automatically diagnose the events to analyze root causes and provide suggestions.
Manual diagnostics: The standard monitoring feature also allows you to select a time period during which you want to perform manual diagnostics.
View standard monitoring information
- Go to the Instances page. In the top navigation bar, select the region in which the RDS instance resides. Then, find the RDS instance and click the ID of the instance.
In the left-side navigation pane, click Monitoring and Alerts.
On the Standard Monitoring tab, click the Standard View or Custom View tab and perform the following operations based on your business requirements:
On the Standard View tab, select a time range to view the trend of metrics and statistics about various events.
You can click Add Trend Comparison to view the comparison of performance trends for a metric within different time ranges.
NoteWhen you specify a time range, the end time must be later than the start time, and the interval between the start time and the end time cannot exceed 30 days.
Event statistics section: You can click View Details to go to the Performance Events tab. On this tab, you can view the details about exceptions, optimization events, and auto scaling events. The details include the events that are scheduled, being executed, and executed.
Trend charts section:
In addition to the classic view, the system provides diagnostic views for common issues. This allows you to quickly identify the causes of issues based on the trends of key metrics in the diagnostic views.
The following diagnostic views are provided: Memory OOM diagnosis, Read-only instance delay diagnosis, Diagnosis of Space Full Problem, CPU jitter diagnosis, and Large transaction recognition diagnosis. You can select a diagnostic view based on your business requirements. For more information, see Use diagnostic views.
NoteYou can click the icon to the right of a monitoring item to view the metrics of the monitoring item.
If you select Classic View, you can click More Metrics to select the monitoring items that you want the system to display.
If you select Classic View, you can select the severity levels of events. If events of the selected severity levels are detected, the system displays the detected events in the MySQL CPU Utilization/Memory Usage and Session trend charts.
You can click an event in a trend chart to view the diagnostic result in the event details.
You can specify a time range and click Diagnose in the trend chart of a monitoring item to analyze the metrics in the selected time range.
You can click Common Causes in the trend chart of a monitoring item to view the common causes of the exceptions in the monitoring item.
You can click Details in the trend chart of a monitoring item to expand the chart. You can also change the time range to view the change trends of the monitoring item in a specific time range.
On the Custom View tab, click Add Monitoring Dashboard to create a custom dashboard to display the metrics that you want to monitor.
You can click Add Node and Metrics to select the RDS instances and metrics for the dashboard.
You can set the Display Mode parameter to Merge Display or Separate Display.
If you select Merge Display, multiple metrics are displayed in the same trend chart.
If you select Separate Display, each metric is displayed in a separate trend chart.
You can configure the Chart Layout parameter to specify the number of trend charts of metrics in each row.
You can click Details in the trend chart of a metric to expand the chart. You can also change the time range to view the change trends of the metric in a specific time range.
Use diagnostic views
Memory OOM diagnosis
You can analyze and resolve the out of memory (OOM) issue based on the trend charts of monitoring items that are displayed in the Memory OOM diagnosis view.
Memory Usage:
If the memory usage slowly and continuously increases for a long period of time, such as more than seven days, and the usage of the InnoDB buffer pool remains unchanged, a memory leak may occur.
If the memory usage suddenly increases and the usage of the InnoDB buffer pool remains unchanged, your services may encounter traffic spikes.
If both the memory usage and the usage of the InnoDB buffer pool increase, InnoDB caches as it is accessed and works as expected.
Resident Memory: shows the memory capacity.
Open files, Temp File Size, Temp Disk Tables, and Sort Rows: show memory consumption.
The increase in memory usage is positively correlated with business metrics. In most cases, SQL statements that cause the increase in memory usage cannot be traced due to OOM errors before the execution of the SQL statements are completed. We recommend that you perform the following operations:
Check business logs to identify the cause of the increase in memory usage.
Expand the memory capacity and enable the SQL Explorer and Audit feature. This way, you can identify the cause of the increase in memory usage based on the points in time at which SQL statements are executed. For more information, see Use the SQL Explorer and Audit feature.
Read-only instance delay diagnosis
You can analyze and resolve latency-related issues of read-only RDS instances based on the trend charts of monitoring items that are displayed in the Read-only instance delay diagnosis view.
Active Session: shows whether metadata locks exist and cause congestions.
In most cases, if a large amount of data is queried, metadata locks cannot be obtained when DDL statements are executed. As a result, the DDL statements block other sessions and cause an accumulation of connections.
Resident Memory: shows the memory capacity.
DML Rows Processed, Pages Requested, DML/DDL Operations, and Temp Disk Space Used: show common business metrics.
Replication Delay: shows the replication latency.
Diagnosis of Space Full Problem
You can analyze and resolve exhausted storage issues based on the trend charts of monitoring items that are displayed in the Diagnosis of Space Full Problem view.
You can check the types of files that occupy storage and the storage usage trend of each type of file. In most cases, the following types of files occupy storage:
Data files (user_data_size): You can use the storage analysis feature to check the storage usage of each database and table. Then, you can expand the storage capacity or delete unnecessary data based on your business requirements. For more information about the storage analysis feature, see Use the storage analysis feature. For more information about how to handle data files, see What do I do if an ApsaraDB RDS for MySQL instance is in the Locked state because its storage capacity is exhausted by data files?
Temporary files (temp_file_size): Temporary tables may be generated when you execute SQL statements to sort and group data or associate tables. Binary log cache files are generated before large transactions are committed. These tables and files occupy storage. For more information about how to handle temporary files, see What do I do if an ApsaraDB RDS for MySQL instance is in the Locked state because its storage capacity is exhausted by temporary files?
Binary logs (binlog_size): Binary logs are quickly generated by large transactions. These logs occupy storage. For more information about how to handle binary logs, see What do I do if the storage capacity of an ApsaraDB RDS for MySQL instance is exhausted by binary log files?
NoteIf binary logs are subscribed, the binary logs may not be deleted at the earliest opportunity and occupy storage.
Undo logs (undo_log_size): In most cases, undo logs cannot be cleared due to long-running queries. You need to check whether queries that are executed for a long period of time and are not completed exist.
NoteIn MySQL 5.6 and earlier versions, no separate tablespaces are provided for undo logs.
Slow query logs (slowlog_size): If slow query logs occupy a large amount of storage, we recommend that you execute the
TRUNCATE
statement to clear slow query logs during off-peak hours.NoteThe
TRUNCATE
statement is supported in MySQL 5.7 with a minor engine version of 20210630 or later and in MySQL 8.0 with a minor engine version of 20210930 or later.General query logs (general_log_size): The value of this metric is the total size of error logs, Performance Agent logs, and recovery logs. In most cases, the total size of these logs is less than 1 GB. If the size significantly exceeds 1 GB, submit a ticket. This metric specifies the kernel metric data that is output by the MySQL kernel on a regular basis. It does not specify the log size of general_log in MySQL.
CPU jitter diagnosis
You can analyze and resolve CPU jitter issues based on the trend charts of monitoring items that are displayed in the CPU jitter diagnosis view. The following types of monitoring items are strongly correlated with CPU utilization.
Business monitoring items:
Page Request: In most cases, the trend of the number of requests in the buffer pool fluctuates based on CPU utilization.
Rows Processed: shows the relationship between CPU utilization and the number of rows that are processed by the system. You can check whether the number of processed rows suddenly increases when CPU utilization changes based on this monitoring item.
Queries: shows the types of SQL statements that are executed when CPU utilization changes.
Connection-related monitoring items:
Thread Running: High concurrency leads to high CPU utilization. Metadata locks or row locks cause an accumulation of connections, which affects CPU utilization.
Common causes of CPU jitters:
The Page Request or Rows Processed monitoring item changes. In this case, you can select the period of time during which CPU utilization changes and click Diagnose to obtain details about the issue.
The number of active connections increases. In this case, you need to resolve the issue from the business side.
Large transaction recognition diagnosis
You can analyze and resolve large transaction-related issues based on the trend charts of monitoring items that are displayed in the Large transaction recognition diagnosis view.
Threads Connected, Temp File Size, and Binlog space: show whether large transactions exist. If one of the following events occurs, a large transaction exists in the database:
Active sessions accumulate.
The storage that is occupied by temporary files increases and then decreases.
The storage that is occupied by temporary files decreases, but the storage that is occupied by binary logs increases.
Rows Processed, Logical Page Write, and Queries per Second: show the type of a large transaction.
For example, if a few queries are executed but a large number of rows are deleted, a large transaction that deletes data exists.
Large transactions cause the blocking of binary log writes.
When large transactions exist, the storage that is occupied by temporary tables (binlog cache) gradually increases and then remains stable.
When the storage that is occupied by temporary tables remains stable, the storage that is occupied by binary logs increases. Binary logs are globally written in serial mode. As a result, other transactions are blocked, which causes the accumulation of connections.
If the RDS instance runs RDS High-availability Edition, the statements that are executed to check the status of primary and secondary RDS instances for primary/secondary switchovers are blocked. As a result, a primary/secondary switchover is directly performed on the RDS instance.
We recommend that you split large transactions into small transactions and separately execute the small transactions. For example, you can add a WHERE
clause to the DELETE
statement to limit the volume of data that is deleted at a time and split the deletion operation into smaller operations.
References
The following topics provide troubleshooting methods for common performance issues:
Troubleshoot slow SQL statements on an ApsaraDB RDS for MySQL instance
Troubleshoot memory consumption issues on an ApsaraDB RDS for MySQL instance
Troubleshoot insufficient storage issues on an ApsaraDB RDS for MySQL instance
Troubleshoot the issues that cause high I/O on an ApsaraDB RDS for MySQL instance
You can use the autonomy service-related features to perform performance diagnosis and optimization on your RDS instance. For more information, see Overview of DAS.