This topic describes the functionality and classification of statistics in AnalyticDB for MySQL, the configurations for automatic statistics collection, and the methods used to manually collect and query statistics.
Overview
The query optimizer converts queries into execution plans that are executed by the execution engine. The quality of an execution plan affects its query performance. Statistics can be used to help the query optimizer generate high-quality execution plans.
AnalyticDB for MySQL provides the automatic statistics collection feature. By default, this feature is enabled. AnalyticDB for MySQL clusters of V3.1.9.2 or later also provide the column group statistics collection feature. By default, this feature is disabled. You can manually enable this feature. Within the maintenance window, AnalyticDB for MySQL automatically performs a full or sampled collection of basic, histogram, and column group statistics based on the data volume of the current table. If a large number of columns are involved, it may take a couple of days to perform a full collection on the columns. Outside the maintenance window, AnalyticDB for MySQL incrementally collects basic statistics on a regular basis.
The statistics collection policies of AnalyticDB for MySQL vary based on the data import methods.
If data is batch imported by using
INSERT OVERWRITE
, AnalyticDB for MySQL immediately collects basic statistics after the data import is complete.If data is imported in real time by using
INSERT INTO
orREPLACE INTO
, AnalyticDB for MySQL does not initiate an incremental collection job until the next maintenance window or the incremental collection period after each BUILD job is complete. We recommend that you manually collect basic statistics after data is imported.
You can also disable the automatic statistics collection feature and execute the ANALYZE TABLE
statement to collect statistics. For more information, see the "Manually collect statistics" section of this topic.
The automatic statistics collection feature of AnalyticDB for MySQL manages internal tables but not external tables. The manual statistics collection feature of AnalyticDB for MySQL manages both internal tables and external tables.
Usage notes
Only AnalyticDB for MySQL Data Lakehouse Edition (V3.0) and Data Warehouse Edition (V3.0) clusters of V3.1.6.1 or later support the statistics collection feature.
For information about how to query the minor version of an AnalyticDB for MySQL cluster, see How do I query the version of an AnalyticDB for MySQL cluster?
To update the minor version of a cluster, submit a ticket.
Classification and selection of statistics
AnalyticDB for MySQL collects three types of statistics: basic statistics, histogram statistics, and column group statistics. For histogram and column group statistics, you can use the full or sampled collection method. For basic statistics, you can use the full, sampled, or automatic incremental collection method. By default, automatic incremental collection is used.
Only AnalyticDB for MySQL clusters of V3.1.9.2 or later support sampled collection for basic, histogram, and column group statistics.
Basic statistics
Basic statistics include the maximum value, the minimum value, the average length (in bytes), the number of distinct values, and the proportion of NULL values in a column.
Collection of basic statistics is suitable for the following columns:
Columns that are not involved in filtering or join operations.
Columns in which data is evenly distributed, such as primary key columns.
Histogram statistics
Histograms are created by dividing basic statistics into data buckets by data range. Each bucket in a histogram describes the characteristics of the data within a specific range.
Histograms are classified into the following types:
Hybrid histogram: similar to an equal-height histogram. It can better describe hot spot values.
Frequency histogram: suitable for columns that have a small number of distinct values. Each value corresponds to a bucket.
AnalyticDB for MySQL automatically selects appropriate types of histograms.
Histogram statistics are suitable for the columns that contain unevenly distributed data and are involved in filtering and join operations. If data is evenly distributed, basic statistics can be used to replace histogram statistics in scenarios that involve filtering and join operations.
Compared with basic statistics, histograms can more accurately reflect the statistics about tables. If a large number of tables are involved, collecting histogram statistics of all columns may decrease the cache hit ratio. Histogram statistics occupy more statistics cache and incur higher costs than basic statistics. By default, the statistics cache can contain approximately 20,000 columns of histogram statistics or 2 million columns of basic statistics.
Column group statistics
Only AnalyticDB for MySQL clusters of V3.1.9.2 or later support column group statistics collection.
Basic and histogram statistics are collected on individual columns. Column group statistics are collected on multiple columns of a table and describe how these columns correlate with each other.
Column group statistics are suitable for aggregating multiple columns. If the columns highly correlate with each other, you can use column group statistics to estimate the number of output rows. This helps you choose an appropriate execution plan.
Automatic statistics collection
Enable or disable the automatic statistics collection feature
AnalyticDB for MySQL provides the automatic statistics collection feature. By default, this feature is enabled. You can execute the following statement to disable or re-enable the automatic statistics collection feature:
SET adb_config O_CBO_AUTONOMOUS_STATS_ENABLED = [false | true];
Enable or disable the automatic column group statistics collection feature
AnalyticDB for MySQL clusters of V3.1.9.2 or later provide the automatic column group statistics collection feature. By default, this feature is disabled. You can execute the following statement to enable or disable the automatic column group statistics collection feature:
SET adb_config O_CBO_MAINTENANCE_WINDOW_COLLECT_GROUP_STATS_ENABLED = [false | true];
Before you can use the automatic column group statistics collection feature, you must enable the automatic statistics collection feature.
Set a maintenance window
The default maintenance window for automatic statistics collection is from 04:00 to 05:00. You can execute the following statement to modify the maintenance window. We recommend that you set the maintenance window to off-peak hours. The interval between the start time and the end time can range from 1 minute to 3 hours. The start time must be earlier than the end time. If the specified maintenance window does not meet the preceding requirements, the default maintenance window is used.
The maintenance time must use the same time zone as the current time of the cluster.
SET adb_config O_CBO_MAINTENANCE_WINDOW_DURATION = [04:00-05:00];
Set a data volume threshold for statistics collection
You can execute the following statement to set a data volume threshold for statistics collection. By default, the threshold is 5 billion rows.
SET adb_config O_CBO_MAINTENANCE_WINDOW_COLLECTOR_ROW_LIMIT = 10000;
If the number of rows in a table exceeds the data volume threshold, the following policies apply:
For AnalyticDB for MySQL clusters of versions earlier than 3.1.9.2, statistics collection skips the table.
For AnalyticDB for MySQL clusters of V3.1.9.2 or later, a sampled collection is executed on the table.
Enable or disable scanning throttling for statistics collection
Within the maintenance window, AnalyticDB for MySQL limits the scanning rate for statistics collection to reduce I/O resource usage. By default, scanning throttling is enabled. If resources are idle within the maintenance window, you can disable scanning throttling to accelerate statistics collection.
SET adb_config O_CBO_AUTONOMOUS_STATS_SCAN_RATE_LIMIT_ENABLED = [false | true];
Collect statistics from a specified resource group
By default, a system account is used in automatic statistics collection. If you want to perform automatic statistics collection in a specific resource group, you can execute the following statement to specify a database account. After you specify a database account, AnalyticDB for MySQL performs automatic statistics collection in the resource group with which the database account is associated. Make sure that the database account has permissions to query all columns of all tables and is associated with a resource group. For more information, see Create a resource group.
SET adb_config O_CBO_AUTONOMOUS_STATS_ACCOUNT = [user_name];
Set an expiration ratio for columns
The default expiration ratio of columns is 0.1 (10%). If the ratio of the updated, deleted, inserted, or replaced rows to the total rows of a table is greater than the expiration ratio, the statistics of the table are considered expired. Then, AnalyticDB for MySQL re-collects statistics on all columns of the expired table within the maintenance window. If the actual expiration ratio of columns does not exceed the specified value, statistics are not automatically collected within the maintenance window.
SET adb_config O_CBO_STATS_EXPIRED_RATIO = 0.1;
Manually collect statistics
Collect statistics on the entire table
The ANALYZE TABLE
statement scans the entire table to collect statistics. For tables that have large volumes of data, a large amount of time is required to collect statistics on the entire table. We recommend that you execute this statement during off-peak hours or perform a sampled collection of statistics.
Syntax
ANALYZE TABLE [schema_name.]table_name [UPDATE [BASIC|HISTOGRAM|GROUP_STATS]] [ON column_name[,...]] [WITH ENABLE SAMPLING]
Parameters
Parameter | Required | Description |
| No | The name of the database. |
| Yes | The name of the table on which you want AnalyticDB for MySQL to collect statistics. You can specify only one table in each |
| No | The type of statistics. Valid values:
Important Only AnalyticDB for MySQL clusters of V3.1.9.2 or later support column group statistics collection. |
| No | The column on which you want to collect statistics. If no column is specified, statistics are collected on all columns of the table. |
| No | Enables sampled collection. Important Only AnalyticDB for MySQL clusters of V3.1.9.2 or later support sampled collection for basic, histogram, and column group statistics. |
Examples
Collect basic statistics on all columns of the
adb_demo.customer
table. You can execute one of the following statements:ANALYZE TABLE adb_demo.customer;
ANALYZE TABLE adb_demo.customer UPDATE BASIC;
Collect basic statistics on the
customer_id
column of theadb_demo.customer
table.ANALYZE TABLE adb_demo.customer UPDATE BASIC ON customer_id;
Collect histogram statistics on the
customer_id
andlogin_time
columns of theadb_demo.customer
table.ANALYZE TABLE adb_demo.customer UPDATE HISTOGRAM ON customer_id,login_time;
Perform a sampled collection of column group statistics on the
customer_id
andlogin_time
columns of theadb_demo.customer
table.ANALYZE TABLE adb_demo.customer UPDATE GROUP_STATS ON customer_id,login_time with enable sampling;
Collect statistics on partitions
Limits
Only AnalyticDB for MySQL Data Lakehouse Edition (V3.0) clusters of V3.1.9.1 or later allow you to execute the ANALYZE TABLE
statement to collect basic statistics on partitions of Object Storage Service (OSS) external tables.
For information about how to query the minor version of a cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.
Syntax
ANALYZE TABLE table_name WITH PARTITIONS = ARRAY[ARRAY[PARTITION_KEYS] [, PARTITION_KEYS, ....]]
Parameters
Parameter | Required | Description |
| Yes | The name of the table on which you want AnalyticDB for MySQL to collect statistics. You can specify only one table in each |
| Yes | The partition on which you want to collect statistics. |
Examples
Collect statistics on the 2023-01 and 2023-02 partitions of the
test1
table.ANALYZE TABLE test1 WITH PARTITIONS = ARRAY[ARRAY['2023-01'], ARRAY['2023-02']];
Collect statistics on the (1,1) and (1,0) partitions of the
test2
table.ANALYZE TABLE test2 WITH PARTITIONS = ARRAY[ARRAY[1, 1], ARRAY[1, 0]];
Query statistics
Statistics are stored in AnalyticDB for MySQL in the binary format. You can use the system table INFORMATION_SCHEMA
to query statistics.
Execute the following statement to query table-level statistics:
SELECT * FROM INFORMATION_SCHEMA.TABLE_STATISTICS;
Execute the following statement to query column-level statistics, including basic statistics, histogram statistics, and column group statistics:
SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
FAQ
Why are ANALYZE statements misdiagnosed as slow queries?
ANALYZE
statements that are automatically initiated during the maintenance window are executed with I/O throttling and low CPU priority. The statements may be diagnosed as slow queries because they are executed for extended periods of time. However, this does not affect services. If the CPU load is not high or CPU overload is not closely associated with the maintenance window, you can ignore this issue. If the CPU is continuously overloaded, refer to the "What do I do if the query response time is affected by CPU overload that is caused by statistics collection?" section of this topic to resolve this issue.
Why does a CPU overload occur when I use the statistics feature?
A CPU may become overloaded due to the following reasons:
During the default maintenance window from 04:00 to 05:00, the system performs a full scan on each table to collect column statistics. During this period, the CPU is overloaded.
In most cases, statistics are incrementally collected, which does not consume large amounts of resources. By default, the statistics feature is enabled for AnalyticDB for MySQL Data Warehouse Edition (V3.0) clusters that run V3.1.6 or later. When a cluster is updated from an earlier minor version to V3.1.6 or later, full statistics are collected. This may cause CPU overloads during the first days after the minor version update. The CPU load decreases after the full data scan is complete.
When the CPU is overloaded, check whether the query response time is affected. If the average query response time does not significantly change, the query response time is not affected. The value of the CPU utilization metric may be high during statistics collection, but when queries are executed, resources are preferentially allocated for the execution. This is because ANALYZE
statements are executed with I/O throttling and low CPU priority.
What do I do if the query response time is affected by CPU overload that is caused by statistics collection?
You can resolve this issue by using the following solutions:
Change the maintenance window to off-peak hours.
set adb_config O_CBO_MAINTENANCE_WINDOW_DURATION = [04:00-05:00];
If you cannot determine the appropriate off-peak hours, we recommend that you change the I/O limit for system queries to a value that is greater than or equal to 16 MB. The default value is 50 MB.
set adb_config CSTORE_IO_LIMIT_SYSTEM_QUERY_BPS = 52428800;
Assign statistics collection to a low-priority resource group to isolate loads. For more information, see the "Automatic statistics collection" section of the Statistics topic.
set adb_config O_CBO_AUTONOMOUS_STATS_ACCOUNT = [user_name];
Increase the expiration ratio for columns to reduce the data to be collected. The default value is 0.1. The value ranges between 0 and 1. We recommend that you do not set the expiration ratio to a value that is greater than 0.5.
set adb_config O_CBO_STATS_EXPIRED_RATIO = 0.1;
If none of the preceding solutions resolve this issue, execute the set adb_config O_CBO_AUTONOMOUS_STATS_ENABLED=false;
statement to disable automatic statistics collection. However, performance may be degraded. If you want to collect statistics in the future, you must manually collect statistics. For more information, see the "Manually collect statistics" section of the Statistics topic.
The execution result of SELECT * FROM INFORMATION_SCHEMA.COLUMN_STATISTICS shows that statistics are not updated for several days. Why?
This issue is caused by the following reasons:
The statistics are not expired.
Statistics expire when the amount of data that is updated, inserted, or replaced reaches the expiration ratio. The default expiration ratio is 0.1 (10%). If only a small amount of your data is changed, you can continue to use your cluster as expected and observe this issue for another week.
A large amount of data is contained in many columns and tables.
By default, only 1 hour per day is required to collect statistics that are not included in incremental updates. If a large number of columns and tables are involved, such as more than 1,000 columns, the system may not be able to complete an update within one day. The update may require one week to complete. In this case, the absence of statistics updates within several days is normal. You can continue to use the cluster as expected and observe this issue.
Are statistics automatically updated after data is imported to a new table?
If data is batch imported by using the INSERT OVERWRITE
statement, basic statistics are automatically collected after the data import is complete. If data is imported in real time by using the INSERT INTO
or REPLACE INTO
statement, statistics are collected during the next maintenance window or an incremental collection task is triggered during the incremental collection period after each BUILD task is complete. We recommend that you manually collect basic statistics after data is imported. For more information, see the "Manually collect statistics" section of the Statistics topic.