Use ANALYZE statements and the auto-analyze feature to collect statistics about a database - Hologres

This topic describes how to use ANALYZE statements and the auto-analyze feature.

ANALYZE statements

Statistics determine whether a suitable execution plan can be generated. Hologres collects sample statistics about data distribution and characteristics, table statistics, column statistics, the number of rows, the number of columns, field width, cardinality, frequency, maximum value, minimum value, high-frequency values, and bucket distribution characteristics. These statistics can help the optimizer update the estimated cost for operator execution, narrow down the query range, and estimate the optimal join order, the memory overheads, and the degree of parallelism (DOP). This way, the optimizer can generate an optimal execution plan.

You can use ANALYZE statements to collect statistics about table data in your database. Then, the optimizer of Hologres generates an optimal query plan based on the statistics to improve query efficiency.

Syntax

-- The following ANALYZE statement is used to update the statistics about a table. The statistics about all columns in the table are collected by default.
analyze <tablename>;

-- The following ANALYZE statement is used to update the statistics about specified columns. This statement helps collect more data than the analyze <tablename> statement that is used to collect statistics about a table based on sampled data. Therefore, statistics collected by using this statement are more accurate. This statement is used to update the statistics about a column that stores conditions.
analyze <tablename>(<colname>, <colname>);

Parameters
The tablename parameter specifies the name of the table about which you want to update statistics, and the colname parameter specifies the name of the column about which you want to update statistics.
Syntax description
The two ANALYZE statements have the following similarities and differences:
- Similarities
  - The two ANALYZE statements can be used to collect statistics about one or more columns. Information that is collected includes the number of rows, column width, most common values in the columns, histograms of the columns, and number of distinct values (NDV) in the columns.
  - After you execute one ANALYZE statement for one or more specified columns, the existing statistics about the columns that are returned by using the other statement are overwritten. However, existing statistics about other columns are retained. For example, if you execute the analyze <tablename>(<colname1>); statement, the existing statistics about the colname1 column are overwritten or updated, but the existing statistics about the colname2 column are retained.
- Differences
  - analyze <tablename>;: Statistics returned by this statement are collected based on sampled data.
  - analyze <tablename>(<colname>, <colname>);: If you execute this statement, Hologres runs the APPROX_COUNT_DISTINCT function to calculate the NDV in columns. In most cases, the NDV calculated by using this statement is more accurate than that based on sampled data, but costs more overheads. Therefore, we recommend that you use the analyze <tablename>(<colname>, <colname>); statement only for specified columns. Statistics other than the NDV, such as the column width and the histograms of the columns, are collected based on sampled data.
  For a table with two columns named colname1 and colname2, the analyze table; statement is not equivalent to the analyze table(colname1, colname2); statement.
  We recommend that you use the analyze <tablename>(<colname>, <colname>); statement for JOIN columns or GROUP BY columns to obtain finer-grained statistics about these columns.
Scenarios
We recommend that you execute the analyze <tablename>; statement for a table in the following scenarios:
- A large number of INSERT, UPDATE, DELETE, or data import operations are performed on the table.
- The performance of a table JOIN operation is deteriorated. In this case, you can execute the ANALYZE statement for the JOIN columns and the GROUP BY columns before you perform the table JOIN operation.
- After you execute the CREATE FOREIGN TABLE statement, you want to collect statistics about a foreign table.
- After you execute the IMPORT FOREIGN SCHEMA statement, you want to collect statistics about the table that you want to query.
Usage notes
- In Hologres V0.10 and V1.1, if you want to query a parent table, execute an ANALYZE statement for the parent table. If you want to query a child table, execute an ANALYZE statement for the child table. If you want to query both the parent table and the child table, execute an ANALYZE statement for both tables to prevent some statistics from being omitted.
- If you encounter one of the following issues, you must execute an ANALYZE statement before you run an import task. This can improve efficiency.
  - An out-of-memory (OOM) exception occurs when multiple tables are joined. The following error message is reported: Query executor exceeded total memory limitation xxxxx: yyyy bytes used.
  - Import efficiency is low. When you query or import data in Hologres, tasks take a long time to complete. This results in low efficiency.
- If the table that you want to query contains ultra-wide columns, such as a column with bitmap indexes of the BYTEA type and a column of the TEXT type that is more than 1 KB in size, the statistics about these columns are useless and consume memory in data sampling. For such a table, we recommend that you do not execute the analyze <tablename>; statement. Instead, you can execute the analyze <tablename>(<colname>, <colname>); statement for columns except for ultra-wide columns. For example, you can execute the statement for the JOIN, GROUP BY, and FILTER columns.
  Note
  You can set a benchmark for ultra-wide columns based on your business requirements. Normally, columns whose data size is greater than 1 KB are considered ultra-wide.

Auto-analyze feature

Hologres V0.10 and later support the auto-analyze feature. After you enable this feature, you do not need to repeatedly and manually execute ANALYZE statements. Hologres determines whether to automatically perform ANALYZE operations on relevant tables based on the number of tables that are created, the amount of written data, and the amount of modified data. This reduces the complexity of operations and prevents omissions in statistics collection caused by omitted ANALYZE statements.

Syntax

The following statements are used to check whether the auto-analyze feature is enabled:

SHOW hg_enable_start_auto_analyze_worker;  -- Syntax in Hologres V1.1 and later.

SHOW hg_experimental_enable_start_auto_analyze_worker; -- Syntax in Hologres V0.10.

The following statements are used to enable or disable the auto-analyze feature for databases. You can execute the statements only as a superuser.

 -- Syntax in Hologres V1.1 and later.
ALTER DATABASE dbname SET hg_enable_start_auto_analyze_worker = ON; -- Enable the auto-analyze feature for the specified database. The default value of the hg_enable_start_auto_analyze_worker parameter is ON.
ALTER DATABASE dbname SET hg_enable_start_auto_analyze_worker = OFF; -- Disable the auto-analyze feature for the specified database.

-- Syntax in Hologres V0.10
ALTER DATABASE dbname SET hg_experimental_enable_start_auto_analyze_worker = ON; -- Enable the auto-analyze feature for the specified database. The default value of the hg_experimental_enable_start_auto_analyze_worker parameter is ON.
ALTER DATABASE dbname SET hg_experimental_enable_start_auto_analyze_worker = OFF; -- Disable the auto-analyze feature for the specified database.

Limits
When you use the auto-analyze feature of Hologres, take note of the following limits:
- The auto-analyze feature is supported only in Hologres V0.10 and later. You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your instance is earlier than V0.10, manually upgrade your Hologres instance in the Hologres console or join a DingTalk group to apply for an instance upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join a DingTalk group, see Obtain online support for Hologres.
- Only superusers can enable or disable the auto-analyze feature.
- The auto-analyze feature has the following limits on partitioned tables:
  - Before the auto-analyze feature is implemented on child tables that have been modified, an ANALYZE operation is performed on the relevant parent table.
  - By default, during data sampling, the system can scan up to 2²⁴ (16,777,216) data entries at a time. If the total number of data entries in child tables exceeds 16,777,216, the data entries are distributed to different partitions and only data entries in specific partitions are sampled. The total number of data entries in specific partitions cannot exceed 16,777,216.
    Note
    Partitioning does not affect statistics collection on the partition key column, but affects statistics collection on columns whose data is distributed in the same way as the partition key column, such as a column that has the same data as the partition key column. If such columns exist, some data in these columns may not be sampled and the number of data entries may be inaccurate. If you have questions, contact technical support in the DingTalk group (ID 32314975). The technical support team can help you adjust the maximum number of data entries that can be scanned at a time based on your business requirements.
- By default, you can use the auto-analyze feature to collect statistics about up to 256 columns at a time. If a table has more than 256 columns, statistics about the first 256 columns are collected. You can change the value of the hg_experimental_auto_analyze_max_columns_count parameter to adjust the upper limit.
- By default, each worker used by the auto-analyze feature can occupy 4 GB of memory. If a table has ultra-wide columns, an ANALYZE operation may fail due to the lack of memory resources for data sampling. You can change the value of the auto_analyze_work_memory_mb parameter to adjust the upper limit. However, you must pay attention to the size of the remaining memory available for your Hologres instance. The selection of a more advanced instance type for your Hologres instance allows for a greater number of workers, and the auto-analyze feature consumes more memory resources.
How the auto-analyze feature works
After you enable the auto-analyze feature, the system periodically searches for tables that require ANALYZE operations.
- Regular tables, including internal tables such as non-partitioned tables and partitioned tables
  - The system checks for DML operations on a table at an interval of 1 minute. Such DML operations include INSERT, UPDATE, and DELETE operations and may cause data changes. If one of the following conditions is met, the system performs an ANALYZE operation to collect statistics about the table:
    - For a non-partitioned table, a DML operation is performed and more than 10% of data entries are changed. For a partitioned table, a DML operation is performed and more than 10% of data entries in the current partition are changed.
    - A TRUNCATE TABLE operation is performed to clear the table data.
    - A DDL operation is performed on the table. For example, a CREATE TABLE operation is performed to create a table, or an ALTER TABLE operation is performed to modify the table schema. CALL SET_TABLE_PROPERTY operations that are used to modify table attributes do not trigger an ANALYZE operation.
  - The system checks data changes in an internal table at an interval of 10 minutes. If more than 10% of data entries are changed compared with the previous check result, an ANALYZE operation is performed on this table at the backend.
    Note
    This prevents the system from omitting data changes caused by implicit DML operations, such as real-time data writes by using Realtime Compute for Apache Flink, Data Integration, or Holo Client.
- Foreign tables
  - You can execute ANALYZE statements or enable the auto-analyze feature for a foreign table only if the foreign table maps to a MaxCompute table.
  - The system checks data changes and metadata changes in a foreign table at an interval of 4 hours. If the following condition is met, the system performs an ANALYZE operation to collect statistics about the table:
    The data of the MaxCompute table that the foreign table maps to is changed within 4 hours after the previous check is complete. This condition is met if the value of the last_modify_time parameter of the MaxCompute table is changed within 4 hours after the previous check is complete.
Note
A check and its corresponding ANALYZE operation are in the same scheduling task. A check can start only if the previous ANALYZE operation is complete and the interval between two consecutive checks specified by the scheduling task elapses.

Set the parameters of the auto-analyze feature

After you enable the auto-analyze feature, Hologres periodically checks tables to determine whether to perform ANALYZE operations on the tables. If ANALYZE operations are performed, data is sampled and statistics are collected. This consumes the resources of your Hologres instance.

The default settings may not apply to specific business scenarios. For example, if data is not frequently written to your Hologres instance, you can decrease the frequency of automatic ANALYZE operations by modifying the default settings of the auto-analyze feature. You can modify the default settings of the auto-analyze feature based on your business requirements. This allows you to optimize the feature performance in a fine-grained manner.

Note

You can modify the default settings as a superuser. When you enable this feature, you need to set the required parameters at the database level. The settings take effect 1 minute later.

Syntax
```
-- Modify the default settings of the auto-analyze feature as a superuser.
ALTER DATABASE <dbname> SET <GUC>=<values>;
```
The dbname parameter specifies the name of the database, the GUC parameter specifies the name of the parameter, and the values parameter specifies the values of the GUC parameter.

Parameters

Parameter	Description	Supported version	Default value	Example
autovacuum_naptime	The interval between two consecutive checks for operations on tables. Unit: seconds.	Hologres V1.1.0 and later Note To modify the settings, contact technical support in the DingTalk group (ID 32314975).	60s	`ALTER DATABASE <dbname> SET autovacuum_naptime = 60;` `ALTER DATABASE <dbname> SET autovacuum_naptime = '60s';` `ALTER DATABASE <dbname> SET autovacuum_naptime = '10min';`
hg_auto_check_table_changes_interval	The interval between two consecutive checks of data changes in internal tables. Unit: seconds.	Hologres V1.1.0 and later	600s	`-- Syntax in Hologres V1.1 and later ALTER DATABASE <dbname> SET hg_auto_check_table_changes_interval = '600s'; -- Syntax in Hologres V0.10 ALTER DATABASE <dbname> SET hg_experimental_auto_check_table_changes_interval = '600s';`
hg_auto_check_foreign_table_changes_interval	The interval between two consecutive checks of data changes in foreign tables. Unit: seconds.	Hologres V1.1.0 and later	14400s	`-- Syntax in Hologres V1.1 and later ALTER DATABASE <dbname> SET hg_auto_check_foreign_table_changes_interval = '14400s'; -- Syntax in Hologres V0.10 ALTER DATABASE <dbname> SET hg_experimental_auto_check_foreign_table_changes_interval = '14400s';`
hg_experimental_auto_analyze_max_columns_count	The maximum number of columns whose statistics are automatically collected at a time.	Hologres V1.1.0 and later	256	`ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_max_columns_count =300;`
auto_analyze_work_memory_mb	The maximum size of memory that can be occupied by the auto-analyze feature for a single table. Unit: MB.	Hologres V1.1.54 and later	By default, each worker used by the auto-analyze feature can occupy 4,096 MB (4 GB) of memory. The selection of a more advanced instance type for your Hologres instance allows for a greater number of workers, and the auto-analyze feature consumes more memory resources.	Change the upper limit of the memory size for a table to 9 GB: `ALTER DATABASE <dbname> SETauto_analyze_work_memory_mb =9216;`
hg_experimental_auto_analyze_start_time	The beginning of the time range when ANALYZE operations are performed every day. Note The specified beginning and end of the time range must be in the same time zone. The beginning of the time range must be equal to or earlier than the end of the time range.	Hologres V1.1.54 and later	00:00 +0800	Set the beginning of the time range when ANALYZE operations are performed to 00:00 and the end of the time range to 06:00. In the daytime, data in foreign tables is not changed, and ANALYZE operations are not required. `ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_start_time = '00:00 +0800';` `ALTER DATABASE <dbname> SET hg_experimental_auto_analyze_end_time = '06:00 +0800';`
hg_experimental_auto_analyze_end_time	The end of the time range when ANALYZE operations are performed every day.	Hologres V1.1.54 and later	23:59 +0800
autovacuum_enabled	Specifies whether to enable the auto-analyze feature for the specified database.	Hologres V1.1.54 and later	true, which means to enable this feature for all tables in the database.	Disable the auto-analyze feature for a specified table. Then, ANALYZE operations are not performed on this table. Note You can use only the following statement to disable the auto-analyze feature for a Hologres internal table: `ALTER TABLE <tablename> SET (autovacuum_enabled = false);`

Query table statistics

Table statistics are recorded in a table named hologres_statistic.hg_table_statistic. You can execute the following statement to check the execution status of an ANALYZE operation on a table:

Note

If you want to query the information about the last ANALYZE operation, sort the information by using the analyze_timestamp parameter.

SELECT schema_name,                -- The schema of the table.
       table_name,                 -- The name of the table.
       schema_version,             -- The version of the table.
       statistic_version,          -- The version of the statistics obtained by using the last ANALYZE operation.
       total_rows,                 -- The number of rows involved in the last ANALYZE operation.
       analyze_timestamp           -- The end of the time range when the last ANALYZE operation is performed.
FROM   hologres_statistic.hg_table_statistic
WHERE  table_name = '<tablename>'
ORDER BY analyze_timestamp DESC;

The hologres_statistic.hg_table_statistic table stores 0 to n statistical records for each table. If the hologres_statistic.hg_table_statistic table stores 0 statistical records for a table, no ANALYZE operation is performed on this table before. If the hologres_statistic.hg_table_statistic table stores one or more statistical records for the table, at least one ANALYZE operation is performed on this table before.
If the hologres_statistic.hg_table_statistic table stores two or more statistical records for the table, the values of the schema_version parameter in the statistical records are different. This is because the schema of the table is changed. For example, if you execute a statement such as the ADD COLUMN statement, the system generates a new statistical record and discards the existing statistical record.

The following sample code provides an example of a query result for a table with two statistical records stored in the hologres_statistic.hg_table_statistic table. In this example, the value of the schema_version parameter in the first statistical record is greater than that in the second statistical record. Therefore, the second statistical record is discarded.

 schema_name |    table_name    | schema_version | statistic_version | total_rows |  analyze_timestamp
-------------+------------------+----------------+-------------------+------------+---------------------
 public      | tbl_name_example |             13 |              8580 |      10002 | 2022-04-29 16:03:18
 public      | tbl_name_example |             10 |              8576 |      10002 | 2022-04-29 15:41:20
(2 rows)

Hologres V0.10 and V1.1 do not delete discarded statistical records from the hologres_statistic.hg_table_statistic table. You do not need to pay attention to these records.

FAQ

In the following scenarios, the auto-analyze feature does not work properly. You can resolve the issues by following the solutions provided in this section.

Why does the hologres_statistic.hg_table_statistic table have no statistical records?
- Problem description: When I try to query statistics about a table in the hologres_statistic.hg_table_statistic table, no statistics are returned.
- Cause:
  - The auto-analyze feature is not enabled, or the table that you want to query does not meet the conditions to trigger an ANALYZE operation.
  - An error occurs in the auto-analyze feature. You can submit a ticket for troubleshooting.
- Solution: Execute an ANALYZE statement.
Why is the value of the analyze_timestamp parameter excessively small?
- Problem description: In a query result, the value of the analyze_timestamp parameter is much smaller than the current timestamp. This means that no ANALYZE operation is performed for a long period of time.
- Cause:
  - The auto-analyze feature does not work properly.
  - The auto-analyze feature was once manually disabled.
- Solution: Manually trigger an ANALYZE operation. Then, submit a ticket for troubleshooting.