MaxCompute uses a cost-based optimizer to accurately estimate the cost of each execution plan based on metadata, such as the number of rows and the average length of strings. This topic describes how to collect metadata for the optimizer so that you can use the optimizer to optimize query performance.
Background information
If the optimizer estimates the cost based on inaccurate metadata, the estimation result is inaccurate and a bad execution plan is generated. Therefore, accurate metadata is crucial to the optimizer. The core metadata of a table is the column stats metrics of the data in the table. Other metadata is estimated based on the column stats metrics.
MaxCompute allows you to use the following methods to collect column stats metrics:
Analyze: an asynchronous collection method. You can run the
analyze
command to asynchronously collect column stats metrics. Active collection is required.NoteThe version of the MaxCompute client must be later than 0.35.
Freeride: a synchronous collection method. While data is generated in a table, the column stats metrics of the data are automatically collected. This method is automated but has an impact on query latency.
The following table lists the column stats metrics that can be collected for different data types.
Column stats metric/Data type | Numeric (TINYINT, SMALLINT, INT, BIGINT, DOUBLE, DECIMAL, and NUMERIC) | Character (STRING, VARCHAR, and CHAR) | Binary (BINARY) | Boolean (BOOLEAN) | Date and time (TIMESTAMP, DATE, and INTERVAL) | Complex data type (MAP, STRUCT, and ARRAY) |
min (minimum value) | Y | N | N | N | Y | N |
max (maximum value) | Y | N | N | N | Y | N |
nNulls (number of null values) | Y | Y | Y | Y | Y | Y |
avgColLen (average column length) | N | Y | Y | N | N | N |
maxColLen (maximum column length) | N | Y | Y | N | N | N |
ndv (number of distinct values) | Y | Y | Y | Y | Y | N |
topK (top K values with the highest frequency of occurrence) | Y | Y | Y | Y | Y | N |
Y indicates that a metric is supported. N indicates that a metric is not supported.
Scenarios
The following table describes the use scenarios of each column stats metric.
Column stats metric | Optimization objective | Scenario | Description |
min (minimum value) or max (maximum value) | Enhance the accuracy of performance optimization. | Scenario 1: Estimate the number of output records. | If only the data type is provided, the value range is too large for the optimizer. If the values of the min and max metrics are provided, the optimizer can more accurately estimate the selection of filter conditions and provide a better execution plan. |
Scenario 2: Push filter conditions down to the storage layer to reduce the amount of data that needs to be read. | In MaxCompute, the filter condition | ||
nNulls (number of null values) | Improve the efficiency of the null value check. | Scenario 1: Reduce checks for null values when a job is run. | When a job is run, null values must be checked for all types of data. If the value of the nNulls metric is 0, the checking logic can be ignored. This improves computing performance. |
Scenario 2: Filter out data based on filter conditions. | If a column has only null values, the optimizer uses the | ||
avgColLen (average column length) or maxColLen (maximum column length) | Estimate resource consumption to reduce shuffle operations. | Scenario 1: Estimate the memory of a hash-clustered table. | For example, the optimizer can estimate the memory usage of variable length fields based on the avgColLen metric to obtain the memory usage of data records. This way, the optimizer can selectively perform automatic map join operations. A broadcast join mechanism is established for the hash-clustered table to reduce shuffle operations. For a large input table, shuffle operations can be reduced to significantly improve performance. |
Scenario 2: Reduce the amount of data that needs to be shuffled. | None | ||
ndv (number of distinct values) | Improve the quality of an execution plan. | Scenario 1: Estimate the number of output records of a join operation. |
|
Scenario 2: Sort join operations. | The optimizer can automatically adjust the join sequence based on the estimated number of output records. For example, it can move the join operations that involve data filtering forward and the join operations that involve data expansion backward. | ||
topK (top K values with the highest frequency of occurrence) | Estimate data distribution to reduce the impact of data skew on performance. | Scenario 1: Optimize join operations that involve skewed data. | If both tables of a join operation have large input and map join operations cannot be used to fully load the smaller table to memory, data skew occurs. The output of one join key is much greater than that of the other join keys. MaxCompute can automatically use map join operations to process skewed data and use merge join operations to process non-skewed data, and then merge the computing results. This feature is especially effective for join operations that involve a large amount of data. It significantly reduces the cost of manual troubleshooting. |
Scenario 2: Estimate the number of output records. | The ndv, min, and max metrics can be used to accurately estimate the number of output records only if the assumption that data is evenly distributed is true. If data is obviously skewed, the estimation based on this assumption is distorted. Therefore, special processing is required for skewed data. Other data can be estimated based on the assumption. |
Use Analyze
Collect the column stats metrics
This section uses a partitioned table and a non-partitioned table as examples to describe how to use Analyze.
Non-partitioned table
You can use Analyze to collect the column stats metrics of one or more specific columns or all columns in a non-partitioned table.
Run the following command on the MaxCompute client to create a non-partitioned table named analyze2_test:
create table if not exists analyze2_test (tinyint1 tinyint, smallint1 smallint, int1 int, bigint1 bigint, double1 double, decimal1 decimal, decimal2 decimal(20,10), string1 string, varchar1 varchar(10), boolean1 boolean, timestamp1 timestamp, datetime1 datetime ) lifecycle 30;
Run the following command to insert data into the table:
insert overwrite table analyze2_test select * from values (1Y, 20S, 4, 8L, 123452.3, 12.4, 52.5, 'str1', 'str21', false, timestamp '2018-09-17 00:00:00', datetime '2018-09-17 00:59:59') ,(10Y, 2S, 7, 11111118L, 67892.3, 22.4, 42.5, 'str12', 'str200', true, timestamp '2018-09-17 00:00:00', datetime '2018-09-16 00:59:59') ,(20Y, 7S, 4, 2222228L, 12.3, 2.4, 2.57, 'str123', 'str2', false, timestamp '2018-09-18 00:00:00', datetime '2018-09-17 00:59:59') ,(null, null, null, null, null, null, null, null, null, null, null , null) as t(tinyint1, smallint1, int1, bigint1, double1, decimal1, decimal2, string1, varchar1, boolean1, timestamp1, datetime1);
Run the
analyze
command to collect the column stats metrics of one or more specific columns or all columns in the table. Examples:-- Collect the column stats metrics of the tinyint1 column. analyze table analyze2_test compute statistics for columns (tinyint1); -- Collect the column stats metrics of the smallint1, string1, boolean1, and timestamp1 columns. analyze table analyze2_test compute statistics for columns (smallint1, string1, boolean1, timestamp1); -- Collect the column stats metrics of all columns. analyze table analyze2_test compute statistics for columns;
Run the
show statistic
command to test the collection results. Examples:-- Test the collection result of the tinyint1 column. show statistic analyze2_test columns (tinyint1); -- Test the collection results of the smallint1, string1, boolean1, and timestamp1 columns. show statistic analyze2_test columns (smallint1, string1, boolean1, timestamp1); -- Test the collection results of all columns. show statistic analyze2_test columns;
The following result is returned:
-- Collection result of the tinyint1 column: ID = 20201126085225150gnqo**** tinyint1:MaxValue: 20 -- The value of max. tinyint1:DistinctNum: 4.0 -- The value of ndv. tinyint1:MinValue: 1 -- The value of min. tinyint1:NullNum: 1.0 -- The value of nNulls. tinyint1:TopK: {1=1.0, 10=1.0, 20=1.0} -- The value of topK. 10=1.0 indicates that the occurrence frequency of column value 10 is 1. Up to 20 values with the highest occurrence frequency can be returned. -- Collection results of the smallint1, string1, boolean1, and timestamp1 columns: ID = 20201126091636149gxgf**** smallint1:MaxValue: 20 smallint1:DistinctNum: 4.0 smallint1:MinValue: 2 smallint1:NullNum: 1.0 smallint1:TopK: {2=1.0, 7=1.0, 20=1.0} string1:MaxLength 6.0 -- The value of maxColLen. string1:AvgLength: 3.0 -- The value of avgColLen. string1:DistinctNum: 4.0 string1:NullNum: 1.0 string1:TopK: {str1=1.0, str12=1.0, str123=1.0} boolean1:DistinctNum: 3.0 boolean1:NullNum: 1.0 boolean1:TopK: {false=2.0, true=1.0} timestamp1:DistinctNum: 3.0 timestamp1:NullNum: 1.0 timestamp1:TopK: {2018-09-17 00:00:00.0=2.0, 2018-09-18 00:00:00.0=1.0} -- Collection results of all columns: ID = 20201126092022636gzm1**** tinyint1:MaxValue: 20 tinyint1:DistinctNum: 4.0 tinyint1:MinValue: 1 tinyint1:NullNum: 1.0 tinyint1:TopK: {1=1.0, 10=1.0, 20=1.0} smallint1:MaxValue: 20 smallint1:DistinctNum: 4.0 smallint1:MinValue: 2 smallint1:NullNum: 1.0 smallint1:TopK: {2=1.0, 7=1.0, 20=1.0} int1:MaxValue: 7 int1:DistinctNum: 3.0 int1:MinValue: 4 int1:NullNum: 1.0 int1:TopK: {4=2.0, 7=1.0} bigint1:MaxValue: 11111118 bigint1:DistinctNum: 4.0 bigint1:MinValue: 8 bigint1:NullNum: 1.0 bigint1:TopK: {8=1.0, 2222228=1.0, 11111118=1.0} double1:MaxValue: 123452.3 double1:DistinctNum: 4.0 double1:MinValue: 12.3 double1:NullNum: 1.0 double1:TopK: {12.3=1.0, 67892.3=1.0, 123452.3=1.0} decimal1:MaxValue: 22.4 decimal1:DistinctNum: 4.0 decimal1:MinValue: 2.4 decimal1:NullNum: 1.0 decimal1:TopK: {2.4=1.0, 12.4=1.0, 22.4=1.0} decimal2:MaxValue: 52.5 decimal2:DistinctNum: 4.0 decimal2:MinValue: 2.57 decimal2:NullNum: 1.0 decimal2:TopK: {2.57=1.0, 42.5=1.0, 52.5=1.0} string1:MaxLength 6.0 string1:AvgLength: 3.0 string1:DistinctNum: 4.0 string1:NullNum: 1.0 string1:TopK: {str1=1.0, str12=1.0, str123=1.0} varchar1:MaxLength 6.0 varchar1:AvgLength: 3.0 varchar1:DistinctNum: 4.0 varchar1:NullNum: 1.0 varchar1:TopK: {str2=1.0, str200=1.0, str21=1.0} boolean1:DistinctNum: 3.0 boolean1:NullNum: 1.0 boolean1:TopK: {false=2.0, true=1.0} timestamp1:DistinctNum: 3.0 timestamp1:NullNum: 1.0 timestamp1:TopK: {2018-09-17 00:00:00.0=2.0, 2018-09-18 00:00:00.0=1.0} datetime1:DistinctNum: 3.0 datetime1:NullNum: 1.0 datetime1:TopK: {1537117199000=2.0, 1537030799000=1.0}
Partitioned table
You can use Analyze to collect the column stats metrics of a specific partition in a partitioned table.
Run the following command on the MaxCompute client to create a partitioned table named srcpart:
create table if not exists srcpart_test (key string, value string) partitioned by (ds string, hr string) lifecycle 30;
Run the following command to insert data into the table:
insert into table srcpart_test partition(ds='20201220', hr='11') values ('123', 'val_123'), ('76', 'val_76'), ('447', 'val_447'), ('1234', 'val_1234'); insert into table srcpart_test partition(ds='20201220', hr='12') values ('3', 'val_3'), ('12331', 'val_12331'), ('42', 'val_42'), ('12', 'val_12'); insert into table srcpart_test partition(ds='20201221', hr='11') values ('543', 'val_543'), ('2', 'val_2'), ('4', 'val_4'), ('9', 'val_9'); insert into table srcpart_test partition(ds='20201221', hr='12') values ('23', 'val_23'), ('56', 'val_56'), ('4111', 'val_4111'), ('12333', 'val_12333');
Run the
analyze
command to collect the column stats metrics of a specific partition in the table. Example:analyze table srcpart_test partition(ds='20201221') compute statistics for columns (key , value);
Run the
show statistic
command to test the collection results. Example:show statistic srcpart_test partition (ds='20201221') columns (key , value);
The following result is returned:
ID = 20210105121800689g28p**** (ds=20201221,hr=11) key:MaxLength 3.0 (ds=20201221,hr=11) key:AvgLength: 1.0 (ds=20201221,hr=11) key:DistinctNum: 4.0 (ds=20201221,hr=11) key:NullNum: 0.0 (ds=20201221,hr=11) key:TopK: {2=1.0, 4=1.0, 543=1.0, 9=1.0} (ds=20201221,hr=11) value:MaxLength 7.0 (ds=20201221,hr=11) value:AvgLength: 5.0 (ds=20201221,hr=11) value:DistinctNum: 4.0 (ds=20201221,hr=11) value:NullNum: 0.0 (ds=20201221,hr=11) value:TopK: {val_2=1.0, val_4=1.0, val_543=1.0, val_9=1.0} (ds=20201221,hr=12) key:MaxLength 5.0 (ds=20201221,hr=12) key:AvgLength: 3.0 (ds=20201221,hr=12) key:DistinctNum: 4.0 (ds=20201221,hr=12) key:NullNum: 0.0 (ds=20201221,hr=12) key:TopK: {12333=1.0, 23=1.0, 4111=1.0, 56=1.0} (ds=20201221,hr=12) value:MaxLength 9.0 (ds=20201221,hr=12) value:AvgLength: 7.0 (ds=20201221,hr=12) value:DistinctNum: 4.0 (ds=20201221,hr=12) value:NullNum: 0.0 (ds=20201221,hr=12) value:TopK: {val_12333=1.0, val_23=1.0, val_4111=1.0, val_56=1.0}
Refresh the number of records in a table in the metadata
Various tasks in MaxCompute may affect the number of records in a table. Most tasks collect only the number of records that are affected by the tasks themselves. The statistics on the number of records that are affected by tasks may not be accurate due to the dynamic nature of distributed tasks and the uncertainty of the data update time. Therefore, you can run the Analyze command to refresh the statistics on the number of records in a table in the metadata to ensure the accuracy of the number of records. You can view the number of records in a table in DataMap of DataWorks. For more information, see View the details of a table.
Refresh the number of records in a table.
set odps.sql.analyze.table.stats=only; analyze table <table_name> compute statistics for columns;
The
table_name
parameter specifies the name of the table.Refresh the number of records in a column of a table.
set odps.sql.analyze.table.stats=only; analyze table <table_name> compute statistics for columns (<column_name>);
The
table_name
parameter specifies the name of the table. Thecolumn_name
parameter specifies the name of the column.Refresh the number of records in a column of a partition in a table.
set odps.sql.analyze.table.stats=only; analyze table <table_name> partition(<pt_spec>) compute statistics for columns (<column_name>);
The
table_name
parameter specifies the name of the table. Thept_spec
parameter specifies the partition. Thecolumn_name
parameter specifies the name of the column.
Use Freeride
To use Freeride, you must run the following commands at the session level at the same time to configure properties:
set odps.optimizer.stat.collect.auto=true;
: Enable Freeride to automatically collect the column stats metrics of tables.set odps.optimizer.stat.collect.plan=xx;
: Configure a collection plan to collect specific column stats metrics of specific columns.-- Collect the avgColLen metric of the key column in the target_table table. set odps.optimizer.stat.collect.plan={"target_table":"{\"key\":\"AVG_COL_LEN\"}"} -- Collect the min and max metrics of the s_binary column in the target_table table, and the topK and nNulls metrics of the s_int column in the table. set odps.optimizer.stat.collect.plan={"target_table":"{\"s_binary\":\"MIN,MAX\",\"s_int\":\"TOPK,NULLS\"}"};
If no data is collected after the commands are run, Freeride may fail to be enabled. You need to check whether the odps.optimizer.stat.collect.auto
property can be found on the Json Summary tab of LogView. If this property is not found, the current server version does not support Freeride. The server will be upgraded to a version that supports Freeride in the future.
Mappings between column stats metrics and parameters in the set odps.optimizer.stat.collect.plan=xx;
command:
min: MIN
max: MAX
nNulls: NULLS
avgColLen: AVG_COL_LEN
maxColLen: MAX_COL_LEN
ndv: NDV
topK: TOPK
MaxCompute allows you to execute the CREATE TABLE
, INSERT INTO
, or INSERT OVERWRITE
statement to trigger Freeride to collect column stats metrics.
Before you can use Freeride, you must first prepare a source table. For example, run the following commands to create a source table named src_test and insert data into the table:
create table if not exists src_test (key string, value string);
insert overwrite table src_test values ('100', 'val_100'), ('100', 'val_50'), ('200', 'val_200'), ('200', 'val_300');
CREATE TABLE
: Use Freeride to collect column stats metrics while you create a destination table named target. Sample statement:-- Create a destination table. set odps.optimizer.stat.collect.auto=true; set odps.optimizer.stat.collect.plan={"target_test":"{\"key\":\"AVG_COL_LEN,NULLS\"}"}; create table target_test as select key, value from src_test; -- Test the collection results. show statistic target_test columns;
The following result is returned:
key:AvgLength: 3.0 key:NullNum: 0.0
INSERT INTO
: Use Freeride to collect column stats metrics while you execute theINSERT INTO
statement to append data to a table. Sample statement:-- Create a destination table. create table freeride_insert_into_table like src_test; -- Append data to the table. set odps.optimizer.stat.collect.auto=true; set odps.optimizer.stat.collect.plan={"freeride_insert_into_table":"{\"key\":\"AVG_COL_LEN,NULLS\"}"}; insert into table freeride_insert_into_table select key, value from src order by key, value limit 10; -- Test the collection results. show statistic freeride_insert_into_table columns;
INSERT OVERWRITE
: Use Freeride to collect column stats metrics while you execute theINSERT OVERWRITE
statement to overwrite data in a table. Sample statement:-- Create a destination table. create table freeride_insert_overwrite_table like src_test; -- Overwrite data in the table. set odps.optimizer.stat.collect.auto=true; set odps.optimizer.stat.collect.plan={"freeride_insert_overwrite_table":"{\"key\":\"AVG_COL_LEN,NULLS\"}"}; insert overwrite table freeride_insert_overwrite_table select key, value from src_test order by key, value limit 10; -- Test the collection results. show statistic freeride_insert_overwrite_table columns;