All Products
Search
Document Center

ApsaraDB for SelectDB:Statistics on external tables

Last Updated:May 31, 2024

ApsaraDB for SelectDB allows you to collect statistics on external tables. This topic describes how the feature works and how to use the feature. This provides a reference for you to optimize query performance.

The collection methods and the content of statistics on external tables are basically the same as those for internal tables. For more information, see Statistics.

Usage notes

  • Statistics on external tables such as Hive, Iceberg, and Hudi tables can be collected.

  • The following statistics collection operations are not supported for external tables:

    • Histogram collection

    • Incremental collection and update of partitions

    • Automatic collection, which can be replaced with periodic collection

    • Sampling collection

Use the feature

This section provides examples on how to collect statistics on external tables by executing the ANALYZE statement.

Except for the four operations that are not supported for external tables, you can perform other operations in the same way as those for internal tables. In the following examples, the hive.tpch100 database is used. The tpch100 database has eight tables, including lineitem, orders, and region.

Collect statistics

ApsaraDB for SelectDB supports manual collection and periodic collection for statistics on external tables.

Manual collection

  • Collect statistics on the lineitem table and all its columns.

    ANALYZE TABLE hive.tpch100.lineitem;
    +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    | Catalog_Name | DB_Name                 | Table_Name | Columns                                                                                                                                                                                       | Job_Id |
    +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    | hive         | default_cluster:tpch100 | lineitem   | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | 16990  |
    +--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    1 row in set (0.06 sec)

    This operation is asynchronously performed. A background collection job is created. You can view the job progress based on the job ID.

    SHOW ANALYZE 16990;
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+
    | job_id | catalog_name | db_name                 | tbl_name | col_name                                                                                                                                                                                      | job_type | analysis_type | message | last_exec_time_in_ms | state   | progress                                    | schedule_type |
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+
    | 16990  | hive         | default_cluster:tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | MANUAL   | FUNDAMENTALS  |         | 2023-07-27 16:01:52  | RUNNING | 2 Finished/0 Failed/15 In Progress/17 Total | ONCE          |
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------------------------------------+---------------+
    1 row in set (0.00 sec)

    You can also view the job status for each column.

    SHOW ANALYZE TASK STATUS 16990;
    +---------+-----------------+---------+------------------------+-----------------+----------+
    | task_id | col_name        | message | last_state_change_time | time_cost_in_ms | state    |
    +---------+-----------------+---------+------------------------+-----------------+----------+
    | 16991   | l_receiptdate   |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 16992   | l_returnflag    |         | 2023-07-27 16:01:44    | 14394           | FINISHED |
    | 16993   | l_tax           |         | 2023-07-27 16:01:52    | 7975            | FINISHED |
    | 16994   | l_shipmode      |         | 2023-07-27 16:02:11    | 18961           | FINISHED |
    | 16995   | l_suppkey       |         | 2023-07-27 16:02:17    | 6684            | FINISHED |
    | 16996   | l_shipdate      |         | 2023-07-27 16:02:26    | 8518            | FINISHED |
    | 16997   | l_commitdate    |         | 2023-07-27 16:02:26    | 0               | RUNNING  |
    | 16998   | l_partkey       |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 16999   | l_quantity      |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17000   | l_orderkey      |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17001   | l_comment       |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17002   | l_linestatus    |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17003   | l_extendedprice |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17004   | l_linenumber    |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17005   | l_shipinstruct  |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17006   | l_discount      |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    | 17007   | TableRowCount   |         | 2023-07-27 16:01:29    | 0               | PENDING  |
    +---------+-----------------+---------+------------------------+-----------------+----------+
    17 rows in set (0.00 sec)
  • Collect statistics on all tables in the tpch100 database.

    ANALYZE DATABASE hive.tpch100;
    +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    | Catalog_Name | DB_Name | Table_Name | Columns                                                                                                                                                                                       | Job_Id |
    +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    | hive         | tpch100 | supplier   | [s_comment,s_phone,s_nationkey,s_name,s_address,s_acctbal,s_suppkey]                                                                                                                          | 17018  |
    | hive         | tpch100 | nation     | [n_comment,n_nationkey,n_regionkey,n_name]                                                                                                                                                    | 17027  |
    | hive         | tpch100 | region     | [r_regionkey,r_comment,r_name]                                                                                                                                                                | 17033  |
    | hive         | tpch100 | partsupp   | [ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost]                                                                                                                                  | 17038  |
    | hive         | tpch100 | orders     | [o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority]                                                                                | 17045  |
    | hive         | tpch100 | lineitem   | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | 17056  |
    | hive         | tpch100 | part       | [p_partkey,p_container,p_name,p_comment,p_brand,p_type,p_retailprice,p_mfgr,p_size]                                                                                                           | 17074  |
    | hive         | tpch100 | customer   | [c_custkey,c_phone,c_acctbal,c_mktsegment,c_address,c_nationkey,c_name,c_comment]                                                                                                             | 17085  |
    +--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
    8 rows in set (0.29 sec)

    This operation is also asynchronously performed. The collection jobs for all tables in the tpch100 database are submitted at a time. A job is created for each table. You can view the progress of the job for each table based on the job ID.

  • Perform synchronous collection.

    You can add WITH sync to the collection statement to synchronously collect statistics on a table or all tables in a database. In this case, no background jobs are created. The client is blocked until the statistics are collected and returned.

    ANALYZE TABLE hive.tpch100.orders WITH sync;
    Query OK, 0 rows affected (33.19 sec)

    Take note that synchronous collection is subject to the query_timeout session variable. If the collection fails due to a timeout, increase the value of this variable and try again. For example, set query_timeout = 3600 indicates that the timeout period is set to 1 hour.

Periodic collection

You can add WITH period to the collection statement to configure periodic statistics collection.

ANALYZE TABLE hive.tpch100.orders WITH period 86400;

The preceding statement creates a periodic collection job that automatically collects and updates the statistics on the orders table every day.

Manage jobs

You can manage statistics collection jobs for external tables in the same way as that for internal tables. You can view jobs, view tasks, and delete jobs. For more information, see Statistics.

  • View the status of all jobs.

    SHOW ANALYZE;
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+
    | job_id | catalog_name | db_name                 | tbl_name | col_name                                                                                                                                                                                      | job_type | analysis_type | message | last_exec_time_in_ms | state    | progress                                    | schedule_type |
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+
    | 16990  | hive         | default_cluster:tpch100 | lineitem | [l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct] | MANUAL   | FUNDAMENTALS  |         | 2023-07-27 16:05:02  | FINISHED | 17 Finished/0 Failed/0 In Progress/17 Total | ONCE          |
    +--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------------------------------------+---------------+
  • View the status of all tasks in a job.

    SHOW ANALYZE TASK STATUS 16990;
    +---------+-----------------+---------+------------------------+-----------------+----------+
    | task_id | col_name        | message | last_state_change_time | time_cost_in_ms | state    |
    +---------+-----------------+---------+------------------------+-----------------+----------+
    | 16991   | l_receiptdate   |         | 2023-07-27 16:05:02    | 9560            | FINISHED |
    | 16992   | l_returnflag    |         | 2023-07-27 16:01:44    | 14394           | FINISHED |
    | 16993   | l_tax           |         | 2023-07-27 16:01:52    | 7975            | FINISHED |
    | 16994   | l_shipmode      |         | 2023-07-27 16:02:11    | 18961           | FINISHED |
    | 16995   | l_suppkey       |         | 2023-07-27 16:02:17    | 6684            | FINISHED |
    | 16996   | l_shipdate      |         | 2023-07-27 16:02:26    | 8518            | FINISHED |
    | 16997   | l_commitdate    |         | 2023-07-27 16:02:34    | 8380            | FINISHED |
    | 16998   | l_partkey       |         | 2023-07-27 16:02:40    | 6060            | FINISHED |
    | 16999   | l_quantity      |         | 2023-07-27 16:02:50    | 9768            | FINISHED |
    | 17000   | l_orderkey      |         | 2023-07-27 16:02:57    | 7200            | FINISHED |
    | 17001   | l_comment       |         | 2023-07-27 16:03:36    | 38468           | FINISHED |
    | 17002   | l_linestatus    |         | 2023-07-27 16:03:51    | 15226           | FINISHED |
    | 17003   | l_extendedprice |         | 2023-07-27 16:04:00    | 8713            | FINISHED |
    | 17004   | l_linenumber    |         | 2023-07-27 16:04:06    | 6659            | FINISHED |
    | 17005   | l_shipinstruct  |         | 2023-07-27 16:04:36    | 29777           | FINISHED |
    | 17006   | l_discount      |         | 2023-07-27 16:04:45    | 9212            | FINISHED |
    | 17007   | TableRowCount   |         | 2023-07-27 16:04:52    | 6974            | FINISHED |
    +---------+-----------------+---------+------------------------+-----------------+----------+
  • Stop an unfinished job.

    KILL ANALYZE [job_id]
  • Delete a periodic collection job.

    DROP ANALYZE JOB [JOB_ID]

View statistics

You can view table statistics and column statistics. Table statistics include the number of rows in a table. For more information, see Statistics.

Table statistics

SHOW TABLE [cached] stats TABLE_NAME;

View the number of rows in the specified table. If the cached parameter is specified, the cached information about the number of rows in the specified table is displayed. Example:

SHOW TABLE STATS hive.tpch100.orders;
+-----------+---------------------+---------------------+
| row_count | update_time         | last_analyze_time   |
+-----------+---------------------+---------------------+
| 150000000 | 2023-07-11 23:01:49 | 2023-07-11 23:01:44 |
+-----------+---------------------+---------------------+

Column statistics

SHOW COLUMN [cached] stats TABLE_NAME;

View the statistics on the columns in the specified table. If the cached parameter is specified, the cached statistics on the columns are displayed. Example:

SHOW COLUMN stats hive.tpch100.orders;
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| column_name     | count | ndv          | num_null | data_size            | avg_size_byte | min                   | max                        |
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| o_orderstatus   | 1.5E8 | 3.0          | 0.0      | 1.50000001E8         | 1.0           | 'F'                   | 'P'                        |
| o_clerk         | 1.5E8 | 100836.0     | 0.0      | 2.250000015E9        | 15.0          | 'Clerk#000000001'     | 'Clerk#000100000'          |
| o_orderdate     | 1.5E8 | 2417.0       | 0.0      | 6.00000004E8         | 4.0           | '1992-01-01'          | '1998-08-02'               |
| o_shippriority  | 1.5E8 | 1.0          | 0.0      | 6.00000004E8         | 4.0           | 0                     | 0                          |
| o_custkey       | 1.5E8 | 1.0023982E7  | 0.0      | 6.00000004E8         | 4.0           | 1                     | 14999999                   |
| o_totalprice    | 1.5E8 | 3.4424096E7  | 0.0      | 1.200000008E9        | 8.0           | 811.73                | 591036.15                  |
| o_orderkey      | 1.5E8 | 1.51621184E8 | 0.0      | 1.200000008E9        | 8.0           | 1                     | 600000000                  |
| o_comment       | 1.5E8 | 1.10204136E8 | 0.0      | 7.275038757500258E9  | 48.50025806   | ' Tiresias about the' | 'zzle? unusual requests w' |
| o_orderpriority | 1.5E8 | 5.0          | 0.0      | 1.2600248124001656E9 | 8.40016536    | '1-URGENT'            | '5-LOW'                    |
+-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+

Modify statistics

You can manually modify column statistics. You can modify parameters such as row_count, ndv, num_nulls, min_value, max_value, and data_size for a specific column. For more information, see Statistics.

ALTER TABLE hive.tpch100.orders MODIFY COLUMN o_orderstatus SET STATS ('row_count'='6001215');
Query OK, 0 rows affected (0.03 sec)

SHOW COLUMN stats hive.tpch100.orders;
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| column_name     | count     | ndv          | num_null | data_size            | avg_size_byte | min                   | max                        |
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
| o_orderstatus   | 6001215.0 | 0.0          | 0.0      | 0.0                  | 0.0           | 'NULL'                | 'NULL'                     |
| o_clerk         | 1.5E8     | 100836.0     | 0.0      | 2.250000015E9        | 15.0          | 'Clerk#000000001'     | 'Clerk#000100000'          |
| o_orderdate     | 1.5E8     | 2417.0       | 0.0      | 6.00000004E8         | 4.0           | '1992-01-01'          | '1998-08-02'               |
| o_shippriority  | 1.5E8     | 1.0          | 0.0      | 6.00000004E8         | 4.0           | 0                     | 0                          |
| o_custkey       | 1.5E8     | 1.0023982E7  | 0.0      | 6.00000004E8         | 4.0           | 1                     | 14999999                   |
| o_totalprice    | 1.5E8     | 3.4424096E7  | 0.0      | 1.200000008E9        | 8.0           | 811.73                | 591036.15                  |
| o_orderkey      | 1.5E8     | 1.51621184E8 | 0.0      | 1.200000008E9        | 8.0           | 1                     | 600000000                  |
| o_comment       | 1.5E8     | 1.10204136E8 | 0.0      | 7.275038757500258E9  | 48.50025806   | ' Tiresias about the' | 'zzle? unusual requests w' |
| o_orderpriority | 1.5E8     | 5.0          | 0.0      | 1.2600248124001656E9 | 8.40016536    | '1-URGENT'            | '5-LOW'                    |
+-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+

Delete statistics

You can delete the information about the number of rows in a table and column statistics. If you specify the names of columns, only the statistics on the specified columns are deleted. If no column name is specified, the statistics on all columns and information about the number of rows in the table are deleted. For more information, see Statistics.

  • Delete all statistics on a table.

    DROP STATS hive.tpch100.orders
  • Delete the statistics on the specified columns in a table.

    DROP STATS hive.tpch100.orders (o_orderkey, o_orderdate)

How it works

Statistics sources

The Nereids optimizer reads statistics from the cache. The cache data is from the following sources:

  • Internal Statistics table. The statistics that are collected by executing the ANALYZE statement are stored in the Statistics table. This source is the same as that for internal tables. You can execute the ANALYZE statement to collect statistics on external tables in the same way that you analyze internal tables.

  • Stats Collector. This source applies only to the cache for external tables. Stats Collector defines some operations for obtaining statistics from external data sources. For example, the Hive Metastore and Iceberg data sources are supported. These operations can be called to obtain the existing statistics from external data sources.

    For example, if you have executed the ANALYZE statement in Hive, ApsaraDB for SelectDB can directly load the existing statistics from Hive Metastore to the cache, including the number of rows and the maximum and minimum values of columns. If no statistics exist in external data sources, Stats Collector provides an approximate estimate of the number of rows to the optimizer based on the size of the data file for a table and the table schema. In this case, the column statistics are missing, and the optimizer may generate an inefficient execution plan. Stats Collector is automatically run if no data is found in the Statistics table. You do not need to manually start Stats Collector by running a command or configure Stats Collector.