Performance tuning for MaxCompute foreign tables - Hologres

MaxCompute and Hologres are seamlessly integrated. The native Hologres vector engine can directly read MaxCompute data and uses multi-level partition filtering and operator pushdown in MaxCompute to optimize query speed. This topic describes how to accelerate queries on MaxCompute data in Hologres by creating foreign tables or by importing data into Hologres.

Create a foreign table

You can create a foreign table in Hologres to directly accelerate queries on MaxCompute table data. This method does not require data import or export and does not create redundant storage. Compared to querying data directly in MaxCompute, this method improves query performance by 2 to 5 times. To further accelerate queries on MaxCompute data, you can use the following optimization strategies.

Use Serverless Computing to query foreign tables

Hologres V2.1.17 and later support Serverless Computing. For scenarios such as the offline import of large volumes of MaxCompute data, large-scale extract, transform, and load (ETL) jobs, or queries on large volumes of data in foreign tables, you can use Serverless Computing to run these tasks. This method uses additional Serverless resources directly instead of your instance's own resources. You do not need to reserve extra compute resources for the instance. This significantly improves instance stability, reduces the probability of out-of-memory (OOM) errors, and you are charged only for the task. For more information about Serverless Computing, see Serverless Computing. For more information about how to use Serverless Computing, see Use Serverless Computing.

Use the new HQE foreign table execution engine

Hologres V0.10 and later use a new execution engine, HQE, to accelerate queries on MaxCompute foreign tables. Compared to instances that run a version earlier than V0.10, query performance is improved by approximately 30% to 100%.

The HQE execution engine has the following limitations:

Currently, acceleration is effective only for MaxCompute tables that use the ORC format. Acceleration for other file formats, such as CFile, is not supported.
HQE does not accelerate queries on MaxCompute transactional tables or tables whose schemas have changed (Schema Evolution). In these cases, the system automatically switches to the SQE engine for execution.
Ensure that the data type mapping between MaxCompute and Hologres tables is correct. Otherwise, the acceleration effect is compromised.

Avoid full table scans

You can optimize your query statements as follows to avoid full table scans when you query foreign table data:

When you query data, use the SELECT a FROM xx statement to query specific columns. Do not use SELECT * FROM xx.
Add conditions to filter partitions or reduce the number of scanned partitions to decrease the amount of scanned data.

SQE query optimization

For query scenarios that HQE does not support, the system automatically switches to the SQE engine for execution. This may cause a performance regression. You can modify the MaxCompute data source table as follows to optimize query performance:

Convert MaxCompute tables to hash clustering tables
Hash clustering tables provide several advantages, including Bucket Pruning optimization, Aggregation optimization, and storage optimization. When you create a table, if you use `CLUSTERED BY` to specify a hash key, MaxCompute performs a hash operation on the specified columns and distributes the data into different buckets based on the hash values. For the hash key, select columns that have few duplicate key values.
The following statement shows how to convert a table to a hash clustering table.
```
ALTER TABLE <table_name> [CLUSTERED BY (<col_name> [, <col_name>, ...])
                       [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                       INTO <number_of_buckets> BUCKETS];
```
After you use the ALTER TABLE statement to add the new clustering property, new partitions are stored using hash clustering. For more information about hash clustering, see Hash Clustering.
Convert MaxCompute tables to range clustering tables
Range clustering is a new data splitting method that provides a globally ordered data distribution. It can prevent potential data skew issues that are caused by hash clustering. Additionally, its ordered data distribution allows it to create a two-level index to support scenarios such as range queries on the clustering key and composite key queries.
The following statement shows how to convert a table to a range clustering table.
```
ALTER TABLE <table_name> [RANGE CLUSTERED BY (<col_name> [, <col_name>, ...])
                         [SORTED BY (<col_name> [ASC | DESC] [, <col_name> [ASC | DESC] ...])]
                         [INTO <number_of_buckets> BUCKETS];
```
After you use the ALTER TABLE statement to add the new clustering property, new partitions are stored using range clustering. For more information about range clustering, see Range Clustering.
Note
- Hash clustering tables and range clustering tables do not support the INSERT INTO statement. Use the INSERT OVERWRITE statement to overwrite data.
- Because data uploaded through Tunnel is unordered, you cannot upload data to hash clustering tables or range clustering tables using Tunnel.

Merge small files

Many small files in MaxCompute can slow down data queries.

You can run the following statement in MaxCompute to view the number of files in a table. For more information about small file optimization in MaxCompute, see FAQ about small file optimization and job diagnostics.

desc extended <table_name>;

In the query results, FileNum indicates the number of files in the target MaxCompute table, and Size indicates the total size of the table in bytes. If a table has more than 100 files and the average file size is less than 64 MB, you can consider merging the small files in MaxCompute.

Parameter tuning

When you query a foreign table, Hologres sets default parameters to increase the concurrency for data reads and improve query efficiency. If you have special requirements, you can configure the following parameters as needed.

Important

The following parameters are set to optimal values based on internal tuning and experiments. Typically, you do not need to change them.
Do not set the hg_foreign_table_executor_max_dop parameter to a very low value. For example, if you set this parameter to 1 and a table in the instance is being written to and queried at the same time, the load might be concentrated on specific workers. This can cause an OOM error in the instance.

Adjust the number of partitions hit by each query.

--The default value is 512 and the maximum value is 1024. Do not set this value too high because it may affect query performance.
set hg_foreign_table_max_partition_limit = 128;

Adjust the batch size for each read from a MaxCompute table.

--The default value is 8192.
set hg_experimental_query_batch_size = 4096;

Set the number of splits for accessing a MaxCompute table. This lets you adjust the concurrency.

--The default value is 64 MB. If the table is large, increase this value to prevent too many splits from affecting performance. This parameter is effective in Hologres V1.1 and later.
set hg_foreign_table_split_size = 128;

Set the maximum concurrency for accessing a foreign table. The default value is the number of cores in the instance.

--The maximum value is 128. Do not set a large value. This prevents foreign table queries, especially in data import scenarios, from affecting other queries and causing system busy errors. This parameter is effective in Hologres V1.1 and later.
set hg_foreign_table_executor_max_dop = 32;

Set the maximum concurrency for running DML statements when you access a foreign table.

--The default value is 32. This parameter is optimized for data import and export scenarios to prevent import operations from using too many system resources. This parameter is effective in Hologres V1.1 and later.
set hg_foreign_table_executor_dml_max_dop = 16;

Import data into Hologres (Recommended)

To perform extensive analytical calculations on foreign table data and join it with internal tables, you can create an internal table in Hologres and import the foreign table data. You can specify a suitable distribution key and index property for the internal table as needed to speed up queries.

Importing data into Hologres is faster than querying a foreign table. It can improve query performance by 10 to 100 times. For more information about how to import data from a MaxCompute foreign table into Hologres, see Import data from MaxCompute using SQL.