Hologres V2.0 and later support runtime filters. Runtime filters automatically optimize filter operations during multi-table join queries and help improve the performance of join queries. This topic describes how to use runtime filters in Hologres.
Background information
Scenarios
Hologres V2.0 and later support runtime filters. Runtime filters are suitable for scenarios in which two or more tables are hash joined for queries, such as the scenario of joining a large table with a small table. The optimizer and execution engine use runtime filters to automatically optimize filter operations during queries without the need for manual settings. After the optimization, I/O overheads are reduced and the performance of join queries is improved.
How it works
To understand the principles of runtime filters, you must understand the join process. In this example, the following SQL statement is executed to join two tables:
select * from test1 join test2 on test1.x = test2.x;
The following execution plan is used.
In the preceding execution plan, a hash table is built based on the test2
table, and data in the hash table is used to match data in the test1
table. Then, a result is returned. This join process involves the following concepts:
Build table: During a hash join of tables or subqueries, the table or subquery based on which a hash table is built is called the build table. The build table corresponds to the hash node in the execution plan.
Probe table: The other side of the hash join is the probe table. Data in the probe table is read and matched with data in the hash table that is built based on the build table.
In most cases, if the execution plan is valid, the small table is the build table and the large table is the probe table.
During hash joins, a lightweight filter is generated based on the distribution of data in the build table and sent to the probe table to prune data in the probe table. This helps reduce the amount of data in the probe table that needs to be hash joined and transmitted over the network. This way, join performance is improved.
Runtime filters deliver better performance in scenarios in which large and small tables are joined and the amounts of data in the tables vary greatly.
Limits and trigger conditions
Limits
Only Hologres V2.0 and later support runtime filters.
In Hologres V2.0, runtime filters can be triggered only if tables are joined based on one field. In Hologres V2.1 and later, runtime filters support table joins based on multiple fields. If the fields meet the trigger conditions, a runtime filter is triggered.
Trigger conditions
Hologres supports high-performance joins. A runtime filter is automatically triggered at the underlying layer if the following conditions are met for an SQL query:
The probe table contains 100,000 or more rows of data.
The amount of data to be scanned in the build table divided by the amount of data to be scanned in the probe table is less than or equal to 0.1
. A smaller value makes it easier to trigger a runtime filter.The amount of data to be joined in the build table divided by the amount of data to be joined in the probe table is less than or equal to 0.1
. A small value makes it easier to trigger a runtime filter.
Types of runtime filters
Runtime filters are classified into the following types based on different dimensions:
Local filters and global filters based on whether data in the probe table needs to be shuffled during hash joins
Local filters: supported in Hologres V2.0 and later. Local filters are used if data shuffling is not required in the probe table during hash joins and the build table meets one of the following conditions:
The values of the join keys in the build table and probe table are distributed in the same manner.
Data in the build table is broadcast to the probe table.
Data in the build table is shuffled to the probe table by using the same distribution method as data in the probe table.
Global filters: supported in Hologres V2.2 and later. If data in the probe table needs to be shuffled, runtime filters need to be merged. In this case, global filters are used.
Local filters may help reduce the amount of data to be scanned and the amount of data in hash joins. Global filters take effect before data is shuffled in the probe table and can reduce the amount of data to be transmitted over the network. The Hologres engine automatically selects a type of filter to adapt to different scenarios. No manual configuration is required.
Bloom filters, In filters, and MinMax filters
Bloom filters: supported in Hologres V2.0 and later. Bloom filters have a false positive rate and do not filter all data. However, Bloom filters are commonly used and can provide a high filtering efficiency for build tables that contain a large amount of data. This helps improve query performance.
In filters: supported in Hologres V2.0 and later. In filters are used when the number of distinct values (NDV) of the build table is small. An In filter uses data in the build table to build a hash set and sends the hash set to the probe table for filtering. In filters can filter all data that needs to be filtered and can be used with bitmap indexes.
MinMax filters: supported in Hologres V2.0 and later. A MinMax filter obtains the maximum and minimum values based on data in the build table and sends the values to the probe table for filtering. In this case, files or a batch of data may be filtered out based on the metadata information. This reduces I/O costs.
The Hologres engine automatically selects a type of filter to adapt to different join scenarios. No manual configuration is required.
Verify runtime filters
The following examples help you better understand runtime filters.
Example 1: Use a local filter for a join based on one column
Sample code
begin; create table test1(x int, y int); call set_table_property('test1', 'distribution_key', 'x'); create table test2(x int, y int); call set_table_property('test2', 'distribution_key', 'x'); end; insert into test1 select t, t from generate_series(1, 100000) t; insert into test2 select t, t from generate_series(1, 1000) t; analyze test1; analyze test2; explain analyze select * from test1 join test2 on test1.x = test2.x;
Execution plan
The
test2
table contains 1,000 rows of data and serves as the build table. Thetest1
table contains 100,000 rows of data and serves as the probe table. The amount of data in the build table divided by the amount of data in the probe table is 0.01, which is less than 0.1. The amount of data to be joined in the build table divided by the amount of data to be joined in the probe table is 0.01, which is less than 0.1. The conditions for triggering runtime filters are met. The Hologres engine automatically uses a runtime filter.The
test1
table that serves as the probe table contains theRuntime Filter Target Expr
node. This indicates that a runtime filter is pushed down to the probe table.The scan_rows parameter of the probe table indicates the amount of data to be read from storage, which is 100,000 rows of data. The rows parameter indicates the amount of data to be scanned after data is filtered by using a runtime filter, which is 1,000 rows of data. The values of the two parameters indicate that a large amount of data is filtered out by using the runtime filter.
Example 2: Use a local filter for a join based on multiple columns in Hologres V2.1 or later
Sample code
drop table if exists test1, test2; begin; create table test1(x int, y int); create table test2(x int, y int); end; insert into test1 select t, t from generate_series(1, 1000000) t; insert into test2 select t, t from generate_series(1, 1000) t; analyze test1; analyze test2; explain analyze select * from test1 join test2 on test1.x = test2.x and test1.y = test2.y;
Execution plan
Tables are joined based on multiple columns, and a runtime filter is used to filter data based on multiple columns.
Data in the build table is broadcast to the probe table, which meets the conditions for triggering a local filter.
Example 3: Use a global filter for a shuffle join in Hologres V2.2 or later
Sample code
SET hg_experimental_enable_result_cache = OFF; drop table if exists test1, test2; begin; create table test1(x int, y int); create table test2(x int, y int); end; insert into test1 select t, t from generate_series(1, 100000) t; insert into test2 select t, t from generate_series(1, 1000) t; analyze test1; analyze test2; explain analyze select * from test1 join test2 on test1.x = test2.x;
Execution plan
Data in the probe table is shuffled to the hash join operator. The Hologres engine automatically uses a global filter to accelerate queries.
Example 4: Use an In filter with bitmap indexes in Hologres V2.2 or later
Sample code
set hg_experimental_enable_result_cache=off; drop table if exists test1, test2; begin; create table test1(x text, y text); call set_table_property('test1', 'distribution_key', 'x'); call set_table_property('test1', 'bitmap_columns', 'x'); call set_table_property('test1', 'dictionary_encoding_columns', ''); create table test2(x text, y text); call set_table_property('test2', 'distribution_key', 'x'); end; insert into test1 select t::text, t::text from generate_series(1, 10000000) t; insert into test2 select t::text, t::text from generate_series(1, 50) t; analyze test1; analyze test2; explain analyze select * from test1 join test2 on test1.x = test2.x;
Execution plan
A bitmap index is used on the scan operator for the probe table. The In filter is used for accurate filtering, and 50 rows of data is remained after filtering. More than 7 million rows of data needs to be scanned, which is less than the original 10 million rows of data. This is because the In filter can be pushed down to the storage engine, which may reduce I/O costs. This way, less data is read from the storage engine. The In filter is used with bitmap indexes to significantly improve performance in scenarios in which the join key is of the STRING type.
Example 5: Use a MinMax filter to reduce I/O overheads in Hologres V2.2 or later
Sample code
set hg_experimental_enable_result_cache=off; drop table if exists test1, test2; begin; create table test1(x int, y int); call set_table_property('test1', 'distribution_key', 'x'); create table test2(x int, y int); call set_table_property('test2', 'distribution_key', 'x'); end; insert into test1 select t::int, t::int from generate_series(1, 10000000) t; insert into test2 select t::int, t::int from generate_series(1, 100000) t; analyze test1; analyze test2; explain analyze select * from test1 join test2 on test1.x = test2.x;
Execution plan
The scan_rows parameter of the probe table indicates that the amount of data to be read from storage is more than 320,000 rows of data, which is significantly less than the original 10 million rows of data. This is because the runtime filter is pushed down to the storage engine, and a batch of data is filtered based on metadata. This may greatly reduce I/O costs. In most cases, MinMax filters help reduce I/O costs when join keys are of a numeric type and the value range of data in the build table is less than the value range of data in the probe table.