All Products
Search
Document Center

MaxCompute:Dynamic filtering

Last Updated:Feb 19, 2025

Join operations are common in distributed systems. A join operation is both time- and resource-consuming because data is shuffled during the join process. This is the case especially when a large amount of data is involved. MaxCompute allows you to optimize shuffle operations by using the equi-join attribute of join operations.

Optimization principle

The following example shows a typical SQL statement that contains a join.

select * from (table1) A join (table2) B on A.a= B.b;

In this example, MaxCompute can generate a filter based on data in Table A by using the equi-join attribute of the join operation. Then, MaxCompute can filter data in Table B before the shuffle or join operation starts. MaxCompute can even apply the filter conditions to the underlying storage to filter data at the source. This feature is called dynamic filtering.

The following figure shows the execution plans of the preceding SQL statement before and after dynamic filtering is enabled.

JOIN

Scenarios

Dynamic filtering is implemented based on the equi-join attribute of join operations. MaxCompute uses this feature to dynamically generate a filter during the running process of a job and filter data before a shuffle or join operation starts. This accelerates queries. The feature is suitable for the join between a dimension table and a fact table.

Dynamic bloom filter and range filter

As shown in the preceding figure, the original execution plan of a job does not contain a filter. After you enable dynamic filtering, MaxCompute automatically generates a filter based on the equi-join attribute of the join operation. The filter is used to check whether elements in Table B exist in datasets generated by Table A and filter out those that do not exist in the datasets.

A dynamic bloom filter can help you effectively reduce the time and resources consumed by a join operation. MaxCompute also allows you to use dynamic range filters based on the value of [min, max] and use the IN predicate method to filter data.

A dynamic filter has a producer and a consumer, as shown in the following figure.动态过滤器

  • Dynamic filter producer (DFP) operator: the producer of dynamic filters. It generates a dynamic bloom filter based on data in the smaller table of a join operation. It also obtains the values of the min and max metrics of the join key to generate a dynamic range filter. Then, the producer sends the filters to the consumer.

  • Dynamic filter consumer (DFC) operator: the consumer of dynamic filters. It uses the received dynamic bloom filter and range filter to filter data in the larger table of the join operation. The dynamic range filter tries to apply the filter conditions to the underlying storage. If this operation succeeds, it filters data at the source.

The roles that a join object can play vary based on the join semantics:

  • A JOIN B: Either Table A or Table B can serve as the producer or consumer.

  • A LEFT JOIN B: Table A can serve only as the producer, and Table B can serve only as the consumer.

  • A RIGHT JOIN B: Table A can serve only as the consumer, and Table B can serve only as the producer.

  • A FULL OUTER JOIN B: Dynamic filtering is not supported.

For more information about how to enable dynamic filtering, see Enable dynamic filtering.

Dynamic partition pruning

The preceding example shows the optimization of join operations between non-partitioned tables. The join key in the example is not a partition key column. If the join key is a partition key column of a partitioned table, a dynamic bloom filter or range filter can still be used. The difference is that, before data in the partitioned table is filtered, MaxCompute reads data in all the partitions of the table. In this case, you can use the dynamic partition pruning feature to prune the partitions that do not need to be read before MaxCompute starts to read data.

The following example shows an SQL statement that contains a join:

-- Table A is a non-partitioned table, and the value of column a in the table is 20200701. 
-- Table B is a partitioned table. Its partition key column is ds, and this column contains three values (partition names): 20200701, 20200702, and 20200703. 
select * from (table1) A join (table2) B on A.a= B.ds;

After you enable dynamic partition pruning, the optimizer of MaxCompute checks whether the larger table of a join operation is a partitioned table. If it is a partitioned table, MaxCompute generates a dynamic bloom filter based on data in the smaller table. Then, MaxCompute checks the partition list, aggregates the partitions that need to be read, and prunes the partitions that do not need to be read. If all partitions to be read in a running process are pruned, the process is not scheduled.

In this example, the 20200702 and 20200703 partitions of Table B are pruned because column a of Table A has only one value, 20200701. This helps save resources and reduce the running duration of the job.

For more information about how to enable dynamic partition pruning, see Enable dynamic partition pruning.

Enable dynamic filtering

You can use one of the following methods to enable dynamic filtering:

  • Method 1: Run the following command to forcibly enable dynamic filtering at the session level. The command must be committed together with an SQL statement.

    set odps.optimizer.force.dynamic.filter=true;
    Note

    Although you can use this method at the project level, we recommend that you use it at the session level. If you use it at the project level, dynamic filtering is enabled for all join jobs of a project. As a result, if data cannot be filtered in a join operation, the join efficiency is low.

    If this method is used, dynamic filters are generated for all join operations for which the feature is enabled.

  • Method 2: Run the following command to optionally enable dynamic filtering at the session level:

    set odps.optimizer.enable.dynamic.filter=true;

    If this method is used, the optimizer estimates whether dynamic filtering can produce benefits in terms of resource or time consumption. If it can produce benefits, MaxCompute generates a dynamic filter. Otherwise, it does not generate a dynamic filter.

    Note

    This method depends on metadata statistics, such as the ndv metric. For more information about metadata statistics, see Collect information for the optimizer of MaxCompute. Metadata statistics are the estimation results of the optimizer and may be inaccurate. Therefore, sometimes a dynamic filter may fail to be generated as expected.

  • Method 3: Use a hint in an SQL statement to enable dynamic filtering.

    The hint must be in the /*+dynamicfilter(Producer, Consumer1[, Consumer2,...])*/ format. Each producer can generate dynamic filters for multiple consumers. Example:

    select /*+dynamicfilter(A, B)*/ * from (table1) A join (table2) B on A.a= B.b;

Enable dynamic partition pruning

Run the following command to enable dynamic partition pruning at the session level. The command must be committed together with an SQL statement.

set odps.optimizer.dynamic.filter.dpp.enable=true;
Note

Although you can use this method at the project level, we recommend that you use it at the session level. If you use it at the project level, dynamic partition pruning is enabled for all join jobs that involve partitioned tables in a project. As a result, if data cannot be filtered in a join operation, the join efficiency is low.

Perform verifications

After you complete the configurations described in Enable dynamic filtering and Enable dynamic partition pruning, you can use the following methods to check whether dynamic filtering and dynamic partition pruning are enabled.

Check whether dynamic filtering is enabled

After you run an SQL job, view the LogView information of the job. If an operator similar to DynamicFilterConsumer1 appears on the LogView page, a dynamic filter has been generated and used to filter data. 确认过滤器结果

Check whether dynamic partition pruning is enabled

After you run an SQL job, view the LogView information of the job. If an operator similar to DppDynamicProducer appears on the LogView page and the operator contains PartitionPruneInfos, dynamic partition pruning is enabled.

image