×
Community Blog The Intelligent Filter Condition Pushdown Principle of AnalyticDB for MySQL

The Intelligent Filter Condition Pushdown Principle of AnalyticDB for MySQL

This article offers basic knowledge about the filter condition pushdown principle and its advantages and disadvantages.

By Yuedong Xi (Xiwen)

Basic Knowledge

Filter Condition Pushdown

It is commonly believed that filter conditions must be pushed down as close to the bottom as possible to bring as many filter conditions closer to the data source, so the irrelevant data will not be queried. In AnalyticDB for MySQL, due to the separation of storage and computing, predicate pushdown aims to push all predicates that can be pushed to the storage node. As shown in the following figure, all filter conditions are pushed to the storage node. This reduces the computation of subsequent operators and the amount of data transmitted on the intermediate network. Predicate pushdown brings many benefits, and all databases use predicate pushdown as an important optimization feature.

1

The Disadvantages of Full Predicate Pushdown to the Storage Layer

Will pushing filter conditions down to the storage layer necessarily speed up queries?

Let's take a look at the index structure of the storage node of AnalyticDB for MySQL to answer this question. AnalyticDB for MySQL creates indexes for all columns by default and allows you to use indexes to filter data based on multiple filter conditions, so you can find the result sets that meet the conditions in milliseconds. The following figure shows the query process using indexes based on multiple filter conditions in a table. As you can see, an index lookup is performed for each column, and RowIds are computed using intersection, union, and difference operations.

2

Pushdown and Not Pushdown

After talking about the index structure, a simple example is used to illustrate the case where the filter condition is pushed down to the storage layer, but data filtering becomes slower. Let's assume a simple cost model where the cost of scanning indexes for each row is A, the cost of obtaining details and data transmission for each row is B, and the cost of filtering for each row in the computing layer is C. Let’s consider the following SQL statements:

create table user (age int, id int);
select count(*) from user; -- The result is 10,000;
select count(*) from user where age > 18; -- The result is 9,000.
select count(*) from user where id < 10; -- The result is 20
-- Consider the following SQL statements.
select * from user where age > 18 and id < 10 -- The result is 10
  • The regular cost is 10,000A +10,000A + 10B (the cost of searching indexes of two columns + the cost of obtaining details and data transmission)
  • The cost of Id column without pushdown is 10,000A + 20B + 20C (the cost of searching indexes of a column + the cost of obtaining details and data transmission + the cost in the computing layer)

It can be seen that if 10B + 20C is less than 10,000A, the plan that does not push down the Id column is better.

3

Optimizable Scenarios

The optimizer pushes down all indexed columns to the storage layer by default. This reduces the amount of data read to the compute engine. However, we recommend not using indexes to filter data in the following scenarios:

  • The predicate selection rate is high, and there are many predicate conditions. There are still a lot of data returned after data filtering. In this case, the effect of using indexes to filter data and then performing the operation of INTERSECT may not be good.
  • High Disk I/O Usage: If large amounts of I/O resources are occupied due to your business query features or data writes, using indexes to filter data may cause competition for disk I/O resources and decrease the filtering efficiency.
  • There are complex operations in filtering predicates, such as string comparison and operations by LIKE. These operations consume a large amount of resources on storage nodes. If there is not too much data returned after filtering, no pushdown will be better for the overall performance.

Intelligent Pushdown

AnalyticDB for MySQL provides the intelligent optimization feature to optimize the performance of AnalyticDB for MySQL in the scenarios above and prevent AnalyticDB for MySQL developers and users from consuming too much effort to use hints to optimize SQL statements. The optimizer intelligently controls whether filter conditions are pushed down to storage nodes based on accurate statistics. SQL tuning allows users to no longer struggle with whether to push down filter conditions, speeding up queries by users with ease.

Term Definition

  • conjunction

We define the unit after the filter condition is split according to the outermost AND as a conjunction. For example, ((A & B) | C) & D & E is composed of three conjunctions, ((A & B) | C), D, and E. The reason for this definition is that the conjunction is the smallest unit pushed down to the storage layer. In conjunction, either all conditions are pushed down, or none are pushed down.

  • selectivity

If there are 100 rows of data and 10 rows of data satisfying the condition of A>10, the selectivity of A>10 is 10%.

  • connector

AnalyticDB for MySQL optimizer supports multiple connectors. The optimizer supports the AnalyticDB for MySQL storage engine and OSS foreign tables. Different connectors require processing differently.

Implementation

  • A Series of Rules

Here are the original rules of AnalyticDB for MySQL optimizer for filter pushdown. The filter is pushed to the top of the table and the place closest to the storage node. Then, the following work is handed over to the intelligent pushdown module. The intelligent pushdown module determines what predicates can be pushed to the storage node.

  • Pretreatment

Short-Circuit Optimization: The module will roughly estimate the overhead of the entire table scan. If the table is small, this step is quickly skipped to reduce the additional overhead of subsequent processes.

Expression Conversion: Based on the operational rule of Boolean algebra, convert the filter conditions to AND connections as much as possible. For example, ( A & B ) | ( C & D ) cannot be pushed down partially, but after it is converted to ( A | C ) & ( A | D ) & ( B | C ) & ( B | D ), it can be partially pushed down. This step is limited and blind conversion is not permitted because the expression after conversion becomes longer, which may cause codegen overrun.

  • CalcSelectivity

The Cardinality Estimation module of AnalyticDB for MySQL optimizer is called to calculate the selectivity and the corresponding reliability for each conjunction based on accurate statistics (such as histograms). This module relies on accurate statistics and cardinality estimation in the AnalyticDB for MySQL optimizer to provide high-quality selectivity and accurate input for subsequent cost calculation and pushdown scheme selection.

  • Connector Cost Model

The custom cost model at the storage layer defines that filter conditions that meet certain conditions based on the cost calculated by selectivity will not be pushed down. This separation mode makes other storage layers connected to AnalyticDB for MySQL realize intelligent pushdown by simply implementing the connector cost model.

  • FilterPushDownSelection

This module inputs all the conjunction, selectivity and reliability to the corresponding connector cost model, enumerates the pushdown combinations with a conjunction as the smallest unit, calculates the cost, and selects the pushdown scheme with the lowest cost. This way, the filtering method of AnalyticDB for MySQL has evolved from the previous full pushdown to the current intelligent pushdown.

The Effect of Full Pushdown and Intelligent Pushdown

The following figure shows the acceleration effect of using intelligent pushdown for some queries with long index scan times. The data comes from the average RT change of online SQL statements before and after intelligent pushdown is enabled for internal grayscale clients. You can see a significant performance improvement.

4

Summary

The problems in actual production are far more complicated than those mentioned in this article. As space is limited, more technical details are not explored in depth. People interested in technology are welcome to join the AnalyticDB community for further discussion.

0 1 0
Share on

ApsaraDB

440 posts | 93 followers

You may also like

Comments