Query Transformation of MySQL 8.0 Optimizer

By Daoke

1. Background and Architecture

In the Query Resolution of MySQL 8.0 Optimizer article, we focused on the resolving, setting, and transformation processes of SQL basic elements in MySQL 8.0.25, including tables, columns, functions, aggregation, grouping, sorting, etc.. This article introduces more complex transformation process of subqueries, partitioned tables, and JOIN operations. The outline is listed below:

Transformation

remove_redundant_subquery_clause:

Permanently remove redundant parts from the query if:

1) This is a subquery
2) Not normalizing a view.

Removal should take place when a query involving a view is optimized, not when the view is created.

remove_base_options:

Remove SELECT_DISTINCT options from a query block if it can skip distinct.

resolve_subquery: Resolve predicate involving subquery and perform early unconditional subquery transformations using one of the options below:
- Convert subquery predicate into semi-join
- Mark the subquery for execution using materialization
- Perform IN->EXISTS transformation
- Perform more/less ALL/ANY -> MIN/MAX rewrite
- Substitute trivial scalar-context subquery with its value
transform_scalar_subqueries_to_join_with_derived:

Transform eligible scalar subqueries to derived tables

flatten_subqueries:

Convert semi-join subquery predicates into semi-join join nests. Convert candidate subquery predicates into semi-join join nests. This transformation is performed once in query lifetime and is irreversible.

apply_local_transforms:
- delete_unused_merged_columns: If query block contains one or more merged derived tables/views, walk through lists of columns in select lists and remove unused columns.
- simplify_joins: Convert all outer joins to inner joins if possible.
- prune_partitions: Perform partition pruning for a given table and condition.
push_conditions_to_derived_tables:

Pushing conditions down to derived tables must be done after validity checks of grouped queries are done by apply_local_transforms().

Window::eliminate_unused_objects:

Eliminate unused objects like window definitions and redundant sorts.

2. Detailed Transformation Process

2.1 Resolve Subquery (resolve_subquery)

Statements with subqueries in resolution conditions require some early-stage unlimited subquery transformations, including:

Mark whether subquery becomes semi-join.
- Transform judgment conditions
- Ensure there are no restrictions on OPTIMIZER_SWITCH_SEMIJOIN and HINT
- Subquery is predicate of IN/=ANY and EXIST subquery.
- Subquery is simple query block instead of UNION.
- Subquery does not have implicit and explicit GROUP BY.
- Subquery has no HAVING or WINDOW functions.
- The resolution phase involves Query_block::RESOLVE_CONDITION and Query_block::RESOLVE_JOIN_NEST and does not use the latest Hyper optimizer.
- External query blocks can support semijoins.
- At least one table is needed, rather than using "SELECT 1".
- The strategy of the subquery is not specified (Subquery_strategy::UNSPECIFIED).
- The parent query also has at least one table.
- Neither parent query nor subquery can have straight join.
- The parent query block does not prohibit semijoin.
- The return value of the IN predicate is certain, not RAND.
- Determine whether the result needs to be transformed to true or false, whether it is NULL, and whether antijoin or semijoin operation can be performed.
- Antijoin or semijoin is supported.
- offset and limit are valid for semjoin. offset starts from the first line, and limit is not 0.
- Set Subquery_strategy::CANDIDATE_FOR_SEMIJOIN and add sj_candidates.
Mark whether subquery adopts the materialization scheme during execution.
- If it is impossible to transform to semijoin, try materialization method and transform judgment conditions.
- Optimizer enables subquery_to_derived=on.
- Subquery is IN/=ANY or EXISTS predicate.
- Subquery is simple query block instead of UNION.
- If it is [NOT] EXISTS, there must be no aggregation.
- Subquery predicate is in the WHERE clause (currently not implemented in the ON clause) and is the expression tree of ANDs or ORs.
- Parent query blocks can support semijoins.
- The strategy of the subquery is not specified (Subquery_strategy::UNSPECIFIED).
- The parent query also has at least one table, and LEFT JOIN operation can be performed.
- The parent query block does not prohibit semijoin.
- The return value of the IN predicate is certain, not RAND.
- Determine whether the result needs to be transformed to true or false, whether it is NULL, and whether antijoin or semijoin operation can be performed.
- If the left parameter is not multi-column subquery, it is not supported. (WHERE (outer_subq) = ROW(derived.col1,derived.col2)).
- This subquery cannot be transformed to derived table (m_subquery_to_derived_is_impossible).
- Set Subquery_strategy::CANDIDATE_FOR_DERIVED_TABLE and add sj_candidates.
If the two strategies cannot be used above, select transformer according to type:
- Item_singlerow_subselect::select_transformer
- For simple scalar subqueries, replace with the execution result in the query.

select * from t1 where a = (select 1);
=>
select * from t1 where a = 1;

Item_in_subselect/Item_allany_subselect::select_transformer->select_in_like_transformer
Use the select_in_like_transformer function to handle the transformation of IN/ALL/ANY/SOME subqueries
Handle "SELECT 1" (Item_in_optimizer)
If there is no subquery execution method currently (subqueries executed by semijoin/antijoin cannot be used), IN->EXISTS transformation will be performed. In essence, we make a choice between materialization execution and iterative loop execution. The IN syntax means that the non-related subquery is only executed once, and the query results are materialized into a temporary table for subsequent search. EXISTS means that for each record of the external table, subquery will be executed once in an iterative and loop manner. The subquery strategy is set to Subquery_strategy::CANDIDATE_FOR_IN2EXISTS_OR_MAT.
Rewrite IN/ALL/ANY subqueries of single-column (single_value_transformer).

oe $cmp$ (SELECT ie FROM ... WHERE subq_where ... HAVING subq_having)
=>
- oe $cmp$ (SELECT MAX(...) )  // Handled by Item_singlerow_subselect
- oe $cmp$ \<max\>(SELECT ...)   // Handled by Item_maxmin_subselect

fails=>Item_in_optimizer
- The materialization scheme is not transformed.
- Transform IN to EXISTS via equi-join.

For single ALL/ANY subquery predicate, try to transform it to MIN/MAX subquery:

SELECT * FROM t1 WHERE a < ANY (SELECT a FROM t1); 
=>
SELECT * FROM t1 WHERE a < (SELECT MAX(a) FROM t1)

If the conditions above are not met, call single_value_in_to_exists_transformer to transform IN to EXISTS:
- Set the subquery to the related subquery and set the UNCACHEABLE_DEPENDENT identification.
- If the subquery contains aggregate functions, window functions, GROUP syntax, and HAVING syntax, add the judgment conditions to the HAVING clause. In addition, NULL and False results are distinguished by ref_or_null_helper. If NULL IN (SELECT ...) needs to be handled, it needs to be encapsulated into the Item_func_trig_cond trigger.

SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2)
=>
SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2
                                 [trigcond] HAVING t1.b=ref-to-<expr of SUM(t1.a)>)

If the subquery does not contain aggregate functions, window functions, and GROUP syntax, it will be placed in the WHERE query conditions. If you need to handle NULL results, it must be put into the HAVING clause (Item_func_trig_cond + Item_is_not_null_test).

Subqueries that do not need to distinguish NULL and FALSE results:

SELECT 1 FROM ... WHERE (oe $cmp$ ie) AND subq_where

Subqueries that need to distinguish:
SELECT 1 FROM ...
      WHERE subq_where AND trigcond((oe $cmp$ ie) OR (ie IS NULL))
      HAVING trigcond(@<is_not_null_test@>(ie))

JOIN::optimize() calculates the materialization and EXISTS transformation costs for selection and sets m_subquery_to_derived_is_impossible=true.
ROW value is transformed through Item_in_optimizer but does not support ALL/ANY/SOME (row_value_transformer).
Item_in_subselect::row_value_in_to_exists_transformer

for (each left operand)
  create the equi-join condition
  if (is_having_used || !abort_on_null)
    create the "is null" and is_not_null_test items
  if (is_having_used)
    add the equi-join and the null tests to HAVING
  else
    add the equi-join and the "is null" to WHERE
add the is_not_null_test to HAVING

The HAVING expression is not contained.

(l1, l2, l3) IN (SELECT v1, v2, v3 ... WHERE where) =>
EXISTS (SELECT ... WHERE where and
                         (l1 = v1 or is null v1) and
                         (l2 = v2 or is null v2) and
                         (l3 = v3 or is null v3)
                   [ HAVING is_not_null_test(v1) and
                         is_not_null_test(v2) and
                         is_not_null_test(v3)) ] <- Ensure that it is not NULL. Thus, HAVING can be removed.

The HAVING expression is contained.

(l1, l2, l3) IN (SELECT v1, v2, v3 ... HAVING having) =>
EXISTS (SELECT ... HAVING having and
                          (l1 = v1 or is null v1) and
                          (l2 = v2 or is null v2) and
                          (l3 = v3 or is null v3) and
                          is_not_null_test(v1) and
                          is_not_null_test(v2) and
                          is_not_null_test(v3))

2.2 Transform Scalar Subqueries to Derived Table (transform_scalar_subqueries_to_join_with_derived)

This feature was officially added in version 8.0.16 to support the downward analysis of Secondary Engine (Heapwave) and enhance the transformation capability of subqueries. Let's look at the difference between the execution plan that involves transformation and the one that does not:

root:test> set optimizer_switch = 'subquery_to_derived=off';
Query OK, 0 rows affected (0.00 sec)

root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type        | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra           |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
|  1 | PRIMARY            | t4    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |   100.00 | Using temporary |
|  2 | DEPENDENT SUBQUERY | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where     |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
2 rows in set, 3 warnings (0.00 sec)

root:test> set optimizer_switch = 'subquery_to_derived=on';
Query OK, 0 rows affected (0.00 sec)

root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
|  1 | PRIMARY     | t4         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   10 |   100.00 | Using temporary                            |
|  1 | PRIMARY     | <derived2> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using where; Using join buffer (hash join) |
|  2 | DERIVED     | t2         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using temporary                            |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
3 rows in set, 3 warnings (0.01 sec)

The detailed transformation process of transform_scalar_subqueries_to_join_with_derived is listed below:
- First, collect scalar subqueries that can be transformed from the JOIN conditions, WHERE conditions, HAVING conditions, and SELECT list (Item::collect_scalar_subqueries).
- Traverse these subqueries to determine whether additional transformation can be added (transform_grouped_to_derived): turn an implicit GROUP BY scalar subquery into a derived table.

SELECT SUM(c1), (SELECT SUM(c1) FROM t3) scalar FROM t1;
Transform to
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
       LEFT JOIN
       (SELECT SUM(b) AS scalar FROM t3) AS derived1
       ON TRUE
The execution plan is as follows:
explain SELECT SUM(a), (SELECT SUM(c1) FROM t3) scalar FROM t1;
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                      |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
|  1 | PRIMARY     | <derived3> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                       |
|  1 | PRIMARY     | <derived4> | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | Using where; Using join buffer (hash join) |
|  4 | DERIVED     | t3         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | NULL                                       |
|  3 | DERIVED     | t1         | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | NULL                                       |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+

Collect the only Item list of aggregate functions (collect_aggregates). These items will be replaced by the columns of the new derived table.
- It is also necessary to add all fields referenced to these Items, including those in the SELECT list and those included in the Window function parameters, ORDER BY, and Partition BY. The ORDER BY columns in the query block are also involved because they will be imported to the derived table.
- Create Query_expression/Query_block required by a derived table (create_query_expr_and_block)
- Add derived table to query blocks and top_join_list
- Retain the old subquery unit block. If it contains a derived table that can be transformed, move it to Query_block under the Derived Table. If not, retain it in the original subquery block.
- Insert the previous list of aggregate function Items into the query block of the derived table
- Collect columns except for the columns in the GROUP AGG expression. Since these fields have been moved to the derived table, delete the unreasonable fields references.
- After collecting all unique columns and references to View, add them to the new derived table list.
- Perform flatten_subqueries/setup_tables operation on the new derived table
- Perform resolve_placeholder_tables operation again, but do not handle subqueries after transformation
- Handle the aggregate function Item in the newly added HAVING condition in the derived table. Refer it to new_derived->base_ref_items by Item_aggregate_refs instead of the previous parent query block base_ref_items.
- Permanently replace the list of aggregate functions in the parent query block, change to the columns of the derived table, and delete them.
- Unique columns and references to views previously saved and added to the derived table also need to be replaced with new fields.

Currently, the subquery cannot be contained in the HAVING expression. However, it can be transformed.

SELECT SUM(a), (SELECT SUM(b) FROM t3) scalar
FROM t1
HAVING SUM(a) > scalar;
Transform to
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
       LEFT JOIN
       (SELECT SUM(b) AS scalar FROM t3) AS derived1
       ON TRUE
WHERE derived0.sum > derived1.scalar;

Next, traverse all subqueries that can be transformed and transform them into derived tables. Replace the corresponding expressions with columns (transform_subquery_to_derived):
- Generate a TABLE_LIST of the derived table (synthesize_derived).
- Set a where_cond that can be moved to the derived table to join_cond
- Add a derived table to the table collection of query blocks
- decorrelate_derived_scalar_subquery_pre
- Add unrelated reference columns (NCF) to the SELECT list. These conditions are referenced by the JOIN condition, and there is another field ('lifted_where') that contains columns related to external queries.
- Add COUNT(*) to the SELECT list so the transformed query blocks can be checked for cardinality. For example, check if there are aggregate functions in subquery. If aggregate functions are included in the subquery, the returned row must be NCF included in the GROUP BY list.
- Add NCF to the GROUP list of the subquery. If it already exists, add to the last. If the dependency check of GROUP BY columns fails, Item_func_any_value (non-aggregate columns) needs to be added to the SELECT list. For NCF, derived.field and derived.count(field) are created.
- Set up some preparations for materialization (setup_materialized_derived)
- decorrelate_derived_scalar_subquery_post:
- Create the corresponding 'lifted_fields'.
- Update the references of related columns in the JOIN condition. Instead of referencing external queries, columns related to the derived table are used.
The subquery expressions that replace the WHERE, JOIN, and HAVING conditions in the SELECT list become columns of the corresponding derived table.

The transformation process and results of this function are illustrated below:

2.3 Flatten Subquery (flatten_subqueries)

This function mainly transforms Semi-join subqueries to nested JOIN. This process is only completed once and is irreversible.

Simply put, the steps can be simplified and understood like this:
- Create a SEMI JOIN (it1 ... itN) syntax and add it to the execution plan of the outer query block
- Add the WHERE condition and JOIN condition of the subquery to the WHERE condition of the parent query
- Remove the subquery predicate from the judgment predicate of the parent query
Since the number of tables that MySQL can join in a query block is limited (MAX_TABLES), not all sj_candidates can perform the flatten_subqueries operation. So, the order of priority determines which needs to be unnested first. The priority rules are listed below:
- Related subqueries are prior to non-related ones.
- Subqueries with more inner tables are prior to those with fewer inner tables.
- Subqueries near the front are prior to those near the end.

subq_item->sj_convert_priority =
        (((dependent * MAX_TABLES_FOR_SIZE) +  // Dependent subqueries first
          child_query_block->leaf_table_count) *
         65536) +           // Then, those with many tables
        (65536 - subq_no);  // Then, choose based on position

In addition, since the recursive call of flatten_subqueries is bottom-up, the lower subqueries are expanded into the outer query block.

    for SELECT#1 WHERE X IN (SELECT #2 WHERE Y IN (SELECT#3)) :

    Query_block::prepare() (select#1)
       -> fix_fields() on IN condition
           -> Query_block::prepare() on subquery (select#2)
               -> fix_fields() on IN condition
                    -> Query_block::prepare() on subquery (select#3)
                    <- Query_block::prepare()
               <- fix_fields()
               -> flatten_subqueries: merge #3 in #2
               <- flatten_subqueries
           <- Query_block::prepare()
       <- fix_fields()
       -> flatten_subqueries: merge #2 in #1

Traverse the subquery list, delete subqueries marked as Subquery_strategy::DELETED by Item::clean_up_after_removal, and set sj_convert_priority according to priority. Sort by priority.
Traverse the sorted list of subqueries. For subqueries adopting Subquery_strategy::CANDIDATE_FOR_DERIVED_TABLE strategy, transform the subquery ([NOT] {IN, EXISTS}) to a JOIN derived table (transform_table_subquery_to_join_with_derived).

FROM [tables] WHERE ... AND/OR oe IN (SELECT ie FROM it) ...
=>
FROM (tables) LEFT JOIN (SELECT DISTINCT ie FROM it) AS derived
              ON oe = derived.ie WHERE ... AND/OR derived.ie IS NOT NULL ...

Set the strategy to Subquery_strategy::DERIVED_TABLE:
- The semijoin subquery and antijoin subquery cannot be nested with each other. If external query table has exceeded MAX_TABLE, no transformation should be done. Otherwise, the strategy is marked as Subquery_strategy::SEMIJOIN.
- Determine whether the WHERE condition of subquery is a constant. If the result is always FALSE, the subquery result is always empty. Thus, call Item::clean_up_after_removal to mark the strategy as Subquery_strategy::DELETED and delete the subquery.
- If it cannot be marked as Subquery_strategy::DELETED, reset the mark of Subquery_strategy::SEMIJOIN. It will turn into Subquery_strategy::UNSPECIFIED to continue the next one.
- Replace the subquery judgment condition in the WHERE condition of the outer query (replace_subcondition)
- If the condition in a subquery is not always FALSE, or if it is always FALSE, change it to antijoin. (In the case of antijoin, the subquery result is always empty, and the outer query condition is always met.) If this is the case, change the condition to Always True.
- If the subquery is always FALSE and not antijoin, replace the condition in the outer query with Always False.
Item_subselect::EXISTS_SUBS does not support aggregation operations.
- The convert_subquery_to_semijoin function can parse SQL statements in the following format.
- IN/=ANY predicate.
If the conditions meet decorrelation, perform decorrelate_condition operation.
Add decorrelated inner table expressions to the SELECT list.
Collect the derived table or join conditions related to the external table of the FROM clause.
Remove the correlation identification UNCACHEABLE_DEPENDENT and update the used table
Add the SELECT_DISTINCT identification in the derived table subquery.
Transform the subquery into a derived table and insert it after FROM of the query block to which it belongs (transform_subquery_to_derived)
Create derived table and its join condition
Traverse WHERE of the parent query block and replace the Item of the subquery with derived table (replace_subcondition)
For subqueries that adopt Subquery_strategy::CANDIDATE_FOR_SEMIJOIN strategy, traverse the list of sorted subqueries
Determine whether it can be transformed to semijoin.
Traverse the sorted list of subqueries. For subqueries adopting Subquery_strategy::SEMIJOIN, transform to semijoin/antijoin (convert_subquery_to_semijoin):
- The convert_subquery_to_semijoin function can parse SQL statements in the following format:
- The IN/=ANY predicate:

  SELECT ...
  FROM ot1 ... otN
  WHERE (oe1, ... oeM) IN (SELECT ie1, ..., ieM
                           FROM it1 ... itK
                          [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) SJ (it1 ... itK)
                     ON (oe1, ... oeM) = (ie1, ..., ieM)
                        [AND inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

The EXISTS predicate:

  SELECT ...
  FROM ot1 ... otN
  WHERE EXISTS (SELECT expressions
                FROM it1 ... itK
                [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) SJ (it1 ... itK)
                     [ON inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

The NOT EXISTS predicate:

  SELECT ...
  FROM ot1 ... otN
  WHERE NOT EXISTS (SELECT expressions
                FROM it1 ... itK
                [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) AJ (it1 ... itK)
                     [ON inner-cond]
  [WHERE outer-cond AND is-null-cond(it1)]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

The NOT IN predicate:

  SELECT ...
  FROM ot1 ... otN
  WHERE (oe1, ... oeM) NOT IN (SELECT ie1, ..., ieM
                               FROM it1 ... itK
                               [WHERE inner-cond])
   [AND outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
  SELECT ...
  FROM (ot1 ... otN) AJ (it1 ... itK)
                     ON (oe1, ... oeM) = (ie1, ..., ieM)
                        [AND inner-cond]
  [WHERE outer-cond]
  [GROUP BY ...] [HAVING ...] [ORDER BY ...]

Find locations where semi-join nesting and the conditions it generates can be inserted. For example, for t1 LEFT JOIN t2, embedding_join_nest, the position is t2. t2 can also be nested, such as t1 LEFT JOIN (t2 JOIN t3)).
Generate a new semijoin-nested TABLE_LIST table
Handle Antijoin
Merge the underlying tables in the subquery into the preceding join table (TABLE_LIST::merge_underlying_tables)
Insert the leaf table of the subquery after the leaf table of the current query block and reset the sequence number of the subquery leaf table and the external table it depends on. Then, reset the leaf table of the subquery.
For outer join, pass nullability in the join linked list (propagate_nullability)
Decorrelate the correlation conditions in the inner subquery. These conditions are added to the semijoin list. These conditions must be certain. Only simple judgment conditions or AND conditions consisting of simple judgment conditions are supported (Query_block::decorrelate_condition):
Determine whether the left and right conditions only depend on the inner and outer tables, and add their expressions to the expression list of inner and outer tables of semijoin, respectively (decorrelate_equality).
Decorrelate the join condition of the inner query (Query_block::decorrelate_condition)
Remove AST of subquery expression from the parent query (Query_express::exclude_level)
Update the corresponding table bitmap based on the WHERE/JOIN condition generated by semi-join nesting (Query_block::fix_tables_after_pullout)
Pull up the WHERE condition of the subquery and update the information of the table used (Item_cond_and::fix_after_pullout())
Create AND condition according to the condition list of semijoin. If there is a condition with a constant that is True, the condition is removed. If the constant is False, all conditions are removed (Query_block::build_sj_cond).
Add the created semijoin condition to the WHERE condition of the outer query
Finally, traverse the sorted subquery list. For subqueries without transformation and those adopting Subquery_strategy::UNSPECIFIED strategy, perform IN->EXISTS transformation (select_transformer). If the original subquery already has alternative items, call replace_subcondition for resolution and add them to the appropriate WHERE or ON clause.
Clear all sj_candidates lists
There are five execution methods in Semi-join. This article does not introduce the optimizer and execution processes. Please refer to the introduction of semijoin for details. Now, let's briefly introduce the optimizer switch that controls the optimization and execution of semijoin. Here, semijoin=on/off is the main switch:

SELECT @@optimizer_switch\G
*************************** 1. row ***************************
@@optimizer_switch: ......
                    materialization=on,semijoin=on,loosescan=on,
                    firstmatch=on,
                    subquery_materialization_cost_based=on,
                    ......

The following example illustrates the transformation process:

SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
=>
/* select#1 */
SELECT `t1`.`a` AS `a`
FROM `t1`
SEMI JOIN (`t2`)
WHERE ((`t1`.`a` = `t2`.`c1`) and (`t2`.`c1` > 0))
The execution plan is as follows:
explain SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                                     |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    1 |   100.00 | Using where; Start temporary                              |
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |    50.00 | Using where; End temporary; Using join buffer (hash join) |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+

2.4 Apply the Current Query Block for Transformation (apply_local_transforms)

This function is called in a bottom-up way after performing the flatten subqueries. It mainly contains the following steps:

Delete Unused Columns (delete_unused_merged_columns)

If some derived tables/views have been deleted from the query block, traverse the columns in the SELECT list and delete unnecessary columns.

Simplify JOIN (simplify_joins)

This function simplifies the nested join of top_join_list in Query_block to a flat join list. Nested connections include forms, such as table1 join table2 and table1 (table2, table3). The following figure shows the simplifying process:

Static Pruning of Partitioned Table (prune_partitions)

Pruning is different according to HASH/RANGE/LIST and list partition. Here we briefly introduce the pruning process. The existing prune_partitions is called in prepare and optimize phases after some constant subqueries are evaluated and executed.

struct TABLE {
   ......    
   partition_info *part_info{nullptr}; /* Partition related information */
   /* If true, all partitions have been pruned away */
   bool all_partitions_pruned_away{false};
   ......
}
    
SQL tranformation phase
SELECT_LEX::apply_local_transforms
--> prune_partitions

for example, select * from employee where company_id = 1000 ;

SQL optimizer phase
JOIN::prune_table_partitions
--> prune_partitions 
------>  based on tbl->join_cond_optim() or JOIN::where_cond

for example, explain select * from employee where company_id = (select c1 from t1);

The following example shows the RANGE pruning process:

root:ref> CREATE TABLE R2  (
    ->     a INT,
    ->     d INT
    -> ) PARTITION BY RANGE(a) (
    ->         PARTITION p20 VALUES LESS THAN (20),
    ->         PARTITION p40 VALUES LESS THAN (40),
    ->         PARTITION p60 VALUES LESS THAN (60),
    ->         PARTITION p80 VALUES LESS THAN (80),
    ->         PARTITION p100 VALUES LESS THAN MAXVALUE
    -> );
Query OK, 0 rows affected (0.09 sec)

root:ref> Select * From R2 where a > 40 and a < 80;

The detailed process of pruning is listed below:

Pruning requires the intersection of pruning results generated according to different conditions. Bitmap, such as read_partitions, is required to save judgment of whether to use the corresponding partition during pruning. In addition, the pruning process is similar to iterative judgment. Therefore, part_iterator is introduced to save the states of start, end, and current. It also saves the endpoint function that obtains the interval range and the iterator function that obtains the next value. Pointers are cleverly used here to be compatible with different partition types, such as Hash/Range/List, as shown in the following figure:

Obtain the SEL_TREE red-black tree of join_cond or m_where_cond (get_mm_tree).
Call find_used_partitions to obtain eligible partitions. For each interval of SEL_TREE:
- 1. Obtain the left and right endpoints of the interval
- 1. Continue to obtain the next eligible partition from the left until the end of the right endpoint. Intersection operation of bitmap needs to be performed after each call.
Set read_partitions and lock_partitions.
find_used_partitions performs recursion according to the structure of SEL_TREE, as shown in the figure. It traverses the next_key_part (and condition) from left to right and then traverses the left and right of SEL_TREE (the up and down directions or condition) to do in-depth recursion.

    (start)
     |                           $
     |   Partitioning keyparts   $  subpartitioning keyparts
     |                           $
     |     ...          ...      $
     |      |            |       $
     | +---------+  +---------+  $  +-----------+  +-----------+
     \-| par1=c1 |--| par2=c2 |-----| subpar1=c3|--| subpar2=c5|
       +---------+  +---------+  $  +-----------+  +-----------+
            |                    $        |             |
            |                    $        |        +-----------+
            |                    $        |        | subpar2=c6|
            |                    $        |        +-----------+
            |                    $        |
            |                    $  +-----------+  +-----------+
            |                    $  | subpar1=c4|--| subpar2=c8|
            |                    $  +-----------+  +-----------+
            |                    $
            |                    $
       +---------+               $  +------------+  +------------+
       | par1=c2 |------------------| subpar1=c10|--| subpar2=c12|
       +---------+               $  +------------+  +------------+
            |                    $
           ...                   $

For example, the traversed stack in the first row (par1=c1 and par2=c2 and subpar1=c3 and subpar2=c5) will be:
in find_used_partitions(key_tree = "subpar2=c5") (***)
in find_used_partitions(key_tree = "subpar1=c3")
in find_used_partitions(key_tree = "par2=c2")   (**)
in find_used_partitions(key_tree = "par1=c1")
in prune_partitions(...)
The same for the following conditions.
or（par1=c1 and par2=c2 and subpar1=c3 and subpar2=c6）
or（par1=c1 and par2=c2 and subpar1=c4 and subpar2=c8）
or（par1=c2 and subpar1=c10 and subpar2=c12）

The following figure shows the structure and process of pruning:

2.5 Push Conditions to Derived Table (push_conditions_to_derived_tables)

This function pushes the conditions down to derived tables. Please see WL#8084 - Condition pushdown to materialized derived table for more information.

root:test> set optimizer _switch = 'derived_ merge=off'; // Disable the dervied_merge to test the push-down capability
Query OK, 0 rows affected (0.00 sec)

root:test> EXPLAIN FORMAT=tree SELECT * FROM (SELECT c1,c2 FROM t1) as dt     WHERE c1 > 10;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Table scan on dt  (cost=2.51..2.51 rows=1)
    -> Materialize  (cost=2.96..2.96 rows=1)
        -> Filter: (t1.c1 > 10)  (cost=0.35 rows=1)
            -> Table scan on t1  (cost=0.35 rows=1)
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The process is listed below:

Traverse the derived table list to determine whether it can be pushed down (can_push_condition_to_derived). If the following situations are included, it cannot be pushed down:
- Derived table has UNION
- Derived table has LIMIT
- Derived table that is an inner table in an outer join, which results in more NULL compensated rows
- Derived table contained in CTE
Create a where cond of a derived table that can be pushed down (Condition_pushdown::make_cond_for_derived).
Retain the remaining conditions that cannot be pushed down (Condition_pushdown::get_remainder_cond)
Top-down recursively calls push_conditions_to_derived_tables.

The detailed illustration of the process is listed below:

3. Summary

The two articles focus on the rule-based optimization part of the optimizer but do not involve cost-based optimization. If the execution is accelerated by the direct application of rule-based optimization, you can directly perform transformation, especially for the change class transformation of the query structure, such as merge_derived. If we cannot determine whether the acceleration of execution is brought about by rule-based optimization, the optimizer will retain some temporary structures to provide more choices for subsequent cost estimation, such as IN/EXIST/Materialized transformation. MySQL currently does not support execution acceleration caused by query structure change together with the uncertainty of rule transformation. This article is detailed, but it cannot cover the whole situation. We hope this article can spur someone to come forward with valuable contributions. If you want to understand more about the specific process of a certain type of SQL, you may need to discover it through debugging.

4. References

Details of the Latest Architecture of MySQL 8.0 Server Layer
WL#13520: Transform correlated scalar subqueries
WL#8084 - Condition pushdown to materialized derived table
WL#2980: Subquery optimization: Semijoin
- WL#3740: Subquery optimization: Semijoin: Pull-out of inner tables
- WL#3741: Subquery optimization: Semijoin: Duplicate elimination strategy
- WL#3750: Subquery optimization: Semijoin: First-match strategy
- WL#3751: Subquery optimization: Semijoin: Inside-out strategy
WL#4389: Subquery optimizations: Make IN optimizations also handle EXISTS
WL#4245: Subquery optimization: Transform NOT EXISTS and NOT IN to anti-join
WL#2985: Perform Partition Pruning of Range conditions
MySQL • Source Code Analysis: Execution Code Analysis for Semi-join Optimizing
MySQL • Source Code Analysis: Source Code Analysis for Subquery Optimizing
Optimizing Subqueries, Derived Tables, View References, and Common Table Expressions

Community

Query Transformation of MySQL 8.0 Optimizer

1. Background and Architecture

Transformation

2. Detailed Transformation Process

2.1 Resolve Subquery (resolve_subquery)

2.2 Transform Scalar Subqueries to Derived Table (transform_scalar_subqueries_to_join_with_derived)

2.3 Flatten Subquery (flatten_subqueries)

2.4 Apply the Current Query Block for Transformation (apply_local_transforms)

Delete Unused Columns (delete_unused_merged_columns)

Simplify JOIN (simplify_joins)

Static Pruning of Partitioned Table (prune_partitions)

2.5 Push Conditions to Derived Table (push_conditions_to_derived_tables)

3. Summary

4. References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for MySQL

Serverless Workflow

AnalyticDB for MySQL

ApsaraDB RDS for SQL Server