By Daoke
In the Query Resolution of MySQL 8.0 Optimizer article, we focused on the resolving, setting, and transformation processes of SQL basic elements in MySQL 8.0.25, including tables, columns, functions, aggregation, grouping, sorting, etc.. This article introduces more complex transformation process of subqueries, partitioned tables, and JOIN operations. The outline is listed below:
Permanently remove redundant parts from the query if:
1) This is a subquery
2) Not normalizing a view.
Removal should take place when a query involving a view is optimized, not when the view is created.
Remove SELECT_DISTINCT options from a query block if it can skip distinct.
resolve_subquery: Resolve predicate involving subquery and perform early unconditional subquery transformations using one of the options below:
Transform eligible scalar subqueries to derived tables
Convert semi-join subquery predicates into semi-join join nests. Convert candidate subquery predicates into semi-join join nests. This transformation is performed once in query lifetime and is irreversible.
apply_local_transforms:
Pushing conditions down to derived tables must be done after validity checks of grouped queries are done by apply_local_transforms().
Eliminate unused objects like window definitions and redundant sorts.
Statements with subqueries in resolution conditions require some early-stage unlimited subquery transformations, including:
Mark whether subquery becomes semi-join.
Mark whether subquery adopts the materialization scheme during execution.
If the two strategies cannot be used above, select transformer according to type:
select * from t1 where a = (select 1);
=>
select * from t1 where a = 1;
oe $cmp$ (SELECT ie FROM ... WHERE subq_where ... HAVING subq_having)
=>
- oe $cmp$ (SELECT MAX(...) ) // Handled by Item_singlerow_subselect
- oe $cmp$ \<max\>(SELECT ...) // Handled by Item_maxmin_subselect
fails=>Item_in_optimizer
- The materialization scheme is not transformed.
- Transform IN to EXISTS via equi-join.
SELECT * FROM t1 WHERE a < ANY (SELECT a FROM t1);
=>
SELECT * FROM t1 WHERE a < (SELECT MAX(a) FROM t1)
If the conditions above are not met, call single_value_in_to_exists_transformer to transform IN to EXISTS:
SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2)
=>
SELECT ... FROM t1 WHERE t1.b IN (SELECT <expr of SUM(t1.a)> FROM t2
[trigcond] HAVING t1.b=ref-to-<expr of SUM(t1.a)>)
Subqueries that do not need to distinguish NULL and FALSE results:
SELECT 1 FROM ... WHERE (oe $cmp$ ie) AND subq_where
Subqueries that need to distinguish:
SELECT 1 FROM ...
WHERE subq_where AND trigcond((oe $cmp$ ie) OR (ie IS NULL))
HAVING trigcond(@<is_not_null_test@>(ie))
for (each left operand)
create the equi-join condition
if (is_having_used || !abort_on_null)
create the "is null" and is_not_null_test items
if (is_having_used)
add the equi-join and the null tests to HAVING
else
add the equi-join and the "is null" to WHERE
add the is_not_null_test to HAVING
(l1, l2, l3) IN (SELECT v1, v2, v3 ... WHERE where) =>
EXISTS (SELECT ... WHERE where and
(l1 = v1 or is null v1) and
(l2 = v2 or is null v2) and
(l3 = v3 or is null v3)
[ HAVING is_not_null_test(v1) and
is_not_null_test(v2) and
is_not_null_test(v3)) ] <- Ensure that it is not NULL. Thus, HAVING can be removed.
(l1, l2, l3) IN (SELECT v1, v2, v3 ... HAVING having) =>
EXISTS (SELECT ... HAVING having and
(l1 = v1 or is null v1) and
(l2 = v2 or is null v2) and
(l3 = v3 or is null v3) and
is_not_null_test(v1) and
is_not_null_test(v2) and
is_not_null_test(v3))
This feature was officially added in version 8.0.16 to support the downward analysis of Secondary Engine (Heapwave) and enhance the transformation capability of subqueries. Let's look at the difference between the execution plan that involves transformation and the one that does not:
root:test> set optimizer_switch = 'subquery_to_derived=off';
Query OK, 0 rows affected (0.00 sec)
root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| 1 | PRIMARY | t4 | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 100.00 | Using temporary |
| 2 | DEPENDENT SUBQUERY | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 33.33 | Using where |
+----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
2 rows in set, 3 warnings (0.00 sec)
root:test> set optimizer_switch = 'subquery_to_derived=on';
Query OK, 0 rows affected (0.00 sec)
root:test> EXPLAIN SELECT b, MAX(a) AS ma FROM t4 GROUP BY b HAVING ma < (SELECT MAX(t2.a) FROM t2 WHERE t2.b=t4.b);
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| 1 | PRIMARY | t4 | NULL | ALL | NULL | NULL | NULL | NULL | 10 | 100.00 | Using temporary |
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 100.00 | Using where; Using join buffer (hash join) |
| 2 | DERIVED | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 100.00 | Using temporary |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
3 rows in set, 3 warnings (0.01 sec)
The detailed transformation process of transform_scalar_subqueries_to_join_with_derived is listed below:
SELECT SUM(c1), (SELECT SUM(c1) FROM t3) scalar FROM t1;
Transform to
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
LEFT JOIN
(SELECT SUM(b) AS scalar FROM t3) AS derived1
ON TRUE
The execution plan is as follows:
explain SELECT SUM(a), (SELECT SUM(c1) FROM t3) scalar FROM t1;
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
| 1 | PRIMARY | <derived3> | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 1 | PRIMARY | <derived4> | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | Using where; Using join buffer (hash join) |
| 4 | DERIVED | t3 | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
| 3 | DERIVED | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | NULL |
+----+-------------+------------+------------+------+---------------+------+---------+------+------+----------+--------------------------------------------+
Collect the only Item list of aggregate functions (collect_aggregates). These items will be replaced by the columns of the new derived table.
SELECT SUM(a), (SELECT SUM(b) FROM t3) scalar
FROM t1
HAVING SUM(a) > scalar;
Transform to
SELECT derived0.summ, derived1.scalar
FROM (SELECT SUM(a) AS summ FROM t1) AS derived0
LEFT JOIN
(SELECT SUM(b) AS scalar FROM t3) AS derived1
ON TRUE
WHERE derived0.sum > derived1.scalar;
Next, traverse all subqueries that can be transformed and transform them into derived tables. Replace the corresponding expressions with columns (transform_subquery_to_derived):
The transformation process and results of this function are illustrated below:
This function mainly transforms Semi-join subqueries to nested JOIN. This process is only completed once and is irreversible.
Simply put, the steps can be simplified and understood like this:
Since the number of tables that MySQL can join in a query block is limited (MAX_TABLES), not all sj_candidates can perform the flatten_subqueries operation. So, the order of priority determines which needs to be unnested first. The priority rules are listed below:
subq_item->sj_convert_priority =
(((dependent * MAX_TABLES_FOR_SIZE) + // Dependent subqueries first
child_query_block->leaf_table_count) *
65536) + // Then, those with many tables
(65536 - subq_no); // Then, choose based on position
for SELECT#1 WHERE X IN (SELECT #2 WHERE Y IN (SELECT#3)) :
Query_block::prepare() (select#1)
-> fix_fields() on IN condition
-> Query_block::prepare() on subquery (select#2)
-> fix_fields() on IN condition
-> Query_block::prepare() on subquery (select#3)
<- Query_block::prepare()
<- fix_fields()
-> flatten_subqueries: merge #3 in #2
<- flatten_subqueries
<- Query_block::prepare()
<- fix_fields()
-> flatten_subqueries: merge #2 in #1
FROM [tables] WHERE ... AND/OR oe IN (SELECT ie FROM it) ...
=>
FROM (tables) LEFT JOIN (SELECT DISTINCT ie FROM it) AS derived
ON oe = derived.ie WHERE ... AND/OR derived.ie IS NOT NULL ...
Set the strategy to Subquery_strategy::DERIVED_TABLE:
Item_subselect::EXISTS_SUBS does not support aggregation operations.
Traverse the sorted list of subqueries. For subqueries adopting Subquery_strategy::SEMIJOIN, transform to semijoin/antijoin (convert_subquery_to_semijoin):
SELECT ...
FROM ot1 ... otN
WHERE (oe1, ... oeM) IN (SELECT ie1, ..., ieM
FROM it1 ... itK
[WHERE inner-cond])
[AND outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
SELECT ...
FROM (ot1 ... otN) SJ (it1 ... itK)
ON (oe1, ... oeM) = (ie1, ..., ieM)
[AND inner-cond]
[WHERE outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
SELECT ...
FROM ot1 ... otN
WHERE EXISTS (SELECT expressions
FROM it1 ... itK
[WHERE inner-cond])
[AND outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
SELECT ...
FROM (ot1 ... otN) SJ (it1 ... itK)
[ON inner-cond]
[WHERE outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
SELECT ...
FROM ot1 ... otN
WHERE NOT EXISTS (SELECT expressions
FROM it1 ... itK
[WHERE inner-cond])
[AND outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
SELECT ...
FROM (ot1 ... otN) AJ (it1 ... itK)
[ON inner-cond]
[WHERE outer-cond AND is-null-cond(it1)]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
SELECT ...
FROM ot1 ... otN
WHERE (oe1, ... oeM) NOT IN (SELECT ie1, ..., ieM
FROM it1 ... itK
[WHERE inner-cond])
[AND outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
=>
SELECT ...
FROM (ot1 ... otN) AJ (it1 ... itK)
ON (oe1, ... oeM) = (ie1, ..., ieM)
[AND inner-cond]
[WHERE outer-cond]
[GROUP BY ...] [HAVING ...] [ORDER BY ...]
SELECT @@optimizer_switch\G
*************************** 1. row ***************************
@@optimizer_switch: ......
materialization=on,semijoin=on,loosescan=on,
firstmatch=on,
subquery_materialization_cost_based=on,
......
SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
=>
/* select#1 */
SELECT `t1`.`a` AS `a`
FROM `t1`
SEMI JOIN (`t2`)
WHERE ((`t1`.`a` = `t2`.`c1`) and (`t2`.`c1` > 0))
The execution plan is as follows:
explain SELECT * FROM t1 WHERE t1.a in (SELECT t2.c1 FROM t2 where t2.c1 > 0);
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | Using where; Start temporary |
| 1 | SIMPLE | t1 | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 50.00 | Using where; End temporary; Using join buffer (hash join) |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------------------------------------------------+
This function is called in a bottom-up way after performing the flatten subqueries. It mainly contains the following steps:
If some derived tables/views have been deleted from the query block, traverse the columns in the SELECT list and delete unnecessary columns.
This function simplifies the nested join of top_join_list in Query_block to a flat join list. Nested connections include forms, such as table1 join table2 and table1 (table2, table3). The following figure shows the simplifying process:
Pruning is different according to HASH/RANGE/LIST and list partition. Here we briefly introduce the pruning process. The existing prune_partitions is called in prepare and optimize phases after some constant subqueries are evaluated and executed.
struct TABLE {
......
partition_info *part_info{nullptr}; /* Partition related information */
/* If true, all partitions have been pruned away */
bool all_partitions_pruned_away{false};
......
}
SQL tranformation phase
SELECT_LEX::apply_local_transforms
--> prune_partitions
for example, select * from employee where company_id = 1000 ;
SQL optimizer phase
JOIN::prune_table_partitions
--> prune_partitions
------> based on tbl->join_cond_optim() or JOIN::where_cond
for example, explain select * from employee where company_id = (select c1 from t1);
root:ref> CREATE TABLE R2 (
-> a INT,
-> d INT
-> ) PARTITION BY RANGE(a) (
-> PARTITION p20 VALUES LESS THAN (20),
-> PARTITION p40 VALUES LESS THAN (40),
-> PARTITION p60 VALUES LESS THAN (60),
-> PARTITION p80 VALUES LESS THAN (80),
-> PARTITION p100 VALUES LESS THAN MAXVALUE
-> );
Query OK, 0 rows affected (0.09 sec)
root:ref> Select * From R2 where a > 40 and a < 80;
The detailed process of pruning is listed below:
Call find_used_partitions to obtain eligible partitions. For each interval of SEL_TREE:
(start)
| $
| Partitioning keyparts $ subpartitioning keyparts
| $
| ... ... $
| | | $
| +---------+ +---------+ $ +-----------+ +-----------+
\-| par1=c1 |--| par2=c2 |-----| subpar1=c3|--| subpar2=c5|
+---------+ +---------+ $ +-----------+ +-----------+
| $ | |
| $ | +-----------+
| $ | | subpar2=c6|
| $ | +-----------+
| $ |
| $ +-----------+ +-----------+
| $ | subpar1=c4|--| subpar2=c8|
| $ +-----------+ +-----------+
| $
| $
+---------+ $ +------------+ +------------+
| par1=c2 |------------------| subpar1=c10|--| subpar2=c12|
+---------+ $ +------------+ +------------+
| $
... $
For example, the traversed stack in the first row (par1=c1 and par2=c2 and subpar1=c3 and subpar2=c5) will be:
in find_used_partitions(key_tree = "subpar2=c5") (***)
in find_used_partitions(key_tree = "subpar1=c3")
in find_used_partitions(key_tree = "par2=c2") (**)
in find_used_partitions(key_tree = "par1=c1")
in prune_partitions(...)
The same for the following conditions.
or(par1=c1 and par2=c2 and subpar1=c3 and subpar2=c6)
or(par1=c1 and par2=c2 and subpar1=c4 and subpar2=c8)
or(par1=c2 and subpar1=c10 and subpar2=c12)
This function pushes the conditions down to derived tables. Please see WL#8084 - Condition pushdown to materialized derived table for more information.
root:test> set optimizer _switch = 'derived_ merge=off'; // Disable the dervied_merge to test the push-down capability
Query OK, 0 rows affected (0.00 sec)
root:test> EXPLAIN FORMAT=tree SELECT * FROM (SELECT c1,c2 FROM t1) as dt WHERE c1 > 10;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Table scan on dt (cost=2.51..2.51 rows=1)
-> Materialize (cost=2.96..2.96 rows=1)
-> Filter: (t1.c1 > 10) (cost=0.35 rows=1)
-> Table scan on t1 (cost=0.35 rows=1)
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The process is listed below:
Traverse the derived table list to determine whether it can be pushed down (can_push_condition_to_derived). If the following situations are included, it cannot be pushed down:
The detailed illustration of the process is listed below:
The two articles focus on the rule-based optimization part of the optimizer but do not involve cost-based optimization. If the execution is accelerated by the direct application of rule-based optimization, you can directly perform transformation, especially for the change class transformation of the query structure, such as merge_derived. If we cannot determine whether the acceleration of execution is brought about by rule-based optimization, the optimizer will retain some temporary structures to provide more choices for subsequent cost estimation, such as IN/EXIST/Materialized transformation. MySQL currently does not support execution acceleration caused by query structure change together with the uncertainty of rule transformation. This article is detailed, but it cannot cover the whole situation. We hope this article can spur someone to come forward with valuable contributions. If you want to understand more about the specific process of a certain type of SQL, you may need to discover it through debugging.
WL#2980: Subquery optimization: Semijoin
ApsaraDB - May 18, 2022
ApsaraDB - March 10, 2022
ApsaraDB - October 19, 2023
Morningking - September 26, 2023
ApsaraDB - October 20, 2023
ApsaraDB - January 22, 2021
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreVisualization, O&M-free orchestration, and Coordination of Stateful Application Scenarios
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreAn on-demand database hosting service for SQL Server with automated monitoring, backup and disaster recovery capabilities
Learn MoreMore Posts by ApsaraDB