Query Resolution of MySQL 8.0 Optimizer

This article introduces how MySQL 8.0 optimizer turns a simple string into an execution sequence the database executor can return the data to the client.

By Daoke

1. Background and Architecture

As we all know, we write programs to dynamically implement the logic required by our applications and run the programs to get the results we need. Database is an application that quickly obtains data by entering SQL strings. If there is no such system application like databases, how can we obtain data using programs? We may find that we still need to constantly modify programs to process different requests for data from different applications, regardless of data storage mode and concurrent data access. For example, we usually obtain data in the big data field through APIs of non-relational databases. This method is simple to get started, but it is extremely difficult to maintain and not so generic. Even if software architecture design or abstract refactoring is continuously carried out, it is still necessary to continuously transform applications. This is why non-relational databases embrace the SQL optimizer.

The SQL optimizer is essentially an implementation of a highly abstract data interface. Clients can use a more generic and easy-to-understand SQL language to operate and process data. They do not have to pay attention to and abstract their data interfaces, which facilitates their applications substantially.

This article introduces how MySQL 8.0 optimizer turns a simple string (SQL) into an execution sequence the database executor can understand and return the data to the client. The powerful optimizer allows clients not to pay attention to SQL optimization to obtain the required data faster, so the optimizer will make some equivalent changes to the original SQL. In the article entitled Details of the Latest Architecture of MySQL 8.0 Server Layer, we focused on the general introduction to the parser, optimizer, and executor at the Server layer, including the detailed display of some code structures and changes of the latest MySQL. We also demonstrated how MySQL optimizer simplifies nested Join optimization in logical transformation through the simple_joins function. This article explains the magic optimizer details step by step and how each step during optimization changes the final execution of SQL.

This article is based on the latest version MySQL 8.0.25. We will introduce the optimizer conversion part in two articles. The first one introduces the resolution and conversion processes of Setup and Resolve based on the basic structure. The second one introduces the more complex conversion process of complex subqueries, partitioned tables, and connections. The outline of this article is listed below:

Setup and Resolve

setup_tables: Set up table leaves in the query block based on the list of tables
resolve_placeholder_tables/merge_derived/setup_table_function/setup_materialized_derived: Resolve derived table, view, or table function references in query block
setup_natural_join_row_types: Compute and store the row types of the top-most NATURAL/USING joins
setup_wild: Expand all '*' in the list of expressions with the matching column references
setup_base_ref_items: Set query_block's base_ref_items
setup_fields: Check that all given fields exist and fill struct with current data
setup_conds: Resolve WHERE condition and join conditions
setup_group: Resolve and set up the GROUP BY list
m_having_cond->fix_fields: Set up the HAVING clause
resolve_rollup: Resolve items in SELECT list and ORDER BY list for rollup processing
resolve_rollup_item: Resolve an item (and its tree) for rollup processing by replacing items matching grouped expressions with Item_rollup_group_items and updating properties (m_nullable, PROP_ROLLUP_FIELD). Check any GROUPING function for incorrect columns
setup_order: Set up the ORDER BY clause
resolve_limits: Resolve OFFSET and LIMIT clauses
Window::setup_windows1: Set up windows after setup_order() and before setup_order_final()
setup_order_final: Do final setup of ORDER BY clause, after the query block is fully resolved
setup_ftfuncs: Set up full-text functions after resolving HAVING
resolve_rollup_wfs: Replace group by field references inside window functions with references in the presence of ROLLUP

2. Detailed Conversion Process

The whole framework of conversion is a top-down or bottom-up process of Query_expression calling prepare function (sql/sql_resolver.cc) in Query_block according to the requirements of different conversion rules.

2.1 Pass the List of Inner Tables from null to join (propagate_nullability)

The prepare function first processes nullable table, which refers to the table that may contain null rows. Null rows can be propagated according to the JOIN relationship (top_join_list). If a table is determined to be nullable, some optimizations will degenerate. For example, the access method cannot be EQ_REF, the outer join cannot be optimized to inner join, etc.

2.2 Function for Resolving and Setting leave_tables of Query Block (setup_tables)

SELECT
  t1.c1
FROM t1,
     (SELECT
       t2.c1
     FROM t2,
          (SELECT
            t3.c1
          FROM t3
          UNION
          SELECT
            t4.c1
          FROM t4) AS t3a) AS t2a;

The value of leaf_tables for each Query_block is 0 before being called by setup_table.

This function builds leaf_tables for subsequent optimization, including the lists of base tables and derived tables. setup_tables does not call recursively but only resolves the tables and counts the number of derived tables at this layer. Then, resolve_placeholder_tables()->resolve_derived()->derived(Query_expression)::prepare->Query_block::prepare will be called to recursively process Query_expression corresponding to the derived table.

Next, according to the call order of prepare, let's continue to look at the resolve_placeholder_tables function for derived table processing.

2.3 Function for Resolving Derived Table, View, and Table of Query Blocks (resolve_placeholder_tables)

This function is used to process derived table, view, and table functions. If a table has been merged or called due to the use of transform_grouped_to_derived(), and it has used the materialized table method, it is ignored.

The role of resolve_derived()was discussed earlier, so we will focus on the merge_derived() function. merge_derived changes the Query_expression/Query_block framework structure and merge derived table or view into the query block.

2.3.1 merge_derived for Processing and Merging Derived Tables

1) Prerequisites for merge_derived transformation

Whether the query block in the outer layer allows merging operation (allow_merge_derived):
- The outer query block is nullptr.
- The subquery of the outer query expression is nullptr, and the derived table is the first-layer subquery.
- Outer query block of the outer layer can ensure allow_merge_derived=true, or the outer query block of the outer layer (if not included) is SELECT/SET.
Whether the outer lex support merge operation (lex->can_use_merged()+lex->can_no_use_merged())
Whether the derived table has been marked as requiring materialization. For example, the method to create a view is CREATE ALGORITHM=TEMPTABLE VIEW (derived_table->algorithm == VIEW_ALGORITHM_TEMPTABLE).
The query expression unit in which the entire derived table resides cannot (Query_expression::is_mergeable() ):
- Be union query
- Contain aggregation, HAVING, DISTINCT, WINDOWS, or LIMIT
- Exclude any table list
HINT or optimizer_switch does not prohibit derived_merge.
heuristic suggests merging (derived_query_expressionmerge_heuristic()):
- If the subquery SELECT list contained in the derived table depends on its own columns, it is not supported.
- If the dependent subquery needs to be executed multiple times, it is not supported.
If the query block in derived table contains SEMI/ANTI-JOIN and specifies STRAIGHT_JOIN, it is not supported.
If the leaf table count of the merged derived table and the existing query block is greater than MAX_TABLES, it is not supported.

2) Conversion process of merge_derived transformation

• Use the derived_table->nested_join structure to assist in handling OUTER JOIN situations.

• Merge the table in the derived table to the NESTED_JOIN struct (derived_table->merge_underlying_tables()).

• Connect all tables in the derived table to the table_list list of the parent query, and delete the derived table from the parent query.

• Recalculate all relevant data structures of the parent query (leaf_table_count, derived_table_count, table_func_count, materialized_derived_table_count, has_sj_nests, has_aj_nests, partitioned_table_count, cond_count, between_count, select_n_having_items)

• Propagation sets the parent query OPTION_SCHEMA_TABLE (add_base_options()), and if it is the inner table of the outer query JOIN, propagation sets the nullable property (propagate_nullability()).

• Merge the where condition of the derived table to the outer query (merge_where())

• Establish a reference to the column to be obtained by the derived table (create_field_translation())

• Delete the structure of the derived table from the parent query (exclude_level())

• Merge the renaming of columns or tables in derived table into the parent query (fix_tables_after_pullout()/repoint_contexts_of_join_nests())

• The table contained in the derived table has been merged into the parent query, so the position of the table in TABLE_LIST needs to be relocated (remap_tables()).

• After merging the derived table into the parent query, we need to modify all the references in the originally derived table to all columns in the derived table (fix_tables_after_pullout()).

• If the derived table contains an ORDER BY statement and the following conditions are met, the derived table will retain ORDER BY and merge it into the parent query. In other cases, ORDER BY will be ignored.

- The parent query allows sorting and only contains the derived table.
- Not a UNION
- The WHERE condition is allowed, but GROUP BY or aggregate functions are not.
- It is not orderly in itself.

The process is simplified below:

Diagram of the merge_derived Process

It seems that the official derived merge is still not perfect enough to perform bottom-up merging.

opt trace contained:

trace_derived.add_utf8_table(derived_table)
       .add("select#", derived_query_block->select_number)
       .add("merged", true);

trace_derived.add_alnum("transformations_to_derived_table", "removed_ordering");

This optimization can be controlled by set optimizer_switch="derived_merge=on/off".

2.3.2 setup_materialized_derived for Setting Materialized Derived Tables

The remaining derived tables that cannot use the merge algorithm will be processed in materialized mode. However, only some preprocessing operations are done at this time, such as variable settings. The materialized execution is performed in the executor phase.

setup_materialized_derived_tmp_table(): Set a temporary table that contains all row data of the materialized derived table
check_materialized_derived_query_blocks(): Set query block structure that belongs to the current derived table

trace_derived.add_utf8_table(this)
       .add("select#", derived->first_query_block()->select_number)
       .add("materialized", true);

2.3.3 setup_table_function for Processing Tables

If the query block contains a table function, the entire process is completed twice. The table of the table function is skipped for the first time, and the preceding logic is executed for the table of the table function at the second time. The external environment is resolved (relative to the table function) first because it is possible that the function parameters have dependencies on external derived tables.

trace_derived.add_utf8_table(this)
       .add_utf8("function_name", func_name, func_name_len)
       .add("materialized", true);

2.4 Expand the Wildcard of SELECT * into Specific Fields (setup_wild)

2.5 Establish base_ref_items in the Query_block Level (setup_base_ref_items)

base_ref_items records the location of all items so other items of the query block can reference them. Other items can also directly refer to them through Item_ref and its subclasses, such as references to subqueries (Item_view_ref), aggregate functions (Item_aggregate_ref), and external query columns (Item_outer_ref), and reference assistance to subqueries that generate NULL values (Item_ref_null_helper).

We give an example below to help you understand the complex Item_outer_ref:

2.6 Perform fix_fields() and Column Permission Check on select_fields (setup_fields)

The following figure shows a relatively complex fixed field process with subquery. Some fields are associated with tables, and some need to add corresponding Item_xxx_ref references:

2.7 Perform fixed_fields on and Resolve WHERE and Join Conditions (setup_conds)

If there is nested_join, setup_join_cond will be called recursively for resolution and setting. Here, we also introduce the use of the simplify_const_condition function. If const Item that can be deleted is found, Item_func_true/Item_func_false will be used to replace the whole condition, as shown in the figure:

2.8 Resolve and Set the ROLLUP Statement (resolve_rollup)

In the database query statement, the WITH ROLLUP statement can be added after the GROUP BY expression to ensure that a single query statement can analyze and count data at different levels.

SELECT YEAR,
       country,
       product,
       SUM(profit) AS profit
FROM sales
GROUP BY YEAR,
         country,
         product WITH ROLLUP;

+------+---------+------------+--------+
| year | country | product    | profit |
+------+---------+------------+--------+
| 2000 | Finland | Computer   |   1500 |
| 2000 | Finland | Phone      |    100 |
| 2000 | Finland | NULL       |   1600 |
| 2000 | India   | Calculator |    150 |
| 2000 | India   | Computer   |   1200 |
| 2000 | India   | NULL       |   1350 |
| 2000 | USA     | Calculator |     75 |
| 2000 | USA     | Computer   |   1500 |
| 2000 | USA     | NULL       |   1575 |
| 2000 | NULL    | NULL       |   4525 |
| 2001 | Finland | Phone      |     10 |
| 2001 | Finland | NULL       |     10 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   2700 |
| 2001 | USA     | TV         |    250 |
| 2001 | USA     | NULL       |   3000 |
| 2001 | NULL    | NULL       |   3010 |
| NULL | NULL    | NULL       |   7535 |
+------+---------+------------+--------+

This is equivalent to the following query:

SELECT *
FROM
  (SELECT YEAR,
          country,
          product,
          SUM(profit) AS profit
   FROM sales
   GROUP BY YEAR,
            country,
            product
   UNION ALL SELECT YEAR,
                    country,
                    NULL,
                    SUM(profit) AS profit
   FROM sales
   GROUP BY YEAR,
            country
   UNION ALL SELECT YEAR,
                    NULL,
                    NULL,
                    SUM(profit) AS profit
   FROM sales
   GROUP BY YEAR
   UNION ALL SELECT NULL,
                    NULL,
                    NULL,
                    SUM(profit) AS profit
   FROM sales) AS sum_table
ORDER BY YEAR, country, product;

+------+---------+------------+--------+
| YEAR | country | product    | profit |
+------+---------+------------+--------+
| NULL | NULL    | NULL       |   7535 |
| 2000 | NULL    | NULL       |   4525 |
| 2000 | Finland | NULL       |   1600 |
| 2000 | Finland | Computer   |   1500 |
| 2000 | Finland | Phone      |    100 |
| 2000 | India   | NULL       |   1350 |
| 2000 | India   | Calculator |    150 |
| 2000 | India   | Computer   |   1200 |
| 2000 | USA     | NULL       |   1575 |
| 2000 | USA     | Calculator |     75 |
| 2000 | USA     | Computer   |   1500 |
| 2001 | NULL    | NULL       |   3010 |
| 2001 | Finland | NULL       |     10 |
| 2001 | Finland | Phone      |     10 |
| 2001 | USA     | NULL       |   3000 |
| 2001 | USA     | Calculator |     50 |
| 2001 | USA     | Computer   |   2700 |
| 2001 | USA     | TV         |    250 |
+------+---------+------------+--------+

Sorting has a NULL problem, so hierarchical summary is very difficult to complete. The group column and SQL complexity continuously change while ROLLUP can achieve the result easily. Let's look at what kind of conversion ROLLUP has done in the resolution process to achieve an amazing result.

2.9 Resolve and Set the GROUP BY/ORDER BY Statement (setup_group/setup_order)

find_order_in_list(): try to find columns that can be mapped in select fields. Otherwise, we will have to add the current column in all fields projected at the end and perform fix_fields.

m_having_cond->fix_fields: perform fixed_fields on the having condition
resolve_limits: handle OFFSET and LIMIT clauses (items of offset_limit and select_limit)
setup_ftfuncs: If there are full-text functions, perform fix_fields on the related items.

remove_redundant_subquery_clause: The expression of Table Subquery is usually IN/ANY/ALL/EXISTS/etc. If there is no aggregate function or Having clause, unnecessary ORDER/DISTINCT/GROUP BY can be deleted. This function supports REMOVE_ORDER, REMOVE_DISTINCT, and REMOVE_GROUP. If it is a subquery of SINGLEROW_SUBS, consider only deleting REMOVE_ORDER.

select c1 from t1 where t1.c2 in (select distinct c1 from t2 group by c1, c2 order by c1);
Convert to
select c1 from t1 where t1.c2 in (select c1 from t2);

The condition for deciding to delete unnecessary distinct statements is that all columns of GROUP BY are in the SELECT list, and there are no ROLLUP and Window functions.

is_grouped() && hidden_group_field_count == 0 && olap == UNSPECIFIED_OLAP_TYPE

For example, in the following scenario:

SELECT DISTINCT c1, max(c2) from t1 group by c1;

2.10 Resolve and Set the Window Function (Window::setup_windows1)

SELECT id,
       release_year,
       rating,
       avg(rating) over(PARTITION BY release_year) AS year_avg
FROM tw;

+------+--------------+--------+-------------------+
| id   | release_year | rating | year_avg          |
+------+--------------+--------+-------------------+
|    1 |         2015 |      8 |               8.5 |
|    3 |         2015 |      9 |               8.5 |
|    2 |         2015 |    8.5 |               8.5 |
|    4 |         2016 |    8.2 |               8.3 |
|    5 |         2016 |    8.4 |               8.3 |
|    6 |         2017 |      7 |                 7 |
+------+--------------+--------+-------------------+

The execution process and results are similar to the following figure:

Let's take a look at what it did before the Query_block::prepare resolution process:

If select_lex->m_windows is not empty, call Window::setup_windows1.

Traverse the window function list and call resolve_window_ordering to resolve m_partition_by and m_order_by
Handle the reference relationship of inter-window (such as WINDOW w1 AS (w2), w2 AS (), w3 AS (w1)), but it must be a directed acyclic graph (DAG).
Re-traverse to check check_unique_name and referenced items to create window partition by and window order by
Check the window function features (Window::check_window_functions1(THD thd, _block select))
- First, judge whether the current window is static or dynamic. If it is static, determine whether the definition of frame has defined upper and lower boundaries. If m_static_aggregates is true, it is a static window. Thus, each partition can be evaluated once. If ma_static_aggregates is false, it determine whether its sliding window is range-based or row-based. m_row_optimizable means row-based, and m_range_optimizable means range-based.
- When obtaining aggregate functions as window functions, the special specifications of window is wfs->check_wf_semantics1(thd, select, &reqs). This method actually judges whether row_buffer is needed for evaluation. If we only focus on rows in the current partition and correct calculations cannot be conducted, it is not needed. If the rows after or before rows in the current partition are involved, row_buffer is needed.

3. Summary

This article focuses on some of the rule-based optimizations of the optimizer. It emphasizes more on the basic operators in SQL, such as the resolving, setting, and some obvious structural changes of elements like tables, columns, functions, aggregations, grouping, and sorting. The next article will continue to introduce the transformation of subquery, partitioned table, and JOIN operations. Please look forward to it.

4. References

Details of the Latest Architecture of MySQL 8.0 Server Layer
MySQL derived_MySQL • New Feature Analysis • Derived Table Metamorphosis in 5.7
ROLLUP Performance Enhancement
WL#9236, WL#9603 and WL#9727 - Add SQL window functions to MySQL

Community

Query Resolution of MySQL 8.0 Optimizer

1. Background and Architecture

Setup and Resolve

2. Detailed Conversion Process

2.1 Pass the List of Inner Tables from null to join (propagate_nullability)

2.2 Function for Resolving and Setting leave_tables of Query Block (setup_tables)

2.3 Function for Resolving Derived Table, View, and Table of Query Blocks (resolve_placeholder_tables)

2.3.1 merge_derived for Processing and Merging Derived Tables

2.3.2 setup_materialized_derived for Setting Materialized Derived Tables

2.3.3 setup_table_function for Processing Tables

2.4 Expand the Wildcard of SELECT * into Specific Fields (setup_wild)

2.5 Establish base_ref_items in the Query_block Level (setup_base_ref_items)

2.6 Perform fix_fields() and Column Permission Check on select_fields (setup_fields)

2.7 Perform fixed_fields on and Resolve WHERE and Join Conditions (setup_conds)

2.8 Resolve and Set the ROLLUP Statement (resolve_rollup)

2.9 Resolve and Set the GROUP BY/ORDER BY Statement (setup_group/setup_order)

2.10 Resolve and Set the Window Function (Window::setup_windows1)

3. Summary

4. References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for MySQL

Serverless Workflow

AnalyticDB for MySQL

ApsaraDB RDS for SQL Server