OceanBase Rewriting Series: An Overview of the OceanBase Query Rewriting Practices

OceanBase is 100% proprietary. It has supported Double 11 steadily for seven consecutive years and launched a new city-level disaster recovery standard of three places and five centers. It is the only domestic native distributed database that has set world records in both TPC-C and TPC-H tests. It officially opened its source code in June 2021. The query optimizer is the core module of the relational database system, the focus and difficulty of database kernel development, and the measure of the maturity of the entire database system.

We will write a series of articles to help you understand the OceanBase query optimizer better. Each article will help you understand the essence of query rewriting, comprehend the equivalence of complex SQL, and write efficient SQL.

Columnist Introduction

The OceanBase Optimizer Team, led by Xifeng (OceanBase Senior Technical Expert) and Shanwen (Technical Expert), is committed to building the world's leading distributed query optimizer.

Structure

This query rewriting series includes subquery optimization, aggregate function optimization, window function optimization, complex expression optimization, and more modules. Please stay tuned.

1. Introduction

This article is Part 1 of the OceanBase Rewriting Series: OceanBase query rewriting practices. Most users usually use SELECT... FROM... WHERE operations when operating databases. GROUP-BY, ORDER-BY, and LIMIT are added in more complex statements. Features (like window functions) are not used in most cases. When you want to write a query statement, you must use your familiar grammar to write a semantically correct query. Intuitive and correct SQL is already a good SQL for users. Ee can write different SQL statements to complete the same query. For example, some application developers write SQL on the left side of figure (1) to implement "update type = 10 in the T table and order the smallest three rows by k." Semantically, this SQL has no problem. However, from a performance point of view, this SQL is not good enough. If no optimization is performed, the kernel will execute the subquery to calculate the set of id that meet the conditions first. Then, it will execute the outer main query to update the corresponding rows. This is not a good SQL for the kernel. We can use SQL on the right side of figure (1) to accomplish the same function.

Figure 1: Example of Query Rewriting

When the kernel executes this rewritten query, it only needs to access the T table once to find the rows that meet the conditions and update them directly. The service may write inefficient SQL statements since it does not have a comprehensive understanding of the syntax nor much about the way the kernel executes.

If every user that uses the database is a senior expert proficient in SQL writing with a deep understanding of the core processing of the database, ze can write good SQL. However, this is not what the language was originally designed to be. The goal of SQL is to make users only care about correctly describing their needs. How to handle it efficiently is the job of the database kernel. Therefore, the kernel development implements the query rewriting module, which is responsible for rewriting the user's good SQL equivalent to a kernel's good SQL. This article describes OceanBase's practices in query rewriting.

2. Overview

2.1 Rewriting Method

After OceanBase receives an SQL statement, it parses the syntax and converts it to an internal STMT structure. The input of the query rewriting module is an STMT S1. After a series of rewriting algorithms, it is equivalent to another STMT S2. The output S2 will be used to generate logical and physical execution plans. Query rewriting is essentially a pattern matching process. It traverses each object in the STMT structure to check whether the object matches the rewriting pattern. If satisfied, it will be rewritten to another form equivalently. Let’s take the constant-truth condition elimination in figure (2) as an example. The query rewriting traverses all the condition expressions in the S1 and checks whether they are constant truth conditions. Here, the 1=1 is a constant truth condition, which is removed by the rewriting algorithm.

Figure 2: Input and Output of Query Rewriting

2.2 Algorithm Classification

The query rewriting module contains many rewriting algorithms. Different rewriting algorithms match different patterns and perform corresponding equivalent transformations. A complete query rewrite attempts to optimize the input STMT with all rewrite algorithms. The query rewriting framework is responsible for iterating various rewriting algorithms to optimize STMT. On the whole, we hope each query rewrite can make SQL better. However, not all rewrite algorithms will transform SQL in a good direction. We divide the rewriting algorithm into the following two categories:

Rule-Based Rewriting: The main feature is that this rewriting can generate a better execution plan and is always effective. For example, eliminating constant true and constant false conditions is always conducive to generating a better execution plan. The overall process is pattern matching and perform rewriting. OceanBase implements rewriting algorithms, such as view merging, subquery promotion, inner join elimination, and outer join elimination.
Cost-Based Rewriting: The main feature is that rewriting can generate better execution plans in some scenarios. Whether this type of rewrite can be triggered depends on factors (such as the actual data distribution and whether an appropriate index is available). After the cost-based algorithm completes the rewrite, it needs to ask the physical optimizer: whether the rewritten STMT can generate a plan with less execution cost. If the cost is reduced, the rewrite will be triggered. The overall process is pattern matching, perform rewriting, and cost verification. OceanBase implements cost rewriting algorithms, such as OR Expansion, JA subquery promotion, Win Magic, and Group-By Placement.

Let's use the following example to explain why some rewrites are triggered based on cost.

Q1: 
SELECT * FROM T1 WHERE C1 < 20000 OR C2 < 30 ;
=>
Q2: 
SELECT /*SEL_1*/ * FROM T1 WHERE C1 < 20000 UNION ALL
SELECT /*SEL_2*/ * FROM T1 WHERE C2 < 30 AND LNNVL (C1 < 20000);

In OceanBase, Q1 is rewritten by OR Expansion to Q2. Semantically, the two queries are equivalent. Q1 finds the set of records that meet the C1 < 20000 or C2 < 30. In Q2, SEL_1 finds the set of records that meets the C1 < 20000, SEL_2 finds the set of records that meets the C2 < 30 and does not meet the C1 < 20000, and the union of the two parts is the final result. It is worth noting that SEL_2 uses LNNVL(...) instead of C1 >= 20000 to find records that do not meet the C1 < 20000. Examining the following example, given the value of r1 (C1, C2) as (NULL, 40), we expect that SEL_2 can return r1. However, if C1 >= 20000 is used, r1 will be filtered out.

For this example of OR Expansion, consider the following two scenarios:

The T1 table has indexes on both the C1 and C2 columns. Q1 fails to use these two indexes, and it can only scan the main table. In Q2, SEL_1 can use the index on C1, and SEL_2 can use the index on C2. If these two filter conditions have strong filtering, index scanning can reduce the overhead of reading data. As such, triggering OR Expansion rewrite helps generate a better execution plan.
There is no index on the T1 table. Both Q1 and Q2 require a full table scan. However, Q2 is split into two SELECT clauses, which require two full table scans. Therefore, the implementation cost of the rewrite will rise. As such, OR Expansion should not be triggered.

3. Completeness and Correctness

The author thinks the concept behind each type of rewrite algorithm is not complicated. If a person understands some rewrite rules, it is easy to manually rewrite SQL and convert it equivalently into a friendly form for the database kernel. However, it is challenging to implement these rewrite algorithms in the kernel. Rewriting SQL at the business layer is simple because our task is only to rewrite a known SQL correctly. Rewriting at the kernel layer is difficult since our task is to give any SQL. If we can rewrite it, we should rewrite it as much as possible, and the rewrite must be correct. This puts forward two requirements for the rewriting algorithm:

Correct: The SQL semantics should not be changed after the rewrite.
Complete: SQL statements that can be rewritten must be rewritten.

Correctness is easy to understand. If the result of a rewrite is wrong, it is negative for the value of the business. Completeness is equally important. It requires a rewriting algorithm to have good generality and can deal with complex scenes. With poor generality, the value of the business is limited. Of course, it is difficult for us to make a rewriting algorithm complete. Some complicated situations are difficult to deal with, and forced rewriting may introduce correctness issues. We will be as complete as possible on the premise of ensuring the correctness in the process of implementing a rewriting algorithm. In the following, we will explain the challenges of achieving a complete rewrite under the correctness premise through outer join elimination.

The outer join operation can be divided into left outer join, right outer join, and full outer join. During the process, the left and the right order of outer join is unable to be transformed, which limits the optimizer's choice of join order. Consider the left outer join (L LEFT JOIN R ON L.ID = R.ID) of L and R. If a row in L is not joined to any row in R, the result will fill the output of the column of R (please see figure 3). You can see the result of the left outer join is a superset of the inner join, which mainly increases the rows generated by the R table.

Figure 3: Differences between Inner and Outer Join

SELECT L.ID, L.C1, R.C2 FROM L LEFT JOIN R ON L.ID = R.ID WHERE R.C2 ='XXX';
/* OuterJoin Elimination * /
SELECT L.ID, L.C1, R.C2 FROM L, R WHERE L.ID = R.ID AND R.C2 ='XXX';

Let’s consider the result of the L LEFT JOIN R. If there is a filter condition R.C2 = 'XXX', the rows generated by filling will be filtered out. The result sets produced by the outer and inner join are the same. We can rewrite the outer join to the inner join. This helps the optimizer consider more of the join order and algorithm. We call condition R.C2 = 'XXX' a reject-NULL condition in the outer join elimination. The biggest challenge is that given any expression, we can accurately determine whether it is a reject-NULL condition. Let’s consider the following scenario:

R.C2 is in the judgment expression side (such as >, <, ==, and IS NOT NULL...). If the parameter on one side of this type of expression is NULL, the result must be unknown/false, so they are reject-NULL conditions.
F(R.C2) is in the judgment expression side (such as >, <, and ==). The determination is more difficult because it is difficult to determine what the output of F(R.C2) is when the input of R.C2 is NULL. For example, when F(R.C2) is NVL(R.C2, ?), what is the output? It is difficult for us to infer whether NVL(R.C2, ?) = 'XXX' constitutes a reject-NULL condition.

It is relatively easy to rewrite the first one, and the second one is more difficult. OceanBase implements the "NULL passing" determination of the expression. Given a nested expression and a set of parameters, it calculates whether the output of the expression is NULL when the input of the parameter is NULL. Based on this determination mechanism, OceanBase supports rewriting in the preceding two scenarios.

We have analyzed the reject-NULL conditions composed of a single column. All columns in the right table are filled in the result of the left outer join. Therefore, we can determine whether the filter expression composed of multiple right table columns is a reject-NULL condition. For example, the following SQL filter condition is an OR condition. Only considering R.C2, R.C2 IS NOT NULL is not a reject-NULL condition for the entire SQL. However, if both R.C2 and R.C3 are considered, the OR condition is a reject-NULL condition.

SELECT L.ID, L.C1, R.C2 FROM L LEFT JOIN R ON L.ID = R.ID 
            WHERE R.C2 IS NOT NULL OR R.C3 IS NOT NULL;

As you can see, outer join elimination is not complicated. It is easy for us to determine whether it is a reject-NULL condition for some simple scenes (i.e., R.C2 directly appears on one side of the judgment expressions). It is also relatively easy to determine some complex scenes (i.e., R.C2 is nested in four arithmetic operations (+,-,*,/)). The difficulty of judging is increased in some more complex scenarios (i.e., R.C2 is nested in some function expressions CASE WHEN, NVL, etc.). In terms of outer join elimination, we first need the abstract reject-NULL property to rewrite correctly. We also need the NULL passing property of abstract expressions to improve the generality of rewriting. Furthermore, the algorithm should be able to identify the reject-NULL condition formed by multiple NULL columns.

We have described how to determine whether a predicate is a reject-NULL predicate as complete as possible. Another important question is whether all reject-NULL predicates can be used to rewrite outer join to inner join. Compared with the previous problem, this problem is of higher importance. It forms the correctness boundary of outer join elimination. If the correctness boundary of the rewrite rule is not accurately summarized, the query rewrite will change the semantics of the business SQL statement, resulting in errors in the query results.

SELECT L.ID, R.ID, T.ID FROM (L LEFT JOIN R ON L.ID = R.ID) 
                              LEFT JOIN T ON (R.ID = T.ID);

Let’s consider the preceding query. The result of the L LEFT JOIN R will be outer joined with the T again. The join predicate is R.ID = T.ID. This predicate contains columns L LEFT JOIN R in the right table. Independently, this predicate does have the property of reject-NULL. However, as a join predicate of an outer join, even if the judgment result of this predicate is non-true, it will not filter the data on the left side of the outer join. Therefore, although the R.ID = T.ID determination result is not true when the R.ID value is NULL, this predicate cannot be used to rewrite the outer join to the inner join.

4. Framework Rewriting

The query rewriting module of OceanBase implements many rewriting rules. As described in Section 2, rewriting rules are divided into rule-based and cost-based.

After the SQL is parsed, an internal structure (STMT) is formed, which is then handed over to the query to rewrite for equivalence changes. The rewrite framework is responsible for trying each rewrite algorithm in a specific order. When a cost rewrite is encountered, it will trigger additional cost verification. The framework will continue to iterate over the rewrite strategy until the current STMT reaches the convergence state and no more rewrites can be triggered.

In the rewrite framework, the trigger order of each rewrite algorithm is important. Some rewrites need to be triggered before others. There are typical scenarios, such as view merging and predicate movement. View merging attempts to merge multiple query blocks into one, and predicate movement directly pulls up or down the predicate on multiple query blocks. Rewriting the framework will first attempt view merging before triggering predicate movement. This is because if view merging can be successful, it is no longer necessary to consider moving predicates between multiple query blocks. The query rewriting module of OceanBase carefully considers the relationships between the various rewriting algorithms and then determines their trigger order.

Another important issue in the rewrite framework is cost verification. Triggering a cost rewriting may or may not generate a better execution plan. The rewrite framework contains a module for cost verification. The framework iterates over the rewritten results for the rule rewrite set in the process of cost verification. After the cost is rewritten, there may be some new rewriting opportunities. Finally, the query optimizer generates an execution plan for the STMT before and after the rewrite. It determines whether the result after the rewrite can reduce the cost of the entire execution plan.

5. Summary

This article mainly introduces the query rewriting of OceanBase. It briefly introduces the rewriting methods, the types of rewriting algorithms, the main problems when implementing each rewriting algorithm, and how multiple rewriting rules are organized under a rewriting framework. Query rewriting is the focus and difficulty of the query optimizer module. It is also the basic knowledge that both SQL performance tuning workers and DBAs need to master. OceanBase implements many rewrite rules. This article provides an overview of the entire rewrite module. Based on this article, we will systematically launch a series of query rewriting articles to introduce different rewriting rules/algorithms implemented by OceanBase. We will introduce OceanBase's experience in optimizing different relational algebra and calculation expressions, such as subqueries, grouping operations, window functions, join types, and expressions. We believe this series of articles can be a references for readers. You are welcome to communicate and discuss the query rewriting issues with us.

Community

OceanBase Rewriting Series: An Overview of the OceanBase Query Rewriting Practices

Columnist Introduction

Structure

1. Introduction

2. Overview

2.1 Rewriting Method

2.2 Algorithm Classification

3. Completeness and Correctness

4. Framework Rewriting

5. Summary

Read previous post:

Read next post:

OceanBase

You may also like

Comments

OceanBase

Related Products

ApsaraDB for OceanBase

ApsaraDB RDS for SQL Server

Database for FinTech Solution

Oracle Database Migration Solution