The OceanBase Optimizer Team, led by Xifeng (OceanBase Senior Technical Expert) and Shanwen (Technical Expert), is dedicated to building the world's leading distributed query optimizer.
In the previous article, we overviewed the OceanBase query rewriting practice. We briefly introduced the rewriting methods, the types of rewriting algorithms, the main issues when implementing each rewriting algorithm, and how multiple rewriting rules are organized in a rewriting framework. This article is the second in the SQL Rewriting Series, which mainly explains the subquery promotion. Using subqueries allows users to write complex SQL statements with clear meaning in a concise and clear manner. This feature is user-friendly but not database-friendly. From a database perspective, processing subqueries is inefficient. Database systems often try to rewrite SQL to improve the processing of subqueries and eliminate subqueries.
Subqueries typically have two types of usage:
OceanBase uses different methods to rewrite these two subqueries.
This article will first introduce the rewrite strategy for the first type of subquery, which is the subquery that appears in the existential predicate.
Existential determination predicates are common in SQL statements. It is mainly used to determine whether there are records that meet the conditions in a set or not. Let’s take the movie table MOVIES and the schedule table PLAY as examples. The following SQL Q1 uses a typical existential determination predicate EXISTS (SELECT ...)
.
-- Movie table
MOVIES(movie_id, movie_name, rate);
-- Schedule table
PLAY(movie_id, time, price)
-- Q1: Find out the movies that are scheduled in 19: 00.
SELECT movie_name
FROM MOVIES
WHERE EXISTS (SELECT timeFROM PLAY
WHERE PLAY.movie_id = MOVIES.movie_id
AND PLAY.time = '19:00');
If the calculation form of the subquery is maintained, you need to scan the schedule table to check whether there is a '19:00' film schedule for each movie in the above query. This process is similar to the NEST-LOOP JOIN process. The number of times a subquery is executed is equal to the number of rows read from the MOVIES
table. OceanBase rewrites the preceding subquery into a join to enrich the space selected by the plan.
OceanBase rewrites the preceding Q1 query to a format similar to Q2. This change is equivalent. The semantics of semi-join in Q2 is – for each row in the MOVIES
, if it can be successfully joined with any row in the PLAY
according to the PLAY.movie_id = MOVIES.movie_id AND PLAY.time = '19:00'
, the row will be output. Otherwise, it will not be output. Intuitively, the semantics of Q1 and Q2 are the same.
-- Q2: Promote to SEMI JOIN.
SELECT movie_name
FROM MOVIES
SEMI JOIN PLAY
ON PLAY.movie_id = MOVIES.movie_id
AND PLAY.time = '19:00';
Rewriting a subquery into a SEMI JOIN has the following benefits:
MOVIES
and PLAY
can use richer join algorithms. For example, use merge-join or hash-join. In the original subquery, the PLAY
table is scanned multiple times. If there is no suitable index, the scanning cost may be high. After the rewrite, when you use merge-join or hash-join, only the PLAY
needs to be scanned once.PLAY.time = '19:00'
filtering is small, and the number of MOVIES
rows is large, this execution plan works well.-- Q3: The SEMI JOIN is rewritten to the INNER JOIN.
SELECT movie_name
FROM MOVIES, (SELECT DISTINCT move_id FROM PLAY WHERE PLAY.time ='19:00') V
WHERE V.movie_id = MOVIES.movie_id;
The EXISTS/ANY comparison predicate can be rewritten to SEMI JOIN. This type of predicate mainly determines whether a record exists that meets the filter conditions. Similarly, the NOT EXISTS/ALL comparison predicates are widely used in businesses. These predicates are used to determine non-existence. OceanBase attempts to rewrite such predicates to ANTI JOIN. For example, Q4 uses a NOT IN predicate (equal to != ALL), which can be rewritten to Q5.
-- Q4: Find the movies that are not scheduled at 19:00.
SELECT movie_name
FROM MOVIES M
WHERE movie_id NOT IN
(SELECT movie_id
FROM PLAY P
WHERE P.time = '19:00');
-- Q5: The NOT IN predicate is rewritten to ANTI JOIN.
SELECT movie_name
FROM MOVIES M
ANTI JOIN PLAY P ON (M.movie_id = P.movie_id
OR M.movie_id IS NULLOR P.movie_id IS NULL)
AND P.time = '19:00'
The semantics of ANTI JOIN here is – for a row in table M, if it can be joined with any row in table P, the row is filtered. Otherwise, the row is output.
It is worth noting that Q5 has a special ANTI JOIN join predicate: M.movie_id = P.movie_id OR M.movie_id IS NULL OR P.movie_id IS NULL
. This is mainly to handle the case where the movie_id value is NULL. Let’s consider the following scenario: given the m1
of a row in the M table, where the m1.movie_id
value is NULL, and the result set of the subquery is not empty. In Q4, the determination result of NOT IN is unknown, and the m1 is filtered out. In Q5, the OR join predicate is determined to be true. The ANTI JOIN can be successful, and the m1
is also filtered out. Most of the time, we may mistakenly believe that Q4 can be directly rewritten into the form of Q7 below. However, the semantics of the two are different. In Q7, due to m1.movie_id is NULL
, the join condition of ANTI JOIN is unknown. If the connection is not successful, the m1
will not be filtered out. Therefore, Q4 and Q7 are not equivalent. Similarly, you can analyze the query behavior when the P.movie_id
value is NULL. This semantic feature is worth mentioning. The particularity of NOT IN predicate when dealing with NULL values is easily ignored. We will analyze this further below.
During the Semantic and Performance Differences between NOT EXISTS and NOT IN in the process of business support, we noticed that many businesses did not realize the semantic difference between NOT EXISTS and NOT IN when processing NULL values. In some scenarios, NOT EXISTS can be used, but NOT IN is used, resulting in slow query performance or high resource overhead. Q6 uses the NOT EXISTS predicate, which is close to Q4. A big difference between the two is that the result set of Q6 will retain the result with the value of M.movie_id as NULL. Q4 will not.
Comparing Q5 with Q7, which are the rewrite results of Q4 and Q6 (respectively), we can further analyze the performance difference between the two predicates. In Q5, ANTI JOIN has no equivalent join predicate, while Q7 has an equivalent join predicate. When there is no equivalent join predicate, the query optimizer can only select nest loop join to join each row in M and P in pairs. If an equivalent join predicate exists, the query optimizer can select merge join/hash join or push the equivalent join predicate down to a filter predicate on the right table by nest loop join. In comparison, when there is an equivalent join predicate, the query can choose a richer join algorithm for better performance.
-Q6: Find movies that are not scheduled at 19:00.
SELECT movie_name
FROM MOVIES M
WHERE NOT EXISTS
(SELECT movie_id
FROM PLAY P
WHERE P.time = '19:00' AND M.movie_id = P.movie_id);
-- Q7: The NOT EXISTS predicate rewrites the ANTI JOIN.
SELECT movie_name
FROM MOIVES M
ANTI JOIN PLAY P ON P.time = '19:00' AND M.movie_id = P.movie_id;
Finally, we can discuss the optimization of Q5 additionally. OceanBase has designed other strategies for Q5 in terms of optimization and execution to improve query performance. Two mechanisms are briefly introduced here:
This article mainly introduces the strategy of rewriting subqueries in existential determination predicates into SEMI/ANTI JOIN.
This type of subquery mainly appears in the existential determination predicate (EXISTS/NOT EXISTS, ANY/ALL comparison). Its typical feature is that the result of the subquery is a set. Rewriting such subqueries into join allows the query optimizer to select a richer join algorithm. It can work with other rewriting strategies to generate a richer join order. On the whole, this is a rule-based rewriting strategy. In addition to such subqueries, many subqueries are used to calculate a single value in SQL. The next article will explain how to rewrite this type of subquery.
OceanBase Rewriting Series: An Overview of the OceanBase Query Rewriting Practices
OceanBase Rewriting Series: Best Practices for Promoting Subquery Performance
OceanBase - August 25, 2022
OceanBase - August 25, 2022
OceanBase - August 25, 2022
OceanBase - September 9, 2022
ApsaraDB - September 30, 2021
chenyucatcat - April 21, 2021
An on-demand database hosting service for SQL Server with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMigrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMigrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.
Learn MoreMore Posts by OceanBase