OceanBase Rewriting Series: Best Practices for Promoting Subquery Performance

Part 3 of this Series introduces aggregation subquery promotion, a rewrite strategy for aggregation-related subqueries, and the pitfalls of this rewrite strategy.

Columnist Introduction

The OceanBase Optimizer Team, led by Xifeng (OceanBase Senior Technical Expert) and Shanwen (Technical Expert), is dedicated to building the world's leading distributed query optimizer.

1. Introduction

In the previous article about Subquery Promotion, we mentioned that there are two types of subqueries. The first type returns a set and the second type returns a specific value. The previous article introduced the rewriting and promotion of the first type. This article introduces the way to rewrite the second type of subquery.

The second type of rewriting-subquery results only returns one value, which is associated with aggregate functions. Q1 is a typical example. The way this subquery is used is common in business SQL. Here, a PLAY table is given, which records the schedule information of all movies. TICKETS record all the actual ticketing information. Q1 finds out all the shows where the ticket price is lower than the average selling price of the movie.

-- Movie table
MOVIE(movie_id, movie_name, release_date)
-- Schedule table
PLAY(play_id, movie_id, time, price, seats)
-- Ticket schedule
TICKETS(play_id, movie_id, real_price, sale_date);

Q1:
SELECT *
FROM PLAY P
    (SELECT AVG(T.real_price)
     FROM TICKETS T
     WHERE T.movie_id = P.movie_id);

When this SQL statement is executed, the execution engine iterates over each row in the PLAY. After it obtains the PLAY.movie_id, it fills in the subquery, calculates the aggregate function AVG(TICKETS.real_price), and determines whether the filter conditions are met. In essence, this process is similar to a NEST-LOOP Join. The number of times a subquery is calculated depends on the number of rows PLAY.

It is not difficult to find that Q1 has two main characteristics:

It is a related subquery, referring to the outer column.
It is not a simple SPJ query. Its projection item contains an aggregate function to calculate a statistical value.

This type of subquery is called an aggregate-related subquery in OceanBase. An aggregate-based subquery is promoted to rewrite this type of SQL.

2. The Aggregation-First Aggregation Subquery Promotion

Let’s take a movie schedule as an example. We know that a movie will schedule many shows, which means the schedule PLAY movie_id contains many repeated values. Take a closer look at Q1. It is not difficult to find that for PLAY.movie_id identical records, subqueries are repeatedly calculated, and the calculation results of these subqueries are the same. In response to this situation, the natural idea is can the subquery only be calculated once for the same PLAY.movie_id record? Based on this idea, OceanBase rewrites the preceding Q1 query to Q2.

Q2:
SELECT P.*
FROM PLAY P,
  (SELECT movie_id,
          AVG(real_price) AS avg_price
   FROM TICKETS T
   GROUP BY movie_id) V
WHERE P.movie_id = V.movie_id
  AND P.price < V.avg_price;

In Q2, we used a grouping query to calculate the average ticket price of each movie in advance. When the main query needs to use different statistical values, it can be obtained directly from the results aggregated in advance. View V in Q2 achieves this effect. It only needs to scan the TICKETS table once to get the average selling price of all movies. After that, Q2 only needs to join the PLAY and the V according to movie_id. Then, it can quickly find out which films have lower average selling prices.

View V in Q2:
SELECT movie_id,
       AVG(real_price) AS avg_price
FROM TICKETS T
GROUP BY movie_id;

The change from Q1 to Q2 is to rewrite the subqueries related to an aggregate class into GROUP BY + JOIN. The rewritten SQL first performs grouping aggregation and then joins, so we call it aggregation-first aggregation subquery promotion. The rewritten query is expected to scan the PLAY table and the TICKETS table once each. This rewriting has the following benefits:

If the PLAY and TICKETS have indexes on the movie_id, we can use merge aggregation to optimize the calculation of view V and use merge join to handle the join between PLAY and V.
If there are (movie_id, price) indexes on the PLAY, you can use the V as the driver table, use the NEST-LOOP JOIN to convert the P.movie_id = V.movie_id AND P.price < V.avg_price into the filter conditions on the PLAY, and use the index scan to reduce the number of scans by PLAY.
Even if there is no suitable index, we can still use HASH JOIN to calculate the join between the PLAY and the V.

As you can see, the rewritten SQL has more room for plan selection. In the original Q1 query, we can only use the PLAY in the main query to drive the calculation of the subquery, which is essentially a NEST LOOP JOIN process. After rewriting, we can use more JOIN algorithms and use the view generated by subquery promotion to drive tables in the main query to join.

2.1 A Small Trap of Rewriting

Next, let's look at a special example. Q3 SQL found out all the sessions with lower than 50% seating rate. If we rewrite this SQL with reference to Q2, we will get SQL in the form of Q4. Q4 and Q3 are not equivalent, which is a trap for the promotion of aggregate subqueries.

Q3:
SELECT *
FROM PLAY P
WHERE p.seats * 0.5 >
    (SELECT count(*)
     FROM TICKETS T
     WHERE T.play_id = P.play_id);
 
Q4:
SELECT P.*
FROM PLAY P,
  (SELECT play_id,
          count(*) AS cnt
   FROM TICKETS T
   GROUP BY play_id) V
WHERE P.play_id = V.play_id
  AND P.seats * 0.5 > V.cnt;

We consider an extreme situation where a movie has not sold a ticket, so the seating rate of this movie is 0%. Then, it should be exported to the results.

In Q3, this field cannot find matching rows in the subquery, but the subquery returns result 0 because of the nature of the count(*). If the result of further p.seats * 0.5 > 0 is true, the record of this field will be output to the result.
In Q4, if there is no record of this field in the TICKETS, the aggregation result of this field will not be included in the view V that is aggregated in advance, and the inner join of Play and V cannot be successful. The record of this session will not be output to Q4 results.

As you can see, Q3 and Q4 are semantically inequivalent.

Based on the preceding analysis, we found that for count(*), if the result of the relevant condition of the subquery is an empty set, the use of the inner join will get the wrong result. At the same time, the result of the count(*) is 0 when the result set of the relevant condition of the subquery is an empty set. Then, you can get a Q5 equivalent to Q3 by slightly modifying Q4.

Q5:
SELECT P.*
FROM PLAY P
LEFT JOIN
  (SELECT play_id,
          count(*) AS cnt
   FROM TICKETS T
   GROUP BY play_id) V ON P.play_id = V.play_id
WHERE P.seats * 0.5 > (CASE WHEN V.play_id IS NULL THEN 0 ELSE V.cnt end);

In Q5, we use left join and case when to distinguish between rows that can match data in subqueries and rows that fail to be matched. We can generate the correct filter conditions for both types to replace the filter conditions that originally contained subqueries.

(1) Let’s consider a record that can match data in a subquery: it can be joined to a V row. In the result of the outer join, the V.play_id will not be NULL, and the final filter condition will be P.seats * 0.5 > V.cnt.

(2) Let’s consider a record that cannot match data in a subquery: it fails to be joined to a V row. In the result of the outer join, the V.play_id must be NULL, and the final filter condition will be P.seats * 0.5 > 0.

3. The Join-First Aggregate Subquery Promotion

When the relevant conditions in the subquery are all equivalent conditions, we can use Q1->Q2 or Q3 ->Q4 to rewrite. Next, we consider a more complex subquery Q6: Find all the movies that earned more than $50,000 at the box office within a week of release. There is a non-equivalent condition in this query: T.sale_date < M.release_date +7. In Q6, the range to be aggregated by a subquery is changed to T.Movie_id = ?and T.sale_date < ?+7. This range cannot be aggregated in advance, and it fails to be transformed using the method given in the previous section.

Q6:
SELECT *
FROM MOVIE M
WHERE
    (SELECT sum(real_price)
     FROM TICKETS T
     WHERE T.movie_id = M.movie_id
       AND T.sale_date < M.release_date + 7) > 50000;

In OceanBase, Q6 is rewritten to Q7. For each row in the M table, Q7 will find the matching set of records in the T table and then perform grouping and summation statistics (movie_id is assumed to be the primary key of the MOVIE) to obtain the total box office. Finally, it is determined in Having whether the total box office exceeds 50000. Semantically, Q6 and Q7 are equivalent.

Q7:
SELECT M.*
FROM MOVIE M, TICKETS T
WHERE T.movie_id = M.movie_id
      AND T.sale_date < M.release_date + 7
GROUP BY M.movie_id
HAVING sum(T.real_price) > 50000;

The change from Q6 to Q7 is to rewrite a subquery related to an aggregate class into JOIN + GROUP BY. The rewritten SQL goes to join first and then group aggregation, so we call it join-first aggregation subquery promotion. Intuitively, the query after the rewrite does not improve the execution performance compared with the query before. Q7 still needs to find the records for each row of the MOVIE table that meet the conditional T.movie_id = M.movie_id AND T.sale_date < M.release_date +7 in TICKETS for aggregation. This rewrite method has the following benefits:

More Flexible join order: If a subquery only contains non-equivalent conditions, the rewritten query can only be NEST-LOOP JOIN. However, you can select the table in the subquery as the driver table.
A Richer join Algorithm: If the conditions of the subquery contain equivalent join conditions, you can also select HASH JOIN or MERGE JOIN for the rewritten query. As such, you only need to scan the table in the outer query and the table in the subquery once.

3.1 Thinking

It is similar to the aggregation-first subquery promotion. When the aggregation function in the subquery is count(*), the join-first aggregation subquery promotion also has a similar trap. The two have the same problem to consider, but the way to solve the problem is different. We will leave this problem as homework for readers. Please think about how Q8 should improve the join-first aggregation subquery.

Q8：
SELECT *
FROM MOVIE M
WHERE
    (SELECT count(*)
     FROM TICKETS T
     WHERE T.movie_id = M.movie_id
       AND T.sale_date > M.release_date + 30) > 0;

4. Summary

This article mainly introduces aggregation subquery promotion, a rewrite strategy for aggregation-related subqueries, and the pitfalls of this rewrite strategy. According to the different join conditions in the subquery, this rewrite strategy can be subdivided into aggregation-first subquery promotion and join-first subquery promotion. Aggregate subquery promotion can rewrite aggregate subqueries into join forms. It allows the optimizer to select richer join algorithms and join orders. On the whole, this is a rule-based rewriting strategy. We have introduced how OceanBase rewrites two types of subqueries. The next article will focus on GroupBy and introduce some rewriting and optimization practices of OceanBase in grouping.

Community

OceanBase Rewriting Series: Best Practices for Promoting Subquery Performance

Columnist Introduction

1. Introduction

2. The Aggregation-First Aggregation Subquery Promotion

2.1 A Small Trap of Rewriting

3. The Join-First Aggregate Subquery Promotion

3.1 Thinking

4. Summary

Read previous post:

Read next post:

OceanBase

You may also like

Comments

OceanBase

Related Products

ApsaraDB for OceanBase

Elastic High Performance Computing Solution

Elastic High Performance Computing

Remote Rendering Solution