By Fangwu
For better understanding of this article, we recommend you to read about the following concepts and principles in advance:
The definition of subquery is so simple, but the derived scenarios are so complex. It is both an expression and a query tree, which is flexible enough to become a constant and to express relationships between collections at any time. While writing subqueries in SQL is so natural, the way it really works isn't as straightforward.
According to the SQL standard, nesting a query block into an expression results in a subquery.
Query blocks in a relational database are usually expressed in a tree plan composed of relational algebraic operators. The relationship between the expressions of each operator and the outer-layer query tree is ambiguous.
The expression in the query tree hides another query tree. The essence of which lies in describing the relationships between query trees through expressions, such as user-defined functions, comparison operators, and logical operators. Due to a large number of expression types and their complexity, it is difficult to directly abstract the relationship between the subquery and the main query through traditional relational algebra. This makes it difficult to implement subqueries in relational databases.
Since subquery implementation is complex for databases, the execution efficiency may not be high. Why this SQL syntax has to be implemented? Is there any reason other than SQL standard?
... WHERE AGE > ALL(SELECT AGE ...)
... WHERE SALARY <= ANY(SELECT SALARY ...)
Each query tree is understood as a dataset. It is very obscure in some scenarios to use JOIN to describe the intersection and union between data. However, from the preceding SQL, the subquery description of AGE >ALL and SLARY
In a nutshell, a subquery is a classic example of leaving the complexity to the database and the simplicity to users.
To understand the subquery, classification is necessary. Semantically, it can be divided into the following aspects:
1) A scalar subquery returns a row of values. The two-dimensional table with rows and columns output by common operators is similar to a vector. Accordingly, a subquery that outputs only one row and one column is called a scalar.
SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
SELECT SUM(o_totalprice)
FROM ORDERS
WHERE o_custkey = c_custkey
)
2) Quantified comparison judges whether ALL or ANY item is met.
SELECT c_name
FROM CUSTOMER
WHERE c_nationkey <> ALL (SELECT s_nationkey FROM SUPPLIER)
3) Existence test judges the EXISTS or NOT EXISTS.
SELECT c_custkey
FROM CUSTOMER
WHERE c_nationkey = 86 AND EXISTS(
SELECT * FROM ORDERS
WHERE o_custkey = c_custkey
)
It is easier to understand the semantic division, but the database will abstract another division based on a logical operational perspective:
AGE >ALL (SELECT AGE)
can be expressed as a.age>b.age[0] AND a.age>b.age[1] AND ... AND a.age>b.age[n]
.This does not mean that a subquery is a common expression after it is logically disorganized. In expression evaluation of a relational database, functions try to use rows as the input and output.
A function that processes a set is usually abstracted into a new operator. For example, an aggregation function is processed by an aggregate operator alone. Similarly, a subquery should also be abstracted as an individual operator. The boundary division when operators and expressions are cut follows the same logic.
The database engine does not confuse complex dataset operations with expression evaluation. In addition, the boundary between set operations and row operations is clearly defined based on the implementation complexity.
Finally, based on whether the subquery itself has a related item, it can also be divided into correlated subquery and non-correlated subquery. Any non-correlated subquery can be regarded as a constant in nature, while the data relationship between the inner and outer layers of the subquery must be considered when processing the correlated subquery. The processing of this data relationship is the focus of subquery optimization.
Will subqueries in SQL statements lead to a sharp drop in query efficiency?
As a declarative programming language, SQL only describes what data it needs. The specific operation depends on the performance of the database itself. With the development of query optimization techniques, it is too arbitrary to say that subqueries will inevitably lead to performance degradation.
The essence of correlated subqueries is the operation of each row of data on the foreign table with the data of the subquery collection. When there are too many rows of data in the foreign table, this nesting process will inevitably lead to low performance. Therefore, a very important step in subquery optimization is "unnesting". Today's unnesting technology has become mature, and HyPer announced a paper Unnesting Arbitrary Queries in 2015.
As mentioned above, the subquery should be separated from the expression and displayed as an operator in the execution plan because it processes collection data. This operator, which refers to a subquery, is generally called Apply.
The name "Apply" comes from list processing (LISP), which is a special type of function. The input parameter is a set of parameter lists, and the returned value is a corresponding set of result lists. From the semantic point of view of relational algebra database, the process is similar to the nested execution of correlated subqueries, which was firstly proposed by the paper of Microsoft SQL Server.
After a subquery is converted to an Apply operator, the correlated items remain in the query tree of the subquery. Therefore, there is no way to directly process Apply through JOIN.
Therefore, the most important step for subquery optimization is unnesting.
SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
SELECT SUM(o_totalprice)
FROM ORDERS
WHERE o_custkey = c_custkey
)
Take the SQL statement as an example to see why unnesting is necessary for correlated subqueries.
The following figure shows the original relation tree that is not unnesting. In actual running mode, the executor runs the query tree on the right of Apply for each row of data. By doing so, the processing time consumed by nested processing of correlated subqueries increases linearly with the increase of the data volume. If multiple subqueries are nested, the processing time increases exponentially. To avoid poor experience due to long and time-consuming operations, unnesting is crucial.
In the 1980s and 1990s, the SQL standard expanded the scope of subqueries. Rule-based unnesting was one of the mainstream researches at that time.
The paper Orthogonal Optimization of Subqueries and Aggregation published in 2001 is a collection of rule-based unnesting, in which nine conversion rules are concluded:
The two transformations in the following figure correspond to Rule 3 and Rule 1 respectively.
The conversion in the preceding example takes effect on certain conditions. The key difference is the AND logical operator that matches the relationship between JOIN and Filter.
Rule-based transformations cannot process subqueries in all schemas, for example:
// Complex example 1: Subquery in a destruction
SELECT * FROM
supplier a
WHERE
s_address IN (select s_name from supplier where a.s_phone=s_phone)
OR s_name='1'
// Complex example 2: A subquery contains aggregation and non-equivalent items
SELECT * FROM T1
WHERE AGE >
(SELECT AVG(AGE) FROM T2 WHERE T1.NAME!=NAME)
// Complex example 3: Nested subqueries
...
(SELECT * FROM T1
WHERE ID IN
(SELECT ID FROM T2 WHERE T3.NAME=NAME))
...
Magic Set is a very old data processing technology, which was first applied in a deductive database. Currently, it also plays an important role in the subquery unnesting of relational databases.
In 1996, Cost-Based Optimization for Magic: Algebra and Implementation published by IBM DB2 introduced Magic Set as a relational algebraic operator into CBO. In 2015, HyPer developed its own side-ways information passing optimization based on this technology to process the unnesting of all types of subqueries. In the execution plan displayed on the HyPer official website, a similar operator is named Magic.
Consider such a SQL statement as
SELECT s.name, e.course
FROM students s, exams e
WHERE s.id=e.sid AND
(s.major = 'CS' or s.major = 'Games Eng') AND
e.grade >= (SELECT avg(e2.grade)+1 --one grade worse
FROM exams e2 --than the average grade
WHERE s.id=e2.sid OR --of exams taken by
(e2.curriculum=s.major AND --him/her or taken
s.year>e2.date)) --by elder peers
The unnesting of this SQL statement is mainly difficult to have a non-equivalent correlated item s.year>e2.date
, which will lead to failure to avoid JOIN computing in the filter layer where the correlated item is located.
Figure from HyPer's paper Unnesting Arbitrary Queries
HyPer attempted to copy the data of the columns involved in the correlated items and introduce the data to the subquery in his paper. In addition, it replaced correlated items through JOIN, thus achieving unnesting.
From an implementation point of view, unnesting of subqueries are at the cost of some space and additional JOIN computing.
Databases often use the SemiJoin operator clusters to express unnesting subqueries. Here, another issue that the definition of Apply and SemiJoin of the relational algebra cannot be completely equivalent will occur.
SemiJoin, which indicates that only the columns of one table are output, and the columns of other tables will not be output to the upper layer.
// Complex example 1: Subquery in disjunction
SELECT * FROM
supplier a
WHERE
s_address IN (select s_name from supplier where a.s_phone=s_phone)
OR s_name='1'
Considering the preceding SQL statements, the subquery is placed between OR expressions. As a result, it cannot be converted to a simple SemiJoin condition because the filtering relationship superimposed with Filter and SemiJoin is different from OR.
Mark Join: In addition to outputting the connected data, it also keeps a Mark to mark the connected results of this row, such as TRUE, FALSE, or NULL.
For this scenario, HyPer introduces the Mark Join to replace SemiJoin.
In the Filter for the Mark Join upper layer, an expression of markvalue OR sname='1'
is formed, avoiding violation with OR semantics by adding a column of the output.
Adopting the Mark mechanism, which allows JOIN to output an extra column, not only breaks the relational algebraic meaning of JOIN but also requires a massive transformation at the execution level. However, in addition to the OR subquery in the preceding example, it also supports subqueries of the Project and Scalar classes. This method is of great significance.
As of HyPer's 2015 paper Unnesting Arbitrary Queries, traditional database vendors such as SQL Server, Oracle did not support non-Scalar subqueries in Project, and support for unnesting of complex subqueries was very limited as well. HyPer, as the first database that claims to be able to perform unnesting for all subqueries, is indeed a very radical practice.
Oracle's 2009 publication, Enhanced subquery optimizations in Oracle, shows the vast number of subquery rewriting optimizations available. These technologies are specifically optimized based on subqueries in the TPC-H. It is estimated that Oracle programmers wrote these technologies when they made the first TPC-H.
These rewriting technologies have high requirements for parameter extraction and derivation.
The technology of merging multiple similar subqueries, such as Q2 to Q3, Q4 to Q5, is shown in the following figure.
A subquery can be merged with the primary table as well as with similar subqueries, for example, Q8 to Q11.
A technique for optimizing subqueries based on window functions was first proposed by IBM in WinMagic: Subquery Elimination Using Window Aggregation. In Oracle, the window function is rewritten as one of the representatives of its subquery rewriting as well.
In short, the subquery with aggregate operation is rewritten as window functions. The rewriting assumes that external query blocks contain all the tables of subqueries as well as filter conditions. For example:
Q1:
SELECT T1.X
FROM T1,T2
WHERE T1.y = T2.y and
T2.name='ming' and
T2.z relop (SELECT AGG(T2.w)
FROM T2
WHERE T2.y = T1.y and
T2.name='ming');
The external query block contains two tables, T1 and T2, and contains all subquery tables (T2) in the subquery. It also contains all the filter conditions (T2.name='ming'). "relop" is short for relational operators.
After the above conditions are met, Q1 can be rewritten as Q2:
Q2:
SELECT T1.x
FROM T1,(
SELECT AGG (T2.w) OVER (PARTITION BY y) AS win_agg,
z,
y
FROM T2
WHERE T2.name='ming' ) V
WHERE T1.y=V.y and
V.z relop win_agg
If the connection between T1 and T2 is lossless join, Q1 can be converted to Q3:
Q3:
SELECT V.x
FROM (
SELECT T1.x, T2.z,
AGG (T2.w) OVER (PARTITION BY T2.y) AS win_agg
FROM T1, T2
WHERE T1.y = T2.y ) V
WHERE V.z relop win_agg
The lossless join is not discussed here. In fact, the difference between Q2 and Q3 can be understood as the reorder of aggregate and JOIN.
After Q1 is rewritten into Q2, it is determined in CBO whether to be converted into the form of Q3 based on the cost. Here is a specific example Q4 (TPC-H Q17):
Q4:
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM
lineitem,
part
WHERE
p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container = 'MED BOX'
AND l_quantity < (
SELECT 0.2 * avg(`l_quantity`)
FROM
lineitem
WHERE
l_partkey = p_partkey
);
In the scenario with 50 gigabytes, the computing workload for each plan is as follows:
After the original execution plan is rewritten to window functions, obviously, a scan of 10^8 levels is reduced.
Compared with the two execution plans rewritten as window functions, the Cost of the Plan 2 window functions is much smaller according to the number of rows processed. However, to be converted to Plan 2, the following decisions similar to aggregate and JOIN reorder should be made. In the lossless join scenario, Plan 2 can be directly converted. Q4 parts and line items are connected with foreign keys. Therefore, this condition is met.
In addition, the JOIN of Plan 2 uses the Batch Key Access algorithm, which is equivalent to performing 10^4 index scans on the line item table. Thus, the Cost is extremely low.
The optimization of the window function significantly improves the RT of TPC-H Q17 and Q2. in distributed databases, the optimization of the measured window functions plus BKA increases the RT of Q17 by nearly a hundred times.
Technically, query optimization is not the whole of the subquery. From the perspective of implementation, it is even more difficult to prevent traps in the execution.
Count traps mainly exist in Count subqueries without group by. Consider the following SQL:
SELECT * FROM T1
WHERE T1.NUM = (SELECT COUNT(ID) FROM T2
WHERE T1.AGE=AGE)
If the above SQL statement is converted into the following execution tree, when the number of certain ages of T2 is 0, the result of SemiJoin will be Null because of T1.age=T2.age
expression, thus the correct result cannot be output. This is essentially due to the particularity of the COUNT aggregate function. To resolve the problem, a node must be output Null rows in the JOIN node, which is similar to the LEFT type.
Other aggregate functions also have different problems. For example, >ALL subqueries cannot be directly converted to >MAX, because >ALL returns TRUE and >MAX returns FALSE if the result set is null.
Observe the following two subqueries that output people with different names or ages in the USER table:
// SQL1
SELECT * FROM USER T1
WHERE AGE NOT IN (SELECT AGE FROM USER T2
WHERE T1.NAME=T2.NAME);
// SQL2
SELECT * FROM USER T1
WHERE NOT EXISTS (SELECT 1 FROM USER T2
WHERE T1.NAME=T2.NAME AND T1.AGE=AGE);
It seems that both subqueries are equivalent. In fact, if a row in the USER table has an age that is NULL, SQL 1 will not output the data, while SQL 2 will.
NOT EXISTS is equivalent to Anti Join. When the Anti Join is used to process the ON condition, the execution result of the executor is null. If this is the case, no match will be found. The problem is that the operand of the not in subquery is also converted into one of the ON conditions, and is connected with the correlated item through AND. By doing so, the correctness problem may occur for multiple outputs.
If this is the case, the not in subquery may not be converted to a join. For example, PostgreSQL keeps the original processing method for processing the not in subquery.
However, another option is mentioned in the article Enhanced Subquery Optimizations in Oracle, which mentions that a new operator, Null-Aware Anti Join (NAAJ). Again, taking SQL 1 as an example, the NAAJ processing algorithm is as follows:
Returns all rows if T2 is null
If any row of Age for T2 is Null, no rows are returned.
If the Age of T1 is Null in a row, the row is not returned
For each row of T1, if the NA condition execution is Null or TRUE, the row is not returned; Otherwise, returned.
Note that the NA condition in NAAJ is reversed: NOT IN -> NA= >ALL -> NA<=
The advantage of distributed databases is that more computing and storage resources can be applied. However, as data is distributed across different nodes, the I/O during data transmission can easily lead to performance bottlenecks. When executing subqueries, it is important to consider how to foster strengths and circumvent weaknesses.
The nested execution of subquery increases network I/O with the amount of data. In addition to slow execution, this also results in higher overall resource consumption and lower system capacity.
Unnesting can eliminate the time complexity of O(N^2) to O(N), avoiding a large amount of network I/O overhead.
If the data distribution meets the conditions, the subquery can be pushed down to the storage nodes for computing, which makes full use of the computing performance of the cluster and avoids the transmission of a large amount of scanned data at the network layer.
Therefore, in a distributed system, it is particularly important to convert more subqueries into joined executions.
In addition to unnesting, materialization can also be used to reduce network transmission data in connection computing.
As shown in the figure, when the number of data on one side of the connection is small, all or part of them can be removed (Semi can be partial while Anti must be full) as constants to the execution plan on the other side for processing.
Even for subqueries that cannot be converted to JOINs, the logical characteristics of the subquery can still be used to reduce the amount of data in network I/O.
Assuming that the condition of the subquery is E, the following filters can be added to the subquery side of each subquery of the Semi type Apply operator:
E OR (E IS NULL)
For E1 OR E2 OR E3 ...
Expressions with En set to FALSE can be ignored.
For the anti type, the following filter conditions can be added:
// E' = NOT E
E' OR (E' IS NOT NULL)
For E1 AND E2 AND E3 ...
Expressions with En set to TRUE can be ignored.
For example:
SELECT * FRMO R
WHERE ID IN (SELECT ID FROM R'
...)
SELECT * FRMO R
WHERE AGE>ALL(SELECT AGE FROM R'
...)
Note that the expression of Anti push-down needs to be reversed.
By doing so, unwanted data can be filtered out from the child nodes in advance. Even in the case of null sets, this filter push-down is valid.
The Cache is an optimization technique that never goes out of style, which has a great effect on Apply.
Take the simplest multi-layer Apply SQL as an example:
SELECT *
FROM T1
WHERE 100>
(SELECT col1
FROM T2
WHERE T1.pk = T2.pk
AND 100 >
(SELECT col2
FROM T3
WHERE T2.pk = T3.pk))
Assuming that T1, T2, and T3 all have1000 lines of data, the execution process of Apply shown as follow:
As the outermost primary table, T1 only needs to be scanned once with 1,000 rows of data, thus making T2 will be scanned 1,000 times by Apply operation. Similarly, T3 needs to be scanned 1,000 times for T2 each time. Finally, the number of T2 scans is 1000 and the number of T3 scans is 10^6.
This means that, for a thousand rows of data, a multi-layer Apply operation can lead to hundreds of thousands or even millions of network I/O operations. Even if Apply is an expected slow query, the Cost is unacceptable.
This Cache is mainly introduced to reduce the number of network I/O:
As shown in the preceding figure, the number of I/O requests to the Cache Node remains unchanged, but the IO level has been downgraded from network to memory. The three I/O tables on the network are each performed only once. Since the access speed of the memory is quite different from that of the network, no specific SQL RT data will be presented here.
In this article, a lot of subquery optimization techniques have been introduced such as Magic Set and window functions. We believe that these can give you some insight into subquery optimization. Meanwhile, traps in practice can be used as references for both databases and developers.
Finally, we have also shared some thoughts about subqueries in distributed databases are shared for inspiration.
Detailed Explanation of Query Rewriting Based on Materialized Views
A Detailed Explanation of the Detection and Processing of BigKey and HotKey in Redis
ApsaraDB - March 26, 2024
OceanBase - August 25, 2022
OceanBase - August 25, 2022
ApsaraDB - May 16, 2022
ApsaraDB - October 19, 2023
chenyucatcat - April 21, 2021
Leverage cloud-native database solutions dedicated for FinTech.
Learn MoreMigrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMigrating to fully managed cloud databases brings a host of benefits including scalability, reliability, and cost efficiency.
Learn MoreDBStack is an all-in-one database management platform provided by Alibaba Cloud.
Learn MoreMore Posts by ApsaraDB