An Introduction to Subqueries

By Fangwu

For better understanding of this article, we recommend you to read about the following concepts and principles in advance:

Relational algebra model in relational databases
Three-valued logic in database expression evaluation
Declarative programming language SQL that interacts with the database
Subquery and SemiJoin
Window functions and lossless join

An Overview of Subqueries

The definition of subquery is so simple, but the derived scenarios are so complex. It is both an expression and a query tree, which is flexible enough to become a constant and to express relationships between collections at any time. While writing subqueries in SQL is so natural, the way it really works isn't as straightforward.

According to the SQL standard, nesting a query block into an expression results in a subquery.

Query blocks in a relational database are usually expressed in a tree plan composed of relational algebraic operators. The relationship between the expressions of each operator and the outer-layer query tree is ambiguous.

The expression in the query tree hides another query tree. The essence of which lies in describing the relationships between query trees through expressions, such as user-defined functions, comparison operators, and logical operators. Due to a large number of expression types and their complexity, it is difficult to directly abstract the relationship between the subquery and the main query through traditional relational algebra. This makes it difficult to implement subqueries in relational databases.

Since subquery implementation is complex for databases, the execution efficiency may not be high. Why this SQL syntax has to be implemented? Is there any reason other than SQL standard?

... WHERE AGE > ALL(SELECT AGE ...)
... WHERE SALARY <= ANY(SELECT SALARY ...)

Each query tree is understood as a dataset. It is very obscure in some scenarios to use JOIN to describe the intersection and union between data. However, from the preceding SQL, the subquery description of AGE >ALL and SLARY

In a nutshell, a subquery is a classic example of leaving the complexity to the database and the simplicity to users.

Classification

To understand the subquery, classification is necessary. Semantically, it can be divided into the following aspects:

1) A scalar subquery returns a row of values. The two-dimensional table with rows and columns output by common operators is similar to a vector. Accordingly, a subquery that outputs only one row and one column is called a scalar.

SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
 SELECT SUM(o_totalprice)
 FROM ORDERS
 WHERE o_custkey = c_custkey
)

2) Quantified comparison judges whether ALL or ANY item is met.

SELECT c_name
FROM CUSTOMER
WHERE c_nationkey <> ALL (SELECT s_nationkey FROM SUPPLIER)

3) Existence test judges the EXISTS or NOT EXISTS.

SELECT c_custkey
FROM CUSTOMER
WHERE c_nationkey = 86 AND EXISTS(
 SELECT * FROM ORDERS
 WHERE o_custkey = c_custkey
)

It is easier to understand the semantic division, but the database will abstract another division based on a logical operational perspective:

Scalar: commonly known as scalar subquery, output one line to the outside
Semi: (EXISTS, IN, ANY) can be used to dissect data into logical relations. For example, the subquery of AGE IN (SELECT AGE) can be expressed as a.age=b.age[0] OR a.age=b.age[1] OR ... OR a.age=b.age[n].
Anti-Semi: (NOT EXISTS, NOT IN, ALL) can be broken down into logical relationships that are merged together. For example, the subquery of AGE >ALL (SELECT AGE) can be expressed as a.age>b.age[0] AND a.age>b.age[1] AND ... AND a.age>b.age[n].

This does not mean that a subquery is a common expression after it is logically disorganized. In expression evaluation of a relational database, functions try to use rows as the input and output.

A function that processes a set is usually abstracted into a new operator. For example, an aggregation function is processed by an aggregate operator alone. Similarly, a subquery should also be abstracted as an individual operator. The boundary division when operators and expressions are cut follows the same logic.

Boundary

The database engine does not confuse complex dataset operations with expression evaluation. In addition, the boundary between set operations and row operations is clearly defined based on the implementation complexity.

Scalar: Since it has only one line of output, the boundary of the subquery can be divided into the query block.

SEMI/ANTI: A query block can output multiple rows. The output can be limited to one row of boolean values only with the ANY and ALL expressions of the subquery. Therefore, the boundary must contain the input parameters preceding the subquery expression.

Finally, based on whether the subquery itself has a related item, it can also be divided into correlated subquery and non-correlated subquery. Any non-correlated subquery can be regarded as a constant in nature, while the data relationship between the inner and outer layers of the subquery must be considered when processing the correlated subquery. The processing of this data relationship is the focus of subquery optimization.

Optimization Technique

Will subqueries in SQL statements lead to a sharp drop in query efficiency?

As a declarative programming language, SQL only describes what data it needs. The specific operation depends on the performance of the database itself. With the development of query optimization techniques, it is too arbitrary to say that subqueries will inevitably lead to performance degradation.

The essence of correlated subqueries is the operation of each row of data on the foreign table with the data of the subquery collection. When there are too many rows of data in the foreign table, this nesting process will inevitably lead to low performance. Therefore, a very important step in subquery optimization is "unnesting". Today's unnesting technology has become mature, and HyPer announced a paper Unnesting Arbitrary Queries in 2015.

Unnesting

As mentioned above, the subquery should be separated from the expression and displayed as an operator in the execution plan because it processes collection data. This operator, which refers to a subquery, is generally called Apply.

The name "Apply" comes from list processing (LISP), which is a special type of function. The input parameter is a set of parameter lists, and the returned value is a corresponding set of result lists. From the semantic point of view of relational algebra database, the process is similar to the nested execution of correlated subqueries, which was firstly proposed by the paper of Microsoft SQL Server.

After a subquery is converted to an Apply operator, the correlated items remain in the query tree of the subquery. Therefore, there is no way to directly process Apply through JOIN.

Therefore, the most important step for subquery optimization is unnesting.

SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
 SELECT SUM(o_totalprice)
 FROM ORDERS
 WHERE o_custkey = c_custkey
)

Take the SQL statement as an example to see why unnesting is necessary for correlated subqueries.

The following figure shows the original relation tree that is not unnesting. In actual running mode, the executor runs the query tree on the right of Apply for each row of data. By doing so, the processing time consumed by nested processing of correlated subqueries increases linearly with the increase of the data volume. If multiple subqueries are nested, the processing time increases exponentially. To avoid poor experience due to long and time-consuming operations, unnesting is crucial.

Unnesting Based on Rule

In the 1980s and 1990s, the SQL standard expanded the scope of subqueries. Rule-based unnesting was one of the mainstream researches at that time.

The paper Orthogonal Optimization of Subqueries and Aggregation published in 2001 is a collection of rule-based unnesting, in which nine conversion rules are concluded:

The two transformations in the following figure correspond to Rule 3 and Rule 1 respectively.

The conversion in the preceding example takes effect on certain conditions. The key difference is the AND logical operator that matches the relationship between JOIN and Filter.

As an expression, the subquery acts as a filter in the Filter.
The logical relationship between the nodes that act as filters from the bottom to the top of the execution plan such as Filter and JOIN is AND.
After the subquery is converted to SemiJoin, subqueries are superimposed with the upper-layer Filter nodes. The AND relationship between the extracted expressions in the Filter will not be destroyed, but the OR relationship will be destroyed. The specific example is shown below.

Rule-based transformations cannot process subqueries in all schemas, for example:

// Complex example 1: Subquery in a destruction
SELECT * FROM
  supplier a     
WHERE        
  s_address IN (select s_name from supplier where a.s_phone=s_phone)
  OR s_name='1'

// Complex example 2: A subquery contains aggregation and non-equivalent items
SELECT * FROM T1
WHERE AGE >
      (SELECT AVG(AGE) FROM T2 WHERE T1.NAME!=NAME)
      
// Complex example 3: Nested subqueries
...
(SELECT * FROM T1
WHERE ID IN
      (SELECT ID FROM T2 WHERE T3.NAME=NAME))
...

Unnesting Based on Magic Set

Magic Set is a very old data processing technology, which was first applied in a deductive database. Currently, it also plays an important role in the subquery unnesting of relational databases.

In 1996, Cost-Based Optimization for Magic: Algebra and Implementation published by IBM DB2 introduced Magic Set as a relational algebraic operator into CBO. In 2015, HyPer developed its own side-ways information passing optimization based on this technology to process the unnesting of all types of subqueries. In the execution plan displayed on the HyPer official website, a similar operator is named Magic.

Consider such a SQL statement as

SELECT s.name, e.course
FROM students s, exams e
WHERE s.id=e.sid AND
     (s.major = 'CS' or s.major = 'Games Eng') AND
     e.grade >= (SELECT avg(e2.grade)+1 --one grade worse
                 FROM exams e2 --than the average grade
                 WHERE s.id=e2.sid OR --of exams taken by
                      (e2.curriculum=s.major AND --him/her or taken
                       s.year>e2.date)) --by elder peers

The unnesting of this SQL statement is mainly difficult to have a non-equivalent correlated item s.year>e2.date, which will lead to failure to avoid JOIN computing in the filter layer where the correlated item is located.

Figure from HyPer's paper Unnesting Arbitrary Queries

HyPer attempted to copy the data of the columns involved in the correlated items and introduce the data to the subquery in his paper. In addition, it replaced correlated items through JOIN, thus achieving unnesting.

From an implementation point of view, unnesting of subqueries are at the cost of some space and additional JOIN computing.

SemiJoin Operator Derivation

Databases often use the SemiJoin operator clusters to express unnesting subqueries. Here, another issue that the definition of Apply and SemiJoin of the relational algebra cannot be completely equivalent will occur.

SemiJoin, which indicates that only the columns of one table are output, and the columns of other tables will not be output to the upper layer.

// Complex example 1: Subquery in disjunction
SELECT * FROM
  supplier a     
WHERE        
  s_address IN (select s_name from supplier where a.s_phone=s_phone)
  OR s_name='1'

Considering the preceding SQL statements, the subquery is placed between OR expressions. As a result, it cannot be converted to a simple SemiJoin condition because the filtering relationship superimposed with Filter and SemiJoin is different from OR.

Mark Join: In addition to outputting the connected data, it also keeps a Mark to mark the connected results of this row, such as TRUE, FALSE, or NULL.

For this scenario, HyPer introduces the Mark Join to replace SemiJoin.

In the Filter for the Mark Join upper layer, an expression of markvalue OR sname='1' is formed, avoiding violation with OR semantics by adding a column of the output.

Adopting the Mark mechanism, which allows JOIN to output an extra column, not only breaks the relational algebraic meaning of JOIN but also requires a massive transformation at the execution level. However, in addition to the OR subquery in the preceding example, it also supports subqueries of the Project and Scalar classes. This method is of great significance.

As of HyPer's 2015 paper Unnesting Arbitrary Queries, traditional database vendors such as SQL Server, Oracle did not support non-Scalar subqueries in Project, and support for unnesting of complex subqueries was very limited as well. HyPer, as the first database that claims to be able to perform unnesting for all subqueries, is indeed a very radical practice.

Optimization of Subquery Rewriting

Oracle's 2009 publication, Enhanced subquery optimizations in Oracle, shows the vast number of subquery rewriting optimizations available. These technologies are specifically optimized based on subqueries in the TPC-H. It is estimated that Oracle programmers wrote these technologies when they made the first TPC-H.

These rewriting technologies have high requirements for parameter extraction and derivation.

Merging Between Subqueries

The technology of merging multiple similar subqueries, such as Q2 to Q3, Q4 to Q5, is shown in the following figure.

Merging a Subquery with a Primary Table

A subquery can be merged with the primary table as well as with similar subqueries, for example, Q8 to Q11.

Window Functions Optimization

A technique for optimizing subqueries based on window functions was first proposed by IBM in WinMagic: Subquery Elimination Using Window Aggregation. In Oracle, the window function is rewritten as one of the representatives of its subquery rewriting as well.

Window functions optimization: Triggered in the phase of the rule-based optimizer (RBO). When certain conditions are met, the connection relationship with aggregate operation is converted into window functions. This is a join or subquery rewriting that is applied to avoid a full table scan.

In short, the subquery with aggregate operation is rewritten as window functions. The rewriting assumes that external query blocks contain all the tables of subqueries as well as filter conditions. For example:

Q1:
SELECT T1.X
FROM T1,T2
WHERE T1.y = T2.y and
            T2.name='ming' and
          T2.z relop (SELECT AGG(T2.w)
                             FROM  T2
                         WHERE T2.y = T1.y and
                                       T2.name='ming');

The external query block contains two tables, T1 and T2, and contains all subquery tables (T2) in the subquery. It also contains all the filter conditions (T2.name='ming'). "relop" is short for relational operators.

After the above conditions are met, Q1 can be rewritten as Q2:

Q2:
SELECT T1.x
FROM T1,(
                SELECT AGG (T2.w) OVER (PARTITION BY y) AS win_agg,
                 z,
                 y
                FROM T2
              WHERE T2.name='ming' ) V
WHERE T1.y=V.y and 
      V.z relop win_agg

If the connection between T1 and T2 is lossless join, Q1 can be converted to Q3:

Q3:
SELECT V.x
FROM (
      SELECT T1.x, T2.z,
             AGG (T2.w) OVER (PARTITION BY T2.y) AS win_agg
      FROM T1, T2
      WHERE T1.y = T2.y ) V
WHERE V.z relop win_agg

The lossless join is not discussed here. In fact, the difference between Q2 and Q3 can be understood as the reorder of aggregate and JOIN.

After Q1 is rewritten into Q2, it is determined in CBO whether to be converted into the form of Q3 based on the cost. Here is a specific example Q4 (TPC-H Q17):

Q4:
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM
  lineitem,
  part
WHERE
  p_partkey = l_partkey
  AND p_brand = 'Brand#23'
  AND p_container = 'MED BOX'
  AND l_quantity < (
    SELECT 0.2 * avg(`l_quantity`)
    FROM
      lineitem
    WHERE
      l_partkey = p_partkey
  );

In the scenario with 50 gigabytes, the computing workload for each plan is as follows:

After the original execution plan is rewritten to window functions, obviously, a scan of 10^8 levels is reduced.

Compared with the two execution plans rewritten as window functions, the Cost of the Plan 2 window functions is much smaller according to the number of rows processed. However, to be converted to Plan 2, the following decisions similar to aggregate and JOIN reorder should be made. In the lossless join scenario, Plan 2 can be directly converted. Q4 parts and line items are connected with foreign keys. Therefore, this condition is met.

In addition, the JOIN of Plan 2 uses the Batch Key Access algorithm, which is equivalent to performing 10^4 index scans on the line item table. Thus, the Cost is extremely low.

The optimization of the window function significantly improves the RT of TPC-H Q17 and Q2. in distributed databases, the optimization of the measured window functions plus BKA increases the RT of Q17 by nearly a hundred times.

Execution Traps

Technically, query optimization is not the whole of the subquery. From the perspective of implementation, it is even more difficult to prevent traps in the execution.

Count

Count traps mainly exist in Count subqueries without group by. Consider the following SQL:

 
SELECT * FROM T1
WHERE T1.NUM = (SELECT COUNT(ID) FROM T2
                WHERE T1.AGE=AGE)

If the above SQL statement is converted into the following execution tree, when the number of certain ages of T2 is 0, the result of SemiJoin will be Null because of T1.age=T2.age expression, thus the correct result cannot be output. This is essentially due to the particularity of the COUNT aggregate function. To resolve the problem, a node must be output Null rows in the JOIN node, which is similar to the LEFT type.

Other aggregate functions also have different problems. For example, >ALL subqueries cannot be directly converted to >MAX, because >ALL returns TRUE and >MAX returns FALSE if the result set is null.

Null-Aware Anti Join

Observe the following two subqueries that output people with different names or ages in the USER table:

// SQL1
SELECT * FROM USER T1
WHERE AGE NOT IN (SELECT AGE FROM USER T2
                 WHERE T1.NAME=T2.NAME);
             
// SQL2            
SELECT * FROM USER T1
WHERE NOT EXISTS (SELECT 1 FROM USER T2
                 WHERE T1.NAME=T2.NAME AND T1.AGE=AGE);

It seems that both subqueries are equivalent. In fact, if a row in the USER table has an age that is NULL, SQL 1 will not output the data, while SQL 2 will.

NOT EXISTS is equivalent to Anti Join. When the Anti Join is used to process the ON condition, the execution result of the executor is null. If this is the case, no match will be found. The problem is that the operand of the not in subquery is also converted into one of the ON conditions, and is connected with the correlated item through AND. By doing so, the correctness problem may occur for multiple outputs.

If this is the case, the not in subquery may not be converted to a join. For example, PostgreSQL keeps the original processing method for processing the not in subquery.

However, another option is mentioned in the article Enhanced Subquery Optimizations in Oracle, which mentions that a new operator, Null-Aware Anti Join (NAAJ). Again, taking SQL 1 as an example, the NAAJ processing algorithm is as follows:

Returns all rows if T2 is null
If any row of Age for T2 is Null, no rows are returned.
If the Age of T1 is Null in a row, the row is not returned
For each row of T1, if the NA condition execution is Null or TRUE, the row is not returned; Otherwise, returned.

Note that the NA condition in NAAJ is reversed: NOT IN -> NA= >ALL -> NA<=

Subquery in Distributed Databases

The advantage of distributed databases is that more computing and storage resources can be applied. However, as data is distributed across different nodes, the I/O during data transmission can easily lead to performance bottlenecks. When executing subqueries, it is important to consider how to foster strengths and circumvent weaknesses.

Leverage the computing performance of the cluster
Reduce network I/O overhead

Unnesting Is More Important

The nested execution of subquery increases network I/O with the amount of data. In addition to slow execution, this also results in higher overall resource consumption and lower system capacity.

Unnesting can eliminate the time complexity of O(N^2) to O(N), avoiding a large amount of network I/O overhead.

If the data distribution meets the conditions, the subquery can be pushed down to the storage nodes for computing, which makes full use of the computing performance of the cluster and avoids the transmission of a large amount of scanned data at the network layer.

Therefore, in a distributed system, it is particularly important to convert more subqueries into joined executions.

Different Applications of Materialization

In addition to unnesting, materialization can also be used to reduce network transmission data in connection computing.

As shown in the figure, when the number of data on one side of the connection is small, all or part of them can be removed (Semi can be partial while Anti must be full) as constants to the execution plan on the other side for processing.

PreFilter of Apply

Even for subqueries that cannot be converted to JOINs, the logical characteristics of the subquery can still be used to reduce the amount of data in network I/O.

SEMI: Expressions are connected by using OR.

Assuming that the condition of the subquery is E, the following filters can be added to the subquery side of each subquery of the Semi type Apply operator:

E OR (E IS NULL)

For E1 OR E2 OR E3 ...
Expressions with En set to FALSE can be ignored.

ANTI: Expressions are connected through AND.

For the anti type, the following filter conditions can be added:

// E' = NOT E
E' OR (E' IS NOT NULL)
       
For E1 AND E2 AND E3 ...
Expressions with En set to TRUE can be ignored.

For example:

SELECT * FRMO R 
WHERE ID IN (SELECT ID FROM R'
             ...)

SELECT * FRMO R 
WHERE AGE>ALL(SELECT AGE FROM R'
             ...)

Note that the expression of Anti push-down needs to be reversed.

By doing so, unwanted data can be filtered out from the child nodes in advance. Even in the case of null sets, this filter push-down is valid.

Cache

Cache: Nested execution of correlated subqueries can lead to significant performance overhead in scenarios that cannot be transformed into JOINs. Especially in distributed scenarios, more I/O occurs with the increase of data volume. In multi-level nesting scenarios, thousands of data can lead to minute-long query times. Proper use of caching can greatly avoid I/O consumption due to data growth, thus allowing the original semantics of subqueries to be executed quickly as well. The CBO can determine based on the data volume. If necessary, a Cache on the execution layer will be added, thus the network I/O is downgraded to memory.

The Cache is an optimization technique that never goes out of style, which has a great effect on Apply.

Take the simplest multi-layer Apply SQL as an example:

SELECT *
FROM T1
WHERE 100>
    (SELECT col1
     FROM T2
     WHERE T1.pk = T2.pk
       AND 100 >
         (SELECT col2
          FROM T3
          WHERE T2.pk = T3.pk))

Assuming that T1, T2, and T3 all have1000 lines of data, the execution process of Apply shown as follow:

As the outermost primary table, T1 only needs to be scanned once with 1,000 rows of data, thus making T2 will be scanned 1,000 times by Apply operation. Similarly, T3 needs to be scanned 1,000 times for T2 each time. Finally, the number of T2 scans is 1000 and the number of T3 scans is 10^6.

This means that, for a thousand rows of data, a multi-layer Apply operation can lead to hundreds of thousands or even millions of network I/O operations. Even if Apply is an expected slow query, the Cost is unacceptable.

This Cache is mainly introduced to reduce the number of network I/O:

As shown in the preceding figure, the number of I/O requests to the Cache Node remains unchanged, but the IO level has been downgraded from network to memory. The three I/O tables on the network are each performed only once. Since the access speed of the memory is quite different from that of the network, no specific SQL RT data will be presented here.

Summary

In this article, a lot of subquery optimization techniques have been introduced such as Magic Set and window functions. We believe that these can give you some insight into subquery optimization. Meanwhile, traps in practice can be used as references for both databases and developers.

Finally, we have also shared some thoughts about subqueries in distributed databases are shared for inspiration.

References

Parameterized Queries and Nesting Equivalencies - C Galindo-Legaria
Orthogonal Optimization of Subqueries and Aggregation - C Galindo-Legaria, M Joshi
Unnesting Arbitrary Queries - T Neumann, A Kemper
The Complete Story of Joins (inHyPer) - T Neumann, V Leis, A Kemper
Enhanced Subquery Optimizations in Oracle - S Bellamkonda
WinMagic: Subquery Elimination Using Window Aggregation - CZHPW Ma
Unnesting SQL Queries in the Presence of Disjunction - P.Parízek

Community