All Products
Search
Document Center

PolarDB:Correlated subquery pull-up

Last Updated:May 31, 2024

This topic describes the correlated subquery pull-up feature and its use cases.

Prerequisites

The feature is supported in PolarDB for PostgreSQL clusters that run PostgreSQL 14 (revision version: 14.8.11.0 or later).

Note

You can execute the following statement to view the revision version of your PolarDB for PostgreSQL cluster:

select version();

Background information

The PostgreSQL optimizer uses a SUBLINK to represent a combination of a subquery and the related operator in an expression. The following types of SUBLINKs are supported:

  • EXISTS_SUBLINK: implements an EXISTS (SELECT ...) subquery.

  • ALL_SUBLINK: implements an ALL (SELECT ...) subquery.

  • ANY_SUBLINK: implements an ANY (SELECT ...) or IN (SELECT ...) subquery.

The optimizer usually attempts to pull up correlated subqueries that are used with the ANY, IN, EXISTS, or NOT EXISTS operator. This way, an optimized execution plan that uses a semi-join or anti-join can be generated for a correlated subquery and its outer query, which improves query performance. If an ANY_SUBLINK subquery references the variables in the outer query, the subquery is not pulled up. As a result, the subquery is not optimized together with the outer query. The subquery can only be independently optimized, which results in a significant increase in SQL execution time.

PolarDB for PostgreSQL allows you to use a parameter to control the pull-up of correlated ANY_SUBLINK subqueries. If you set this parameter to ON, the optimizer pulls up a correlated subquery that is used with the IN or ANY operator even if the subquery references the variables in the outer query. This increases the search scope of the optimizer and allows the optimizer to generate a better execution plan.

Usage

You can configure the polar_enable_pullup_with_lateral parameter to specify whether to pull up correlated ANY_SUBLINK subqueries. Valid values:

  • ON (default)

  • OFF

Examples

Prepare data

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 SELECT i, 1 FROM generate_series(1, 100000) i;
CREATE TABLE t2 AS SELECT * FROM t1;

View the execution plan and time after the correlated subquery pull-up feature is disabled

=> SET polar_enable_pullup_with_lateral TO OFF;

=> EXPLAIN (COSTS OFF, ANALYZE)
   SELECT * FROM t1
   WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on t1 (actual time=67.631..1641827.119 rows=100000 loops=1)
   Filter: (SubPlan 1)
   SubPlan 1
     ->  Result (actual time=0.005..13.124 rows=50000 loops=100000)
           One-Time Filter: (t1.b = 1)
           ->  Seq Scan on t2 (actual time=0.005..7.718 rows=50000 loops=100000)
                 Filter: (b = 1)
 Planning Time: 0.145 ms
 Execution Time: 1641847.702 ms
(9 rows)

View the execution plan and time after the correlated subquery pull-up feature is enabled

=> SET polar_enable_pullup_with_lateral TO ON;

=> EXPLAIN (COSTS OFF, ANALYZE)
   SELECT * FROM t1
   WHERE t1.a IN (SELECT a FROM t2 WHERE t2.b = t1.b AND t2.b = 1);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Hash Semi Join (actual time=64.783..173.482 rows=100000 loops=1)
   Hash Cond: (t1.a = t2.a)
   ->  Seq Scan on t1 (actual time=0.016..25.440 rows=100000 loops=1)
         Filter: (b = 1)
   ->  Hash (actual time=64.550..64.551 rows=100000 loops=1)
         Buckets: 131072  Batches: 2  Memory Usage: 2976kB
         ->  Seq Scan on t2 (actual time=0.010..30.330 rows=100000 loops=1)
               Filter: (b = 1)
 Planning Time: 0.195 ms
 Execution Time: 178.050 ms
(10 rows)

As shown in the preceding example, the subquery and the outer query are optimized into a semi-join after the correlated subquery pull-up feature is enabled. The filter conditions in the subquery effectively filter the results of the outer query, and the execution time is significantly shortened. If the subquery is not pulled up, the rows to be returned to the outer query cannot be filtered by the subquery.