Partition-wise join

Updated at: 2024-05-17 01:53

PolarDB for PostgreSQL provides the partition-wise join feature to reduce the number of invalid joins between partitions and improve the performance of join queries.

Overview

The partition-wise join feature divides a large join into smaller joins between a pair of partitions from the two joined tables. It is used to optimize the joins between two partitions. If you want to use partition keys to join partitions, you can use the partition-wise join feature to reduce invalid joins between the partitions and improve the performance of join queries.

image.png

Usage notes

You can execute the following statement to enable partition-wise join:

set enable_partitionwise_join to on;

Examples

This section provides two examples to help you understand the partition-wise join feature.

In the following example, the measurement and sales tables are used to describe the partition-wise join feature.

CREATE TABLE measurement(
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);
CREATE TABLE measurement_y2023q1 PARTITION OF measurement
    FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE measurement_y2023q2 PARTITION OF measurement
    FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE measurement_y2023q3 PARTITION OF measurement
    FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE measurement_y2023q4 PARTITION OF measurement
    FOR VALUES FROM ('2023-10-01') TO ('2024-04-01');
    
CREATE TABLE sales (
    dept_no     number,   
    part_no     varchar2,
    country     varchar2(20),
    date        date,
    amount      number
) PARTITION BY RANGE (date);
CREATE TABLE sales_y2023q1 PARTITION OF sales
    FOR VALUES FROM ('2023-01-01') TO ('2023-04-01');
CREATE TABLE sales_y2023q2 PARTITION OF sales
    FOR VALUES FROM ('2023-04-01') TO ('2023-07-01');
CREATE TABLE sales_y2023q3 PARTITION OF sales
    FOR VALUES FROM ('2023-07-01') TO ('2023-10-01');
CREATE TABLE sales_y2023q4 PARTITION OF sales
    FOR VALUES FROM ('2023-10-01') TO ('2024-04-01');

In the preceding statement:

  • The measurement table has the measurement_y2023q1, measurement_y2023q2, measurement_y2023q3, and measurement_y2023q4 partitions, corresponding to the four quarters of 2023, respectively.

  • The sales table has the sales_y2023q1, sales_y2023q2, sales_y2023q3, and sales_y2023q4 partitions, corresponding to the four quarters of 2023, respectively.

Execute the SQL query statement for the join between measurement and sales and view the query plan.

explain select a.* from sales a join measurement b on a.date = b.logdate where b.unitsales > 10;

When partition-wise join is disabled, measurement and sales are fully joined and the following query plan is used:

                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Aggregate  (cost=871.75..871.76 rows=1 width=8)
   ->  Merge Join  (cost=448.58..812.79 rows=23587 width=32)
         Merge Cond: (a.date = b.logdate)
         ->  Sort  (cost=185.83..191.03 rows=2080 width=40)
               Sort Key: a.date
               ->  Append  (cost=0.00..71.20 rows=2080 width=40)
                     ->  Seq Scan on sales_y2023q1 a  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q2 a_1  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q3 a_2  (cost=0.00..15.20 rows=520 width=40)
                     ->  Seq Scan on sales_y2023q4 a_3  (cost=0.00..15.20 rows=520 width=40)
         ->  Sort  (cost=262.75..268.42 rows=2268 width=8)
               Sort Key: b.logdate
               ->  Append  (cost=0.00..136.34 rows=2268 width=8)
                     ->  Seq Scan on measurement_y2023q1 b  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q2 b_1  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q3 b_2  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
                     ->  Seq Scan on measurement_y2023q4 b_3  (cost=0.00..31.25 rows=567 width=8)
                           Filter: (unitsales > 10)
(21 rows)

The query plan is designed to query data on the full join between measurement and sales. In this case, invalid joins exist. The join between sales_y2023q1 and measurement_y2023q3 is empty. This is because the join condition is an equi-join between the partition keys of sales_y2023q1 and measurement_y2023q3, but their partition keys mismatch. Query results are returned only if the partition keys of sales_y2023q1 and measurement_y2023q1 match.

Then, execute the following statement to enable partition-wise join:

set enable_partitionwise_join to on;

Execute the SQL query statement for the join between measurement and sales and view the query plan.

explain select a.* from sales a join measurement b on a.date = b.logdate where b.unitsales > 10;
                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Append  (cost=21.70..453.33 rows=5896 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b.logdate = a.date)
         ->  Seq Scan on measurement_y2023q1 b  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q1 a  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_1.logdate = a_1.date)
         ->  Seq Scan on measurement_y2023q2 b_1  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q2 a_1  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_2.logdate = a_2.date)
         ->  Seq Scan on measurement_y2023q3 b_2  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q3 a_2  (cost=0.00..15.20 rows=520 width=128)
   ->  Hash Join  (cost=21.70..105.96 rows=1474 width=128)
         Hash Cond: (b_3.logdate = a_3.date)
         ->  Seq Scan on measurement_y2023q4 b_3  (cost=0.00..31.25 rows=567 width=8)
               Filter: (unitsales > 10)
         ->  Hash  (cost=15.20..15.20 rows=520 width=128)
               ->  Seq Scan on sales_y2023q4 a_3  (cost=0.00..15.20 rows=520 width=128)
(25 rows)

After partition-wise join is enabled, joins are created only between sales_y2023q2 and measurement_y2023q2, between sales_y2023q3 and measurement_y2023q3, and between sales_y2023q4 and measurement_y2023q4. Invalid joins between partitions are reduced, which significantly improves the performance of join queries.

  • On this page (1, T)
  • Overview
  • Usage notes
  • Examples
Feedback