An Improvement of Aggregate Query Parallelism for a Large Number of Partitions in a PostgreSQL Partition Table

By digoal

Background

For partition tables, each partition contains at least one file. If there are thousands of partitions, there will be thousands of files. For versions earlier than PostgreSQL 11, full table aggregation (for example, count) would take a long time for such a partition because it needs to scan files one by one.

Parallel computing improves performance. In PostgreSQL 12, the aggregation operation for a partition table with 2,048 partitions and about 0.2 billion records only takes one second.

Example

The degree of parallelism in a partition table depends on the partitions. When we establish the standard, we force parallel_workers to be 52 for each partition.

do language plpgsql $$  
declare  
begin  
drop table if exists p;  
create table p (id int , info text, crt_time timestamp) partition by range (crt_time);    
create table p2020 partition of p FOR VALUES FROM ('2020-01-01') TO ('2021-01-01') partition by hash (id);    
create table p2021 partition of p FOR VALUES FROM ('2021-01-01') TO ('2022-01-01') partition by hash (id);    
for i in 0..1023 loop  
  execute format ('create table p2020_%s partition of p2020 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);  
  execute format ('create table p2021_%s partition of p2021 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);  
  execute format('alter table p2020_%s set (parallel_workers=52)',i);  
  execute format('alter table p2021_%s set (parallel_workers=52)',i);  
end loop;  
end;  
$$;

Write data through stress testing:

vi test.sql  
\set id random(1,2000000000)  
insert into p values (:id, random()::text, now());

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 52 -j 52 -T 600  
  
  
progress: 14.0 s, 459890.7 tps, lat 0.113 ms stddev 0.132  
progress: 15.0 s, 439633.9 tps, lat 0.118 ms stddev 0.209  
progress: 16.0 s, 440804.9 tps, lat 0.118 ms stddev 0.233  
progress: 17.0 s, 444148.2 tps, lat 0.117 ms stddev 0.148  
progress: 18.0 s, 463751.1 tps, lat 0.112 ms stddev 0.094  
progress: 19.0 s, 469972.4 tps, lat 0.110 ms stddev 0.078  
......

Force parallelism to obtain the best aggregation performance for all partitions:

db1=# show max_worker_processes ;  
 max_worker_processes   
----------------------  
 128  
(1 row)  
set max_parallel_workers =32;  
set max_parallel_workers_per_gather =32;  
set parallel_setup_cost =0;  
set parallel_tuple_cost =0;  
set min_parallel_table_scan_size =0;  
set min_parallel_index_scan_size =0;  
set parallel_leader_participation=off;

Set the other parameter based on the situation.

Situation 1:

set enable_partitionwise_aggregate=off;  
  
db1=# explain      select count(*) from p;  
                                           QUERY PLAN                                             
------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=847577.04..847577.05 rows=1 width=8)  
   ->  Gather  (cost=847576.95..847576.96 rows=32 width=8)  
         Workers Planned: 32  
         ->  Partial Aggregate  (cost=847576.95..847576.96 rows=1 width=8)  
               ->  Parallel Append  (cost=0.00..840170.98 rows=2962388 width=0)  
                     ->  Parallel Seq Scan on p2020_25  (cost=0.00..920.11 rows=3011 width=0)  
                     ->  Parallel Seq Scan on p2020_4  (cost=0.00..905.94 rows=2994 width=0)  
                     ->  Parallel Seq Scan on p2020_42  (cost=0.00..898.65 rows=2965 width=0)  
                     ->  Parallel Seq Scan on p2020_49  (cost=0.00..891.59 rows=2859 width=0)  
                     ->  Parallel Seq Scan on p2020_3  (cost=0.00..890.98 rows=2998 width=0)  
                     ->  Parallel Seq Scan on p2020_13  (cost=0.00..890.25 rows=3025 width=0)

Create multiple parallel tasks to aggregate multiple partitions at a time. Each partition is not aggregated in parallel, but the result is appended in parallel. This method applies when partitions contain many small tables, such as tables of less than 1 GB.

Situation 2:

set enable_partitionwise_aggregate=on;  
  
db1=# explain      select count(*) from p;  
                                           QUERY PLAN                                             
------------------------------------------------------------------------------------------------  
 Finalize Aggregate  (cost=832959.57..832959.58 rows=1 width=8)  
   ->  Gather  (cost=874.54..832795.73 rows=65536 width=8)  
         Workers Planned: 32  
         ->  Parallel Append  (cost=874.54..832795.73 rows=2048 width=8)  
               ->  Partial Aggregate  (cost=927.63..927.64 rows=1 width=8)  
                     ->  Parallel Seq Scan on p2020_25  (cost=0.00..920.11 rows=3011 width=0)  
               ->  Partial Aggregate  (cost=913.43..913.44 rows=1 width=8)  
                     ->  Parallel Seq Scan on p2020_4  (cost=0.00..905.94 rows=2994 width=0)  
               ->  Partial Aggregate  (cost=906.07..906.08 rows=1 width=8)  
                     ->  Parallel Seq Scan on p2020_42  (cost=0.00..898.65 rows=2965 width=0)  
               ->  Partial Aggregate  (cost=898.73..898.74 rows=1 width=8)  
                     ->  Parallel Seq Scan on p2020_49  (cost=0.00..891.59 rows=2859 width=0)  
               ->  Partial Aggregate  (cost=898.47..898.48 rows=1 width=8)  
                     ->  Parallel Seq Scan on p2020_3  (cost=0.00..890.98 rows=2998 width=0)

Calculate one partition at a time. Multiple parallel calculations are performed in one partition to obtain the result, and the result is aggregated in multiple stages.

This method applies when the number of partitions is less than 16, while each partition is relatively large, such as larger than 10 GB.

You can choose to enable or disable enable_partitionwise_aggregate based on the performance. Generally, the computing time should be greater than the scheduling time.

db1=# select count(*) from p;  
   count     
-----------  
 212758999  
(1 row)  
  
Time: 1098.609 ms (00:01.099)

Issue 2

You can see that it is very slow to count queries immediately after the writing is completed. Why?

db1=# select wait_event, wait_event_type from pg_stat_activity ;  
     wait_event      | wait_event_type   
---------------------+-----------------  
 AutoVacuumMain      | Activity  
 LogicalLauncherMain | Activity  
 ExecuteGather       | IPC  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
                     |   
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 SLRURead            | IO  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
 CLogControlLock     | LWLock  
                     |   
 BgWriterMain        | Activity  
 CheckpointerMain    | Activity  
 WalWriterMain       | Activity  
(40 rows)

As shown in the session table, the process is waiting for CLogControlLock. Why does the lock exist in queries?

The header information of each record in PostgreSQL records the transaction status (commit and rollback) of this record, but the status is changed immediately when the transaction is committed or rolled back. You do not need to update the header information of the write record corresponding to the transaction but write a clog file (2 bits per transaction) instead. However, a question arises. When will the transaction status in the header information of the record be updated?

The first time PostgreSQL touches the record (such as select, vacuum, analyze, autovacuum, and update operations), it obtains the transaction status from the corresponding bit in the clog file and updates the transaction status mask corresponding to the record header information.

Moreover, if datafile block checksum is enabled or wal_log_hints=on is set, the update operation will also generate WAL logs, and queries will generate logs.

Why is the first query slow? The reason lies in the update operation of hint bit in the record header information.

After the update, it is fast. PostgreSQL does not need to go to the clog file to judge the end status of the transaction. Instead, check the mask of the record header information.

Kernel Optimization for Slow First Query Caused by hint bit Update

There is a switch that can be set for kernel improvement. You can determine whether to update hint bit at the table level, painting level, or cluster level. For example, you can choose to delete some log tables and flow tables after writing. There is no need to update hint bit, so it is better not to update it. It can be obtained from the clog file during the query.

The improvement of kernel and concurrency of CLogControlLock to avoid congestion can reflect the capability of parallel computing.

Community

An Improvement of Aggregate Query Parallelism for a Large Number of Partitions in a PostgreSQL Partition Table

Background

Example

Issue 2

Kernel Optimization for Slow First Query Caused by hint bit Update

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

AnalyticDB for PostgreSQL

Tablestore

Elastic High Performance Computing Solution