By digoal
For partition tables, each partition contains at least one file. If there are thousands of partitions, there will be thousands of files. For versions earlier than PostgreSQL 11, full table aggregation (for example, count) would take a long time for such a partition because it needs to scan files one by one.
Parallel computing improves performance. In PostgreSQL 12, the aggregation operation for a partition table with 2,048 partitions and about 0.2 billion records only takes one second.
The degree of parallelism in a partition table depends on the partitions. When we establish the standard, we force parallel_workers to be 52 for each partition.
do language plpgsql $$
declare
begin
drop table if exists p;
create table p (id int , info text, crt_time timestamp) partition by range (crt_time);
create table p2020 partition of p FOR VALUES FROM ('2020-01-01') TO ('2021-01-01') partition by hash (id);
create table p2021 partition of p FOR VALUES FROM ('2021-01-01') TO ('2022-01-01') partition by hash (id);
for i in 0..1023 loop
execute format ('create table p2020_%s partition of p2020 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);
execute format ('create table p2021_%s partition of p2021 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);
execute format('alter table p2020_%s set (parallel_workers=52)',i);
execute format('alter table p2021_%s set (parallel_workers=52)',i);
end loop;
end;
$$;
Write data through stress testing:
vi test.sql
\set id random(1,2000000000)
insert into p values (:id, random()::text, now());
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 52 -j 52 -T 600
progress: 14.0 s, 459890.7 tps, lat 0.113 ms stddev 0.132
progress: 15.0 s, 439633.9 tps, lat 0.118 ms stddev 0.209
progress: 16.0 s, 440804.9 tps, lat 0.118 ms stddev 0.233
progress: 17.0 s, 444148.2 tps, lat 0.117 ms stddev 0.148
progress: 18.0 s, 463751.1 tps, lat 0.112 ms stddev 0.094
progress: 19.0 s, 469972.4 tps, lat 0.110 ms stddev 0.078
......
Force parallelism to obtain the best aggregation performance for all partitions:
db1=# show max_worker_processes ;
max_worker_processes
----------------------
128
(1 row)
set max_parallel_workers =32;
set max_parallel_workers_per_gather =32;
set parallel_setup_cost =0;
set parallel_tuple_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set parallel_leader_participation=off;
Set the other parameter based on the situation.
Situation 1:
set enable_partitionwise_aggregate=off;
db1=# explain select count(*) from p;
QUERY PLAN
------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=847577.04..847577.05 rows=1 width=8)
-> Gather (cost=847576.95..847576.96 rows=32 width=8)
Workers Planned: 32
-> Partial Aggregate (cost=847576.95..847576.96 rows=1 width=8)
-> Parallel Append (cost=0.00..840170.98 rows=2962388 width=0)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
-> Parallel Seq Scan on p2020_13 (cost=0.00..890.25 rows=3025 width=0)
Create multiple parallel tasks to aggregate multiple partitions at a time. Each partition is not aggregated in parallel, but the result is appended in parallel. This method applies when partitions contain many small tables, such as tables of less than 1 GB.
Situation 2:
set enable_partitionwise_aggregate=on;
db1=# explain select count(*) from p;
QUERY PLAN
------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=832959.57..832959.58 rows=1 width=8)
-> Gather (cost=874.54..832795.73 rows=65536 width=8)
Workers Planned: 32
-> Parallel Append (cost=874.54..832795.73 rows=2048 width=8)
-> Partial Aggregate (cost=927.63..927.64 rows=1 width=8)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Partial Aggregate (cost=913.43..913.44 rows=1 width=8)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Partial Aggregate (cost=906.07..906.08 rows=1 width=8)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Partial Aggregate (cost=898.73..898.74 rows=1 width=8)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Partial Aggregate (cost=898.47..898.48 rows=1 width=8)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
Calculate one partition at a time. Multiple parallel calculations are performed in one partition to obtain the result, and the result is aggregated in multiple stages.
This method applies when the number of partitions is less than 16, while each partition is relatively large, such as larger than 10 GB.
You can choose to enable or disable enable_partitionwise_aggregate
based on the performance. Generally, the computing time should be greater than the scheduling time.
db1=# select count(*) from p;
count
-----------
212758999
(1 row)
Time: 1098.609 ms (00:01.099)
You can see that it is very slow to count queries immediately after the writing is completed. Why?
db1=# select wait_event, wait_event_type from pg_stat_activity ;
wait_event | wait_event_type
---------------------+-----------------
AutoVacuumMain | Activity
LogicalLauncherMain | Activity
ExecuteGather | IPC
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
|
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
SLRURead | IO
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
|
BgWriterMain | Activity
CheckpointerMain | Activity
WalWriterMain | Activity
(40 rows)
As shown in the session table, the process is waiting for CLogControlLock. Why does the lock exist in queries?
The header information of each record in PostgreSQL records the transaction status (commit and rollback) of this record, but the status is changed immediately when the transaction is committed or rolled back. You do not need to update the header information of the write record corresponding to the transaction but write a clog file (2 bits per transaction) instead. However, a question arises. When will the transaction status in the header information of the record be updated?
The first time PostgreSQL touches the record (such as select, vacuum, analyze, autovacuum, and update operations), it obtains the transaction status from the corresponding bit in the clog file and updates the transaction status mask corresponding to the record header information.
Moreover, if datafile block checksum is enabled or wal_log_hints=on
is set, the update operation will also generate WAL logs, and queries will generate logs.
Why is the first query slow? The reason lies in the update operation of hint bit in the record header information.
After the update, it is fast. PostgreSQL does not need to go to the clog file to judge the end status of the transaction. Instead, check the mask of the record header information.
There is a switch that can be set for kernel improvement. You can determine whether to update hint bit at the table level, painting level, or cluster level. For example, you can choose to delete some log tables and flow tables after writing. There is no need to update hint bit, so it is better not to update it. It can be obtained from the clog file during the query.
The improvement of kernel and concurrency of CLogControlLock to avoid congestion can reflect the capability of parallel computing.
PostgreSQL Sorting Deduplication and Limit Query Optimization – Recursion vs. Group
Quick Creation of a Large Number of Partition Indexes in PostgreSQL
ApsaraDB - January 15, 2024
ApsaraDB - July 26, 2024
digoal - April 12, 2019
digoal - May 16, 2019
digoal - May 16, 2019
digoal - September 27, 2022
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA fully managed NoSQL cloud database service that enables storage of massive amount of structured and semi-structured data
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreMore Posts by digoal