By digoal
The page modified for the first time after the checkpoint needs to write the entire page into wal log (full page write (FPW)) to prevent bad block problems.
With FPW, bad blocks can be repaired (such as pg_basebackup), and the online copy is supported. If data is copied to a partial block, fp can repair it.
However, the introduction of FPW will lead to larger wal logs, especially in scenarios where frequent updates are made. It is clearer when the checkpoint frequency is high.
The higher the checkpoint frequency, the fewer wal logs that need replayed for recovery, and the faster the recovery is. For example, database crash recovery will become very fast. Therefore, there is a contradiction:
PolarDB supports bad block repair through standby and can disable the FPW function. This solves the contradiction above and significantly improves performance. At the same time:
In the case of high-frequency checkpoints, disabling FPW can improve the writing load performance by 30%.
8c64g 2TB nvme ssd:
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex openjade bzip2 git iotop lvm2 perf centos-release-scl
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install -y postgresql14*
parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE
mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -b 4096 -L lv01
vi /etc/fstab
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0
mkdir /data01
mount -a
vi /etc/sysctl.conf
# add by digoal.zhou
fs.aio-max-nr = 1048576
fs.file-max = 76724600
# Options: kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p
# The /data01/corefiles directory that is used to store core dumps is created with the 777 permission before testing. If a symbolic link is used, change the corresponding directory to 777.
kernel.sem = 4096 2147483647 2147483646 512000
# Specify the semaphore. You can run the ipcs -l or -u command to obtain the semaphore count. Each group of 16 processes requires a semaphore with a count of 17.
kernel.shmall = 107374182
# Specify the total size of shared memory segments. Recommended value: 80% of the memory capacity. Unit: pages.
kernel.shmmax = 274877906944
# Specify the maximum size of a single shared memory segment. Recommended value: 50% of the memory capacity. Unit: bytes. In PostgreSQL versions later than 9.2, the use of shared memory significantly drops.
kernel.shmmni = 819200
# Specify the total number of shared memory segments that can be generated. There are at least 2 shared memory segments within each PostgreSQL cluster.
net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144
# The default setting of the socket receive buffer in bytes.
net.core.rmem_max = 4194304
# The maximum receive socket buffer size in bytes
net.core.wmem_default = 262144
# The default setting (in bytes) of the socket send buffer.
net.core.wmem_max = 4194304
# The maximum send socket buffer size in bytes.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
# Enable SYN cookies. If an SYN waiting queue overflows, you can enable SYN cookies to defend against a small number of SYN attacks.
net.ipv4.tcp_timestamps = 1
# Reduce time_wait.
net.ipv4.tcp_tw_recycle = 0
# If you set this parameter to 1, sockets in the TIME-WAIT state over TCP connections are recycled. However, if network address translation (NAT) is used, TCP connections may fail. We recommend that you set this parameter to 0 on the database server.
net.ipv4.tcp_tw_reuse = 1
# Enable the reuse function. This function enables network sockets in the TIME-WAIT state to be reused over new TCP connections.
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000
vm.dirty_background_bytes = 409600000
# When the dirty page of the system reaches this value, the dirty page brushing process pdflush in the system(or other page brushing processes) automatically brushes the dirty page (dirty_expire_centisecs/100) seconds ago to the disk.
# The default limit is 10% of the memory capacity. We recommend that you specify the limit in bytes for machines with large memory capacity.
vm.dirty_expire_centisecs = 3000
# Specify the maximum period to retain dirty pages. Dirty pages are flushed to disks after the time period specified by this parameter elapses. The value 3000 indicates 30 seconds.
vm.dirty_ratio = 95
# If the process that the system flushes dirty pages is too slow, causing the system dirty pages to exceed 95% of the memory, the process that users call to write data onto disks must actively flush dirty pages to disks (These processes include fsync, fdatasync, etc.).
# Set this parameter properly to prevent user-called processes from flushing dirty pages to disks, which is very effective when a single machine has multiple instances, and CGROUP is used to limit the IOPS of a single instance.
vm.dirty_writeback_centisecs = 100
# Specify the time interval at which the background scheduling process (such as pdflush or other processes) flushes dirty pages to disks. The value 100 indicates 1 second.
vm.swappiness = 0
# Disable the swap function.
vm.mmap_min_addr = 65536
vm.overcommit_memory = 0
# When allocating memory, more memory space than the malloc is allowed. If you set this parameter to 1, the system always considers the available memory space sufficient. If the memory capacity provided in the test environment is low, we recommend that you set this parameter to 1.
vm.overcommit_ratio = 90
# Specify the memory capacity that can be allocated when the overcommit_memory parameter is set to 2.
vm.swappiness = 0
# Disable the swap partition.
vm.zone_reclaim_mode = 0
# Disable non-uniform memory access (NUMA). You can also disable NUMA in the vmlinux file.
net.ipv4.ip_local_port_range = 40000 65535
# Specify the range of TCP or UDP port numbers that are automatically allocated locally.
fs.nr_open=20480000
# Specify the maximum number of file handles that a single process can open.
# Take note of the following parameters:
# vm.extra_free_kbytes = 4096000
# vm.min_free_kbytes = 2097152 # vm.min_free_kbytes We recommend that you set the value of the vm.min_free_kbytes parameter to 1 GB for every 32 GB of memory.
# If the physical host does not provide much memory, we recommend that you do not configure vm.extra_free_kbytes and vm.min_free_kbytes.
# vm.nr_hugepages = 66536
# If the size of the shared buffer exceeds 64 GB, we recommend that you use huge pages. You can specify the page size by setting the Hugepagesize parameter in the /proc/meminfo file.
# vm.lowmem_reserve_ratio = 1 1 1
# If the memory capacity exceeds 64 GB, we recommend that you set this parameter. Otherwise, we recommend that you retain the default value 256 256 32.
sysctl -p
vi /etc/security/limits.d/20-nproc.conf
# If nofile exceeds 1048576, the fs.nr_open of sysctl must be set to a larger value, and then you can continue to set nofile after sysctl takes effect.
* soft nofile 1024000
* hard nofile 1024000
* soft nproc unlimited
* hard nproc unlimited
* soft core unlimited
* hard core unlimited
* soft memlock unlimited
* hard memlock unlimited
chmod +x /etc/rc.d/rc.local
vi /etc/rc.local
touch /var/lock/subsys/local
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
useradd postgres
su - postgres
wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2
tar -jxvf postgresql-snapshot.tar.bz2
cd postgresql-15devel
./configure --prefix=/home/postgres/pg15
make world -j 16
make install-world
cd ~
vi .bash_profile
# Append the following parameters:
export PS1="$USER@`/bin/hostname -s`-> "
export PGPORT=1921
export PGDATA=/data01/pg15_$PGPORT/pg_root
export LANG=en_US.utf8
export PGHOME=/home/postgres/pg15
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export DATE=`date +"%Y%m%d%H%M"`
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGHOST=$PGDATA
export PGUSER=postgres
export PGDATABASE=postgres
alias rm='rm -i'
alias ll='ls -lh'
unalias vi
su - root
mkdir /data01/pg15_1921
chown postgres:postgres /data01/pg15_1921
su - postgres
initdb -D $PGDATA -U postgres -E UTF8 --lc-collate=C --lc-ctype=en_US.utf8
cd $PGDATA
vi postgresql.conf
listen_addresses = '0.0.0.0'
port = 1921
max_connections = 1000
superuser_reserved_connections = 13
unix_socket_directories = '., /tmp'
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 6
shared_buffers = 16GB
maintenance_work_mem = 1GB
dynamic_shared_memory_type = posix
vacuum_cost_delay = 0
bgwriter_delay = 10ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 5.0
effective_io_concurrency = 0
max_parallel_workers_per_gather = 0
wal_level = replica
fsync = on
synchronous_commit = on
full_page_writes = on
wal_writer_delay = 10ms
wal_writer_flush_after = 1MB
max_wal_size = 1GB # High-frequency checkpoint, which has obvious influence compared with fpw
min_wal_size = 80MB
random_page_cost = 1.1
effective_cache_size = 64GB
log_destination = 'csvlog'
logging_collector = on
log_truncate_on_rotation = on
log_checkpoints = on
log_timezone = 'Asia/Shanghai'
autovacuum_vacuum_cost_delay = 0ms
vacuum_freeze_table_age = 750000000
vacuum_multixact_freeze_table_age = 750000000
datestyle = 'iso, mdy'
timezone = 'Asia/Shanghai'
lc_messages = 'en_US.utf8'
lc_monetary = 'en_US.utf8'
lc_numeric = 'en_US.utf8'
lc_time = 'en_US.utf8'
default_text_search_config = 'pg_catalog.english'
pg_ctl start
pgbench -i -s 5000
pgbench -M prepared -n -r -P 1 -c 16 -j 16 -T 120
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 5000
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 1210084
latency average = 1.586 ms
latency stddev = 1.181 ms
initial connection time = 8.439 ms
tps = 10072.852551 (without initial connection time)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.077 BEGIN;
0.681 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.111 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.117 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.107 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.091 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.399 END;
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 5000
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 1569385
latency average = 1.223 ms
latency stddev = 0.970 ms
initial connection time = 9.154 ms
tps = 13070.045981 (without initial connection time)
statement latencies in milliseconds:
0.001 \set aid random(1, 100000 * :scale)
0.000 \set bid random(1, 1 * :scale)
0.000 \set tid random(1, 10 * :scale)
0.000 \set delta random(-5000, 5000)
0.075 BEGIN;
0.402 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.113 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
0.114 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
0.111 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.096 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
0.310 END;
After FPW is turned off, the performance is improved by 30%.
The Principle, Response to Prevention of PostgreSQL Transaction ID Exhaustion or Exhaustion Warning
PostgreSQL + FDW + Vector Plug-in Accelerate Vector Retrieval
digoal - May 28, 2024
digoal - October 31, 2022
digoal - November 11, 2024
ApsaraDB - October 26, 2023
ApsaraDB - April 30, 2019
Alibaba Clouder - April 9, 2018
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal