×
Community Blog PostgreSQL v12 Performance: Alibaba Cloud ECS Local SSD vs. ESSD PL3

PostgreSQL v12 Performance: Alibaba Cloud ECS Local SSD vs. ESSD PL3

In this article, the author discusses and demonstrates the performance of PostgreSQL v12 with respect to local and cloud SSD storage types in Alibaba Cloud.

By digoal

Background

Many options are available for storing data in the cloud: local SSD, SSD, ESSD, ESSD PL1, PL2, and PL3. But, how to choose a storage type for a PostgreSQL database? What are the advantages and disadvantages of each option? And how to select a storage type in different scenarios? Is the single I/O latency of cloud disks a shared concern for all kinds of businesses? How to avoid single I/O latency properly?

First, test the performance. Then analyze the similarities and differences between local SSD and cloud SSD to finalize the appropriate storage option for the database.

The test environment is as follows:

Machines with Local SSD

Disk:

  • 1.8T * 8
  • Nominal IOPS: 240000 IOPS per block

CPU:

  • 32 cores 64 threads,2.5GHz

Memory: 512 GB

# lscpu      
Architecture:          x86_64      
CPU op-mode(s):        32-bit, 64-bit      
Byte Order:            Little Endian      
CPU(s):                64      
On-line CPU(s) list:   0-63      
Thread(s) per core:    2      
Core(s) per socket:    32      
Socket(s):             1      
NUMA node(s):          1      
Vendor ID:             GenuineIntel      
CPU family:            6      
Model:                 85      
Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz      
Stepping:              4      
CPU MHz:               2500.006      
BogoMIPS:              5000.01      
Hypervisor vendor:     KVM      
Virtualization type:   full      
L1d cache:             32K      
L1i cache:             32K      
L2 cache:              1024K      
L3 cache:              33792K      
NUMA node0 CPU(s):     0-63      
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 spec_ctrl intel_stibp      
      
# lsblk      
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT      
vda    253:0    0  100G  0 disk       
└─vda1 253:1    0  100G  0 part /      
vdb    253:16   0  1.8T  0 disk       
vdc    253:32   0  1.8T  0 disk       
vdd    253:48   0  1.8T  0 disk       
vde    253:64   0  1.8T  0 disk       
vdf    253:80   0  1.8T  0 disk       
vdg    253:96   0  1.8T  0 disk       
vdh    253:112  0  1.8T  0 disk       
vdi    253:128  0  1.8T  0 disk    

Machines with ESSD PL3

Disk 1:

  • 1.8T * 8
  • Nominal IOPS: 91800 per block

Disk 2:

  • 20T * 1
  • Nominal IOPS: 1000000 per block

CPU:

  • 26 cores 52 threads, 2.5GHz

Memory: 192 GB

# lscpu      
Architecture:          x86_64      
CPU op-mode(s):        32-bit, 64-bit      
Byte Order:            Little Endian      
CPU(s):                52      
On-line CPU(s) list:   0-51      
Thread(s) per core:    2      
Core(s) per socket:    26      
Socket(s):             1      
NUMA node(s):          1      
Vendor ID:             GenuineIntel      
CPU family:            6      
Model:                 85      
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz      
Stepping:              7      
CPU MHz:               2500.002      
BogoMIPS:              5000.00      
Hypervisor vendor:     KVM      
Virtualization type:   full      
L1d cache:             32K      
L1i cache:             32K      
L2 cache:              1024K      
L3 cache:              36608K      
NUMA node0 CPU(s):     0-51      
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni      
      
# lsblk      
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT      
vda    253:0    0  100G  0 disk       
└─vda1 253:1    0  100G  0 part /      
vdb    253:16   0  1.8T  0 disk       
vdc    253:32   0  1.8T  0 disk       
vdd    253:48   0  1.8T  0 disk       
vde    253:64   0  1.8T  0 disk       
vdf    253:80   0  1.8T  0 disk       
vdg    253:96   0  1.8T  0 disk       
vdh    253:112  0  1.8T  0 disk       
vdi    253:128  0  1.8T  0 disk       
vdj    253:144  0 19.5T  0 disk       

Machine Deployment

Kernel Parameter Setting

vi /etc/sysctl.conf          
          
# add by digoal.zhou              
fs.aio-max-nr = 1048576              
fs.file-max = 76724600              
              
# 可选:kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p                       
# /data01/corefiles 事先建好,权限777,如果是软链接,对应的目录修改为777              
              
kernel.sem = 4096 2147483647 2147483646 512000                  
# 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。              
              
kernel.shmall = 107374182                    
# 所有共享内存段相加大小限制 (建议内存的80%),单位为页。              
kernel.shmmax = 274877906944                 
# 最大单个共享内存段大小 (建议为内存一半), >9.2的版本已大幅降低共享内存的使用,单位为字节。              
kernel.shmmni = 819200                       
# 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段              
              
net.core.netdev_max_backlog = 10000              
net.core.rmem_default = 262144                     
# The default setting of the socket receive buffer in bytes.              
net.core.rmem_max = 4194304                        
# The maximum receive socket buffer size in bytes              
net.core.wmem_default = 262144                     
# The default setting (in bytes) of the socket send buffer.              
net.core.wmem_max = 4194304                        
# The maximum send socket buffer size in bytes.              
net.core.somaxconn = 4096              
net.ipv4.tcp_max_syn_backlog = 4096              
net.ipv4.tcp_keepalive_intvl = 20              
net.ipv4.tcp_keepalive_probes = 3              
net.ipv4.tcp_keepalive_time = 60              
net.ipv4.tcp_mem = 8388608 12582912 16777216              
net.ipv4.tcp_fin_timeout = 5              
net.ipv4.tcp_synack_retries = 2              
net.ipv4.tcp_syncookies = 1                  
# 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击              
net.ipv4.tcp_timestamps = 1                  
# 减少time_wait              
net.ipv4.tcp_tw_recycle = 0                  
# 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它              
net.ipv4.tcp_tw_reuse = 1                    
# 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接              
net.ipv4.tcp_max_tw_buckets = 262144              
net.ipv4.tcp_rmem = 8192 87380 16777216              
net.ipv4.tcp_wmem = 8192 65536 16777216              
              
net.nf_conntrack_max = 1200000              
net.netfilter.nf_conntrack_max = 1200000              
              
vm.dirty_background_bytes = 409600000                     
#  系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘              
#  默认为10%,大内存机器建议调整为直接指定多少字节              
              
vm.dirty_expire_centisecs = 3000                           
#  比这个值老的脏页,将被刷到磁盘。3000表示30秒。              
vm.dirty_ratio = 95                                        
#  如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。              
#  有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。                
              
vm.dirty_writeback_centisecs = 100                          
#  pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。              
              
vm.swappiness = 0              
#  不使用交换分区              
              
vm.mmap_min_addr = 65536              
vm.overcommit_memory = 0                   
#  在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .                
              
vm.overcommit_ratio = 90                   
#  当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。              
vm.swappiness = 0                          
#  关闭交换分区              
vm.zone_reclaim_mode = 0                   
# 禁用 numa, 或者在vmlinux中禁止.               
net.ipv4.ip_local_port_range = 40000 65535                  
# 本地自动分配的TCP, UDP端口号范围              
fs.nr_open=20480000              
# 单个进程允许打开的文件句柄上限              
              
# 以下参数请注意              
vm.extra_free_kbytes = 4096000              
vm.min_free_kbytes = 6291456  # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes             
# 如果是小内存机器,以上两个值不建议设置              
# vm.nr_hugepages = 66536                  
#  建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize              
vm.lowmem_reserve_ratio = 1 1 1              
# 对于内存大于64G时,建议设置,否则建议默认值 256 256 32      
      
# 生效      
# sysctl -p   

Configuration Limits

vi /etc/security/limits.conf          
          
# nofile超过1048576的话,一定要先将sysctl的fs.nr_open设置为更大的值,并生效后才能继续设置nofile.              
# 注释其他行 , 添加如下      
      
* soft    nofile  1024000              
* hard    nofile  1024000              
* soft    nproc   unlimited              
* hard    nproc   unlimited              
* soft    core    unlimited              
* hard    core    unlimited              
* soft    memlock unlimited              
* hard    memlock unlimited     

Simultaneously modify (if any):

/etc/security/limits.d/20-nproc.conf      

Disable Transparent Huge Pages (Optional)

echo never > /sys/kernel/mm/transparent_hugepage/enabled     

The configuration is permanently valid after taking effect.

chmod +x /etc/rc.d/rc.local    
    
    
vi /etc/rc.local          
          
touch /var/lock/subsys/local          
               
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then              
   echo never > /sys/kernel/mm/transparent_hugepage/enabled              
fi    

Modify Clock (Optional)

vi /etc/rc.local    
    
touch /var/lock/subsys/local    
    
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then    
   echo never > /sys/kernel/mm/transparent_hugepage/enabled    
fi    
       
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource    

Supported clocks:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource       
kvm-clock tsc acpi_pm     

Modify clock:

echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource  

Clock performance:

kvm-clock:       
      
pg_test_timing       
Testing timing overhead for 3 seconds.      
Per loop time including overhead: 38.05 ns      
Histogram of timing durations:      
  < us   % of total      count      
     1     96.22568   75875976      
     2      3.76667    2970102      
     4      0.00055        435      
     8      0.00704       5550      
    16      0.00005         42      
    32      0.00000          1      
    64      0.00000          1      
   128      0.00000          0      
   256      0.00000          0      
   512      0.00000          0      
  1024      0.00000          1      
      
tsc:      
      
pg_test_timing       
Testing timing overhead for 3 seconds.      
Per loop time including overhead: 31.93 ns      
Histogram of timing durations:      
  < us   % of total      count      
     1     96.82838   90964258      
     2      3.16507    2973390      
     4      0.00078        731      
     8      0.00574       5394      
    16      0.00003         26      
    32      0.00000          2      

TSC is slightly better.

Enable irqbalance Service

systemctl enable irqbalance      
systemctl start irqbalance      
systemctl status irqbalance      

Confirm to Enable the NIC Queue

ethtool -l eth0      
      
Channel parameters for eth0:      
Pre-set maximums:      
RX:             0      
TX:             0      
Other:          0      
Combined:       32      
Current hardware settings:      
RX:             0      
TX:             0      
Other:          0      
Combined:       32   

If not enabled, take the maximum value of Pre-set maximum Combined.

ethtool -L eth0 combined 32    

Deploy PostgreSQL v12

yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm      
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm      
yum install -y centos-release-scl
yum install -y postgresql12*      
yum install -y lvm2      

Deploy Storage

Logical Volume and File System Deployment

For Local SSD (1.8 TB * 8)

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE       
      
wipefs -f -a /dev/vd[b-i]1      
pvcreate /dev/vd[b-i]1        
vgcreate -A y -s 128M vgdata01 /dev/vd[b-i]1         
lvcreate -A y -i 8 -I 8 -l 100%FREE -n lv01 vgdata01       
      
      
[root@iZbp12984c7sv43o6lbbxoZ ~]# pvs      
  PV         VG       Fmt  Attr PSize  PFree      
  /dev/vdb1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdc1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdd1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vde1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdf1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdg1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdh1  vgdata01 lvm2 a--  <1.75t    0       
  /dev/vdi1  vgdata01 lvm2 a--  <1.75t    0       
[root@iZbp12984c7sv43o6lbbxoZ ~]# lvs      
  LV   VG       Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert      
  lv01 vgdata01 -wi-a----- <13.97t        
      
      
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=16 -b 4096 -T largefile -L lv01      
      
vi /etc/fstab           
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0        
      
mkdir /data01        
      
mount -a        

For Cloud Disk — Logical Volume (1.8 TB * 8)

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdc mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdd mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vde mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdf mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdg mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdh mklabel gpt mkpart primary 1MiB 100%FREE        
parted -a optimal -s /dev/vdi mklabel gpt mkpart primary 1MiB 100%FREE       
      
wipefs -f -a /dev/vd[b-i]1      
pvcreate /dev/vd[b-i]1        
vgcreate -A y -s 128M vgdata01 /dev/vd[b-i]1         
lvcreate -A y -i 8 -I 8 -l 100%FREE -n lv01 vgdata01       
      
[root@iZbp1621kc3de3pm12a0exZ ~]# pvs      
  PV         VG       Fmt  Attr PSize  PFree      
  /dev/vdb1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdc1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdd1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vde1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdf1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdg1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdh1  vgdata01 lvm2 a--  <1.76t    0       
  /dev/vdi1  vgdata01 lvm2 a--  <1.76t    0       
[root@iZbp1621kc3de3pm12a0exZ ~]# lvs      
  LV   VG       Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert      
  lv01 vgdata01 -wi-a----- 14.06t       
      
      
mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=16 -b 4096 -T largefile -L lv01      
      
vi /etc/fstab           
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0        
      
mkdir /data01        
      
mount -a      

For Cloud Disk — Single Disk (20 TB)

parted -a optimal -s /dev/vdj mklabel gpt mkpart primary 1MiB 100%FREE        
      
mkfs.ext4 /dev/vdj1 -m 0 -O extent,uninit_bg -E lazy_itable_init=1 -b 4096 -T largefile -L vdj1      
      
vi /etc/fstab           
LABEL=vdj1 /data02 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0        
      
mkdir /data02        
      
mount -a 

PostgreSQL v12 Database Deployment

For Local SSD (1.8 TB * 8)

mkdir /data01/pg12_1921        
chown postgres:postgres /data01/pg12_1921     
  
su - postgres        
      
vi ~/.bash_profile      
      
# 追加      
export PS1="$USER@`/bin/hostname -s`-> "        
export PGPORT=1921        
export PGDATA=/data01/pg12_$PGPORT/pg_root       
      
export LANG=en_US.utf8        
export PGHOME=/usr/pgsql-12        
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH        
export DATE=`date +"%Y%m%d%H%M"`        
export PATH=$PGHOME/bin:$PATH:.        
export MANPATH=$PGHOME/share/man:$MANPATH        
export PGHOST=$PGDATA        
export PGUSER=postgres        
export PGDATABASE=postgres        
alias rm='rm -i'        
alias ll='ls -lh'        
unalias vi        

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024  
    
vi $PGDATA/postgresql.auto.conf      
      
listen_addresses = '0.0.0.0'      
# 监听端口      
port = 1921      
max_connections = 1000      
superuser_reserved_connections = 13      
unix_socket_directories = '., /var/run/postgresql, /tmp'      
unix_socket_permissions = 0700      
tcp_keepalives_idle = 60      
tcp_keepalives_interval = 10      
tcp_keepalives_count = 10      
shared_buffers = 32GB      
max_prepared_transactions = 1000      
work_mem = 8MB      
maintenance_work_mem = 2GB      
dynamic_shared_memory_type = posix      
vacuum_cost_delay = 0      
bgwriter_delay = 10ms      
bgwriter_lru_maxpages = 1000      
bgwriter_lru_multiplier = 10.0      
bgwriter_flush_after = 512kB      
effective_io_concurrency = 0      
max_worker_processes = 128      
max_parallel_maintenance_workers = 5      
max_parallel_workers_per_gather = 0      
parallel_leader_participation = off      
max_parallel_workers = 40      
backend_flush_after = 2MB      
wal_level = replica      
fsync = on      
synchronous_commit = off      
full_page_writes = on      
wal_compression = on      
wal_buffers = 16MB      
wal_writer_delay = 10ms      
wal_writer_flush_after = 1MB      
checkpoint_timeout = 20min      
max_wal_size = 128GB      
min_wal_size = 16GB      
checkpoint_completion_target = 0.2      
archive_mode = on      
archive_command = '/bin/date'      
max_wal_senders = 32      
wal_keep_segments = 0      
wal_sender_timeout = 10s      
max_replication_slots = 24      
random_page_cost = 1.1      
effective_cache_size = 128GB      
log_destination = 'csvlog'      
logging_collector = on      
log_directory = 'log'      
log_filename = 'postgresql-%a.log'      
log_truncate_on_rotation = on      
log_rotation_age = 1d      
log_rotation_size = 0      
log_checkpoints = on      
log_connections = on      
log_disconnections = on      
log_error_verbosity = verbose        
log_line_prefix = '%m [%p] '      
log_lock_waits = on      
log_statement = 'ddl'      
log_timezone = 'Asia/Shanghai'      
track_functions = pl      
autovacuum = on      
log_autovacuum_min_duration = 0      
autovacuum_max_workers = 8      
autovacuum_vacuum_scale_factor = 0.1      
autovacuum_analyze_scale_factor = 0.05      
autovacuum_freeze_max_age = 800000000      
autovacuum_multixact_freeze_max_age = 900000000      
autovacuum_vacuum_cost_delay = 0ms      
default_table_access_method = 'heap'      
check_function_bodies = on      
# 以下两个超时参数建议根据业务需求设置      
lock_timeout = 6s      
idle_in_transaction_session_timeout = 60s      
vacuum_freeze_table_age = 750000000      
vacuum_multixact_freeze_table_age = 750000000      
datestyle = 'iso, mdy'      
timezone = 'Asia/Shanghai'      
lc_messages = 'en_US.utf8'      
lc_monetary = 'en_US.utf8'      
lc_numeric = 'en_US.utf8'      
lc_time = 'en_US.utf8'      
default_text_search_config = 'pg_catalog.english'   

Start database:

pg_ctl start    

For Cloud Disk — Logical Volume (1.8 TB * 8)

mkdir /data01/pg12_1921        
chown postgres:postgres /data01/pg12_1921       

su - postgres        
      
vi ~/.bash_profile      
      
# 追加      
export PS1="$USER@`/bin/hostname -s`-> "        
export PGPORT=1921        
export PGDATA=/data01/pg12_$PGPORT/pg_root       
      
export LANG=en_US.utf8        
export PGHOME=/usr/pgsql-12        
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH        
export DATE=`date +"%Y%m%d%H%M"`        
export PATH=$PGHOME/bin:$PATH:.        
export MANPATH=$PGHOME/share/man:$MANPATH        
export PGHOST=$PGDATA        
export PGUSER=postgres        
export PGDATABASE=postgres        
alias rm='rm -i'        
alias ll='ls -lh'        
unalias vi        

initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024   
   
vi $PGDATA/postgresql.auto.conf      
      
listen_addresses = '0.0.0.0'      
# 监听端口      
port = 1921      
max_connections = 1000      
superuser_reserved_connections = 13      
unix_socket_directories = '., /var/run/postgresql, /tmp'      
unix_socket_permissions = 0700      
tcp_keepalives_idle = 60      
tcp_keepalives_interval = 10      
tcp_keepalives_count = 10      
shared_buffers = 32GB      
max_prepared_transactions = 1000      
work_mem = 8MB      
maintenance_work_mem = 2GB      
dynamic_shared_memory_type = posix      
vacuum_cost_delay = 0      
bgwriter_delay = 10ms      
bgwriter_lru_maxpages = 1000      
bgwriter_lru_multiplier = 10.0      
bgwriter_flush_after = 512kB      
effective_io_concurrency = 0      
max_worker_processes = 128      
max_parallel_maintenance_workers = 5      
max_parallel_workers_per_gather = 0      
parallel_leader_participation = off      
max_parallel_workers = 40      
backend_flush_after = 2MB      
wal_level = replica      
fsync = on      
synchronous_commit = off      
full_page_writes = on      
wal_compression = on      
wal_buffers = 16MB      
wal_writer_delay = 10ms      
wal_writer_flush_after = 1MB      
checkpoint_timeout = 20min      
max_wal_size = 128GB      
min_wal_size = 16GB      
checkpoint_completion_target = 0.2      
archive_mode = on      
archive_command = '/bin/date'      
max_wal_senders = 32      
wal_keep_segments = 0      
wal_sender_timeout = 10s      
max_replication_slots = 24      
random_page_cost = 1.1      
effective_cache_size = 128GB      
log_destination = 'csvlog'      
logging_collector = on      
log_directory = 'log'      
log_filename = 'postgresql-%a.log'      
log_truncate_on_rotation = on      
log_rotation_age = 1d      
log_rotation_size = 0      
log_checkpoints = on      
log_connections = on      
log_disconnections = on      
log_error_verbosity = verbose        
log_line_prefix = '%m [%p] '      
log_lock_waits = on      
log_statement = 'ddl'      
log_timezone = 'Asia/Shanghai'      
track_functions = pl      
autovacuum = on      
log_autovacuum_min_duration = 0      
autovacuum_max_workers = 8      
autovacuum_vacuum_scale_factor = 0.1      
autovacuum_analyze_scale_factor = 0.05      
autovacuum_freeze_max_age = 800000000      
autovacuum_multixact_freeze_max_age = 900000000      
autovacuum_vacuum_cost_delay = 0ms      
default_table_access_method = 'heap'      
check_function_bodies = on      
# 以下两个超时参数建议根据业务需求设置      
lock_timeout = 6s      
idle_in_transaction_session_timeout = 60s      
vacuum_freeze_table_age = 750000000      
vacuum_multixact_freeze_table_age = 750000000      
datestyle = 'iso, mdy'      
timezone = 'Asia/Shanghai'      
lc_messages = 'en_US.utf8'      
lc_monetary = 'en_US.utf8'      
lc_numeric = 'en_US.utf8'      
lc_time = 'en_US.utf8'      
default_text_search_config = 'pg_catalog.english' 

Start database:

pg_ctl start

For Cloud Disk — Single Disk (20 TB)

mkdir /data02/pg12_1922        
chown postgres:postgres /data02/pg12_1922
       
su - postgres        
      
export PGPORT=1922        
export PGDATA=/data02/pg12_$PGPORT/pg_root       
export PGHOST=$PGDATA        
      
initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024      

vi $PGDATA/postgresql.auto.conf      
      
listen_addresses = '0.0.0.0'      
# 监听端口      
port = 1922      
max_connections = 1000      
superuser_reserved_connections = 13      
unix_socket_directories = '., /var/run/postgresql, /tmp'      
unix_socket_permissions = 0700      
tcp_keepalives_idle = 60      
tcp_keepalives_interval = 10      
tcp_keepalives_count = 10      
shared_buffers = 32GB      
max_prepared_transactions = 1000      
work_mem = 8MB      
maintenance_work_mem = 2GB      
dynamic_shared_memory_type = posix      
vacuum_cost_delay = 0      
bgwriter_delay = 10ms      
bgwriter_lru_maxpages = 1000      
bgwriter_lru_multiplier = 10.0      
bgwriter_flush_after = 512kB      
effective_io_concurrency = 0      
max_worker_processes = 128      
max_parallel_maintenance_workers = 5      
max_parallel_workers_per_gather = 0      
parallel_leader_participation = off      
max_parallel_workers = 40      
backend_flush_after = 2MB      
wal_level = replica      
fsync = on      
synchronous_commit = off      
full_page_writes = on      
wal_compression = on      
wal_buffers = 16MB      
wal_writer_delay = 10ms      
wal_writer_flush_after = 1MB      
checkpoint_timeout = 20min      
max_wal_size = 128GB      
min_wal_size = 16GB      
checkpoint_completion_target = 0.2      
archive_mode = on      
archive_command = '/bin/date'      
max_wal_senders = 32      
wal_keep_segments = 0      
wal_sender_timeout = 10s      
max_replication_slots = 24      
random_page_cost = 1.1      
effective_cache_size = 128GB      
log_destination = 'csvlog'      
logging_collector = on      
log_directory = 'log'      
log_filename = 'postgresql-%a.log'      
log_truncate_on_rotation = on      
log_rotation_age = 1d      
log_rotation_size = 0      
log_checkpoints = on      
log_connections = on      
log_disconnections = on      
log_error_verbosity = verbose        
log_line_prefix = '%m [%p] '      
log_lock_waits = on      
log_statement = 'ddl'      
log_timezone = 'Asia/Shanghai'      
track_functions = pl      
autovacuum = on      
log_autovacuum_min_duration = 0      
autovacuum_max_workers = 8      
autovacuum_vacuum_scale_factor = 0.1      
autovacuum_analyze_scale_factor = 0.05      
autovacuum_freeze_max_age = 800000000      
autovacuum_multixact_freeze_max_age = 900000000      
autovacuum_vacuum_cost_delay = 0ms      
default_table_access_method = 'heap'      
check_function_bodies = on      
# 以下两个超时参数建议根据业务需求设置      
lock_timeout = 6s      
idle_in_transaction_session_timeout = 60s      
vacuum_freeze_table_age = 750000000      
vacuum_multixact_freeze_table_age = 750000000      
datestyle = 'iso, mdy'      
timezone = 'Asia/Shanghai'      
lc_messages = 'en_US.utf8'      
lc_monetary = 'en_US.utf8'      
lc_numeric = 'en_US.utf8'      
lc_time = 'en_US.utf8'      
default_text_search_config = 'pg_catalog.english'      

Start database:

pg_ctl start      

fsync Performance Test

This test also includes the single I/O capacity and the synchronous commit latency of databases.

For Local SSD (1.8 TB * 8)

su - postgres      
cd $PGDATA      
      
postgres@iZbp12984c7sv43o6lbbxoZ-> pg_test_fsync       
5 seconds per test      
O_DIRECT supported on this platform for open_datasync and open_sync.      
      
Compare file sync methods using one 8kB write:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                     32804.010 ops/sec      30 usecs/op      
        fdatasync                         28216.659 ops/sec      35 usecs/op      
        fsync                             25325.493 ops/sec      39 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                         29137.619 ops/sec      34 usecs/op      
      
Compare file sync methods using two 8kB writes:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                     14620.630 ops/sec      68 usecs/op      
        fdatasync                         24263.659 ops/sec      41 usecs/op      
        fsync                             21762.878 ops/sec      46 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                         13006.581 ops/sec      77 usecs/op      
      
Compare open_sync with different write sizes:      
(This is designed to compare the cost of writing 16kB in different write      
open_sync sizes.)      
         1 * 16kB open_sync write         25182.114 ops/sec      40 usecs/op      
         2 *  8kB open_sync writes        13001.270 ops/sec      77 usecs/op      
         4 *  4kB open_sync writes         7233.661 ops/sec     138 usecs/op      
         8 *  2kB open_sync writes         3865.482 ops/sec     259 usecs/op      
        16 *  1kB open_sync writes         1979.762 ops/sec     505 usecs/op      
      
Test if fsync on non-write file descriptor is honored:      
(If the times are similar, fsync() can sync data written on a different      
descriptor.)      
        write, fsync, close               24085.484 ops/sec      42 usecs/op      
        write, close, fsync               24072.174 ops/sec      42 usecs/op      
      
Non-sync'ed 8kB writes:      
        write                            192054.909 ops/sec       5 usecs/op     

For Cloud Disk — Logical Volume (1.8 TB * 8)

su - postgres      
cd $PGDATA      
      
postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync       
5 seconds per test      
O_DIRECT supported on this platform for open_datasync and open_sync.      
      
Compare file sync methods using one 8kB write:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                      6345.217 ops/sec     158 usecs/op      
        fdatasync                          6087.221 ops/sec     164 usecs/op      
        fsync                              3689.149 ops/sec     271 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                          3942.073 ops/sec     254 usecs/op      
      
Compare file sync methods using two 8kB writes:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                      3020.085 ops/sec     331 usecs/op      
        fdatasync                          4684.567 ops/sec     213 usecs/op      
        fsync                              2785.027 ops/sec     359 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                          1814.915 ops/sec     551 usecs/op      
      
Compare open_sync with different write sizes:      
(This is designed to compare the cost of writing 16kB in different write      
open_sync sizes.)      
         1 * 16kB open_sync write          2847.150 ops/sec     351 usecs/op      
         2 *  8kB open_sync writes         1920.576 ops/sec     521 usecs/op      
         4 *  4kB open_sync writes         1023.594 ops/sec     977 usecs/op      
         8 *  2kB open_sync writes           82.223 ops/sec   12162 usecs/op      
        16 *  1kB open_sync writes           63.008 ops/sec   15871 usecs/op      
      
Test if fsync on non-write file descriptor is honored:      
(If the times are similar, fsync() can sync data written on a different      
descriptor.)      
        write, fsync, close                3677.066 ops/sec     272 usecs/op      
        write, close, fsync                3708.723 ops/sec     270 usecs/op      
      
Non-sync'ed 8kB writes:      
        write                            297195.065 ops/sec       3 usecs/op     

For Cloud Disk — Single Disk (20 TB)

su - postgres        
      
export PGPORT=1922        
export PGDATA=/data02/pg12_$PGPORT/pg_root       
export PGHOST=$PGDATA        
cd $PGDATA      
      
      
postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync       
5 seconds per test      
O_DIRECT supported on this platform for open_datasync and open_sync.      
      
Compare file sync methods using one 8kB write:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                      6103.193 ops/sec     164 usecs/op      
        fdatasync                          5901.509 ops/sec     169 usecs/op      
        fsync                              3893.064 ops/sec     257 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                          3971.187 ops/sec     252 usecs/op      
      
Compare file sync methods using two 8kB writes:      
(in wal_sync_method preference order, except fdatasync is Linux's default)      
        open_datasync                      3010.795 ops/sec     332 usecs/op      
        fdatasync                          5034.260 ops/sec     199 usecs/op      
        fsync                              3648.353 ops/sec     274 usecs/op      
        fsync_writethrough                              n/a      
        open_sync                          2033.980 ops/sec     492 usecs/op      
      
Compare open_sync with different write sizes:      
(This is designed to compare the cost of writing 16kB in different write      
open_sync sizes.)      
         1 * 16kB open_sync write          3574.507 ops/sec     280 usecs/op      
         2 *  8kB open_sync writes         1996.539 ops/sec     501 usecs/op      
         4 *  4kB open_sync writes         1079.944 ops/sec     926 usecs/op      
         8 *  2kB open_sync writes           96.440 ops/sec   10369 usecs/op      
        16 *  1kB open_sync writes           75.738 ops/sec   13203 usecs/op      
      
Test if fsync on non-write file descriptor is honored:      
(If the times are similar, fsync() can sync data written on a different      
descriptor.)      
        write, fsync, close                3716.489 ops/sec     269 usecs/op      
        write, close, fsync                3641.653 ops/sec     275 usecs/op      
      
Non-sync'ed 8kB writes:      
        write                            297784.864 ops/sec       3 usecs/op    

FIO Test (for IOPS and Throughput)

yum install -y fio      
yum install -y iotop      

Test script:

vi fiotest      
      
function RunFio      
{      
 numjobs=$1   # 实例中的测试线程数,如示例中的8      
 iodepth=$2   # 同时发出I/O数的上限,如示例中的64      
 bs=$3        # 单次I/O的块文件大小,如示例中的4K      
 rw=$4        # 测试时的读写策略,如示例中的randwrite      
 filename=$5  # 指定测试文件的名称,如示例中的/data01/test      
 ioengine=$6  # io engine : libaio, sync等,参考man fio      
 direct=$7    # 是否跳过page cache ,参考man fio      
      
 nr_cpus=`cat /proc/cpuinfo |grep "processor" |wc -l`      
 if [ $nr_cpus -lt $numjobs ];then      
     echo “Numjobs is more than cpu cores, exit!”      
     exit -1      
 fi      
 let nu=$numjobs+1      
 cpulist=""      
 for ((i=1;i<10;i++))      
 do      
     list=`cat /sys/block/vdb/mq/*/cpu_list | awk '{if(i<=NF) print $i;}' i="$i" | tr -d ',' | tr '\n' ','`      
     if [ -z $list ];then      
         break      
     fi      
     cpulist=${cpulist}${list}      
 done      
 spincpu=`echo $cpulist | cut -d ',' -f 2-${nu}`      
 echo $spincpu      
 fio --ioengine=${ioengine} --runtime=60s --numjobs=${numjobs} --iodepth=${iodepth} --bs=${bs} --rw=${rw} --filename=${filename} --time_based=1 --direct=${direct} --name=test --group_reporting --cpus_allowed=$spincpu --cpus_allowed_policy=split --size=16G     
}      
      
# 设置essd块设备 queue rq affinity,假设有9个essd盘,并且他们在vd[b-j]      
echo 2 > /sys/block/vdb/queue/rq_affinity      
echo 2 > /sys/block/vdc/queue/rq_affinity      
echo 2 > /sys/block/vdd/queue/rq_affinity      
echo 2 > /sys/block/vde/queue/rq_affinity      
echo 2 > /sys/block/vdf/queue/rq_affinity      
echo 2 > /sys/block/vdg/queue/rq_affinity      
echo 2 > /sys/block/vdh/queue/rq_affinity      
echo 2 > /sys/block/vdi/queue/rq_affinity      
echo 2 > /sys/block/vdj/queue/rq_affinity      
      
sleep 5      
RunFio $1 $2 $3 $4 $5 $6 $7      
# RunFio 16 64 8k randwrite /data01/test libaio 1     

Or,

vi fiotest_1    
    
function RunFio      
{      
 numjobs=$1   # 实例中的测试线程数,如示例中的8      
 iodepth=$2   # 同时发出I/O数的上限,如示例中的64      
 bs=$3        # 单次I/O的块文件大小,如示例中的4K      
 rw=$4        # 测试时的读写策略,如示例中的randwrite      
 filename=$5  # 指定测试文件的名称,如示例中的/data01/test      
 ioengine=$6  # io engine : libaio, sync等,参考man fio      
 direct=$7    # 是否跳过page cache ,参考man fio      
 size=$8      # 写多大数据,例如16G    
      
 nr_cpus=`cat /proc/cpuinfo |grep "processor" |wc -l`      
 if [ $nr_cpus -lt $numjobs ];then      
     echo “Numjobs is more than cpu cores, exit!”      
     exit -1      
 fi      
 let nu=$numjobs+1      
 cpulist=""      
 for ((i=1;i<10;i++))      
 do      
     list=`cat /sys/block/vdb/mq/*/cpu_list | awk '{if(i<=NF) print $i;}' i="$i" | tr -d ',' | tr '\n' ','`      
     if [ -z $list ];then      
         break      
     fi      
     cpulist=${cpulist}${list}      
 done      
 spincpu=`echo $cpulist | cut -d ',' -f 2-${nu}`      
 echo $spincpu      
 fio --ioengine=${ioengine} --runtime=60s --numjobs=${numjobs} --iodepth=${iodepth} --bs=${bs} --rw=${rw} --filename=${filename} --time_based=1 --direct=${direct} --name=test --group_reporting --cpus_allowed=$spincpu --cpus_allowed_policy=split --size=${size}     
}      
      
# 设置essd块设备 queue rq affinity,假设有9个essd盘,并且他们在vd[b-j]      
echo 2 > /sys/block/vdb/queue/rq_affinity      
echo 2 > /sys/block/vdc/queue/rq_affinity     
echo 2 > /sys/block/vdd/queue/rq_affinity     
echo 2 > /sys/block/vde/queue/rq_affinity    
echo 2 > /sys/block/vdf/queue/rq_affinity     
echo 2 > /sys/block/vdg/queue/rq_affinity     
echo 2 > /sys/block/vdh/queue/rq_affinity     
echo 2 > /sys/block/vdi/queue/rq_affinity     
echo 2 > /sys/block/vdj/queue/rq_affinity     
      
sleep 5      
RunFio $1 $2 $3 $4 $5 $6 $7 $8     
# RunFio 16 64 8k randwrite /data01/test libaio 1 16G    

For Local SSD (1.8 TB * 8)

. ./fiotest 32 8 8K randwrite /data01/test libaio 1      
  write: IOPS=131k, BW=1024MiB/s (1074MB/s)(29.0GiB/30003msec)      
      
. ./fiotest 32 8 8K randread /data01/test libaio 1      
   read: IOPS=153k, BW=1193MiB/s (1250MB/s)(34.9GiB/30001msec)      
      
. ./fiotest 32 8 8K write /data01/test libaio 1      
  write: IOPS=134k, BW=1050MiB/s (1101MB/s)(30.8GiB/30001msec)      
      
. ./fiotest 32 8 8K read /data01/test libaio 1      
   read: IOPS=151k, BW=1182MiB/s (1240MB/s)(34.6GiB/30002msec)    

For Cloud Disk — Logical Volume (1.8 TB * 8)

. ./fiotest 32 8 8K randwrite /data01/test libaio 1      
write: IOPS=106k, BW=825MiB/s (865MB/s)(24.2GiB/30003msec)      
      
. ./fiotest 32 8 8K randread /data01/test libaio 1      
read: IOPS=106k, BW=826MiB/s (866MB/s)(24.2GiB/30016msec)      
      
. ./fiotest 32 8 8K write /data01/test libaio 1      
write: IOPS=106k, BW=826MiB/s (866MB/s)(24.2GiB/30003msec)      
      
. ./fiotest 32 8 8K read /data01/test libaio 1      
read: IOPS=106k, BW=829MiB/s (869MB/s)(24.3GiB/30003msec)     

For Cloud Disk — Single Disk (20 TB)

. ./fiotest 32 8 8K randwrite /data02/test libaio 1      
write: IOPS=106k, BW=826MiB/s (867MB/s)(24.2GiB/30003msec)      
      
. ./fiotest 32 8 8K randread /data02/test libaio 1      
read: IOPS=106k, BW=830MiB/s (870MB/s)(24.3GiB/30004msec)      
      
. ./fiotest 32 8 8K write /data02/test libaio 1      
write: IOPS=106k, BW=824MiB/s (865MB/s)(24.2GiB/30006msec)      
      
. ./fiotest 32 8 8K read /data02/test libaio 1      
read: IOPS=106k, BW=825MiB/s (866MB/s)(24.2GiB/30004msec)     

PostgreSQL: 1 Billion tpcb Test (Comprehensive Database Performance)

For Local SSD (1.8 TB * 8)

su - postgres      
pgbench -i -s 10000      
1000000000 of 1000000000 tuples (100%) done (elapsed 1058.21 s, remaining 0.00 s)      

checkpoint;      
set max_parallel_workers_per_gather =16;      
set enable_indexscan =on;      
set enable_indexonlyscan =on;      
set enable_bitmapscan =off;      
set enable_seqscan=off;      
explain (analyze) select count(*) from pgbench_accounts ;      
set enable_indexscan =off;      
set enable_indexonlyscan =off;      
set enable_bitmapscan =off;      
set enable_seqscan=on;      
explain (analyze) select count(*) from pgbench_accounts ;    

  
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S      
      
transaction type: <builtin: select only>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 75279853      
latency average = 0.204 ms      
latency stddev = 0.557 ms      
tps = 627000.964875 (including connections establishing)      
tps = 627126.432692 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.205  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  

    
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 3073411      
latency average = 4.997 ms      
latency stddev = 25.946 ms      
tps = 25595.742845 (including connections establishing)      
tps = 25600.289357 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.001  \set bid random(1, 1 * :scale)      
         0.001  \set tid random(1, 10 * :scale)      
         0.001  \set delta random(-5000, 5000)      
         0.105  BEGIN;      
         3.458  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.339  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.220  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.206  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.196  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.488  END;  

Write operation takes 1,058 seconds.

The query qps is 627,000 (with all server memory of 512GB hit).

The read/write tps is 25,595, and the read/write qps is 127,975.

For Cloud Disk — Logical Volume (1.8 TB * 8)

su - postgres      
pgbench -i -s 10000      
1000000000 of 1000000000 tuples (100%) done (elapsed 948.07 s, remaining 0.00 s)   
   
checkpoint;      
set max_parallel_workers_per_gather =16;      
set enable_indexscan =on;      
set enable_indexonlyscan =on;      
set enable_bitmapscan =off;      
set enable_seqscan=off;      
explain (analyze) select count(*) from pgbench_accounts ;      
set enable_indexscan =off;      
set enable_indexonlyscan =off;      
set enable_bitmapscan =off;      
set enable_seqscan=on;      
explain (analyze) select count(*) from pgbench_accounts ;    
  
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S      
      
transaction type: <builtin: select only>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 7802607      
latency average = 1.968 ms      
latency stddev = 9.345 ms      
tps = 64979.905944 (including connections establishing)      
tps = 64991.584022 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.000  \set aid random(1, 100000 * :scale)      
         1.970  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;    

  
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 3438342      
latency average = 4.467 ms      
latency stddev = 11.680 ms      
tps = 28627.417199 (including connections establishing)      
tps = 28633.295374 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.000  \set bid random(1, 1 * :scale)      
         0.000  \set tid random(1, 10 * :scale)      
         0.000  \set delta random(-5000, 5000)      
         0.035  BEGIN;      
         3.967  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.084  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.117  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.084  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.071  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.109  END;    

Write operation takes 948 seconds.

The query qps is 64,979 (with the server memory of 192 GB hit at a low rate and worse I/O rt compared with local disk).

The read/write tps is 28,627,and the read/write qps is 143,135.

For Cloud Disk — Single Disk (20 TB)

su - postgres       
export PGPORT=1922        
export PGDATA=/data02/pg12_$PGPORT/pg_root       
export PGHOST=$PGDATA        
      
pgbench -i -s 10000      
1000000000 of 1000000000 tuples (100%) done (elapsed 1082.59 s, remaining 0.00 s)   
   
checkpoint;      
set max_parallel_workers_per_gather =16;      
set enable_indexscan =on;      
set enable_indexonlyscan =on;      
set enable_bitmapscan =off;      
set enable_seqscan=off;      
explain (analyze) select count(*) from pgbench_accounts ;      
set enable_indexscan =off;      
set enable_indexonlyscan =off;      
set enable_bitmapscan =off;      
set enable_seqscan=on;      
explain (analyze) select count(*) from pgbench_accounts ;      


pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S       
      
transaction type: <builtin: select only>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 4473595      
latency average = 3.433 ms      
latency stddev = 5.682 ms      
tps = 37253.374526 (including connections establishing)      
tps = 37259.270387 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.000  \set aid random(1, 100000 * :scale)      
         3.434  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  

    
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 10000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 3790414      
latency average = 4.052 ms      
latency stddev = 12.467 ms      
tps = 31545.429236 (including connections establishing)      
tps = 31550.985798 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.000  \set bid random(1, 1 * :scale)      
         0.000  \set tid random(1, 10 * :scale)      
         0.000  \set delta random(-5000, 5000)      
         0.024  BEGIN;      
         3.700  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.053  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.077  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.062  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.081  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.052  END;      

Write operation takes 1,082 seconds.

The query qps is 37,253 (with the server memory of 192 GB hit at a low rate and a worse I/O rt compared with local disk).

The read/write tps is 31,545, and the read/write qps is 157,725.

PostgreSQL: 10 Billion tpcb Test (Comprehensive Database Performance)

postgres=# \dt+      
                          List of relations      
 Schema |       Name       | Type  |  Owner   |  Size   | Description       
--------+------------------+-------+----------+---------+-------------      
 public | pgbench_accounts | table | postgres | 1251 GB |       
 public | pgbench_branches | table | postgres | 3576 kB |       
 public | pgbench_history  | table | postgres | 0 bytes |       
 public | pgbench_tellers  | table | postgres | 42 MB   |       
(4 rows)      
      
postgres=# \di+      
                                      List of relations      
 Schema |         Name          | Type  |  Owner   |      Table       |  Size   | Description       
--------+-----------------------+-------+----------+------------------+---------+-------------      
 public | pgbench_accounts_pkey | index | postgres | pgbench_accounts | 209 GB  |       
 public | pgbench_branches_pkey | index | postgres | pgbench_branches | 4408 kB |       
 public | pgbench_tellers_pkey  | index | postgres | pgbench_tellers  | 43 MB   |       
(3 rows)     

For Local SSD (1.8 TB * 8)

10000000000 of 10000000000 tuples (100%) done (elapsed 9486.77 s, remaining 0.00 s)      
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S      
      
transaction type: <builtin: select only>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 19059491      
latency average = 0.806 ms      
latency stddev = 14.075 ms      
tps = 158451.015605 (including connections establishing)      
tps = 158486.055663 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.813  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
      
Total DISK READ :       4.56 G/s | Total DISK WRITE :      16.16 M/s    
Actual DISK READ:       4.56 G/s | Actual DISK WRITE:      16.49 M/s    
最高到了 7 GB/s    
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 2969763      
latency average = 5.171 ms      
latency stddev = 27.769 ms      
tps = 24720.906138 (including connections establishing)      
tps = 24727.358605 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.001  \set bid random(1, 1 * :scale)      
         0.001  \set tid random(1, 10 * :scale)      
         0.001  \set delta random(-5000, 5000)      
         0.088  BEGIN;      
         3.596  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.304  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.223  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.218  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.412  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.340  END;     

For ESSD PL3 (1.8 TB * 8)

10000000000 of 10000000000 tuples (100%) done (elapsed 8195.55 s, remaining 0.00 s)      
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S      
      
transaction type: <builtin: select only>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 2536629      
latency average = 6.055 ms      
latency stddev = 7.617 ms      
tps = 21123.418940 (including connections establishing)      
tps = 21126.649320 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         6.054  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
      
Total DISK READ :     793.93 M/s | Total DISK WRITE :    1475.73 K/s    
Actual DISK READ:     793.21 M/s | Actual DISK WRITE:    1351.46 K/s    
最高900多MB/s    
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 2071345      
latency average = 7.415 ms      
latency stddev = 9.884 ms      
tps = 17243.124457 (including connections establishing)      
tps = 17246.457208 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.000  \set bid random(1, 1 * :scale)      
         0.000  \set tid random(1, 10 * :scale)      
         0.000  \set delta random(-5000, 5000)      
         0.019  BEGIN;      
         7.153  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.047  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.077  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.051  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.037  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.029  END;     

For ESSD PL3 (20TB)

10000000000 of 10000000000 tuples (100%) done (elapsed 7983.23 s, remaining 0.00 s)      
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120 -S      
      
transaction type: <builtin: select only>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 2702617      
latency average = 5.683 ms      
latency stddev = 4.769 ms      
tps = 22504.744988 (including connections establishing)      
tps = 22508.795301 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.000  \set aid random(1, 100000 * :scale)      
         5.681  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
      
postgres=# select wait_event,count(*) from pg_stat_activity group by 1 order by 2 desc;      
     wait_event      | count       
---------------------+-------      
 DataFileRead        |   128      
 BgWriterMain        |     1      
                     |     1      
 AutoVacuumMain      |     1      
 CheckpointerMain    |     1      
 LogicalLauncherMain |     1      
 WalWriterMain       |     1      
(7 rows)      
      
pgbench -M prepared -n -r -P 3 -c 128 -j 128 -T 120       
      
transaction type: <builtin: TPC-B (sort of)>      
scaling factor: 100000      
query mode: prepared      
number of clients: 128      
number of threads: 128      
duration: 120 s      
number of transactions actually processed: 2052621      
latency average = 7.482 ms      
latency stddev = 5.452 ms      
tps = 17087.070214 (including connections establishing)      
tps = 17090.310318 (excluding connections establishing)      
statement latencies in milliseconds:      
         0.001  \set aid random(1, 100000 * :scale)      
         0.000  \set bid random(1, 1 * :scale)      
         0.000  \set tid random(1, 10 * :scale)      
         0.000  \set delta random(-5000, 5000)      
         0.020  BEGIN;      
         7.224  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;      
         0.046  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;      
         0.073  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;      
         0.052  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;      
         0.037  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);      
         0.029  END;  

Supplementary Test on 104-core 384 GB ECS with 20 TB ESSD PL3

The bandwidth and IOPS of cloud disks are related to the capacity of cloud disks as well as the type of ECS instances. Therefore, test 104-core 384 GB ECS with 20 TB ESSD PL3 additionally.

104-core 384 GB ECS with 20 TB ESSD PL3:

pg_test_fsync     
5 seconds per test    
O_DIRECT supported on this platform for open_datasync and open_sync.    
    
Compare file sync methods using one 8kB write:    
(in wal_sync_method preference order, except fdatasync is Linux's default)    
        open_datasync                      6358.070 ops/sec     157 usecs/op    
        fdatasync                          6221.143 ops/sec     161 usecs/op    
        fsync                              3922.576 ops/sec     255 usecs/op    
        fsync_writethrough                              n/a    
        open_sync                          4078.494 ops/sec     245 usecs/op    
    
Compare file sync methods using two 8kB writes:    
(in wal_sync_method preference order, except fdatasync is Linux's default)    
        open_datasync                      3237.264 ops/sec     309 usecs/op    
        fdatasync                          5393.018 ops/sec     185 usecs/op    
        fsync                              3499.650 ops/sec     286 usecs/op    
        fsync_writethrough                              n/a    
        open_sync                          2117.949 ops/sec     472 usecs/op    
    
Compare open_sync with different write sizes:    
(This is designed to compare the cost of writing 16kB in different write    
open_sync sizes.)    
         1 * 16kB open_sync write          3684.151 ops/sec     271 usecs/op    
         2 *  8kB open_sync writes         2083.492 ops/sec     480 usecs/op    
         4 *  4kB open_sync writes         1142.101 ops/sec     876 usecs/op    
         8 *  2kB open_sync writes           85.792 ops/sec   11656 usecs/op    
        16 *  1kB open_sync writes           69.520 ops/sec   14384 usecs/op    
    
Test if fsync on non-write file descriptor is honored:    
(If the times are similar, fsync() can sync data written on a different    
descriptor.)    
        write, fsync, close                3878.116 ops/sec     258 usecs/op    
        write, close, fsync                3795.515 ops/sec     263 usecs/op    
    
Non-sync'ed 8kB writes:    
        write                            325926.283 ops/sec       3 usecs/op   

FIO:

pg_test_fsync     
5 seconds per test    
O_DIRECT supported on this platform for open_datasync and open_sync.    
    
Compare file sync methods using one 8kB write:    
(in wal_sync_method preference order, except fdatasync is Linux's default)    
        open_datasync                      6358.070 ops/sec     157 usecs/op    
        fdatasync                          6221.143 ops/sec     161 usecs/op    
        fsync                              3922.576 ops/sec     255 usecs/op    
        fsync_writethrough                              n/a    
        open_sync                          4078.494 ops/sec     245 usecs/op    
    
Compare file sync methods using two 8kB writes:    
(in wal_sync_method preference order, except fdatasync is Linux's default)    
        open_datasync                      3237.264 ops/sec     309 usecs/op    
        fdatasync                          5393.018 ops/sec     185 usecs/op    
        fsync                              3499.650 ops/sec     286 usecs/op    
        fsync_writethrough                              n/a    
        open_sync                          2117.949 ops/sec     472 usecs/op    
    
Compare open_sync with different write sizes:    
(This is designed to compare the cost of writing 16kB in different write    
open_sync sizes.)    
         1 * 16kB open_sync write          3684.151 ops/sec     271 usecs/op    
         2 *  8kB open_sync writes         2083.492 ops/sec     480 usecs/op    
         4 *  4kB open_sync writes         1142.101 ops/sec     876 usecs/op    
         8 *  2kB open_sync writes           85.792 ops/sec   11656 usecs/op    
        16 *  1kB open_sync writes           69.520 ops/sec   14384 usecs/op    
    
Test if fsync on non-write file descriptor is honored:    
(If the times are similar, fsync() can sync data written on a different    
descriptor.)    
        write, fsync, close                3878.116 ops/sec     258 usecs/op    
        write, close, fsync                3795.515 ops/sec     263 usecs/op    
    
Non-sync'ed 8kB writes:    
        write                            325926.283 ops/sec       3 usecs/op    

1 billion:

pgbench -i -s 10000      
1000000000 of 1000000000 tuples (100%) done (elapsed 1058.21 s, remaining 0.00 s)   
   
checkpoint;      
set max_parallel_workers_per_gather =16;      
set enable_indexscan =on;      
set enable_indexonlyscan =on;      
set enable_bitmapscan =off;      
set enable_seqscan=off;      
explain (analyze) select count(*) from pgbench_accounts ;      
set enable_indexscan =off;      
set enable_indexonlyscan =off;      
set enable_bitmapscan =off;      
set enable_seqscan=on;      
explain (analyze) select count(*) from pgbench_accounts ;     

 
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120 -S      
    
transaction type: <builtin: select only>    
scaling factor: 10000    
query mode: prepared    
number of clients: 52    
number of threads: 52    
duration: 120 s    
number of transactions actually processed: 106501930    
latency average = 0.058 ms    
latency stddev = 0.017 ms    
tps = 887505.958088 (including connections establishing)    
tps = 887576.878385 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set aid random(1, 100000 * :scale)    
         0.058  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;   

 
pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120     
    
transaction type: <builtin: TPC-B (sort of)>    
scaling factor: 10000    
query mode: prepared    
number of clients: 52    
number of threads: 52    
duration: 120 s    
number of transactions actually processed: 9363834    
latency average = 0.666 ms    
latency stddev = 4.141 ms    
tps = 78006.043823 (including connections establishing)    
tps = 78014.124808 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set aid random(1, 100000 * :scale)    
         0.001  \set bid random(1, 1 * :scale)    
         0.000  \set tid random(1, 10 * :scale)    
         0.000  \set delta random(-5000, 5000)    
         0.020  BEGIN;    
         0.417  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;    
         0.039  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;    
         0.053  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;    
         0.046  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;    
         0.046  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);    
         0.044  END;   

10 billion:

pgbench -i -s 100000      
10000000000 of 10000000000 tuples (100%) done (elapsed 7789.99 s, remaining 0.00 s)     


pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120 -S      
    
transaction type: <builtin: select only>    
scaling factor: 100000    
query mode: prepared    
number of clients: 52    
number of threads: 52    
duration: 120 s    
number of transactions actually processed: 3833782    
latency average = 1.627 ms    
latency stddev = 1.566 ms    
tps = 31946.758650 (including connections establishing)    
tps = 31949.973049 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set aid random(1, 100000 * :scale)    
         1.628  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;    


pgbench -M prepared -n -r -P 3 -c 52 -j 52 -T 120     
    
transaction type: <builtin: TPC-B (sort of)>    
scaling factor: 100000    
query mode: prepared    
number of clients: 52    
number of threads: 52    
duration: 120 s    
number of transactions actually processed: 2287129    
latency average = 2.728 ms    
latency stddev = 2.637 ms    
tps = 19056.932156 (including connections establishing)    
tps = 19058.904144 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set aid random(1, 100000 * :scale)    
         0.000  \set bid random(1, 1 * :scale)    
         0.000  \set tid random(1, 10 * :scale)    
         0.000  \set delta random(-5000, 5000)    
         0.016  BEGIN;    
         2.504  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;    
         0.038  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;    
         0.066  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;    
         0.046  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;    
         0.031  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);    
         0.024  END;   
Storage Disk Configuration Nominal IOPS of the Disk CPU Memory
ESSD PL3 20 TB 1,000,000, 4 GB/s bandwidth 52 cores 104 threads 2.5 GHz 384 GB
Test Case 104-thread machine with 20 TB ESSD PL3
fdatasync 8K 6221
Test Case 104-thread machine with 20 TB ESSD PL3
8k random write IOPS=167k, BW=1,303 MiB/s
8k random read IOPS=141k, BW=1,103 MiB/s
8k sequential write IOPS=158k, BW=1,231 MiB/s
8K sequential read IOPS=121k, BW=943 MiB/s
Test Case 104-thread machine with 20 TB ESSD PL3
1 billion write 1,058 seconds
1 billion read-only qps: 887,505
1 billion read and write tps: 78,006, qps: 390,030
Test Case 104-thread machine with 20 TB ESSD PL3
10 billion write 7,789 seconds
10 billion read-only qps: 31,946
10 billion read and write tps: 19,056, qps: 95,280

Supplementary Test 2

26-core 52-thread 192 GB-memory machine with 1.8 TB ESSD:

export PGPORT=1923       
export PGDATA=/data03/pg12_$PGPORT/pg_root       
export PGHOST=$PGDATA        
      
initdb -D $PGDATA -U postgres -E SQL_ASCII --lc-collate=C --lc-ctype=en_US.utf8 --wal-segsize=1024      


postgres@iZbp1621kc3de3pm12a0exZ-> pg_test_fsync     
5 seconds per test  
O_DIRECT supported on this platform for open_datasync and open_sync.  
  
Compare file sync methods using one 8kB write:  
(in wal_sync_method preference order, except fdatasync is Linux's default)  
        open_datasync                      6455.059 ops/sec     155 usecs/op  
        fdatasync                          6114.725 ops/sec     164 usecs/op  
        fsync                              4182.573 ops/sec     239 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                          4318.160 ops/sec     232 usecs/op  
  
Compare file sync methods using two 8kB writes:  
(in wal_sync_method preference order, except fdatasync is Linux's default)  
        open_datasync                      3187.113 ops/sec     314 usecs/op  
        fdatasync                          5170.527 ops/sec     193 usecs/op  
        fsync                              3430.941 ops/sec     291 usecs/op  
        fsync_writethrough                              n/a  
        open_sync                          2123.727 ops/sec     471 usecs/op  
  
Compare open_sync with different write sizes:  
(This is designed to compare the cost of writing 16kB in different write  
open_sync sizes.)  
         1 * 16kB open_sync write          3902.495 ops/sec     256 usecs/op  
         2 *  8kB open_sync writes         2136.396 ops/sec     468 usecs/op  
         4 *  4kB open_sync writes         1138.692 ops/sec     878 usecs/op  
         8 *  2kB open_sync writes           90.899 ops/sec   11001 usecs/op  
        16 *  1kB open_sync writes           74.271 ops/sec   13464 usecs/op  
  
Test if fsync on non-write file descriptor is honored:  
(If the times are similar, fsync() can sync data written on a different  
descriptor.)  
        write, fsync, close                4072.978 ops/sec     246 usecs/op  
        write, close, fsync                4136.551 ops/sec     242 usecs/op  
  
Non-sync'ed 8kB writes:  
        write                            296604.607 ops/sec       3 usecs/op  


. ./fiotest 32 8 8K randwrite /data03/test libaio 1 16G  
  write: IOPS=91.9k, BW=718MiB/s (753MB/s)(42.1GiB/60019msec)  
  
. ./fiotest 32 8 8K randread /data03/test libaio 1 16G  
  read: IOPS=92.2k, BW=720MiB/s (755MB/s)(42.2GiB/60003msec)  
  
. ./fiotest 32 8 8K write /data03/test libaio 1 16G  
  write: IOPS=91.9k, BW=718MiB/s (753MB/s)(42.1GiB/60003msec)  
  
. ./fiotest 32 8 8K read /data03/test libaio 1 16G  
read: IOPS=92.0k, BW=719MiB/s (754MB/s)(42.2GiB/60026msec)    

  
pgbench -i -s 10000   
  
1000000000 of 1000000000 tuples (100%) done (elapsed 735.47 s, remaining 0.00 s)  


checkpoint;      
set max_parallel_workers_per_gather =16;      
set enable_indexscan =on;      
set enable_indexonlyscan =on;      
set enable_bitmapscan =off;      
set enable_seqscan=off;      
explain (analyze) select count(*) from pgbench_accounts ;      
set enable_indexscan =off;      
set enable_indexonlyscan =off;      
set enable_bitmapscan =off;      
set enable_seqscan=on;      
explain (analyze) select count(*) from pgbench_accounts ;    


pgbench -M prepared -n -r -P 3 -c 32 -j 32 -T 120 -S  
  
transaction type: <builtin: select only>  
scaling factor: 10000  
query mode: prepared  
number of clients: 32  
number of threads: 32  
duration: 120 s  
number of transactions actually processed: 9111047  
latency average = 0.421 ms  
latency stddev = 1.461 ms  
tps = 75913.573252 (including connections establishing)  
tps = 75917.478749 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.000  \set aid random(1, 100000 * :scale)  
         0.421  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  


pgbench -M prepared -n -r -P 3 -c 32 -j 32 -T 120   
  
transaction type: <builtin: TPC-B (sort of)>  
scaling factor: 10000  
query mode: prepared  
number of clients: 32  
number of threads: 32  
duration: 120 s  
number of transactions actually processed: 4021731  
latency average = 0.955 ms  
latency stddev = 1.840 ms  
tps = 33511.413835 (including connections establishing)  
tps = 33513.166609 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.001  \set aid random(1, 100000 * :scale)  
         0.000  \set bid random(1, 1 * :scale)  
         0.000  \set tid random(1, 10 * :scale)  
         0.000  \set delta random(-5000, 5000)  
         0.019  BEGIN;  
         0.756  UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;  
         0.036  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;  
         0.046  UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;  
         0.039  UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;  
         0.032  INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);  
         0.025  END;  

Performance Summary

Test Environment Description

Storage Disk Configuration Nominal IOPS of the Disk CPU Memory
Local SSD 1.8 TB * 8 240,000 for each disk 32 cores 64 threads 2.5 GHz 512 GB
ESSD PL3 1.8 TB 91,800 26 cores 52 threads 2.5 GHz 192 GB
ESSD PL3 1.8 TB * 8 91,800 for each disk 26 cores 52 threads 2.5 GHz 192 GB
ESSD PL3 20 TB 1,000,000 26 cores 52 threads 2.5 GHz 192 GB
ESSD PL3 20 TB 1,000,000 52 cores 104 threads 2.5 GHz 384 GB

fsync Performance Test

This test also includes the single I/O capacity and the synchronous commit latency of databases.

Test Case Local SSD 1.8 TB * 8 ESSD PL3 1.8 TB * 8 (26-core) ESSD PL3 20 TB (52-core) ESSD PL3 20 TB (26-core) ESSD PL3 1.8 TB
fdatasync 8K 28,216 6,087 5,901 6,221 6,114

FIO Test (IOPS and Throughput)

Test Case Local SSD 1.8 TB * 8 ESSD PL3 1.8 TB * 8 (26-core) ESSD PL3 20 TB (52-core) ESSD PL3 20 TB (26-core) ESSD PL3 1.8 TB
8k random write IOPS=131k, BW=1024 MiB/s IOPS=106k, BW=825 MiB/s IOPS=106k, BW=826 MiB/s IOPS=167k, BW=1303 MiB/s IOPS=91.9k, BW=718 MiB/s
8k random read IOPS=153k, BW=1193 MiB/s IOPS=106k, BW=826 MiB/s IOPS=106k, BW=830 MiB/s IOPS=141k, BW=1103 MiB/s IOPS=92.2k, BW=720 MiB/s
8 KB sequential write IOPS=134k, BW=1050 MiB/s IOPS=106k, BW=826 MiB/s IOPS=106k, BW=824 MiB/s IOPS=158k, BW=1231 MiB/s IOPS=91.9k, BW=718 MiB/s
8 KB sequential read IOPS=151k, BW=1182 MiB/s IOPS=106k, BW=829 MiB/s IOPS=106k, BW=825 MiB/s IOPS=121k, BW=943 MiB/s IOPS=92.0k, BW=719 MiB/s

PostgreSQL: 1 Billion tpcb Test (Comprehensive Database Performance)

Test Case Local SSD 1.8 TB * 8 ESSD PL3 1.8 TB * 8 (26-core) ESSD PL3 20 TB (52-core) ESSD PL3 20 TB (26-core) ESSD PL3 1.8 TB
1 billion write 1,058 seconds 948 seconds 1,082 seconds 1,058 seconds 735 seconds
1 billion read-only qps: 627,000 (with the server memory of 512 GB all hit) qps: 64,979 (with the server memory of 192 GB hit at a low rate and a worse I/O rt compared with local disk) qps: 37,253 (with the server memory of 192 GB hit at a low rate and a worse I/O rt compared with local disk) qps: 887,505 (with the server memory of 384 GB all hit) qps: 75,913 (with the server memory of 192 GB hit at a low rate and a worse I/O rt compared with local disk)
1 billion read and write tps: 31,545; qps: 157,725 tps: 25,595; qps: 127,975 tps: 28,627; qps: 143,135 tps: 78,006; qps: 390,030 tps: 33,511; qps: 167,555

PostgreSQL: 10 Billion tpcb Test (Comprehensive Database Performance)

Test Case Local SSD 1.8 TB * 8 ESSD PL3 1.8TB * 8 (26-core) ESSD PL3 20 TB (52-core) ESSD PL3 20 TB
10 billion write 9,486 seconds 8,195 seconds 7,983 seconds 7,789 seconds
10 billion read-only qps: 158,451 qps: 21,123 qps: 22,504 qps: 31,946
10 billion read and write tps: 24,720; qps: 123,600 tps: 17,243; qps: 86,215 tps: 17,087; qps: 85,435 tps: 19,056; qps: 95,280

Cloud Disks and Local Disks: Advantages and Disadvantages

Advantages of Cloud Disks:

  • More reliable: 3 replicas vs. 1 replica. Even if the secondary node fails, the primary node still has 3 replicas. Only 1 replica is available for the physical machine.
  • More secure: Cloud disk encryption supported.
  • Larger capacity: A maximum of 512 TB (32 TB * 16 blocks) vs. 14 TB.
  • Faster backup: Create snapshots vs. copy files. The backup of a local disk of over 1 TB can take more than 10 hours.
  • If you store a data file in a local disk, its back up may cause network transmission and take up OS page cache (buffer I/O).
  • Faster cloning: Snapshot recovery vs. copy files
  • When the host of the cloud disk fails, you can achieve quick recovery by migrating the cloud disk mounted to the ECS instance without reconstructing the data.
  • Smooth performance: There is a huge performance jitter when the local SSD is undergoing garbage collection. With data on the cloud disk scattered, the overall I/O operations are smoother.

Cloud Disk Issues

The latency of a single I/O is higher (164us vs. 30us). What’s the solution?

  • Both the operation system and the database feature pre-read and write I/O merging. A large number of operations on PG are BufferIO, so the impact of a single I/O latency on PG performance is almost negligible.
  • After concurrency, the waiting time introduced by the single I/O latency is also reduced.
  • Partitions are recommended to avoid a massive single table. As such, you can perform many vacuum operations in parallel.

Advice on Cloud Disk Usage

Parallelization is needed to make up for the cloud disk I/O latency.

1) Partition the table as much as possible to improve vacuum throughput.

2) Adopt 32 KB block size for analytical use cases to increase the single I/O and the throughput.

3) Pre-read block devices based on the operation system or database.

4) The IOPS and bandwidth of cloud disks are not only restricted by the disk size but also by the ECS instance specification. ECS instances of higher specifications are recommended for I/O-sensitive use cases.

Kernel Optimization Suggestions for Cloud Disk Scenarios

I/O Parallelization

  • Parallelize bgwriter
  • Parallelize checkpoint
  • Parallelize walwriter
  • Parallelize vacuum single table

Other Parallelized Operations

Table creation, indexing, and queries are all parallelized. See PG v12 parallel computing.

Optimization Suggestions for DBA

Enable backend flush to involve more processes in I/O operations to improve throughput and reduce the checkpoint burden.

0 0 0
Share on

digoal

281 posts | 24 followers

You may also like

Comments