Deployment Guide for Compute-Storage Separation Architecture on PolarDB for PostgreSQL Open-Source Version

By digoal

Background

The PolarDB for PostgreSQL project opened the relevant source code of the core storage architecture on September 1, 2021. This architecture is similar to the Oracle RAC architecture. It supports the creation of multiple instances on single storage, enabling Write Once Read Many and rapid read expansion. Currently, the open-source code can be quickly deployed on a stand-alone computer. The code of distributed PolarDB file system (PFS) will be available soon. You can use PFS to run PolarDB on distributed shared storage.

The industry's first open-source cloud-native database contains a large number of core self-developed technologies of Alibaba Cloud databases. In the future, features (such as cross-machine parallel execution) will be introduced to enhance analysis capabilities. The original distributed version continues to evolve on the distributed branch.

Stay tuned for more information: https://github.com/alibaba/PolarDB-for-PostgreSQL

Deployment Guide

CentOS 7.9 16c64g 2TB ssd

Please see Why Does PolarDB Solve the Performance Problem of FPW? for more information about machine deployment.

yum -y install coreutils glib2 lrzsz dstat sysstat e4fsprogs xfsprogs ntp readline-devel zlib-devel openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc gcc-c++ make smartmontools flex bison perl-devel perl-ExtUtils* openldap-devel jadetex  openjade bzip2 git iotop lvm2 perf centos-release-scl     
    
rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

parted -a optimal -s /dev/vdb mklabel gpt mkpart primary 1MiB 100%FREE      
      
mkfs.ext4 /dev/vdb1 -m 0 -O extent,uninit_bg -b 4096 -L lv01

vi /etc/fstab                 
LABEL=lv01 /data01 ext4 defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback 0 0

mkdir /data01              
            
mount -a

vi /etc/sysctl.conf      
      
# add by digoal.zhou              
fs.aio-max-nr = 1048576              
fs.file-max = 76724600              
              
# Options: Set the kernel.core_pattern parameter to /data01/corefiles/core_%e_%u_%t_%s.%p.                       
# The /data01/corefiles directory that is used to store core dumps is created with the 777 permission before testing. If a symbolic link is used, change the corresponding directory to 777.              
              
kernel.sem = 4096 2147483647 2147483646 512000                  
# Specify the semaphore. You can run the ipcs -l or -u command to obtain the semaphore count. Each group of 16 processes requires a semaphore with a count of 17.              
              
kernel.shmall = 107374182                    
# Specify the total size of shared memory segments. Recommended value: 80% of the memory capacity. Unit: pages.              
kernel.shmmax = 274877906944                 
# Specify the maximum size of a single shared memory segment. Recommended value: 50% of the memory capacity. Unit: bytes. In PostgreSQL versions later than 9.2, the use of shared memory significantly drops.              
kernel.shmmni = 819200                       
# Specify the total number of shared memory segments that can be generated. There are at least 2 shared memory segments within each PostgreSQL cluster.              
              
net.core.netdev_max_backlog = 10000              
net.core.rmem_default = 262144                     
# The default setting of the socket receive buffer in bytes.              
net.core.rmem_max = 4194304                        
# The maximum receive socket buffer size in bytes              
net.core.wmem_default = 262144                     
# The default setting (in bytes) of the socket send buffer.              
net.core.wmem_max = 4194304                        
# The maximum send socket buffer size in bytes.              
net.core.somaxconn = 4096              
net.ipv4.tcp_max_syn_backlog = 4096              
net.ipv4.tcp_keepalive_intvl = 20              
net.ipv4.tcp_keepalive_probes = 3              
net.ipv4.tcp_keepalive_time = 60              
net.ipv4.tcp_mem = 8388608 12582912 16777216              
net.ipv4.tcp_fin_timeout = 5              
net.ipv4.tcp_synack_retries = 2              
net.ipv4.tcp_syncookies = 1                  
# Enable SYN cookies. If the SYN waiting queue overflows, you can enable SYN cookies to defend against a small number of SYN attacks.              
net.ipv4.tcp_timestamps = 1                  
# Reduce time_wait.              
net.ipv4.tcp_tw_recycle = 0                  
# If you set this parameter to 1, sockets in the TIME-WAIT state over TCP connections are recycled. However, if network address translation (NAT) is used, TCP connections may fail. We recommend that you set this parameter to 0 on the database server.              
net.ipv4.tcp_tw_reuse = 1                    
# Enable the reuse function. This function enables network sockets in the TIME-WAIT state to be reused over new TCP connections.              
net.ipv4.tcp_max_tw_buckets = 262144              
net.ipv4.tcp_rmem = 8192 87380 16777216              
net.ipv4.tcp_wmem = 8192 65536 16777216              
              
net.nf_conntrack_max = 1200000              
net.netfilter.nf_conntrack_max = 1200000              
              
vm.dirty_background_bytes = 409600000                     
# If the size of dirty pages reaches the specified limit, a background scheduling process (for example, pdflush) is invoked to flush the dirty pages to disks. These are the pages that are generated n seconds earlier. The value of n is calculated by using the following formula: n = The value of the dirty_expire_centisecs parameter/100.              
# The default limit is 10% of the memory capacity. We recommend that you specify the limit in bytes for machines with large memory capacity.              
              
vm.dirty_expire_centisecs = 3000       
      
# Specify the maximum period to retain dirty pages. Dirty pages are flushed to disks after the time period specified by this parameter elapses. The value 3000 indicates 30 seconds.              
vm.dirty_ratio = 95                                        
# If the process that the system flushes dirty pages is too slow, causing the system dirty pages to exceed 95% of the memory, the process that users call to write data onto disks must actively flush dirty pages to disks (These processes include fsync, fdatasync, etc.).              
# Set this parameter properly to prevent user-called processes from flushing dirty pages to disks. This allows you to create multiple ApsaraDB RDS instances on a stand-alone server and use control groups to limit the input/output operations per second (IOPS) per instance.                
              
vm.dirty_writeback_centisecs = 100                          
# Specify the time interval at which the background scheduling process (such as pdflush) flushes dirty pages to disks. The value 100 indicates 1 second.              
              
vm.swappiness = 0              
# Disable the swap function.              
              
vm.mmap_min_addr = 65536              
vm.overcommit_memory = 0                   
# When allocating memory, more memory space than the malloc is allowed. If you set this parameter to 1, the system always considers the available memory space sufficient. If the memory capacity provided in the test environment is low, we recommend that you set this parameter to 1.                
              
vm.overcommit_ratio = 90                   
# Specify the memory capacity that can be allocated when the overcommit_memory parameter is set to 2.              
vm.swappiness = 0                          
# Disable the swap partition.              
vm.zone_reclaim_mode = 0                   
# Disable non-uniform memory access (NUMA). You can also disable NUMA in the vmlinux file.               
net.ipv4.ip_local_port_range = 40000 65535                  
# Specify the range of TCP or UDP port numbers that are automatically allocated locally.              fs.nr_open=20480000              
# Specify the maximum number of file handles a single process can open.              
              
# Take note of the following parameters:              
# vm.extra_free_kbytes = 4096000              
# vm.min_free_kbytes = 2097152  # vm.min_free_kbytes We recommend that you set the value of the vm.min_free_kbytes parameter to 1 GB for every 32 GB of memory.             
# If the physical host does not provide much memory, we recommend that you do not configure vm.extra_free_kbytes and vm.min_free_kbytes.              
# vm.nr_hugepages = 66536                  
# If the size of the shared buffer exceeds 64 GB, we recommend that you use huge pages. You can specify the page size by setting the Hugepagesize parameter in the /proc/meminfo file.              
# vm.lowmem_reserve_ratio = 1 1 1              
# If the memory capacity exceeds 64 GB, we recommend that you set this parameter. Otherwise, we recommend that you retain the default value 256 256 32.

sysctl -p

vi /etc/security/limits.d/20-nproc.conf        
          
# If nofile exceeds 1048576, the fs.nr_open of sysctl must be set to a larger value, and then you can continue to set nofile after sysctl takes effect.              
              
* soft    nofile  1024000              
* hard    nofile  1024000              
* soft    nproc   unlimited              
* hard    nproc   unlimited              
* soft    core    unlimited              
* hard    core    unlimited              
* soft    memlock unlimited              
* hard    memlock unlimited

chmod +x /etc/rc.d/rc.local

vi /etc/rc.local          
          
touch /var/lock/subsys/local          
               
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then              
   echo never > /sys/kernel/mm/transparent_hugepage/enabled              
fi

yum install -y readline-devel zlib-devel perl-CPAN bison flex git     
    
cpan -fi Test::More IPC::Run

useradd polardb -d /data01/polardb    
    
su - polardb    
    
cd ~    
git clone https://github.com/alibaba/PolarDB-for-PostgreSQL.git    
    
./polardb_build.sh --withrep --repnum=2 --withstandby

PS: If you want to modify lc_collate in initdb, modify the polardb_build.sh.sh script. If you want to modify some parameters of postgresql.conf, modify the polardb_build.sh.sh

Test:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -c 'select version();'    
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -c 'select * from pg_replication_slots;'

Database started:

[polardb@iZbp15sgewxe2ioglp30z4Z PolarDB-for-PostgreSQL]$ ipcs  
  
------ Message Queues --------  
key        msqid      owner      perms      used-bytes   messages      
  
------ Shared Memory Segments --------  
key        shmid      owner      perms      bytes      nattch     status        
0x0052e2c1 16         polardb    600        56         15                        
0x0052e6a9 17         polardb    600        56         7                         
0x0052ea91 18         polardb    600        56         7                         
0x0052ee79 19         polardb    600        56         6                         
  
------ Semaphore Arrays --------  
key        semid      owner      perms      nsems

Directory Explanation:

cd ~  
  
[polardb@iZbp15sgewxe2ioglp30z4Z ~]$ du -sh *  
968M    PolarDB-for-PostgreSQL            PolarDB Source Code
92M    tmp_basedir_polardb_pg_1100_bld       PolarDB Binary Software
840M    tmp_datadir_polardb_pg_1100_bld   Shared Data of RW RO Instance
24M    tmp_primary_dir_polardb_pg_1100_bld   Local Content of RW Instance
23M    tmp_replica_dir_polardb_pg_1100_bld1  Local Content of RO Instance
23M    tmp_replica_dir_polardb_pg_1100_bld2  Local Content of RO Instance
40M    tmp_standby_datadir_polardb_pg_1100_bld   Shared Data of Standby Instance
23M    tmp_standby_dir_polardb_pg_1100_bld       Local Content of Standby Instance

Configure environment variables for ease of use:

vi ~/.bashrc  
  
# Append the following parameters:          
export PS1="$USER@`/bin/hostname -s`-> "            
export PGPORT=5432            
export PGDATA=/data01/polardb/tmp_primary_dir_polardb_pg_1100_bld        
          
export LANG=en_US.utf8            
export PGHOME=/data01/polardb/tmp_basedir_polardb_pg_1100_bld  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH            
export DATE=`date +"%Y%m%d%H%M"`            
export PATH=$PGHOME/bin:$PATH:.            
export MANPATH=$PGHOME/share/man:$MANPATH            
export PGHOST=127.0.0.1         
export PGUSER=postgres            
export PGDATABASE=postgres            
alias rm='rm -i'            
alias ll='ls -lh'            
unalias vi

Query:

postgres=# select version();  
                                                  version                                                     
------------------------------------------------------------------------------------------------------------  
 PostgreSQL 11beta2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit  
(1 row)  
  
  
postgres=# select * from pg_replication_slots;  
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn   
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------  
 standby1  |        | physical  |        |          | f         | t      |      25125 |      |              | 0/185DCB0   |   
 replica2  |        | physical  |        |          | f         | t      |      25112 |      |              | 0/185DB98   |   
 replica1  |        | physical  |        |          | f         | t      |      25098 |      |              | 0/185DB98   |   
(3 rows)

Why is it 11beta2? The PolarDB for PG was developed from beta2, and all the versions will be merged gradually in the future. We are looking forward to PG 14.

People who know greenplum must remember how greenplum was updated to Version 9.4 step by step and have experienced every PG version.

Perform stress testing:

pgbench -i -s 100  
pgbench -M prepared -n -r -P 1 -c 16 -j 16 -T 120

Some parameters:

postgres=# select name,setting from pg_settings where name ~ 'polar';  
                    name                    |                          setting                             
--------------------------------------------+------------------------------------------------------------  
 polar_bg_replay_batch_size                 | 20000  
 polar_bgwriter_batch_size_flushlist        | 100  
 polar_bgwriter_max_batch_size              | 5000  
 polar_bgwriter_sleep_lsn_lag               | 100  
 polar_buffer_copy_lsn_lag_with_cons_lsn    | 100  
 polar_buffer_copy_min_modified_count       | 5  
 polar_check_checkpoint_legal_interval      | 1  
 polar_clog_max_local_cache_segments        | 128  
 polar_clog_slot_size                       | 128  
 polar_copy_buffers                         | 16384  
 polar_datadir                              | file-dio:///data01/polardb/tmp_datadir_polardb_pg_1100_bld  
 polar_disk_name                            | data01  
 polar_enable_async_ddl_lock_replay         | on  
 polar_enable_parallel_bgwriter             | on  
 polar_enable_redo_logindex                 | on  
 polar_fullpage_keep_segments               | 16  
 polar_hostid                               | 2  
 polar_logindex_bloom_blocks                | 1024  
 polar_logindex_mem_size                    | 512  
 polar_logindex_table_batch_size            | 100  
 polar_logindex_unit_test                   | 0  
 polar_max_logindex_files                   | 80  
 polar_openfile_with_readonly_in_replica    | off  
 polar_parallel_bgwriter_check_interval     | 10  
 polar_parallel_bgwriter_delay              | 10  
 polar_parallel_bgwriter_enable_dynamic     | on  
 polar_parallel_bgwriter_workers            | 5  
 polar_parallel_new_bgwriter_threshold_lag  | 1024  
 polar_parallel_new_bgwriter_threshold_time | 10  
 polar_read_ahead_xlog_num                  | 200  
 polar_storage_cluster_name                 |   
 polar_streaming_xlog_meta                  | on  
 polar_version                              | 1.1.15  
 polar_vfs.localfs_mode                     | on  
 polar_vfs.max_direct_io_size               | 1048576  
 polar_vfs.pfs_force_mount                  | on  
 polar_worker.enable_polar_worker           | on  
 polar_worker.polar_worker_check_interval   | 5  
 polar_worker.prealloc_wal_file_num         | 2  
 polar_worker.xlog_temp_outdate_time        | -1  
 polar_xlog_page_buffers                    | 1024  
 polar_xlog_queue_buffers                   | 512  
(42 rows)

References

https://github.com/alibaba/PolarDB-for-PostgreSQL

Deploy an Instance Based on Local Storage

We provide a script that uses default configuration to compile PolarDB. This section describes how to deploy a PolarDB instance with storage on your local disks with this script. System Requirements: CentOS 7.5 and above. The following setup steps are tested based on CentOS 7.5.

1. Download the source code of PolarDB from https://github.com/alibaba/PolarDB-for-PostgreSQL/tree/main.

2. Install dependent packages:

sudo yum install readline-devel zlib-devel perl-CPAN bison flex    
sudo cpan -fi Test::More IPC::Run

3. Select the command according to your needs to start deployment.

Run the following commands to only compile the source code (without creating any instance):

./polardb_build.sh --noinit

Run the following commands to compile the source code and create an instance with one node (primary node). The port number is 5432.

./polardb_build.sh

Run the following commands to compile the source code and create an instance with two nodes. The nodes include:

One primary node and the port number is 5432.
One read-only node and the port number is 5433.

./polardb_build.sh --withrep --repnum=1

Run the following commands to compile the source code and create an instance with three nodes. The nodes include:

One primary node and the port number is 5432.
One read-only node and the port number is 5433.
One standby node (for standby database) and the port number is 5434.

./polardb_build.sh --withrep --repnum=1 --withstandby

Run the following commands to compile the source code and create an instance with four nodes. The nodes include:

One primary node and the port number is 5432.
Two read-only nodes and the port numbers are 5433 and 5434.
One standby node (for standby database) and the port number is 5435.

./polardb_build.sh --withrep --repnum=2 --withstandby

4. After the deployment, please refer to the following steps to check and test the created instance to make sure it is deployed correctly.

Run the following commands to check the instance:

$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -c 'select version();'    
$HOME/tmp_basedir_polardb_pg_1100_bld/bin/psql -p 5432 -c 'select * from pg_replication_slots;'

Run the following commands to perform a full regression test:

./polardb_build.sh --withrep --repnum=1 --withstandby -r-check-all -e -r-contrib -r-pl -r-external -r-installcheck-all

Community

Deployment Guide for Compute-Storage Separation Architecture on PolarDB for PostgreSQL Open-Source Version

Background

Deployment Guide

References

Deploy an Instance Based on Local Storage

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Managed Service for Prometheus