By digoal
Network Block Device (NBD) is a cheap shared storage solution. NBD can be used as a lightweight shared storage test solution to deploy and learn Oracle RAC and PolarDB for PostgreSQL.
In addition to supporting Transmission Control Protocol (TCP), NBD supports Remote Direct Memory Access (RDMA)-based Sockets Direct Protocol (SDP) and can test the performance of RAC and PolarDB for PostgreSQL.
The open-source address of PolarDB for PostgreSQL: https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
Build Oracle-RAC using NBD: http://www.fi.muni.cz/~kripac/orac-nbd/
There are some issues that need to be noted, such as the cache at the operating system level. When you switch from an NBD server to another NBD server, if there is a cache written in image files, data loss will occur. This problem can be solved with the nbd-server sync export mode.
One server (two 100 GB disks-vdb and vdc); two clients. Runtime environment: CentOS 7.9.
yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum install -y centos-release-scl
yum install -y postgresql14*
vi /etc/sysctl.conf
# add by digoal.zhou
fs.aio-max-nr = 1048576
fs.file-max = 76724600
# Options: kernel.core_pattern = /data01/corefiles/core_%e_%u_%t_%s.%p
# The /data01/corefiles directory that is used to store core dumps is created with the 777 permission before testing. If a symbolic link is used, change the corresponding directory to 777
kernel.sem = 4096 2147483647 2147483646 512000
# Specify the semaphore. You can run the ipcs -l or -u command to obtain the semaphore count. Each group of 16 processes requires a semaphore with a count of 17.
kernel.shmall = 107374182
# Specify the total size of shared memory segments. Recommended value: 80% of the memory capacity. Unit: pages.
kernel.shmmax = 274877906944
# Specify the maximum size of a single shared memory segment. Recommended value: 50% of the memory capacity. Unit: bytes. In PostgreSQL versions later than 9.2, the use of shared memory significantly drops.
kernel.shmmni = 819200
# Specify the total number of shared memory segments that can be generated. There are at least 2 shared memory segments within each PostgreSQL cluster.
net.core.netdev_max_backlog = 10000
net.core.rmem_default = 262144
# The default setting of the socket receive buffer in bytes.
net.core.rmem_max = 4194304
# The maximum receive socket buffer size in bytes
net.core.wmem_default = 262144
# The default setting (in bytes) of the socket send buffer.
net.core.wmem_max = 4194304
# The maximum send socket buffer size in bytes.
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_intvl = 20
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_mem = 8388608 12582912 16777216
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syncookies = 1
# Enable SYN cookies. If an SYN waiting queue overflows, you can enable SYN cookies to defend against a small number of SYN attacks.
net.ipv4.tcp_timestamps = 1
# Reduce time_wait.
net.ipv4.tcp_tw_recycle = 0
# If you set this parameter to 1, sockets in the TIME-WAIT state over TCP connections are recycled. However, if network address translation (NAT) is used, TCP connections may fail. We recommend that you set this parameter to 0 on the database server.
net.ipv4.tcp_tw_reuse = 1
# Enable the reuse function. This function enables network sockets in the TIME-WAIT state to be reused over new TCP connections.
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.nf_conntrack_max = 1200000
net.netfilter.nf_conntrack_max = 1200000
vm.dirty_background_bytes = 409600000
# When the dirty page of the system reaches this value, the dirty page brushing process pdflush in the system(or other page brushing processes) automatically brushes the dirty page (dirty_expire_centisecs/100) seconds ago to the disk.
# The default limit is 10% of the memory capacity. We recommend that you specify the limit in bytes for machines with large memory capacity.
vm.dirty_expire_centisecs = 3000
# Specify the maximum period to retain dirty pages. Dirty pages are flushed to disks after the time period specified by this parameter elapses. The value 3000 indicates 30 seconds.
vm.dirty_ratio = 95
# If the process that the system flushes dirty pages is too slow, causing the system dirty pages to exceed 95% of the memory, the process that users call to write data onto disks must actively flush dirty pages to disks (These processes include fsync, fdatasync, etc.).
# Set this parameter properly to prevent user-called processes from flushing dirty pages to disks, which is very effective when a single machine has multiple instances and CGROUP is used to limit the IOPS of a single instance.
vm.dirty_writeback_centisecs = 100
# Specify the time interval at which the background scheduling process (such as pdflush or other processes) flushes dirty pages to disks. The value 100 indicates 1 second.
vm.swappiness = 0
# Disable the swap partition
vm.mmap_min_addr = 65536
vm.overcommit_memory = 0
# When allocating memory, more memory space than the malloc is allowed. If you set this parameter to 1, the system always considers the available memory space sufficient. If the memory capacity provided in the test environment is low, we recommend that you set this parameter to 1.
vm.overcommit_ratio = 90
# Specify the memory capacity that can be allocated when the overcommit_memory parameter is set to 2.
vm.swappiness = 0
# Disable the swap partition.
vm.zone_reclaim_mode = 0
# Disable non-uniform memory access (NUMA). You can also disable NUMA in the vmlinux file.
net.ipv4.ip_local_port_range = 40000 65535
# Specify the range of TCP or UDP port numbers that are automatically allocated locally.
fs.nr_open=20480000
# Specify the maximum number of file handles that a single process can open.
# Take note of the following parameters:
# vm.extra_free_kbytes = 4096000
# vm.min_free_kbytes = 6291456 # vm.min_free_kbytes We recommend that you set the value of the vm.min_free_kbytes parameter to 1 GB for every 32 GB of memory.
# If the physical host does not provide much memory, we recommend that you do not configure vm.extra_free_kbytes and vm.min_free_kbytes.
# vm.nr_hugepages = 66536
# If the size of the shared buffer exceeds 64 GB, we recommend that you use huge pages. You can specify the page size by setting the Hugepagesize parameter in the /proc/meminfo file.
# vm.lowmem_reserve_ratio = 1 1 1
# If the memory capacity exceeds 64 GB, we recommend that you set this parameter. Otherwise, we recommend that you retain the default value 256 256 32.
# Take effect
# sysctl -p
vi /etc/security/limits.conf
# If nofile exceeds 1048576, the fs.nr_open of sysctl must be set to a larger value, and then you can continue to set nofile after sysctl takes effect.
# Comment other lines and add them as follows:
* soft nofile 1024000
* hard nofile 1024000
* soft nproc unlimited
* hard nproc unlimited
* soft core unlimited
* hard core unlimited
* soft memlock unlimited
* hard memlock unlimited
Do modifications at the same time (if any):
/etc/security/limits.d/20-nproc.conf
echo never > /sys/kernel/mm/transparent_hugepage/enabled
The configuration takes effect permanently:
chmod +x /etc/rc.d/rc.local
vi /etc/rc.local
touch /var/lock/subsys/local
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
vi /etc/rc.local
touch /var/lock/subsys/local
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
Supported clocks:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
kvm-clock tsc acpi_pm
Modify the clock:
echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
The NBD package can be used directly, but the kernel in the client needs to be compiled (maybe the kernel of CentOS 7 does not have a built-in NBD module by default).
yum install -y nbd
[root@iZbp1eo3op9s5gxnvc7aokZ ~]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.164.66 netmask 255.255.240.0 broadcast 172.17.175.255
inet6 fe80::216:3eff:fe00:851f prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:00:85:1f txqueuelen 1000 (Ethernet)
RX packets 159932 bytes 229863288 (219.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 30124 bytes 3706650 (3.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@iZbp1eo3op9s5gxnvc7aokZ ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 100G 0 disk
└─vda1 253:1 0 100G 0 part /
vdb 253:16 0 100G 0 disk
vdc 253:32 0 100G 0 disk
Write the NBD-server configuration file (Please see man 5 nbd-server for more information).
Note: No space is allowed at the end of each configuration line, or parse problems may occur.
vi /root/nbd.conf
# This is a comment
[generic]
# The [generic] section is required, even if nothing is specified
# there.
# When either of these options are specified, nbd-server drops
# privileges to the given user and group after opening ports, but
# _before_ opening files.
# user = nbd
# group = nbd
listenaddr = 0.0.0.0
port = 1921
[export1]
exportname = /dev/vdb
readonly = false
multifile = false
copyonwrite = false
flush = true
fua = true
sync = true
[export2]
exportname = /dev/vdc
readonly = false
multifile = false
copyonwrite = false
flush = true
fua = true
sync = true
Start NBD-server:
[root@iZbp1eo3op9s5gxnvc7aokZ ~]# nbd-server -C /root/nbd.conf
Additional compilation of the NBD module into the kernel is required. Please refer to for a more detailed compilation:
yum install -y kernel-devel kernel-headers elfutils-libelf-devel gcc+ gcc-c++
[root@iZbp1eo3op9s5gxnvc7aolZ ~]# uname -r
3.10.0-1160.42.2.el7.x86_64
In theory, the related configuration of the 42.2 minor version should be consistent with that above, but no rpm package was found. Kernel-3.10.0-1160.el7 seems to be available.
PS: Later, the src was found here to update the minor version.
curl https://vault.centos.org/7.9.2009/os/Source/SPackages/kernel-3.10.0-1160.el7.src.rpm -o ./kernel-3.10.0-1160.el7.src.rpm
Revised to:
curl https://vault.centos.org/7.9.2009/updates/Source/SPackages/kernel-3.10.0-1160.42.2.el7.src.rpm -o ./kernel-3.10.0-1160.42.2.el7.src.rpm
rpm -ivh kernel-3.10.0-1160.el7.src.rpm
Revised to:
rpm -ivh kernel-3.10.0-1160.42.2.el7.src.rpm
cd rpmbuild/SOURCES/
tar xvf linux-3.10.0-1160.el7.tar.xz -C /usr/src/kernels/
Revised to:
tar xvf linux-3.10.0-1160.42.2.el7.tar.xz -C /usr/src/kernels/
cd /usr/src/kernels/linux-3.10.0-1160.el7
Revised to:
cd /usr/src/kernels/linux-3.10.0-1160.42.2.el7
make mrproper
cp /usr/src/kernels/3.10.0-1160.42.2.el7.x86_64/Module.symvers ./
cp /boot/config-3.10.0-1160.el7.x86_64 ./.config
Revised to:
cp /boot/config-3.10.0-1160.42.2.el7.x86_64 ./.config
make oldconfig
make prepare
make scripts
The following is a section to fix the compilation error. (The error is caused by the lack of variable definitions owing to the absence of the blkdev.h header file):
REQ_TYPE_SPECIAL = 7 is defined in /usr/src/kernels/linux-3.10.0-1160.el7/include/linux/blkdev.h
Revised to:
REQ_TYPE_SPECIAL = 7 is defined in /usr/src/kernels/linux-3.10.0-1160.42.2.el7/include/linux/blkdev.h
/*
* request command types
*/
enum rq_cmd_type_bits {
REQ_TYPE_FS = 1, /* fs request */
REQ_TYPE_BLOCK_PC, /* scsi command */
REQ_TYPE_SENSE, /* sense request */
REQ_TYPE_PM_SUSPEND, /* suspend request */
REQ_TYPE_PM_RESUME, /* resume request */
REQ_TYPE_PM_SHUTDOWN, /* shutdown request */
#ifdef __GENKSYMS__
REQ_TYPE_SPECIAL, /* driver defined type */
#else
REQ_TYPE_DRV_PRIV, /* driver defined type */
#endif
/*
* for ATA/ATAPI devices. this really doesn't belong here, ide should
* use REQ_TYPE_DRV_PRIV and use rq->cmd[0] with the range of driver
* private REQ_LB opcodes to differentiate what type of request this is
*/
REQ_TYPE_ATA_TASKFILE,
REQ_TYPE_ATA_PC,
};
Modify the file:
vi drivers/block/nbd.c
modifies
sreq.cmd_type = REQ_TYPE_SPECIAL;
as
sreq.cmd_type = 7;
Continue the compilation:
make CONFIG_BLK_DEV_NBD=m M=drivers/block CONFIG_STACK_VALIDATION=
cp drivers/block/nbd.ko /lib/modules/3.10.0-1160.42.2.el7.x86_64/kernel/drivers/block/
Load the NBD module:
depmod -a
modinfo nbd
modprobe nbd
Configure automatic loading of the NBD module:
#cd /etc/sysconfig/modules/
#vi nbd.modules
Add the following contents to the file
#!/bin/sh
/sbin/modinfo -F filename nbd > /dev/null 2>&1
if [ $? -eq 0 ]; then
/sbin/modprobe nbd
fi
#chmod 755 nbd.modules //This step is crucial.
#reboot
Mount the network block device:
[root@iZbp1eo3op9s5gxnvc7aomZ ~]# nbd-client 172.17.164.66 1921 -N export1 /dev/nbd0
Negotiation: ..size = 102400MB
bs=1024, sz=107374182400 bytes
[root@iZbp1eo3op9s5gxnvc7aomZ ~]# nbd-client 172.17.164.66 1921 -N export2 /dev/nbd1
Negotiation: ..size = 102400MB
bs=1024, sz=107374182400 bytes
Format the file system and mount it:
mkfs.ext4 /dev/nbd0
mkfs.ext4 /dev/nbd1
mkdir /data01
mkdir /data02
mount /dev/nbd0 /data01
mount /dev/nbd1 /data02
Write tests:
# dd if=/dev/zero of=/data01/test oflag=direct bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.90611 s, 214 MB/s
# dd if=/dev/zero of=/data02/test oflag=direct bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.90611 s, 214 MB/s
df -h
/dev/nbd0 99G 1.1G 93G 2% /data01
/dev/nbd1 99G 1.1G 93G 2% /data02
Some IO operations can be seen on the server side iotop:
13899 be/4 root 0.00 B/s 42.56 M/s 0.00 % 73.39 % nbd-server -C /root/nbd.conf [pool]
13901 be/4 root 0.00 B/s 42.81 M/s 0.00 % 73.00 % nbd-server -C /root/nbd.conf [pool]
13897 be/4 root 0.00 B/s 42.56 M/s 0.00 % 72.95 % nbd-server -C /root/nbd.conf [pool]
13900 be/4 root 0.00 B/s 42.32 M/s 0.00 % 72.47 % nbd-server -C /root/nbd.conf [pool]
fsync test:
[root@iZbp1eo3op9s5gxnvc7aomZ data01]# /usr/pgsql-14/bin/pg_test_fsync -f /data01/test
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 1056.250 ops/sec 947 usecs/op
fdatasync 1032.631 ops/sec 968 usecs/op
fsync 404.807 ops/sec 2470 usecs/op
fsync_writethrough n/a
open_sync 414.387 ops/sec 2413 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 553.453 ops/sec 1807 usecs/op
fdatasync 1011.726 ops/sec 988 usecs/op
fsync 404.171 ops/sec 2474 usecs/op
fsync_writethrough n/a
open_sync 208.758 ops/sec 4790 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
1 * 16kB open_sync write 405.717 ops/sec 2465 usecs/op
2 * 8kB open_sync writes 208.324 ops/sec 4800 usecs/op
4 * 4kB open_sync writes 106.849 ops/sec 9359 usecs/op
8 * 2kB open_sync writes 52.999 ops/sec 18868 usecs/op
16 * 1kB open_sync writes 26.657 ops/sec 37513 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 413.350 ops/sec 2419 usecs/op
write, close, fsync 417.832 ops/sec 2393 usecs/op
Non-sync'ed 8kB writes:
write 608345.462 ops/sec 2 usecs/op
The other client server operates similarly, except mkfs is not required. If you want to mount the cluster file system (which can reflect write changes and support distributed lock), you can use linux gfs2.
Disconnect the NBD:
Umount first
umount /data01
umount /data02
Then
nbd-client -d /dev/nbd0
nbd-client -d /dev/nbd1
An introduction to NBD (from Wikipedia):
Network block device
From Wikipedia, the free encyclopedia
In Linux, a network block device is a device node whose content is provided by a remote machine. Typically, network block devices are used to access a storage device that does not physically reside in the local machine but on a remote one. As an example, the local machine can access a fixed disk that is attached to another computer.
Contents [hide]
1 Kernel client/userspace server
2 Example
3 Availability
4 See also
5 References
6 External links
Kernel client/userspace server[edit]
Technically, a network block device is realized by two components. In the client machine, where the device node is to work, a kernel module named nbd controls the device. Whenever a program tries to access the device, this kernel module forwards the request to the server machine, where the data physically resides.
On the server machine, requests from the client are handled by a userspace program called nbd-server. This program is not implemented as a kernel module because all it has to do is to serve network requests, which in turn just requires regular access to the server filesystem.
Example[edit]
If the file /tmp/xxx on ComputerA has to be made accessible on ComputerB, one performs the following steps:
On ComputerA:
nbd-server 2000 /tmp/xxx
On ComputerB:
modprobe nbd
nbd-client ComputerA 2000 /dev/nbd0
The file is now accessible on ComputerB as device /dev/nbd0. If the original file was for example a disk image, it could be mounted for example via mount /dev/nbd0 /mnt/whatever.
The command modprobe nbd is not necessary if module loading is done automatically. Once the module is in the kernel, nbd-client is used to send commands to it, such as associating a given remote file to a given local nb device. To finish using /dev/nbd0, that is, to destroy its association with the file on other computer, one can run nbd-client -d /dev/nbd0 on ComputerB.
In this example, 2000 is the number of the server port through which the file is made accessible. Any available port could be used.
Availability[edit]
The network block device client module is available on Linux and GNU Hurd.
Since the server is a userspace program, it can potentially run on every Unix-like platform. It was ported to Solaris.[1]
In CentOS or RHEL, you can use the EPEL additional repository to install NBD:
[root@150 postgresql-9.3.5]# yum install -y nbd
Loaded plugins: fastestmirror, refresh-packagekit, security, versionlock
Loading mirror speeds from cached hostfile
epel/metalink | 5.4 kB 00:00
* base: mirrors.skyshe.cn
* epel: mirrors.ustc.edu.cn
* extras: mirrors.163.com
* updates: centos.mirror.cdnetworks.com
base | 3.7 kB 00:00
extras | 3.3 kB 00:00
updates | 3.4 kB 00:00
updates/primary_db | 5.3 MB 00:21
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package nbd.x86_64 0:2.9.20-7.el6 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
====================================================================================================================================
Package Arch Version Repository Size
====================================================================================================================================
Installing:
nbd x86_64 2.9.20-7.el6 epel 43 k
Transaction Summary
====================================================================================================================================
Install 1 Package(s)
Total download size: 43 k
Installed size: 83 k
Downloading Packages:
nbd-2.9.20-7.el6.x86_64.rpm | 43 kB 00:00
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Installing : nbd-2.9.20-7.el6.x86_64 1/1
Verifying : nbd-2.9.20-7.el6.x86_64 1/1
Installed:
nbd.x86_64 0:2.9.20-7.el6
Complete!
Contents:
[root@iZbp1eo3op9s5gxnvc7aokZ ~]# rpm -ql nbd
/etc/sysconfig/nbd-server
/usr/bin/gznbd
/usr/bin/nbd-server
/usr/bin/nbd-trdump
/usr/lib/systemd/system/nbd-server.service
/usr/lib/systemd/system/nbd@.service
/usr/sbin/nbd-client
/usr/share/doc/nbd-3.14
/usr/share/doc/nbd-3.14/README.md
/usr/share/doc/nbd-3.14/proto.md
/usr/share/doc/nbd-3.14/todo.txt
/usr/share/licenses/nbd-3.14
/usr/share/licenses/nbd-3.14/COPYING
/usr/share/man/man1/nbd-server.1.gz
/usr/share/man/man1/nbd-trdump.1.gz
/usr/share/man/man5/nbd-server.5.gz
/usr/share/man/man5/nbdtab.5.gz
/usr/share/man/man8/nbd-client.8.gz
PostgreSQL + FDW + Vector Plug-in Accelerate Vector Retrieval
Isolation of PostgreSQL CTID Physical Line Numbers in Concurrent DML Operations
digoal - October 31, 2022
digoal - October 31, 2022
ApsaraDB - December 5, 2018
ApsaraDB - September 19, 2022
Alibaba Clouder - April 10, 2018
Alibaba Cloud Community - April 24, 2022
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreMore Posts by digoal