Alibaba Cloud Linux 3 provides Shared Memory Communication (SMC), a high-performance network protocol that functions in kernel space. SMC utilizes shared memory technology and works together with socket interfaces to establish network communications. SMC is classified into the following types based on shared memory technology: Shared Memory Communications - Direct Memory Access (SMC-D) and Shared Memory Communications over Remote Direct Memory Access (SMC-R). SMC-D uses internal shared memory (ISM) technology, and SMC-R uses remote direct memory access (RDMA) technology. This topic describes SMC-R and how to use it.
Background information
In 2017, IBM open-sourced SMC-R for Linux 4.11 and has been maintaining it until now. For more information about SMC-R, see RFC 7609. Alibaba Cloud Linux 3 leverages Alibaba Cloud Elastic RDMA (eRDMA) to implement SMC-R in the cloud. SMC-R can transparently replace TCP in applications without loss of functionality and deliver high-performance, hardware-software co-designed networks that are accessible to all users.
The shared memory-based data exchange model of SMC-R relies on the atomic memory operations provided by RDMA. RDMA implements protocol stacks in RDMA network interface cards (RNICs) to allow network nodes to bypass kernel and directly access remote memory. Compared with traditional TCP networks, RDMA networks reduce memory-to-memory copies and consume less CPU resources in data transfers to provide low-latency, high-throughput communications. The following figure shows the differences between TCP/IP and RDMA stacks.
RDMA is widely used in data-intensive and compute-intensive scenarios and is suitable for multiple fields, such as high-performance computing, machine learning, data centers, and massive storage.
In the past, RDMA can be used only together with network cards and switches in specific data centers. As a result, RDMA was complex to deploy. Alibaba Cloud eRDMA brings RDMA to the cloud. This allows all Elastic Compute Service (ECS) users to use RDMA to transmit data without the need to make complex configurations for the underlying physical network environment, such as network cards and switches.
RDMA relies on InfiniBand (IB) verbs interfaces, which are significantly different from traditional Portable Operating System Interface (POSIX) socket interfaces. Existing socket applications must be significantly transformed before they can be migrated to RDMA networks. A high level of technical expertise is required to apply RDMA.
To utilize eRDMA and deliver higher network performance, Alibaba Cloud Linux 3 provides optimized SMC-R. Optimized SMC-R utilizes RDMA in an efficient manner and is compatible with standard TCP applications. This helps improve the performance of more applications without modifications.
Benefits
SMC-R provides the following benefits:
High performance
RDMA offloads protocol stacks from kernel to network cards. This equips SMC-R with lower network latency, higher throughput, and lower CPU utilization over traditional TCP stacks in specific scenarios.
Hardware offloading
Reliable and efficient direct access to remote memory
Transparent replacement
SMC-R is compatible with POSIX socket interfaces and provides transparent replacement features to allow socket applications to switch from TCP stacks to SMC-R stacks without modifications or further development.
SMC-R can call socket interfaces to provide shared memory communications.
SMC-R enables multi-level transparent replacement of protocol stacks without functionality loss.
SMC-R provides the automatic negotiation and secure fallback mechanisms.
Architecture
SMC-R architecture:
Protocol hierarchy and transparent replacement
SMC-R functions in kernel space and supports the network behaviors that user-mode programs describe by using socket interfaces. SMC-R also performs RDMA transmission by using IB verbs interfaces. SMC-R stacks use, manage, and maintain RDMA resources. Applications are not affected by the RDMA entities in kernel. The following figure shows the architecture of SMC-R.
Alibaba Cloud Linux 3 provides a mechanism that can be used to transparently replace TCP stacks with SMC-R stacks at the process level or net namespace level. The mechanism uses LD_PRELOAD or sysctl net.smc.tcp2smc to transparently replace AF_INET sockets with AF_SMC sockets in applications. This type of replacement enables data transmission over SMC-R stacks and improves network performance based on RDMA without the need to modify applications.
Automatic negotiation and secure fallback
SMC-R provides automatic negotiation capabilities and can dynamically fall back to TCP. To establish SMC-R communications, an SMC-R stack establishes a TCP connection in the kernel to the peer node. During the handshake process, the local node uses specific TCP options to support SMC-R and verifies that the peer node also supports SMC-R.
If the negotiation is successful, the SMC-R stacks on the local and peer nodes create new RDMA resources or reuse existing RDMA resources to establish a usable RDMA reliable connection (RC). Data is transmitted between the nodes along the RDMA link.
If the negotiation fails due to specific reasons, such as because the local or peer node does not have RDMA devices, the SMC-R stack automatically falls back to the TCP stack. The local and peer nodes use the TCP connection that is established during the negotiation to transmit data.
NoteSMC-R supports fallback to TCP stacks only during connection negotiation, but not during data transmission.
The following figure shows the data flows for connection negotiation and data transmission.
Shared memory communications based on RDMA
After the negotiation is complete and a connection is established, each SMC-R stack locally allocates the SMC-R socket a ring-shaped send buffer (sndbuf) that is used to cache data to be sent and a ring-shaped remote memory buffer (RMB) that is used to cache data to be received.
When an application on the sending node attempts to send data, the application uses socket interfaces to copy the data to the local sndbuf. Then, the SMC-R stack on the sending node performs RDMA Write operations to write the data to the RMB of the receiving node, and performs RDMA Send or RDMA Receive operations to send or receive Connection Data Control (CDC) messages to update and synchronize cursors in ring-shaped buffers.
When the SMC-R stack on the receiving node detects that data is written to the RMB, the SMC-R stack uses different methods, such as epoll, to notify the application on the receiving node to copy the data from the RMB to the user-mode buffer. The data transmission is complete when the data is copied to the user-mode buffer. In SMC-R, RMBs are used as shared memory during data transmission.
The following figure shows the data transmission procedure.
Scenarios
SMC-R is suitable for the following scenarios:
Latency-sensitive data queries and processing
SMC-R is suitable for scenarios that involve high-performance data queries and data processing and require high network performance, such as Redis, Memcached, and PostgreSQL. SMC-R can replace TCP in applications in a transparent and non-invasive manner and allows applications to gain a 50% increase in queries per second (QPS) without further development or adaptation.
High-throughput data transmission
A large amount of bandwidth and CPU resources are consumed when data is exchanged or transmitted at a large scale within a cluster. RDMA enables SMC-R to deliver the same throughput at a lower CPU load than traditional TCP stacks. This saves computing resources.
During the handshake process of SMC-R, RDMA resources are requested and created. Therefore, SMC-R is not suitable for short-lived connection scenarios in which connections are frequently established and closed.
The number of connections that SMC-R supports for an ECS instance is subject to the following factors:
Available contiguous physical memory of the instance. By default, the sndbuf and the RMB that are used by each SMC-R socket use the contiguous physical memory that is allocated when an SMC-R connection is established. The default size of the sndbuf is the net.smc.wmem value, and the default size of the RMB is the net.smc.rmem value. You can run the following commands to view the default sizes:
sysctl net.smc.wmem # The default size of the sndbuf used by each SMC-R socket. Unit: bytes. sysctl net.smc.rmem # The default size of the RMB used by each SMC-R socket. Unit: bytes.
Elastic RDMA Interface (ERI) eRDMA specifications. The maximum number of RDMA resources that SMC-R creates for a connection. The resources include Queue Pairs (QPs), Memory Registrations (MRs), Completion Queues (CQs), and Protection Domains (PDs). The maximum number varies based on the ERI eRDMA specifications of the instance.
If SMC-R cannot obtain the required resources, SMC-R securely falls back to TCP to ensure stable and reliable data transmission.
Use SMC-R
Alibaba Cloud Linux 3 provides optimized SMC-R stacks in kernel. The SMC-R stacks are backed by comprehensive SMC-R monitoring and diagnostic tools. To use SMC-R, perform the following steps:
Create an ECS instance that supports ERI.
SMC-R relies on RDMA. Before you use SMC-R, you must create an ECS instance that supports the ERI feature to enable RDMA in the cloud. For more information, see Configure eRDMA on an enterprise-level instance.
ImportantAlibaba Cloud ERI eRDMA devices and SMC do not support IPv6 addresses. If applications use IPv6, SMC falls back to TCP.
Run the following command to load the
smc
andsmc_diag
kernel modules:modprobe smc modprobe smc_diag
You can run the
dmesg
command to view kernel-related messages. If the kernel modules are loaded, the following information is displayed:smc: smc: load SMC module with reserve_mode NET: Registered protocol family 43 smc: netns <netns ID> reserved ports [65500 ~ 65515] for eRDMA OOB smc: adding ib device erdma_0 with port count 1 smc: ib device erdma_0 port 1 has pnetid
NoteFor kernel version 5.10.134-015 and later, 16 socket ports from ports 65500 to 65515 in net namespaces that can access ERIs are used to create out-of-band (OOB) RDMA connections when SMC modules are being loaded, due to the combined use of SMC-R and eRDMA. If the ports cannot be used, the SMC modules can be loaded but ERI eRDMA devices cannot be used. When the SMC modules are unloaded, the used socket ports are freed.
You can run the following command to view the kernel version:
uname -r
Information displayed when ERI eRDMA devices cannot be used because the ports cannot be used when SMC modules are loaded:
smc: smc: load SMC module with reserve_mode NET: Registered protocol family 43 warning: smc: netns <netns ID> reserved ports <Numbers of the ports that cannot be used> FAIL for eRDMA OOB
You can run the following command to unload the SMC modules:
rmmod smc_diag rmmod smc
Information displayed when SMC modules are unloaded:
NET: Unregistered protocol family 43 smc: removing ib device erdma_0 smc: netns <netns ID> released ports [65500 ~ 65515] used by eRDMA OOB
Run the following command to install the smc-tools toolkit:
yum install -y smc-tools
(Optional) Specify the default sndbuf and RMB sizes.
Each SMC-R stack locally allocates the SMC-R socket a ring-shaped sndbuf that is used to cache data to be sent and a ring-shaped RMB that is used to cache data to be received. For more information, see the Architecture section of this topic. Compared with send buffers and receive buffers in TCP, sndbufs and RMBs in SMC-R range from 16 KB to 512 KB in size.
To maximize network acceleration based on SMC-R, you can use the following methods to change the default sndbuf and RMB sizes of SMC-R sockets for throughput-intensive network models.
Alibaba Cloud Linux 3 provides the sysctl net.smc.wmem and sysctl net.smc.rmem commands to configure the default sndbuf and RMB sizes for subsequent SMC-R sockets in the current net namespace. The default sndbuf and RMB sizes of existing SMC-R sockets are not affected.
sysctl net.smc.wmem=<sndbuf size, in bytes> sysctl net.smc.rmem=<RMB size, in bytes>
The initial
sysctl net.smc.wmem
andsysctl net.smc.rmem
values are 256 KB.The application can also configure the SO_SNDBUF and SO_RCVBUF options by using the setsockopt() call to change the sizes of the sndbuf and RMB that are used by the SMC-R socket.
Run TCP socket applications over SMC stacks.
Alibaba Cloud Linux 3 allows SMC to transparently replace TCP at the net namespace level or process level:
Net namespace-level transparent replacement
Alibaba Cloud Linux 3 provides the net namespace-level transparent replacement feature. The feature allows you to run the
sysctl net.smc.tcp2smc
command to replace TCP sockets with SMC sockets in a net namespace. The TCP sockets must meet the following conditions:The family value is AF_INET.
The type value is SOCK_STREAM.
The protocol value is IPPROTO_IP or IPPROTO_TCP.
The following figure shows the replacement procedure.
To configure transparent replacement for a net namespace, perform the following steps:
Run the following command to enable transparent replacement for a net namespace:
sysctl net.smc.tcp2smc=1
By default,
sysctl net.smc.tcp2smc
is set to 0, which indicates that transparent replacement is disabled.Run the following command to run TCP socket applications in the net namespace:
./foo
The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.
Run the following command to disable transparent replacement for the net namespace:
sysctl net.smc.tcp2smc=0
Process-level transparent replacement
Alibaba Cloud Linux 3 provides the process-level transparent replacement feature to replace TCP with SMC-R for an application. This feature requires the SMC-R monitoring and diagnostic toolkit smc-tools. For information about how to install smc-tools, see Step 3. Install smc-tools.
The following figure shows the replacement procedure.
When you execute the
smc_run
script from smc-tools to run applications, thesmc_run
script uses theLD_PRELOAD
environment variable to set libsmc-preload.so in smc-tools as the dynamic library that must be loaded first.libsmc-preload.so replaces the TCP sockets in an application and in the child processes of the application with SMC sockets. The TCP sockets must meet the following conditions:
The family value is AF_INET.
The type value is SOCK_STREAM.
The protocol value is IPPROTO_IP or IPPROTO_TCP.
Run the following command to run foo over SMC-R stacks:
smc_run ./foo
The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.
Monitoring and diagnostics
When you use SMC, you can use the smc-tools toolkit to monitor and diagnose SMC stacks in the kernel. This allows you to understand various metrics of SMC network traffic and determine the network health status. For more information, see Monitor and check SMC.
References
For information about how to resolve SMC issues, such as communication failures or unavailability of specific ports, see Troubleshoot SMC.