An In-Depth Understanding of RDMA Interaction Mechanism between Software and Hardware

This article analyzes the working principle and the interaction mechanism between the software and hardware of RDMA technology under the high-performance network environment of the data center.

By Yujing

Preface

As data centers rapidly develop, high-performance networks continue to push the limits of bandwidth and latency. The bandwidth of network interface controllers (NIC) has grown from 10 Gb/s and 25 Gb/s in the past to 100 Gb/s and 200 Gb/s at present, and is expected to reach 400 Gb/s in the next generation. This growth rate exceeds that of the CPU. In order to meet the communication requirements of high-performance networks, Alibaba Cloud has not only developed high-performance user-mode protocol stacks (Luna and Solar), but has also utilized RDMA technology to fully leverage high-performance networks, particularly in the fields of storage and AI. Compared to the Socket interface provided by Kernel TCP, the abstraction of RDMA is more complex. In order to use RDMA more efficiently, it is essential to understand its working principles and mechanisms. This article uses the RDMA NIC of NVIDIA as an example to analyze its working principles and the interaction mechanism between software and hardware.

What is RDMA?

RDMA (Remote Direct Memory Access) is designed to solve the data processing latency on the server side during network transmission. It enables a system to quickly transmit data to another over the network without consuming any CPU. It eliminates the overhead of memory copy and context switch operations, thereby releasing memory bandwidth and CPU cycles for overall system performance improvement.

First, let’s look at the most common Kernel TCP. Its process of receiving data mainly goes through the following stages:

The NIC drivers allocate DMA buffers from the kernel and fill the receive queue.
The NIC receives the package, initiates DMA, and writes it to DMA buffers in the receive queue.
The NIC interrupts.
The NIC drivers check the receive queue, take out the DMA buffer, and give it to the TCP stack.
The TCP stack processes the message.
The operating system informs the user-mode app of read events.
The user-mode app prepares a buffer and initiates a system call.
The kernel copies data to the buffer of the user-mode application.
The system call ends.

It can be seen that there are three context switches (an interrupt, user-mode and kernel-mode context switches) and one memory copy. Although the kernel has some optimization methods, such as reducing the number of interrupts through the NAPI mechanism, the latency and throughput of Kernel TCP still work poorly in high-performance scenarios.

After using the RDMA technology, the main process for receiving data changes to the following stages (taking send/recv as an example):

The user-mode app allocates the buffer and fills the receive queue.
The NIC receives the package, initiates DMA, and writes it to buffer in the receive queue.
The NIC generates a completion event (but may not generate an interrupt).
The user-mode app polls the completion event.
The user-mode app processes the buffer.

The above process has no context switch, no data copy, no processing logic of protocol stack (offloaded to the RDMA NIC), and no kernel involved. CPU can focus on data processing and business logic without spending a large amount of cycles to process protocol stacks and memory copies.

Software Architecture of RDMA

From the above analysis, we can see that RDMA is completely independent of traditional kernel protocol stacks. Therefore, its software architecture also varies from that of kernel protocol stacks, including the following parts:

1. User-mode drivers (libibverbs, libmlx5, etc.): These libraries belong to the rdma-core project (https://github.com/linux-rdma/rdma-core). Various Verbs APIs are provided for users. In addition, there are also some specific APIs of manufacturers. The reason why this layer is called "user-mode drivers" is that RDMA directly interacts with hardware in the user mode. However, the traditional way of implementing HAL in the kernel mode does not meet the requirements of RDMA, so RDMA needs to be aware of hardware details in the user mode.

2. Kernel-mode IB software stacks: An abstraction layer of the kernel mode, providing a unified interface for applications. These interfaces can be called by both the user mode and the kernel mode.

3. Kernel-mode drivers: They are NIC drivers implemented by various manufacturers, and directly interact with hardware.

Memory Management of RDMA

Since RDMA allows hardware to directly read and write the memory of the user-mode app, it causes many problems:

Security risk: Can the user-mode app read and write any physical memory by using NIC?
Address mapping: The user-mode app uses the virtual address, and the actual physical address is managed by the operating system. How can the NIC know the mapping between the virtual address and the physical address?
Changes of address mapping: The operating system may swap and compress memory, and it also has some complex mechanisms, such as Copy-On-Write. In these cases, how can we ensure the accuracy of the address accessed by the NIC?

To address the above issues, RDMA introduces two concepts:

PD (Protection Domain): In RDMA, PD is a container that contains various resources. It is similar to a tenant ID and protects these resources from unauthorized access. Multiple PDs can be created in one process. The resources contained in each PD are isolated from each other and cannot be used together.
MR (Memory Registration): In RDMA, MR is a measure to protect memory. The memory can be used by RDMA only if the memory to be operated is registered in MR. Attributes of MR include PD, lkey, rkey, address, length, and permission. PD is the protection domain to which MR belongs, and the context of other protection domains cannot access this MR (avoiding the possibility of brute-force traversing the memory to access other processes' memory). lkey and rkey are collectively known as mkey which is a credential for accessing memory (corresponding to local access and remote access respectively). All operations of RDMA are performed only with mkey.

It may be abstract, so let's take a practical example to see how RDMA registers memory:

// Scan RDMA devices and select the device we want to use.
int num_devices = 0, i;
struct ibv_device ** device_list = ibv_get_device_list(&num_devices);
struct ibv_device *device = NULL;
for (i = 0; i < num_devices; ++i) {
    if (strcmp("mlx5_bond_0", ibv_get_device_name(device_list[i])) == 0) {
        device = device_list[i];
        break;
    }
}

// Open the device and generate a context.
struct ibv_context *ibv_ctx = ibv_open_device(device);

// Assign a PD.
struct ibv_pd* pd = ibv_alloc_pd(ibv_ctx);

// Allocate 8K memory and align it based on the page (4K).
void *alloc_region = NULL;
posix_memalign(&alloc_region, sysconf(_SC_PAGESIZE), 8192);

// Register the memory.
struct ibv_mr* reg_mr = ibv_reg_mr(pd, alloc_region, mr_len, IBV_ACCESS_LOCAL_WRITE);

Open the debug information of the kernel module:

echo "file mr.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file mem.c +p" > /sys/kernel/debug/dynamic_debug/control

Output logs:

infiniband mlx5_0: mlx5_ib_reg_user_mr:1300:(pid 100804): start 0xdb1000, virt_addr 0xdb1000, length 0x2000, access_flags 0x100007
infiniband mlx5_0: mr_umem_get:834:(pid 100804): npages 2, ncont 2, order 1, page_shift 12
infiniband mlx5_0: get_cache_mr:484:(pid 100804): order 2, cache index 0
infiniband mlx5_0: mlx5_ib_reg_user_mr:1369:(pid 100804): mkey 0xdd5d
infiniband mlx5_0: __mlx5_ib_populate_pas:158:(pid 100804): pas[0] 0x3e07a92003
infiniband mlx5_0: __mlx5_ib_populate_pas:158:(pid 100804): pas[1] 0x3e106a6003

Combined with the code analysis of the kernel module, memory registration can be roughly divided into the following stages:

The user-mode app calls the ibv_reg_mr and sends it to the kernel through the uverbs interface (the actual system call used is ioctl or write). The file operated is /dev/infiniband/uverbsX.
After converting by the uverbs layer, the code of the device driver is called. In the NVIDIA NIC, the function is mlx5_ib_reg_user_mr:

This function mainly achieves two things. One is to pin the user-mode memory through the mr_umem_get and obtain the physical address. The other is to send the command of creating mkey to the NIC through reg_create. mr_umem_get will eventually call the function __ib_umem_get in the ib_core.ko:

This function will prevent unexpected changes (such as swap) in the mapping of the user memory through the pin_user_pages_fast interface of the kernel.
reg_create sends the command of creating mkey based on the command format defined in the reference manual:

After the above process, a mkey object will be created in the NIC:

mkey 0xdd5d
virt_addr 0xdb1000, length 0x2000, access_flags 0x100007
pas[0] 0x3e07a92003
pas[1] 0x3e106a6003

When the NIC receives the virtual address, it can know what the actual physical page is and control permissions.

Now we can answer the above questions:

1. Security risk: Can the user-mode app read and write any physical memory by using NIC?

No, RDMA implements strict memory protection through PD and MR.

2. Address mapping: The user-mode app uses the virtual address, and the actual physical address is managed by the operating system. How can the NIC know the mapping between the virtual address and the physical address?

The drivers will tell the NIC the mapping relationship. In the subsequent data flow, the NIC will convert it.

3. Changes of address mapping: The operating system may swap and compress memory, and it also has some complex mechanisms, such as Copy-On-Write. In these cases, how can we ensure the accuracy of the address accessed by the NIC?

It can be guaranteed by calling pin_user_pages_fast through the drivers. In addition, the user-mode driver will mark the registered memory with the DONT_FORK flag to avoid Copy-On-Write.

Interaction Between Software and Hardware of RDMA

The basic unit of interaction between the software and hardware of RDMA is the Work Queue. Work Queue is a circular queue with a single producer and a single consumer. Based on different functions, the Work Queue is divided into SQ (send queue), RQ (receive queue), CQ (completion queue), and EQ (event queue).

The process of sending an RDMA request is as follows:

The software constructs WQE (Work Queue Element) and submits it to Work Queue.
The software writes Doorbell and notifies the hardware.
The hardware pulls and processes WQE.
The hardware completes processing, generates CQE, and writes it into CQ.
The hardware interrupts (optional).
The software polls CQ.
The software reads CQE updated by the hardware and knows that WQE is completed.

We will analyze it with a practical example for an in-depth understanding of the whole process:

uint32_t send_demo(struct ibv_qp *qp, struct ibv_mr *mr)
{
    struct ibv_send_wr sq_wr = {}, *bad_wr_send = NULL;
    struct ibv_sge sq_wr_sge[1];

    // Send the MR data.
    sq_wr_sge[0].lkey = mr->lkey;
    sq_wr_sge[0].addr = (uint64_t)mr->addr;
    sq_wr_sge[0].length = mr->length;

    sq_wr.next = NULL;
    sq_wr.wr_id = 0x31415926; // Requested ID that can be matched when WC is received.
    sq_wr.send_flags = IBV_SEND_SIGNALED; // Tell the NIC that this request is to generate CQE.
    sq_wr.opcode = IBV_WR_SEND;
    sq_wr.sg_list = sq_wr_sge;
    sq_wr.num_sge = 1;

    ibv_post_send(qp, &sq_wr, &bad_wr_send); // Submit the request to the NIC.

    struct ibv_wc wc = {};
    while (ibv_poll_cq(qp->send_cq, 1, &wc) == 0) {
        continue;
    }

    if (wc.status != IBV_WC_SUCCESS || wc.wr_id != 0x31415926) {
        exit(__LINE__);
    }

    return 0;
}

The core of this code is ibv_post_send and ibv_poll_cq. The former is used to submit the request to the NIC, while the latter is used to poll CQ to check if any requests are completed.

In addition, the IB specification defines several abstractions:

ibv_send_wr is an abstraction of SQ WQE (the NIC WQE formats vary among different manufacturers).
ibv_sge is an abstraction of data, containing three fields (addr, len, and lkey).
ibv_wc is an abstraction of CQE, which corresponds to ibv_send_wr through wr_id.

Open debug of rdma-core and we can see two logs:

dump wqe at 0x7fc59b357000
0000000a 00115003 00000008 00000000
00000000 c0000000 00000000 00000000
0000022a 00008a0b 00007fc5 9a5b2000

mlx5_get_next_cqe:565: dump cqe for cqn 0x4c2:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00004863 2a39b102 0a001150 00003f00

The first log is WQE the rdma-core event submits to the NIC after we call ibv_post_send.

The second log is CQE which we receive from the NIC after we call ibv_poll_cq.

On the NVIDIA NIC, ibv_post_send finally calls _mlx5_post_send:

The core functionalities of this function are:

Translate ibv_send_wr into the WQE format of the NVIDIA NIC.
Doorbell notifies the NIC.

Fill WQE

According to the NVIDIA manual, WQE for a Send request consists of two parts:

Control segment:

Data segment:

According to the definition in the manual, we will parse WQE in the above log:

It can be seen that it is similar to ibv_send_wr filled in by the software. Note that wr_id we filled in does not appear in WQE. In fact, this wr_id is stored in rdma-core. As WQE is processed in order, when receiving CQE, we can find ibv_send_wr that was originally submitted.

Doorbell Notifies the NIC

The function post_send_db is simple, but with many details:

First, this function adds a memory barrier to ensure that when Doorbell arrives in the hardware, the WQE built earlier is visible.

Then, update Doorbell information of QP (actually the tail of the queue).

Finally, write it to the register of the NIC through mmio_write64_be and tell the NIC to send it.

mmio_write64_be uses the sse instruction (on x86) and the first part of WQE (the control segment above) is written to the register of the NIC:

It is actually an assignment operation which is equivalent to the following statement:

*(uint64_t)(bf->reg + bf->offset) = *(uint64_t *)ctrl;

SSE is mainly used to ensure atomicity and avoid the NIC receiving the first 4 bytes and then the last 4 bytes.

It seems that all the emphasis is on bf->reg. What is this buffer? Why can the NIC receive the event after the software writes a uint64 to this buffer?

It can be seen in gdb that this bf->reg looks like an ordinary virtual address, but gdb fails to read it.

Take a look at the maps information for the process:

The memory is mapped to Write Only, so the reading fails. Then what is the memory?

Look at the page table with crash:

The physical address is 1bffc020800. Through the /proc/iomem, we can see that this physical address belongs to the PCI device 0000:3b:00.0:

This device is exactly the RDMA NIC we used, and this address is located at BAR0:

According to the NVIDIA manual, this address area written by the software is called UAR (User Access Region). UAR is a part of the PCI address space. It is mapped directly into the memory of the application by the NIC driver. When the software writes Doorbell to UAR, the NIC will receive a message from PCI, thus knowing that the software needs to send data.

Thereafter, the NIC is scheduled and arbitrated to pull WQE for execution. In the process of execution, the NIC will first extract mkey and address from WQE, then send them to the TPT (Translation and Protection) unit to query information about mkey. The NIC will obtain the address and permission from the unit, perform DMA after verification, and finally send the data to the physical interface. When the data is sent and confirmed by the opposite terminal, the NIC generates CQE to notify the software. The software can recycle resources only after receiving CQE. This is different from the Socket interface which allows copying. After the software finishes calling write, resources can be recycled. However, RDMA is zero-copy, which makes the management of the resource lifecycle more complicated.

Summary

So far, the basic working principle of RDMA and the interaction mechanism between software and hardware have been introduced. The principle of RDMA technology is very simple. That is, whether it can bypass all restrictions and directly send data to users. However, RDMA is facing challenges such as complex programming and debugging, heavy dependence on hardware, and unavailability of long-distance transmission. For I/O-intensive services, RDMA can save CPU and achieve high throughput and low latency. But for computing-intensive services or hardware-independent services, TCP is more appropriate.

References

IB Specification: https://www.infinibandta.org/ibta-specification/
NVIDIA Reference manual: https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

An In-Depth Understanding of RDMA Interaction Mechanism between Software and Hardware

Preface

What is RDMA?

Software Architecture of RDMA

Memory Management of RDMA

Interaction Between Software and Hardware of RDMA

Fill WQE

Doorbell Notifies the NIC

Summary

References

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Tair

Accelerated Global Networking Solution for Distance Learning

Networking Overview

Remote Rendering Solution