All Products
Search
Document Center

Elastic Compute Service:eRDMA

Last Updated:Dec 13, 2024

Compared with traditional Remote Direct Memory Access (RDMA), Elastric RDMA (eRDMA) can be used in a wide range of scenarios, such as Redis-based cache databases, Spark-based big data analytics, Weather Research and Forecasting Model (WRF) in high performance computing (HPC), and AI training. You can use eRDMA to deploy HPC applications in the cloud to build high-performance application clusters that have high elasticity at low costs. You can also replace a VPC with an eRDMA network to accelerate applications.

What is eRDMA?

eRDMA is an elastic Remote Direct Memory Access (RDMA) network developed by Alibaba Cloud for the cloud. eRDMA reuses virtual private clouds (VPCs) as the underlying link and uses a congestion control (CC) algorithm that is developed by Alibaba Cloud. eRDMA features high throughput and low latency based on RDMA supports. Compared with RDMA, eRDMA implements large-scale RDMA networking within seconds. eRDMA supports traditional HPC applications, AI applications, and Transmission Control Protocol/Internet Protocol (TCP/IP) applications.

Why eRDMA?

The TCP/IP protocol stack provides mainstream network communication protocols based on which many applications are built. With the development of business that is related to data centers, higher requirements are imposed on network performance, such as lower latency and higher throughput. TCP/IP has become a bottleneck that restricts the performance of communication networks due to limits such as high copy overheads, cross-protocol stack processing, complex CC algorithm, and frequent context switching.

RDMA helps resolve the preceding pain points. RDMA provides features, such as zero-copy and kernel bypass, to prevent overheads when data is copied and context is frequently switched. Compared with TCP/IP communication, RDMA features low latency, high throughput, and low CPU utilization. However, RDMA has a few use scenarios due to high prices and O&M costs.

Alibaba Cloud eRDMA is designed to have inclusive compatibility with diverse cloud environments. eRDMA provides low latency and lowers requirements for a wide range of applications to adapt to cloud environments to enhance their performance.

Benefits of eRDMA

  • High performance

    RDMA bypasses the kernel stack to transfer data from user-mode programs to Host Channel Adapter (HCA) for network transmission. This greatly reduces the CPU load and latency. eRDMA provides the advantages of traditional RDMA interfaces and applies RDMA to VPCs. eRDMA features ultra-low latency that RDMA provides to cloud networks.

    Note

    An HCA is a hardware network interface card (NIC) that connects a server to a network and provides support for RDMA.

  • Inclusiveness

    You can enable eRDMA free of charge. To enable eRDMA, you need to only select the Elastic RDMA Interface option when you purchase an ECS instance.

  • Large-scale deployment

    Traditional RDMA is based on lossless networks. This makes large-scale deployment costly and difficult. eRDMA uses the CC algorithm developed by Alibaba Cloud to control transmission quality in VPCs, such as latency and packet loss. eRDMA provides good performance in lossy networks.

  • Scalability

    Compared with RDMA that requires a separate hardware NIC, eRDMA uses an RDMA HCA card that has cloud attributes based on the Shenlong architecture. eRDMA can dynamically add devices when you use ECS and supports hot migration, which allows for flexible deployment.

  • Shared VPCs

    eRDMA depends on ENIs and reuses networks to which ENIs belong. This allows you to activate the RDMA feature in legacy networks without the need to modify service networking.

Implement eRDMA communication

  • Enable eRDMA for Elastic Compute Service (ECS) instances: Alibaba Cloud provides flexible and convenient configuration options for you to quickly configure eRDMA for ECS instances, enable the RDMA feature in VPCs, and establish RDMA connections for communication. For more information, see Enable eRDMA on an ECS instance.

  • Quick adaptation of applications to eRDMA: If you want to implement and configure RDMA-related logic in your applications to meet the requirements of low latency, high bandwidth, and low CPU utilization, you can use Network Accelerator (NetACC) or Server Migration Center (SMC) to adapt your applications. For more information, see Overview of adapting eRDMA and applications.

Basic capabilities and specifications of eRDMA

In RDMA network communication, Queue Pair (QP), Completion Queue (CQ), Memory Region (MR), and verbs Opcode are the core components. They play important roles in RDMA communication and ensure high efficiency and low latency of RDMA network communication.

This section describes the specifications of eRDMA. When you use eRDMA, make sure that the service specification requirements are met. Otherwise, your applications may not work as expected.

QP

A QP is a basic communication entity in RDMA. It consists of a Send Queue (SQ) and a Receive Queue (RQ). The QP is used to manage the data sent and received.

  • Features: QP allows applications to send and receive data. It is the core of RDMA communication. The QP state machine manages the status of connections, from initialization to termination.

  • eRDMA QP specifications:

    Item

    Specifications

    Description

    Connection establishment method

    RDMA_CM

    • RDMA_CM is used to manage the establishment, maintenance, and closure of RDMA connections. It simplifies the management process of RDMA connections and makes it easier for applications to use the RDMA feature. It is commonly used in scenarios such as Message Passing Interface (MPI), Shared Memory Communications over Remote Direct Memory Access (SMC-R), and PolarDB SCC. For more information, see Linux rdma_cm.

    • eRDMA provides the compat mode for applications in out-of-band (OOB) scenarios, such as Tensorflow, NVIDIA Collective Communications Library (NCCL), and better Remote Procedure Call (bRPC).

      Important

    QP types

    RC

    RC-type QPs provide reliable connection services. A QP of the RC-type supports send operations, RDMA write operations, RDMA read operations, and atomic operations.

    Shared Receive Queue (SRQ)

    Not supported.

    None.

    Maximum QPs (max_qp_num)

    This parameter varies based on the instance family. Up to 131,071 QPs can be created.

    • The maximum number of QPs that can be created on an RDMA device or network interface.

    • This parameter determines the maximum number of concurrent connections that can be created in an RDMA network, which affects the extensibility and concurrent processing capabilities of the network.

    Maximum send Work Request (WR) depth (max_send_wr)

    8,192

    • The maximum number of work requests of a QP send queue.

    • This parameter determines the number of transmission operations that the QP can initiate at the same time, which affects the transmission performance and throughput of the QP.

    Maximum receive WR depth (max_recv_wr)

    32,768

    • The maximum number of work requests in a QP receive queue.

    • This parameter determines the number of receive operations that the QP can handle simultaneously, which affects the QP receive performance and throughput.

    Maximum SGEs in a send WR (max_send_sge)

    Note

    6

    • The maximum number of scatter-gather elements (SGEs) in a send WR.

    • This parameter determines the maximum number of memory segments that a QP can handle in a single send operation, which affects the efficiency and flexibility of data transfers.

    Maximum SGEs in a receive WR (max_recv_sge)

    1

    • The maximum number of SGEs in a receive WR.

    • This parameter determines the maximum number of memory segments that a QP can process in one receive operation, which affects the efficiency and flexibility of data receiving.

CQ

CQ is used to notify an application of the completion of a WR. When an RDMA operation, such as send or receive data, completes, the relevant completion information is put into the CQ.

  • Features: The CQ is the key to asynchronous operation completion notification in RDMA. The CQ helps applications manage asynchronous events and handle errors. The CQ provides a mechanism to notify applications of which operations a completed, which is essential for the management of asynchronous operations.

  • eRDMA CQ specifications:

    Item

    Specifications

    Description

    CQs

    The number of CQs varies based on the instance type. The maximum number of CQs is twice the number of QPs.

    None.

    Vectors in a CQ (vector_num)

    The number of vectors in a CQ varies based on the instance type. The maximum number of vectors in a CQ is 31. The number of CPUs is related to the number of QPs.

    • Each vector corresponds to a hardware interrupt. In actual usage, each CPU can be configured with up to one vector to meet communication requirements.

    • Each vector is associated with a completion event queue (CEQ) in eRDMA.

    Maximum CEQ depth

    4,096

    • The maximum CEQ depth is 256 in version 0.2.34.

    • In event mode, we recommend that you do not bind more than 4,096 CQs to each vector. Otherwise, CEQ overflow may occur.

    Maximum CQ depth

    1,048,576

    None.

RDMA memory management

MR and Memory Window (MW) are important concepts for memory management in RDMA.

  • MR: specifies a memory area that can be accessed by RDMA. After you register the MR, an application can grant the RDMA hardware direct access to this memory area.

    • Features: The MR enables RDMA to directly perform operations, such as read and write operations, on the memory of a remote host. This is the basis of the zero-copy feature of RDMA.

    • eRDMA MR specifications:

      Item

      Specifications

      MRs

      The number of MRs vary based on the instance type. The maximum number of MRs is twice the number of QPs.

      Max MR size

      The size of MRs varies based on the underlying hardware. The minimum supported MR size is 2 GB and the maximum supported MR size is 64 GB.

  • MW: Alibaba Cloud does not support MW.

verbs interfaces

verbs is the foundation of RDMA programming, which defines a set of interfaces for controlling the behavior of RDMA devices. Opcode is the code used in these interfaces to specify a specific type of operation.

  • Features: Opcode defines the types of RDMA operations, such as send (SEND), receive (RECEIVE), read (READ), and write (WRITE). Opcode gives RDMA hardware specific instructions to perform, allows applications to directly interact with the RDMA hardware for efficient data transfers.

  • Opcode support:

    Opcode

    Whether the operation is supported

    RDMA Write

    Supported

    RDMA Write with Immediate

    Supported

    RDMA Read

    Supported

    Send

    Supported

    Send with Invalidate

    Supported

    Send with Immediate

    Supported

    Send with Solicited Event

    Supported

    Local Invalidate

    Supported

    Atomic Operation

    Supported