All Products
Search
Document Center

Elastic Compute Service:Configure eRDMA on an enterprise-level instance

Last Updated:Dec 18, 2024

You can configure Elastic Remote Direct Memory Access (eRDMA) on specific enterprise-level Elastic Compute Service (ECS) instances to use the low-latency, high-throughput, high-performance, and highly scalable RDMA network services and improve network performance without the need to modify the network architecture.

Limits

Item

Description

Region

eRDMA is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Shenzhen), China (Guangzhou), China (Ulanqab), and China (Heyuan).

Instance family

The following instance families support eRDMA:

Image

  • Alibaba Cloud Linux 3 (recommended)

  • Alibaba Cloud Linux 2 for x86

  • CentOS 7.9 for x86

  • Ubuntu 18.04/20.04/22.04

  • Anolis OS 8.4 ANCK for Arm and Anolis OS 8.6 ANCK for Arm

Note

The images that are available for selection vary based on the instance type. The images that are available for selection are displayed on the instance buy page when you select an instance type that supports eRDMA.

Number of eRDMA devices

To query the maximum number of ERIs that you can bind to an ECS instance of a specific instance type, call the DescribeInstanceTypes operation and check the value of the EriQuantity parameter in the response. A value of 0 indicates that you cannot bind an ERI to an ECS instance of the instance type.

Network

  • You cannot assign IPv6 addresses to elastic RDMA interfaces (ERIs).

  • When two ECS instances communicate over eRDMA, the communication path cannot span across network elements, such as Server Load Balancer (SLB) instances.

Configure eRDMA on an enterprise-level ECS instance

Configure eRDMA when you create an ECS instance

Important
  • When you create an eRDMA-capable instance that runs Alibaba Cloud Linux, Ubuntu, or Anolis OS, you can enable eRDMA by selecting the Auto-install eRDMA Driver option to automatically install the eRDMA driver and enabling the ERI feature for the primary ENI.

  • If you cannot select the Auto-install eRDMA Driver option for the operating system version that you select or the eRDMA driver fails to be automatically installed, you can install the driver manually or by using a script after the instance is created. For more information, see the Configure eRDMA on an existing instance section of this topic.

  • After you start the ECS instance, wait for a period of time for the system to install the eRDMA driver.

  1. Go to the ECS instance buy page.

  2. Create an enterprise-level ECS instance that supports ERIs. When you create the ECS instance, take note of the following parameters or options. For information about other parameters on the ECS instance buy page, see Create an instance on the Custom Launch tab.

    • Instance and image: Select an instance type that supports eRDMA and install the eRDMA driver.

      Instance: For more information, see the Limits section of this topic.

    • ENI: Select the eRDMA Interface option on the right side of Primary ENI to bind an ERI to the ECS instance.

      image

Note

When you create an enterprise-level instance, you can enable the ERI feature only for the primary elastic network interface (ENI). You can enable the ERI feature for a secondary ENI in the ECS console or by calling an API operation. For more information, see ERI.

Configure eRDMA on an existing ECS instance

  • Check whether eRDMA is configured as expected for the instance.

  • Install the eRDMA driver.

    If you do not select Auto-install eRDMA Driver when you create the instance, the eRDMA driver is not automatically installed on the instance. Install the eRDMA driver manually or by using a script based on the actual scenario.

    • If you use a script to install the eRDMA driver, the installation package for the latest stable eRDMA driver version is automatically downloaded.

    • If you want to manually install the eRDMA driver, you can download the package for a specific eRDMA driver version.

    Execute a script to install the eRDMA driver

    • Run the following command to download the most recent and stable eRDMA driver package:

      curl -O http://mirrors.cloud.aliyuncs.com/erdma/env_setup.sh
    • Run the following command to install the eRDMA driver package:

      sudo /bin/bash env_setup.sh > /var/log/erdma_install.log 2>&1

      The script automatically installs the dependencies that are required by the eRDMA driver and then the eRDMA driver. Wait for the script execution to complete.

      Note

      If the eRDMA driver fails to be installed by using the script, check logs in the /var/log/erdma_install.log file.

    Manually install the eRDMA driver

    • Update the prerequisite package.

      • For Alibaba Cloud Linux 3, CentOS, and Anolis OS, run the following command:

        sudo yum update -y
      • For Ubuntu, skip this step.

    • Run the following commands in sequence to query the most recent kernel package version and the operating system kernel version:

      rpm -qa | grep kernel  #Query the latest kernel package version.
      uname -r  #Query the operating system kernel version.

      The command outputs shown in the following figure indicate that the kernel package version is the same as the operating system kernel version. In this case, you do not need to perform additional operations. If the versions are different, restart the ECS instance to make the versions the same.

      image.png

    • Install dependency packages.

      • If the ECS instance is an x86 instance, run one of the following commands based on the instance operating system.

        • For Alibaba Cloud Linux 3, CentOS, and Anolis OS, run the following command:

          sudo yum install gcc-c++ dkms cmake kernel-devel kernel-headers libnl3 libnl3-devel
        • For Ubuntu, run the following command:

          sudo apt-get install dkms cmake libnl-3-dev libnl-route-3-dev kernel-headers
      • If the ECS instance is an Arm instance, the building task is executed based on the source code. In this case, a large number of dependencies are required and subject to change. You can skip this step and execute the installation script. If the installation script fails to install dependency packages, you are prompted to install the required dependency packages. Install the dependency packages as prompted and then re-install the eRDMA driver.

    • Download the driver installation package.

      • Run the following command to download the eRDMA driver installation package from an internal URL:

        wget http://mirrors.cloud.aliyuncs.com/erdma/erdma_installer-latest.tar.gz
      • Run the following command to download the eRDMA driver installation package from a public URL:

        wget https://mirrors.aliyun.com/erdma/erdma_installer-latest.tar.gz

      In this example, the installation package for the latest eRDMA driver version is downloaded. You can download the installation package for a specific eRDMA driver version based on your business scenarios. For information about the release of different versions of the eRDMA installation package, see the Install the eRDMA driver for an ECS instance section of the "Use eRDMA" topic.

    • Run the following command to decompress the installation package and then go to the directory to which the installation package is decompressed:

      tar -xvf erdma_installer-latest.tar.gz && cd erdma_installer
    • Use one of the following methods to install the eRDMA driver:

      • Method 1: Run the following command to install the eRDMA driver. During the installation process, confirm relevant uninstallation steps and automatic installation steps.

        sudo sh install.sh
      • Method 2: Run the following command to automatically install the eRDMA driver:

        sudo sh install.sh  --batch

      View the command output to check whether the driver is installed.

      The following command output indicates that the eRDMA driver is installed.

      4.png

      The following command output indicates that the eRDMA driver failed to be installed. Perform operations as prompted and then re-install the eRDMA driver.

      5.png

      Note

      If the ECS instance runs CentOS 7 and you receive an error message indicating that packages are missing when you re-install the driver, you may fail to obtain the packages by running the yum commands. In this case, you may need to run the yum install -y epel-release command to install the Extra Packages for Enterprise Linux (EPEL) repository before you obtain the packages.

  • Bind an ERI to the ECS instance.

    You can bind only one ERI to each enterprise-level ECS instance. For more information, see the Limits section of this topic.

    • Enable the ERI feature for an ENI that is bound to an ECS instance

      You can enable the ERI feature for an ENI that is bound to an ECS instance by modifying the attributes of the ENI. For more information, see the Change the status of the ERI feature for an existing ENI section of the "ERIs" topic.

    • Create an ERI and bind the ERI to an ECS instance

    • Call API operations to create an ERI and bind the ERI to an ECS instance

      Perform the following steps:

      1. Call an API operation to create an ERI.

        Call the CreateNetworkInterface operation to create an ENI and set the NetworkInterfaceTrafficMode parameter to HighPerformance to enable the ERI feature for the ENI.

        After the call is successful, record the return value of the NetworkInterfaceId parameter, which is the ERI ID.

      2. Set the NetworkInterfaceId parameter to the return value recorded in the preceding step and the InstanceId parameter to the ID of an ECS instance and call the AttachNetworkInterface operation to bind the ERI to the ECS instance.

        Important

        If the instance type of the ECS instance supports multiple ERIs per instance, we recommend that you set the NetworkCardIndex parameter to a different value for each ERI when you bind multiple ERIs to the instance. This ensures that the ERIs are bound to different channels and the maximum network bandwidth is achieved for the instance. For more information, see the Request parameters section of the "AttachNetworkInterface" topic.

Test the eRDMA write latency

You can install Perftest and test the write latency by using ib_write_lat on two enterprise-level instances that have eRDMA configured. For information about Perftest tests, see the Perftest test set section of the "Use eRDMA" topic.

ib_write_lat parameters

  • -R: uses RDMA Connection Manager (CM) to establish connections. The RDMA CM protocol is used to manage RDMA connections and can establish RDMA connections without the need for global name services.

  • -a: sends test messages in all sizes. The size range is 2 bytes to 2^23 bytes. This allows you to test the impacts of different message sizes on latency.

  • -F: forcefully replaces an existing connection. If you do not require an existing RDMA connection, configure the -F parameter to ignore the existing connection and establish a new connection.

  • -s: specifies the size of messages to exchange. Default value: 2.

  • -u: specifies the power value of the queue pair (QP) timeout period (in microseconds). The timeout period is calculated by using the following formula: 4 × 2^(<Power value>). Default value: 14.

Latency data in the ib_write_lat test results

  • #bytes: the size of the payload of a test message. Valid values: 2 to 8388608. Unit: bytes. Different message sizes help you understand the performance under different loads.

  • #iterations: the number of iterations, which specifies the number of times messages of each size are repeatedly tested. A larger value indicates more stable statistics results, including average values.

  • t_min[usec]: the minimum latency recorded in all tests. Unit: microseconds. This value provides a reference for the best-case network latency.

  • t_max[usec]: the maximum latency recorded in all tests. Unit: microseconds. A large value may indicate specific network issues or transient traffic congestion.

  • t_typical[usec]: the typical latency recorded in tests. Unit: microseconds. In most cases, the value is the median of all tests.

  • t_avg[usec]: the average latency of all tests. Unit: microseconds. The average latency reflects the overall user experience on network latency.

  • t_stdev[usec]: the standard deviation of the latency. Unit: microseconds. A smaller value indicates more stable latency. A larger value indicates that the latency fluctuates.

  • 99% percentile[usec]: the latency value at the 99th percentile, which indicates that 99% of the latency values in the test results are lower than this value. Unit: microseconds. The data points at the 99th percentile help you understand latency performance in extreme cases.

  • 99.9% percentile[usec]: the latency value at the 99.9th percentile, which indicates that 99.9% of the latency values in the test results are lower than this value. Unit: microseconds. The data points at the 99.9th percentile help you understand latency performance in extreme cases.

The preceding latency statistics provide you with a comprehensive understanding of the RDMA network performance to help you optimize network performance and troubleshoot network issues. For example, if the test results indicate a sudden increase in latency when test messages of a specific size are sent, you can check whether the network configurations or hardware performance meets your business requirements. If the test results indicate fluctuations in latency, you can check for traffic congestion or network instability issues.

Prepare the environment

  1. Create two enterprise-level ECS instances that function as the server and client. Make sure that the ECS instances have eRDMA configurations, such as installing the eRDMA software stack and enabling the ERI feature.

  2. Make sure that the instances have valid network configurations and can communicate with each other over the internal network. For more information, see Connect ECS instances through an internal network.

Procedure

  1. Connect to the two ECS instances.

    For more information, see Use Workbench to connect to a Linux instance over SSH.

  2. Verify and confirm that the eRDMA configurations on both instances are correct.

    For more information, see Verify the correctness of eRDMA configurations of the "Use eRDMA" topic.

  3. Install Perftest on each ECS instance.

    You can download the perftest package from the official perftest repository and install perftest, or use a Yellowdog Updater, Modified (YUM) or Advanced Packaging Tool (APT) repository to install perftest.

    Official perftest repository

    1. Enable public bandwidth for an ECS instance on which you want to install perftest. For more information, see Enable public bandwidth for an ECS instance.

    2. Download the perftest package from the official perftest repository and install perftest.

    YUM or APT repository

    Note

    Different versions of perftest are included in the repositories of different Linux distributions. Incompatibility may occur. To prevent incompatibility, we recommend that you identify the Linux distribution run by the ECS instance on which you want to install perftest and install the perftest version included in the repository of the same Linux distribution. Otherwise, download the perftest package from the official perftest repository and install perftest.

    • Alibaba Cloud Linux 3, CentOS, and Anolis OS

      sudo yum install perftest -y
    • Ubuntu

      sudo apt install perftest -y
  4. Test whether the eRDMA network latency meets the expected performance.

    1. On the server-side instance, run the following command to start ib_write_lat as a server that listens to connections from the client:

      ib_write_lat -R -a -F
    2. On the client-side instance, run the following command to start ib_write_lat and connect to the server:

      ib_write_lat -R -a -F <server_ip>

      Replace <server_ip> with the private IP address of the ERI bound to the server-side ECS instance. For information about how to query IP addresses, see View IP addresses.

    3. Check the test results.

      After the client is tested, ib_write_lat outputs the test configuration information, connection information, and performance test results. The statistics include the minimum, maximum, and average latency. For more information, see the Latency data in the ib_write_lat test results section of this topic.

      image