Configure and use eRDMA to accelerate container networking in ACK clusters - Container Service for Kubernetes

Elastic Remote Direct Memory Access (eRDMA) is a low-latency, high-throughput, high-performance, and highly scalable RDMA network service provided by Alibaba Cloud. eRDMA is developed based on the fourth-generation SHENLONG architecture and Virtual Private Cloud (VPC). eRDMA is fully compatible with the RDMA ecosystem and provides an ultra-large, inclusive network for Elastic Compute Service (ECS) instances. This topic describes how to configure and use eRDMA in Container Service for Kubernetes (ACK) clusters.

Prerequisites

An ACK cluster is created. For more information, see Create an ACK managed cluster.
A node that supports elastic Remote Direct Memory Access (eRDMA) is created and added to the node pool.
You can bind ERIs only to ECS instances of specific instance families. For information about the instance families that support ERIs, see Overview of instance families.

Step 1: Install ACK eRDMA Controller

You can perform the following steps to install ACK eRDMA Controller.

Note

If your ACK cluster uses Terway, configure an elastic network interface (ENI) filter for Terway in case Terway modifies the eRDMA ENIs. For more information, see Configure an ENI filter.
If a node has multiple ENIs, ACK eRDMA Controller configures routes for additional ENIs of eRDMA with a lower priority than routes for ENIs within the same CIDR block, using a default routing priority of 200. If you need to manually configure ENIs after installing ACK eRDMA Controller, make sure to avoid routing conflicts.

On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Operations > Add-ons.

On the Add-ons page, click the Networking tab, find ACK eRDMA Controller, follow the instructions on the page to configure and install the component.

Parameter

Description

preferDriver Driver type

Select the type of the eRDMA driver used on the cluster nodes. Valid values:

default: The default driver mode.
compat: The driver mode that is compatible with RDMA over Converged Ethernet (RoCE).
ofed: The ofed-based driver mode, which is applicable to GPU models.

For more information about the types of drivers, see Use eRDMA.

Specifies whether to assign all eRDMA devices of nodes to pods

Valid values:

True: If you select this check box, all eRDMA devices on the node are assigned to the pod.
False: If you do not select this check box, the pod is assigned an eRDMA device based on the non-uniform memory access (NUMA) topology. You must enable the static CPU policy for the node to ensure that NUMA can be allocated to pods and devices. For more information about how to configure CPU policies, see Create and manage node pools.

In the left-side navigation pane, choose Workloads > Pods. On the Pods page, select the ack-erdma-controller namespace to view the status of pods and ensure that the component runs as expected.

Step 2: Use eRDMA to accelerate container networking

After you install ACK eRDMA Controller, you can use the following configuration to enable eRDMA for the pod.

Configuration

Configuration method

Description

Enable eRDMA

Specify the resource usage of aliyun/erdma in the container resource of the pod.

spec:
  containers:
  - name: erdma-container
    resources:
      limits:
        aliyun/erdma: 1

Allocate eRDMA devices to the pod by specifying the resources of aliyun/erdma in the pod.

After you allocate RDMA devices, you can view the allocated devices in the pod.

/# ls /dev/infiniband/
rdma_cm  uverbs0

Enable Shared Memory Communication over RDMA (SMC-R)

After you enable eRDMA, specify the network.alibabacloud.com/erdma-smcr: "true" annotation to to accelerate TCP connections in the pod.

metadata:
  annotations:
    network.alibabacloud.com/erdma-smcr: "true"

After you enable SMC-R, eRDMA acceleration can be used only if you configure SMC-R on both ends of the TCP connection.

You can install smc-tools in the pod and run the smcss command to check the acceleration status of the connection.

Note

This feature is supported only in Alibaba Cloud Linux 3. The kernel version must be 5.10.134-17 or later. For more information, see Release notes for Alibaba Cloud Linux 3.
This feature is not supported if ofed or compat is selected as the eRDMA driver type.
Alibaba Cloud ERI eRDMA devices and SMC do not support IPv6 addresses. If applications use IPv6, SMC falls back to TCP.

Scenario 1: GPU models use eRDMA to accelerate NCCL

When you install ACK eRDMA Controller based on Step 1: Install ACK eRDMA Controller, set the preferDriver parameter to ofed to accelerate Nvidia Collective Communication Library (NCCL).
Add GPU-accelerated nodes to the node pool. For more information, see Create and manage node pools.

Install the eRDMA-related packages when you build an application container image.

View the installed eRDMA-related packages

# Debian or Ubuntu: Make sure that the OS name and version in sources.list are the same as those you use. 
wget -qO - https://mirrors.aliyun.com/erdma/GPGKEY | apt-key add - && echo "deb [ arch=amd64 ] https://mirrors.aliyun.com/erdma/apt/{OS|ubuntu} {Version|focal}/erdma main" | tee /etc/apt/sources.list.d/erdma.list && apt update && apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

# For Alibaba Cloud Linux or Red Hat Enterprise Linux (RHEL), specify the OS repository address in the yum.repos.d directory. 
cat > /etc/yum.repos.d/erdma.repo <<EOF
[erdma]
name = ERDMA Repository
baseurl = http://mirrors.aliyun.com/erdma/yum/redhat/7/erdma/x86_64/
gpgcheck = 0
enabled = 1
EOF
yum install --disablerepo=*  --enablerepo erdma -y libibverbs ibverbs-providers ibverbs-utils librdmacm

Run a GPU application that uses eRDMA in a cluster. nccl-test is used as an example.

View the sample template of a GPU application that uses eRDMA

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nccltest
spec:
  selector:
    matchLabels:
      app: nccltest
  serviceName: "nccltest"
  replicas: 2
  template:
    metadata:
      labels:
        app: nccltest
    spec:
      hostNetwork: true 
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - env:
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_IB_GID_INDEX
          value: "1"
        image: <nccl-test-image-with-erdma>
        imagePullPolicy: Always
        name: nccltest
        securityContext:
          privileged: true
        resources:
          limits:
            nvidia.com/gpu: "8"
            aliyun/erdma: "1"
          requests:
            nvidia.com/gpu: "8"
            aliyun/erdma: "1"

Verify that eRDMA is used by NCCL.
You can check the communication type and the number of network interfaces used by NCCL in the application logs. Example:
The command output indicates that the erdma_0 and erdma_1 eRDMA devices is accelerated.

Scenario 2: Use SMC-R to accelerate application networking

When you install ACK eRDMA Controller based on Step 1: Install ACK eRDMA Controller, set the preferDriver parameter to default to accelerate regular communication.

Create an application that can be accelerated by using SMC-R in a cluster based on the following sample code:

View the sample template of applications that use SMC-R

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: app-with-erdma
  name: app-with-erdma
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app-with-erdma
  template:
    metadata:
      labels:
        app: app-with-erdma
      annotations:
        network.alibabacloud.com/erdma-smcr: "true"
    spec:
      containers:
      - image: <application image>
        imagePullPolicy: Always
        name: app-with-erdma
        resources:
          limits:
            aliyun/erdma: 1

Check the status of network connections in the pod.
You can install smc-tools in a container and run the smcss command to view the acceleration results.
```
/# smcss
State          UID   Inode   Local Address           Peer Address            Intf Mode 
ACTIVE         00000 0059964 172.17.192.73:47772     172.17.192.10:80        0000 SMCR
```
In the command output, SMCR is displayed in the Mode column, which indicates that eRDMA is used by the connection.