Network performance between service mesh pods accelerated by eRDMA - Alibaba Cloud Service Mesh

Alibaba Cloud Linux 3 provides Shared Memory Communication (SMC), a high-performance network protocol that functions in kernel space. SMC utilizes the Remote Direct Memory Access (RDMA) technology and works with socket interfaces to establish network communications. SMC can significantly optimize the performance of network communications. However, when you use SMC to optimize the performance of network communications in a native Elastic Compute Service (ECS) environment, you must carefully maintain the SMC whitelist and configurations in the network namespace of the related pod to prevent SMC from being unexpectedly downgraded to TCP. Service Mesh (ASM) provides the SMC optimization capability in a controllable network environment (that is, cluster) to automatically optimize the network communications between pods in an ASM instance. You do not need to care about the specific SMC configuration.

Prerequisites

The cluster is added to the ASM instance.

Limits

Note

The feature of enabling SMC in an ASM instance to accelerate network communications is in the beta phase.

Nodes must use ECS instances that support elastic Remote Direct Memory Access (eRDMA). For more information, see Configure eRDMA on an enterprise-level instance.
Nodes must use Alibaba Cloud Linux 3. For more information, see Alibaba Cloud Linux 3.
The version of your ASM instance is V1.21 or later. For more information about how to update an ASM instance, see Update an ASM instance.
Your Container Service for Kubernetes (ACK) cluster uses the Terway network plug-in. For more information, see Work with Terway.
Internet access to the API server of the ACK cluster is enabled. For more information, see Control public access to the API server of a cluster.
In current versions, the network communication acceleration based on SMC and Sidecar Acceleration using eBPF features cannot be used simultaneously. This limit will be removed in later versions.

Procedure

Step 1: Initialize the related nodes

SMC uses elastic RDMA interface (ERI) to accelerate network communications. Before you enable SMC, you must initialize the related nodes.

Upgrade the kernel version of Alibaba Cloud Linux 3 to 5.10.134-16.3 or later.

Note

The known issue of kernel version 5.10.134-16.3 and how to fix it are described in the Known issue section.

Run the uname -r command to view the kernel version. If the kernel version is 5.10.134-16.3 or later, skip the kernel upgrade step.
```
$ uname -r
5.10.134-16.3.al8.x86_64
```

View kernel versions that can be installed.

$ sudo yum search kernel --showduplicates | grep kernel-5.10
Last metadata expiration check: 3:01:27 ago on Tue 09 Apr 2024 07:40:15 AM CST.
kernel-5.10.134-15.1.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-15.2.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-15.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-16.1.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-16.2.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-16.3.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
kernel-5.10.134-16.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports
[...]

Install the latest version of the kernel or install a specific version of the kernel.
1. Install the latest version of the kernel:
```
$ sudo yum update kernel
```
2. Install a specific version of the kernel. In this example, kernel-5.10.134-16.3.al8.x86_64 is used:
```
$ sudo yum install kernel-5.10.134-16.3.al8.x86_64
```
Restart the nodes. After a system restarts, run the uname -r command to check whether the kernel has been upgraded to the expected version.

Configure an elastic network interface (ENI) filter for the Terway network plug-in of the ACK cluster. This way, the Terway network plug-in does not manage the secondary ERI that will be added. For more information, see Configure an ENI filter.
For each node in the cluster, create a secondary ERI and bind it to the node. For more information, see Configure eRDMA on an existing instance.
Note
You need to only create secondary ERIs and bind them to nodes. In this scenario, the secondary ERIs must be in the same subnet as the primary network interface cards (NICs). For more information about how to configure a secondary ERI, see the next step.

Configure a secondary ERI on a node.

Save the following script in any directory of the node. Run the sudo chmod +x asm_erdma_eth_config.sh command to grant execute permissions on the script.

asm_erdma_eth_config.sh

#!/bin/bash

#
# Params
#
mode=
mac=
ipv4=
mask=
gateway=
state=  # UP/DOWN

#
# Functions
#
function find_erdma_eth
{
        echo "$(rdma link show | awk '{print $NF}')"
}

function get_erdma_eth_info
{
        e=$1
        echo "Find ethernet device with erdma: $e"

        # UP/DOWN
        ip link show $e | grep -q "state UP" && state="UP" || state="DOWN"

        # MAC address
        mac=$(ip a show dev $e | grep ether | awk '{print $2}')

        # IPv4 address
        ipv4=$(curl --silent --show-error --connect-timeout 5 \
                http://100.100.100.200/latest/meta-data/network/interfaces/macs/"$mac"/primary-ip-address \
                2>&1)
        if [ $? -ne 0 ]; then
                echo "failed to retrieve $e IPv4 address. Error: $ipv4"
                exit 1
        fi

        # Mask
        mask=$(curl --silent --show-error --connect-timeout 5 \
                http://100.100.100.200/latest/meta-data/network/interfaces/macs/"$mac"/netmask \
                2>&1)
        if [ $? -ne 0 ]; then
                echo "failed to retrieve $e network mask. Error: $mask"
                exit 1
        fi

        # Gateway
        gateway=$(curl --silent --show-error --connect-timeout 5 \
                http://100.100.100.200/latest/meta-data/network/interfaces/macs/"$mac"/gateway \
                2>&1)
        if [ $? -ne 0 ]; then
                echo "failed to retrieve $e gateway. Error: $mask"
                exit 1
        fi
        echo "- state <$state>, IPv4 <$ipv4>, mask <$mask>, gateway <$gateway>"
}

function set_erdma_eth
{
        local eths=()

        # find all eths with erdma
        eths=$(find_erdma_eth)
        if [ ${#eths[@]} -eq 0 ]; then
                echo "Can't find ethernet device with erdma capability"
                exit 1
        fi

        for e in ${eths[@]}
        do
                if [[ $e == "eth0" ]]; then
                        echo "Skip eth0, no need to configure."
                        continue
                fi
                get_erdma_eth_info $e
                echo "Config.."

                # Enable
                if [ "$state" == "DOWN" ]; then
                        ip link set $e up 1>/dev/null 2>&1
                        if ip link show $e | grep -q "state UP" ; then
                                echo "- successed to set $e UP"
                        else
                                echo "- failed to set $e UP"
                                exit 1
                        fi
                else
                        echo "- $e has been activated, nothing to do."
                fi

                # Set IP & mask
                if ! ip addr show $e | grep -q "inet\b"; then
                        local eth0_metric=$(ip route | grep "dev eth0 proto kernel scope link" \
                                                | awk '/metric/ {print $NF}')
                        ip addr add $ipv4/$mask dev $e metric $((eth0_metric + 1)) 1>/dev/null 2>&1
                        if [ $? -eq 0 ]; then
                                echo "- successed to configure $e IPv4/mask and direct route"
                        else
                                echo "- failed to configure $e IPv4/mask and direct route"
                        fi
                else
                        echo "- $e has been configured with IPv4(s), nothing to do."
                fi

                echo "Complete all configurations of $e"
        done
}

function reset_erdma_eth
{
        local eths=()

        # Find all eths with erdma
        eths=$(find_erdma_eth)
        if [ ${#eths[@]} -eq 0 ]; then
                echo "Can't find ethernet device with erdma capability"
                exit 1
        fi

        for e in ${eths[@]}
        do
                if [[ $e == "eth0" ]]; then
                        echo "Skip eth0, no need to configure."
                        continue
                fi
                get_erdma_eth_info $e
                echo "Reset.."

                # Remove IPv4
                ip addr flush dev $e scope global 1>/dev/null 2>&1
                if [ $? -eq 0 ]; then
                        echo "- successed to flush $e IPv4(s)"
                else
                        echo "- failed to flush $e IPv4(s)"
                fi

                # Disable
                ip link set $e down 1>/dev/null 2>&1
                if [ $? -eq 0 ]; then
                        echo "- successed to set $e DOWN"
                else
                        echo "- failed to set $e DOWN"
                fi
                echo "Complete all resets of $e"
        done
}

print_help() {
        echo "Usage: $0 [option]"
        echo "Options:"
        echo "  -s            Enable eRDMA-cap Eth and configure its IPv4"
        echo "  -r            Disable eRDMA-cap Eth and remove all its IPv4"
        echo "  -h, --help    Show this help message"
}

while [ "$1" != "" ]; do
        case $1 in
                -s)
                        set_erdma_eth
                        exit 0
                        ;;
                -r)
                        reset_erdma_eth
                        exit 0
                        ;;
                -h | --help)
                        print_help
                        exit 0
                        ;;
                *)
                        echo "Invalid option: $1"
                        print_help
                        exit 1
                        ;;
        esac
        shift
done

if [ -z "$1" ]; then
        print_help
        exit 1
fi

Note

This script is only used to configure a secondary ERI in the current context and is not applicable to other NIC configuration scenarios.

Run the sudo ./asm_erdma_eth_config.sh -s command to set the status of the new secondary ERI to UP and configure an IPv4 address for it. Expected output:

$ sudo ./asm_erdma_eth_config.sh -s
Find ethernet device with erdma: eth2
- state <DOWN>, IPv4 <192.168.x.x>, mask <255.255.255.0>, gateway <192.168.x.x>
Config..
- successed to set eth2 UP
- successed to configure eth2 IPv4/mask and direct route
Complete all configurations of eth2

(Optional) The preceding steps for configuring the secondary ERI need to be performed again each time the node is restarted. If you require the secondary ERI to be automatically configured when the node is restarted, you can perform the following steps to create the corresponding systemd service.

Add the following asm_erdma_eth_config.service file to the /etc/systemd/system directory of the node, and replace /path/to/asm_erdma_eth_config.sh with the actual path of the asm_erdma_eth_config.sh script for the node.

asm_erdma_eth_config.service

[Unit]
Description=Run asm_erdma_eth_config.sh script after network is up
Wants=network-online.target
After=network-online.target

[Service]
Type=oneshot
ExecStart=/bin/sh /path/to/asm_erdma_eth_config.sh -s
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Enable asm_erdma_eth_config.service.

sudo systemctl daemon-reload
sudo systemctl enable asm_erdma_eth_config.service

The secondary ERI is then automatically configured at node startup. After the node is started, you can run the sudo systemctl status asm_erdma_eth_config.service command to view the status of asm_erdma_eth_config.service. The expected state is active. Expected output:

# sudo systemctl status asm_erdma_eth_config.service
● asm_erdma_eth_config.service - Run asm_erdma_eth_config.sh script after network is up
   Loaded: loaded (/etc/systemd/system/asm_erdma_eth_config.service; enabled; vendor preset: enabled)
   Active: active (exited) since [time]
 Main PID: 1689 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 403123)
   Memory: 0B
   CGroup: /system.slice/asm_erdma_eth_config.service

[time] <hostname> sh[1689]: Find ethernet device with erdma: eth2
[time] <hostname> systemd[1]: Starting Run asm_erdma_eth_config.sh script after network is up...
[time] <hostname> sh[1689]: - state <DOWN>, IPv4 <192.168.x.x>, mask <255.255.255.0>, gateway <192.168.x.x>
[time] <hostname> sh[1689]: Config..
[time] <hostname> sh[1689]: - successed to set eth2 UP
[time] <hostname> sh[1689]: - successed to configure eth2 IPv4/mask and direct route
[time] <hostname> sh[1689]: Complete all configurations of eth2
[time] <hostname> systemd[1]: Started Run asm_erdma_eth_config.sh script after network is up.

If asm_erdma_eth_config.service is no longer needed, you can run the sudo systemctl disable asm_erdma_eth_config.service command to remove it.

Step 2: Deploy test applications

Enable automatic sidecar proxy injection for the default namespace, which is used in the following tests. For more information, see Enable automatic sidecar proxy injection.

Create a fortioserver.yaml file that contains the following content:

Expand to view the fortioserver.yaml file

---
apiVersion: v1
kind: Service
metadata:
  name: fortioserver
spec:
  ports:
  - name: http-echo
    port: 8080
    protocol: TCP
  - name: tcp-echoa
    port: 8078
    protocol: TCP
  - name: grpc-ping
    port: 8079
    protocol: TCP
  selector:
    app: fortioserver
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: fortioserver
  name: fortioserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fortioserver
  template:
    metadata:
      labels:
        app: fortioserver
      annotations:
        sidecar.istio.io/inject: "true"
        sidecar.istio.io/proxyCPULimit: 2000m
        proxy.istio.io/config: |
          concurrency: 2 
    spec:
      shareProcessNamespace: true
      containers:
      - name: captured
        image: fortio/fortio:latest_release
        ports:
        - containerPort: 8080
          protocol: TCP
        - containerPort: 8078
          protocol: TCP
        - containerPort: 8079
          protocol: TCP
      - name: anolis
        securityContext:
          runAsUser: 0
        image: openanolis/anolisos:latest
        args:
        - /bin/sleep
        - 3650d
---
apiVersion: v1
kind: Service
metadata:
  annotations:
      service.beta.kubernetes.io/alibaba-cloud-loadbalancer-health-check-switch: "off"
  name: fortioclient
spec:
  ports:
  - name: http-report
    port: 8080
    protocol: TCP
  selector:
    app: fortioclient
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: fortioclient
  name: fortioclient
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fortioclient
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
        sidecar.istio.io/proxyCPULimit: 4000m
        proxy.istio.io/config: |
           concurrency: 4
      labels:
        app: fortioclient
    spec:
      shareProcessNamespace: true
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - fortioserver
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: captured
        volumeMounts:
        - name: shared-data
          mountPath: /var/lib/fortio
        image: fortio/fortio:latest_release
        ports:
        - containerPort: 8080
          protocol: TCP
      - name: anolis
        securityContext:
          runAsUser: 0
        image: openanolis/anolisos:latest
        args:
        - /bin/sleep
        - 3650d
      volumes:
      - name: shared-data
        emptyDir: {}

Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file and then run the following command to deploy the test applications:
```
kubectl apply -f fortioserver.yaml
```

Run the following command to view the status of the test applications:

kubectl get pods | grep fortio

Expected output:

NAME                            READY   STATUS    RESTARTS      
fortioclient-8569b98544-9qqbj   3/3     Running   0
fortioserver-7cd5c46c49-mwbtq   3/3     Running   0

The output indicates that both test applications start as expected.

Step 3: Run a test in the baseline environment and view the baseline test results

After the fortio application starts, the listening port 8080 is exposed. You can access this port to open the console page of the fortio application. To generate test traffic, you can map the port of the fortioclient application to a local port. Then, open the console page of the fortio application on your on-premises host.

Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file and then run the following command to map port 8080 of the fortioclient application to the local port 8080.
```
kubectl port-forward service/fortioclient 8080:8080
```

In the address bar of your browser, enter http://localhost:8080/fortio to access the console of the fortioclient application and modify related configurations.

Modify the parameter settings on the page shown in the preceding figure according to the following table.

Parameter	Example
URL	http://fortioserver:8080/echo
QPS	100000
Duration	30s
Threads/Simultaneous connections	64
Payload	Enter the following string (128 bytes): xhsyL4ELNoUUbC3WEyvaz0qoHcNYUh0j2YHJTpltJueyXlSgf7xkGqc5RcSJBtqUENNjVHNnGXmoMyILWsrZL1O2uordH6nLE7fY6h5TfTJCZtff3Wib8YgzASha8T8g

After the configuration is complete, click Start in the lower part of the page to start the test. The test ends after the progress bar reaches the end.
After the test ends, the results of the test are displayed on the page. The following figure is for reference only. Test results vary with test environments.
On the test results page, the x-axis shows the latencies of requests. You can obtain the distribution of the latencies of requests by observing the distribution of data on the x-axis of the histogram. The purple curve shows the number of the processed requests within different response time. The y-axis shows the number of processed requests. At the top of the histogram, the P50, P75, P90, P99, and P99.9 latencies of requests are provided. After you obtain the test data of the baseline environment, you need to enable SMC for the applications to test the performance of the applications after SMC is enabled.

Step 4: Enable network communication acceleration based on SMC for your ASM instance and workloads

Use kubectl to connect to the ASM instance based on the information in the kubeconfig file. Then, run the following command to add the "smcEnabled: true" field to enable network communication acceleration based on SMC.

$ kubectl edit asmmeshconfig

apiVersion: istio.alibabacloud.com/v1beta1
kind: ASMMeshConfig
metadata:
  name: default
spec:
  ambientConfiguration:
    redirectMode: ""
    waypoint: {}
    ztunnel: {}
  cniConfiguration:
    enabled: true
    repair: {}
  smcEnabled: true

Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file. Then, run the following command to modify the Deployments of the fortioserver and fortioclient applications and add the smc.asm.alibabacloud.com/enabled: "true" annotation to the pods in which the fortioserver and fortioclient applications reside.
After you enable SMC for the ASM instance, you need to further enable SMC for the workloads. To enable SMC for workloads, you can add the smc.asm.alibabacloud.com/enabled: "true" annotation to the related pods. You must enable SMC for workloads on both the client and server sides.
1. Modify the Deployment of the fortioclient application.
```
$ kubectl edit deployment fortioclient

apiVersion: apps/v1
kind: Deployment
metadata:
  ......
  name: fortioclient
spec:
  ......
  template:
    metadata:
      ......
      annotations:
        smc.asm.alibabacloud.com/enabled: "true"
        
```
2. Modify the Deployment of the fortioserver application.
```
$ kubectl edit deployment fortioserver

apiVersion: apps/v1
kind: Deployment
metadata:
  ......
  name: fortioserver
spec:
  ......
  template:
    metadata:
      ......
      annotations:
        smc.asm.alibabacloud.com/enabled: "true"
        
```

Step 5: Run the test in the environment in which SMC is enabled and view the test results

Workloads are restarted after you modify the Deployments. Therefore, you must map the port of the fortioclient application to the local port again by referring to Step 3. Then, start the test again and wait until the test is complete.

Compared with the test results when you do not enable SMC, you can see that after you enable SMC for the ASM instance, latencies of requests decrease and the queries per second (QPS) significantly increases.

Known issue

After SMC is enabled for an ECS instance that runs an Alibaba Cloud Linux 3 system with the kernel version of 5.10.134-16.3, the system reports an error message similar to unregister_netdevice: waiting for eth* to become free. Usage count = * when the related pod is restarted. The related pod cannot be deleted successfully.
The cause of this issue is that the kernel module for SMC does not properly release the reference count of the network interface. You can enable the following hotfix to fix this issue. This issue does not exist in later versions. For more information about kernel hotfixes, see Operations related to kernel hotfixes.
```
$ sudo yum install kernel-hotfix-16664924-5.10.134-16.3
```