Alibaba Cloud Linux 3 provides Shared Memory Communication (SMC), a high-performance network protocol that functions in kernel space. SMC utilizes the Remote Direct Memory Access (RDMA) technology and works with socket interfaces to establish network communications. SMC can significantly optimize the performance of network communications. However, when you use SMC to optimize the performance of network communications in a native Elastic Compute Service (ECS) environment, you must carefully maintain the SMC whitelist and configurations in the network namespace of the related pod to prevent SMC from being unexpectedly downgraded to TCP. Service Mesh (ASM) provides the SMC optimization capability in a controllable network environment (that is, cluster) to automatically optimize the network communications between pods in an ASM instance. You do not need to care about the specific SMC configuration.
Prerequisites
Limits
The feature of enabling SMC in an ASM instance to accelerate network communications is in the beta phase.
Nodes must use ECS instances that support elastic Remote Direct Memory Access (eRDMA). For more information, see Configure eRDMA on an enterprise-level instance.
Nodes must use Alibaba Cloud Linux 3. For more information, see Alibaba Cloud Linux 3.
The version of your ASM instance is V1.21 or later. For more information about how to update an ASM instance, see Update an ASM instance.
Your Container Service for Kubernetes (ACK) cluster uses the Terway network plug-in. For more information, see Work with Terway.
Internet access to the API server of the ACK cluster is enabled. For more information, see Control public access to the API server of a cluster.
In current versions, the network communication acceleration based on SMC and Sidecar Acceleration using eBPF features cannot be used simultaneously. This limit will be removed in later versions.
Procedure
Step 1: Initialize the related nodes
SMC uses elastic RDMA interface (ERI) to accelerate network communications. Before you enable SMC, you must initialize the related nodes.
Upgrade the kernel version of Alibaba Cloud Linux 3 to 5.10.134-16.3 or later.
NoteThe known issue of kernel version 5.10.134-16.3 and how to fix it are described in the Known issue section.
Run the
uname -r
command to view the kernel version. If the kernel version is 5.10.134-16.3 or later, skip the kernel upgrade step.$ uname -r 5.10.134-16.3.al8.x86_64
View kernel versions that can be installed.
$ sudo yum search kernel --showduplicates | grep kernel-5.10 Last metadata expiration check: 3:01:27 ago on Tue 09 Apr 2024 07:40:15 AM CST. kernel-5.10.134-15.1.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-15.2.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-15.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-16.1.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-16.2.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-16.3.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports kernel-5.10.134-16.al8.x86_64 : The Linux kernel, based on version 5.10.134, heavily modified with backports [...]
Install the latest version of the kernel or install a specific version of the kernel.
Install the latest version of the kernel:
$ sudo yum update kernel
Install a specific version of the kernel. In this example, kernel-5.10.134-16.3.al8.x86_64 is used:
$ sudo yum install kernel-5.10.134-16.3.al8.x86_64
Restart the nodes. After a system restarts, run the
uname -r
command to check whether the kernel has been upgraded to the expected version.
Configure an elastic network interface (ENI) filter for the Terway network plug-in of the ACK cluster. This way, the Terway network plug-in does not manage the secondary ERI that will be added. For more information, see Configure an ENI filter.
For each node in the cluster, create a secondary ERI and bind it to the node. For more information, see Configure eRDMA on an existing instance.
NoteYou need to only create secondary ERIs and bind them to nodes. In this scenario, the secondary ERIs must be in the same subnet as the primary network interface cards (NICs). For more information about how to configure a secondary ERI, see the next step.
Configure a secondary ERI on a node.
Save the following script in any directory of the node. Run the
sudo chmod +x asm_erdma_eth_config.sh
command to grant execute permissions on the script.NoteThis script is only used to configure a secondary ERI in the current context and is not applicable to other NIC configuration scenarios.
Run the
sudo ./asm_erdma_eth_config.sh -s
command to set the status of the new secondary ERI to UP and configure an IPv4 address for it. Expected output:$ sudo ./asm_erdma_eth_config.sh -s Find ethernet device with erdma: eth2 - state <DOWN>, IPv4 <192.168.x.x>, mask <255.255.255.0>, gateway <192.168.x.x> Config.. - successed to set eth2 UP - successed to configure eth2 IPv4/mask and direct route Complete all configurations of eth2
(Optional) The preceding steps for configuring the secondary ERI need to be performed again each time the node is restarted. If you require the secondary ERI to be automatically configured when the node is restarted, you can perform the following steps to create the corresponding systemd service.
Add the following asm_erdma_eth_config.service file to the /etc/systemd/system directory of the node, and replace
/path/to/asm_erdma_eth_config.sh
with the actual path of the asm_erdma_eth_config.sh script for the node.Enable asm_erdma_eth_config.service.
sudo systemctl daemon-reload sudo systemctl enable asm_erdma_eth_config.service
The secondary ERI is then automatically configured at node startup. After the node is started, you can run the
sudo systemctl status asm_erdma_eth_config.service
command to view the status of asm_erdma_eth_config.service. The expected state isactive
. Expected output:# sudo systemctl status asm_erdma_eth_config.service ● asm_erdma_eth_config.service - Run asm_erdma_eth_config.sh script after network is up Loaded: loaded (/etc/systemd/system/asm_erdma_eth_config.service; enabled; vendor preset: enabled) Active: active (exited) since [time] Main PID: 1689 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 403123) Memory: 0B CGroup: /system.slice/asm_erdma_eth_config.service [time] <hostname> sh[1689]: Find ethernet device with erdma: eth2 [time] <hostname> systemd[1]: Starting Run asm_erdma_eth_config.sh script after network is up... [time] <hostname> sh[1689]: - state <DOWN>, IPv4 <192.168.x.x>, mask <255.255.255.0>, gateway <192.168.x.x> [time] <hostname> sh[1689]: Config.. [time] <hostname> sh[1689]: - successed to set eth2 UP [time] <hostname> sh[1689]: - successed to configure eth2 IPv4/mask and direct route [time] <hostname> sh[1689]: Complete all configurations of eth2 [time] <hostname> systemd[1]: Started Run asm_erdma_eth_config.sh script after network is up.
If asm_erdma_eth_config.service is no longer needed, you can run the
sudo systemctl disable asm_erdma_eth_config.service
command to remove it.
Step 2: Deploy test applications
Enable automatic sidecar proxy injection for the default namespace, which is used in the following tests. For more information, see Enable automatic sidecar proxy injection.
Create a fortioserver.yaml file that contains the following content:
Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file and then run the following command to deploy the test applications:
kubectl apply -f fortioserver.yaml
Run the following command to view the status of the test applications:
kubectl get pods | grep fortio
Expected output:
NAME READY STATUS RESTARTS fortioclient-8569b98544-9qqbj 3/3 Running 0 fortioserver-7cd5c46c49-mwbtq 3/3 Running 0
The output indicates that both test applications start as expected.
Step 3: Run a test in the baseline environment and view the baseline test results
After the fortio application starts, the listening port 8080 is exposed. You can access this port to open the console page of the fortio application. To generate test traffic, you can map the port of the fortioclient application to a local port. Then, open the console page of the fortio application on your on-premises host.
Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file and then run the following command to map port 8080 of the fortioclient application to the local port 8080.
kubectl port-forward service/fortioclient 8080:8080
In the address bar of your browser, enter
http://localhost:8080/fortio
to access the console of the fortioclient application and modify related configurations.Modify the parameter settings on the page shown in the preceding figure according to the following table.
Parameter
Example
URL
http://fortioserver:8080/echo
QPS
100000
Duration
30s
Threads/Simultaneous connections
64
Payload
Enter the following string (128 bytes):
xhsyL4ELNoUUbC3WEyvaz0qoHcNYUh0j2YHJTpltJueyXlSgf7xkGqc5RcSJBtqUENNjVHNnGXmoMyILWsrZL1O2uordH6nLE7fY6h5TfTJCZtff3Wib8YgzASha8T8g
After the configuration is complete, click Start in the lower part of the page to start the test. The test ends after the progress bar reaches the end.
After the test ends, the results of the test are displayed on the page. The following figure is for reference only. Test results vary with test environments.
On the test results page, the x-axis shows the latencies of requests. You can obtain the distribution of the latencies of requests by observing the distribution of data on the x-axis of the histogram. The purple curve shows the number of the processed requests within different response time. The y-axis shows the number of processed requests. At the top of the histogram, the P50, P75, P90, P99, and P99.9 latencies of requests are provided. After you obtain the test data of the baseline environment, you need to enable SMC for the applications to test the performance of the applications after SMC is enabled.
Step 4: Enable network communication acceleration based on SMC for your ASM instance and workloads
Use kubectl to connect to the ASM instance based on the information in the kubeconfig file. Then, run the following command to add the "smcEnabled: true" field to enable network communication acceleration based on SMC.
$ kubectl edit asmmeshconfig apiVersion: istio.alibabacloud.com/v1beta1 kind: ASMMeshConfig metadata: name: default spec: ambientConfiguration: redirectMode: "" waypoint: {} ztunnel: {} cniConfiguration: enabled: true repair: {} smcEnabled: true
Use kubectl to connect to the ACK cluster based on the information in the kubeconfig file. Then, run the following command to modify the Deployments of the fortioserver and fortioclient applications and add the smc.asm.alibabacloud.com/enabled: "true" annotation to the pods in which the fortioserver and fortioclient applications reside.
After you enable SMC for the ASM instance, you need to further enable SMC for the workloads. To enable SMC for workloads, you can add the
smc.asm.alibabacloud.com/enabled: "true"
annotation to the related pods. You must enable SMC for workloads on both the client and server sides.Modify the Deployment of the fortioclient application.
$ kubectl edit deployment fortioclient apiVersion: apps/v1 kind: Deployment metadata: ...... name: fortioclient spec: ...... template: metadata: ...... annotations: smc.asm.alibabacloud.com/enabled: "true"
Modify the Deployment of the fortioserver application.
$ kubectl edit deployment fortioserver apiVersion: apps/v1 kind: Deployment metadata: ...... name: fortioserver spec: ...... template: metadata: ...... annotations: smc.asm.alibabacloud.com/enabled: "true"
Step 5: Run the test in the environment in which SMC is enabled and view the test results
Workloads are restarted after you modify the Deployments. Therefore, you must map the port of the fortioclient application to the local port again by referring to Step 3. Then, start the test again and wait until the test is complete.
Compared with the test results when you do not enable SMC, you can see that after you enable SMC for the ASM instance, latencies of requests decrease and the queries per second (QPS) significantly increases.
Known issue
After SMC is enabled for an ECS instance that runs an Alibaba Cloud Linux 3 system with the kernel version of 5.10.134-16.3, the system reports an error message similar to
unregister_netdevice: waiting for eth* to become free. Usage count = *
when the related pod is restarted. The related pod cannot be deleted successfully.The cause of this issue is that the kernel module for SMC does not properly release the reference count of the network interface. You can enable the following hotfix to fix this issue. This issue does not exist in later versions. For more information about kernel hotfixes, see Operations related to kernel hotfixes.
$ sudo yum install kernel-hotfix-16664924-5.10.134-16.3