Enterprise SSDs (ESSDs) that support the Non-Volatile Memory Express (NVMe) protocol are called NVMe disks. NVMe disks support the multi-attach feature. This feature allows you to attach an NVMe disk to at most 16 Elastic Compute Service (ECS) instances and further use the NVMe reservation feature that complies with NVMe specifications. These features facilitate data sharing among all the nodes that run the pods of an application to improve data read and write performance. This topic describes how to use the multi-attach and NVMe reservation features of NVMe disks in a Container Service for Kubernetes (ACK) cluster.
Before you begin
To make better use of the multi-attach and NVMe reservation features of NVMe disks, we recommend that you understand the following content before you read this topic.
For more information about the NVMe protocol, see NVMe protocol.
For more information about NVMe disks, see NVMe disks.
Scenarios
The multi-attach feature is suitable for the following scenarios:
Limits
A single NVMe disk can be attached to at most 16 ECS instances that reside in the same zone.
If you use the multi-attach feature, ACK allows you to attach NVMe disks only by using the volumeDevices method. In this case, you can read data from and write data to the disks on multiple nodes, but the disks cannot be accessed from file systems.
For more information, see the Limits section of the "Enable multi-attach" topic.
Prerequisites
An ACK managed cluster that runs Kubernetes 1.20 or later is created. For more information, see Create an ACK managed cluster.
The csi-plugin and csi-provisioner components of V1.24.10-7ae4421-aliyun or later are installed. For more information about how to update the csi-plugin and csi-provisioner components, see Manage the CSI plug-in.
The ACK cluster contains at least two nodes that reside in the same zone and support the multi-attach feature. For more information about the instance families that support the multi-attach feature, see the Limits section of the "Enable multi-attach" topic.
An application that meets the following requirements is prepared. The application is packaged into a container image for deployment in the ACK cluster.
The application supports simultaneous access to the data on a disk from multiple replicated pods.
The application can use standard features such as the NVMe reservation feature to ensure data consistency.
Billing
The multi-attach feature for NVMe disks is free of charge. You are charged for resources that support NVMe based on their individual billing methods. For more information about disk billing, see Billing.
Sample application
In this example, the following source code and Dockerfile are used to develop a sample application. The source code and Dockerfile are uploaded to an image repository for subsequent deployment in the cluster. Multiple replicated pods jointly manage the container lease, but only one pod master the lease. If the pod that masters the container lease cannot run as expected, other pods automatically preempt this lease. When you develop the sample application, take note of the following items:
The sample application uses
O_DIRECT
to open block storage devices and perform read and write operations. This prevents the tests in this example from being affected by the cache.The sample application uses an interface provided by the Linux kernel for simplified reservation. You can also use one of the following methods to run NVMe reservation commands. In this case, you must grant the required permissions to your container.
C:
ioctl(fd, NVME_IOCTL_IO_CMD, &cmd);
Command-line tool:
nvme-cli
For more information about the NVMe reservation feature, see NVMe Specifications.
Step 1: Deploy an application and configure multi-attach
Create a StorageClass named alicloud-disk-shared, and enable the multi-attach feature for NVMe disks.
Create a persistent volume claim (PVC) named data-disk, and set the accessModes
parameter to ReadWriteMany
and the volumeMode
parameter to Block
.
Create a StatefulSet named lease-test that uses the image of the sample application in this example.
Create the lease.yaml file that contains the following content:
Replace the container image URL in the following YAML file with the image URL of your application.
ImportantThe NVMe reservation feature takes effect on nodes. Multiple pods on the same node may interfere with each other. In this example,
podAntiAffinity
settings are configured to prevent multiple pods from being scheduled to the same node.If your cluster contains other nodes that do not use the NVMe protocol, you must configure nodeAffinity settings to schedule pods to the nodes that use the NVMe protocol.
Parameter or setting
Configure the multi-attach feature
Configure the regular attach feature
StorageClass
parameters.multiAttach
true: enables the multi-attach feature for NVMe disks.
No configuration is required.
PVC
accessModes
ReadWriteMany
ReadWriteOnce
volumeMode
Block
Filesystem
Volume attach method
volumeDevices: uses block storage devices to access the data on disks.
volumeMounts: attaches the volumes of file systems.
Run the following command to deploy the application:
kubectl apply -f lease.yaml
Step 2: Verify the multi-attach and NVMe reservation features
To ensure data consistency for NVMe cloud disks, you can use the reservation feature to control read and write permissions in your application. If one pod is performing write operations, other pods can only perform read operations.
Read data from and write data to a disk on multiple nodes
Run the following command to query pod logs:
kubectl logs -l app=lease-test --prefix -f
Expected results:
[pod/lease-test-0/lease] Register as key 4745d0c5cd9a2fa4
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
The expected results show that Pod lease-test-1 can instantly read the data written by Pod lease-test-0.
Check whether an NVMe reservation is acquired
Run the following command to query the disk ID:
kubectl get pvc data-disk -ojsonpath='{.spec.volumeName}'
Log on to one of the two nodes and run the following command to check whether an NVMe reservation is acquired:
Replace
2zxxxxxxxxxxx
in the following code with the content afterd-
in the disk ID that you queried in the previous step.nvme resv-report -c 1 /dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_2zxxxxxxxxxxx
Expected results:
NVME Reservation status: gen : 3 rtype : 1 regctl : 1 ptpls : 1 regctlext[0] : cntlid : ffff rcsts : 1 rkey : 4745d0c5cd9a2fa4 hostid : 4297c540000daf4a4*****
The expected results show that an NVMe reservation is acquired.
Block write operations on abnormal nodes by using the NVMe reservation feature
Log on to the node on which Pod lease-test-0 resides and run the following command to suspend the process for fault simulation:
pkill -STOP -f /usr/local/bin/lease
Wait 30 seconds and run the following command to query pod logs:
kubectl logs -l app=lease-test --prefix -f
Expected results:
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease [pod/lease-test-1/lease] Remote is dead, preempting [pod/lease-test-1/lease] Register as key 4745d0c5cd9a2fa4 [pod/lease-test-1/lease] Refreshed lease [pod/lease-test-1/lease] Refreshed lease [pod/lease-test-1/lease] Refreshed lease
The expected results show that Pod lease-test-1 has taken over the container lease and becomes the master node.
Log on to the node on which Pod lease-test-0 resides again and run the following command to resume the suspended process:
pkill -CONT -f /usr/local/bin/lease
Run the following command to query pod logs again:
kubectl logs -l app=lease-test --prefix -f
Expected results:
[pod/lease-test-0/lease] failed to write lease: Invalid exchange
The expected results show that Pod lease-test-0 can no longer write data to disks, and the container lease is automatically restarted. This indicates that the write operations of Pod lease-test-0 are blocked by the NVMe reservation feature.
References
If your NVMe disks do not have sufficient space, see Expand disk volumes.