All Products
Search
Document Center

Container Service for Kubernetes:Use the multi-attach and NVMe reservation features of NVMe disks

Last Updated:Oct 09, 2024

Enterprise SSDs (ESSDs) that support the Non-Volatile Memory Express (NVMe) protocol are called NVMe disks. NVMe disks support the multi-attach feature. This feature allows you to attach an NVMe disk to at most 16 Elastic Compute Service (ECS) instances and further use the NVMe reservation feature that complies with NVMe specifications. These features facilitate data sharing among all the nodes that run the pods of an application to improve data read and write performance. This topic describes how to use the multi-attach and NVMe reservation features of NVMe disks in a Container Service for Kubernetes (ACK) cluster.

Before you begin

To make better use of the multi-attach and NVMe reservation features of NVMe disks, we recommend that you understand the following content before you read this topic.

  • For more information about the NVMe protocol, see NVMe protocol.

  • For more information about NVMe disks, see NVMe disks.

Scenarios

The multi-attach feature is suitable for the following scenarios:

  • Data sharing

    Data sharing is the simplest use scenario of NVMe. After data is written to a shared NVMe disk from one attachment node, all other attachment nodes can access the data. This reduces storage costs and improves read/write performance. For example, a single NVMe-capable container image in the cloud can be read and loaded by multiple instances that run the same operating system.

    image
  • High-availability failover

    High service availability is one of the most common application scenarios of shared disks. Traditional SAN-based databases, such as Oracle Real Application Clusters (RAC), SAP High-performance ANalytic Appliance (HANA), and cloud-native high-availability databases, may encounter single points of failure (SPOFs) in actual business scenarios. Shared NVMe disks can be used to ensure business continuity and high availability in terms of cloud-based storage and networks in case of SPOFs. Compute nodes encounter frequent outages, downtime, and hardware failures. To achieve high availability of compute nodes, you can deploy business in primary/secondary mode.

    For example, in a database scenario, if the primary database fails, the secondary database quickly takes over to provide services. After the instance that hosts the primary database is switched to the instance that hosts the secondary database, you can run an NVMe Persistent Reservation (PR) command to revoke the write permissions on the faulty primary database. This helps prevent data from being written to the faulty primary database, which ensures data consistency. The following figure shows the failover process.

    Note

    PR is a part of the NVMe protocol that can precisely control read and write permissions on a cloud disk to ensure that the compute nodes can write data as expected. For more information, see NVM Express Base Specification.

    1. The primary database instance (Database Instance 1) fails, which causes the service to stop.

    2. Run an NVMe PR command to prevent data from being written to Database Instance 1 and allow data to be written to the secondary database instance (Database Instance 2).

    3. Restore Database Instance 2 to the same state as Database Instance 1 by using different methods such as log replay.

    4. Database Instance 2 takes over as the primary database instance to provide services externally.

    image
  • Distributed data cache acceleration

    Multi-attach-enabled cloud disks deliver high performance, IOPS, and throughput and can facilitate performance acceleration for storage systems with slow and medium speeds. For example, data lakes are commonly built on top of Object Storage Service (OSS). Each data lake can be simultaneously accessed by multiple clients. Data lakes deliver high sequential read throughput and high append write throughput, but have low sequential read/write throughput, high latency, and low random read/write performance. To greatly improve the access performance in scenarios such as data lakes, you can attach a high-speed, multi-attach-enabled cloud disk as a cache to compute nodes.

    image
  • Machine learning

    In machine learning scenarios, after a sample is labeled and written, the sample is split and distributed across multiple nodes to facilitate parallel distributed computing. The multi-attach feature allows each compute node to directly access shared storage resources without the need to frequently transmit data over the network. This reduces data transfer latency and accelerates the model training process. The combination of high performance and the multi-attach feature allows cloud disks to provide an efficient and flexible storage solution for machine learning scenarios, such as large-scale model training tasks that require high-speed data access and processing. The storage solution significantly improves the efficiency and effectiveness of the machine learning process.

    image

Limits

  • A single NVMe disk can be attached to at most 16 ECS instances that reside in the same zone.

  • If you use the multi-attach feature, ACK allows you to attach NVMe disks only by using the volumeDevices method. In this case, you can read data from and write data to the disks on multiple nodes, but the disks cannot be accessed from file systems.

  • For more information, see the Limits section of the "Enable multi-attach" topic.

Prerequisites

  • An ACK managed cluster that runs Kubernetes 1.20 or later is created. For more information, see Create an ACK managed cluster.

  • The csi-plugin and csi-provisioner components of V1.24.10-7ae4421-aliyun or later are installed. For more information about how to update the csi-plugin and csi-provisioner components, see Manage the CSI plug-in.

  • The ACK cluster contains at least two nodes that reside in the same zone and support the multi-attach feature. For more information about the instance families that support the multi-attach feature, see the Limits section of the "Enable multi-attach" topic.

  • An application that meets the following requirements is prepared. The application is packaged into a container image for deployment in the ACK cluster.

    • The application supports simultaneous access to the data on a disk from multiple replicated pods.

    • The application can use standard features such as the NVMe reservation feature to ensure data consistency.

Billing

The multi-attach feature for NVMe disks is free of charge. You are charged for resources that support NVMe based on their individual billing methods. For more information about disk billing, see Billing.

Sample application

In this example, the following source code and Dockerfile are used to develop a sample application. The source code and Dockerfile are uploaded to an image repository for subsequent deployment in the cluster. Multiple replicated pods jointly manage the container lease, but only one pod master the lease. If the pod that masters the container lease cannot run as expected, other pods automatically preempt this lease. When you develop the sample application, take note of the following items:

  • The sample application uses O_DIRECT to open block storage devices and perform read and write operations. This prevents the tests in this example from being affected by the cache.

  • The sample application uses an interface provided by the Linux kernel for simplified reservation. You can also use one of the following methods to run NVMe reservation commands. In this case, you must grant the required permissions to your container.

    • C: ioctl(fd, NVME_IOCTL_IO_CMD, &cmd);

    • Command-line tool: nvme-cli

  • For more information about the NVMe reservation feature, see NVMe Specifications.

View the source code of the sample application

#define _GNU_SOURCE
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/pr.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <time.h>
#include <unistd.h>

const char *disk_device = "/dev/data-disk";
uint64_t magic = 0x4745D0C5CD9A2FA4;

void panic(const char *restrict format, ...) {
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    exit(EXIT_FAILURE);
}

struct lease {
    uint64_t magic;
    struct timespec acquire_time;
    char holder[64];
};

volatile bool shutdown = false;
void on_term(int signum) {
    shutdown = true;
}

struct lease *lease;
const size_t lease_alloc_size = 512;

void acquire_lease(int disk_fd) {
    int ret;

    struct pr_registration pr_reg = {
        .new_key = magic,
        .flags = PR_FL_IGNORE_KEY,
    };
    ret = ioctl(disk_fd, IOC_PR_REGISTER, &pr_reg);
    if (ret != 0)
        panic("failed to register (%d): %s\n", ret, strerror(errno));

    struct pr_preempt pr_pre = {
        .old_key = magic,
        .new_key = magic,
        .type  = PR_WRITE_EXCLUSIVE,
    };
    ret = ioctl(disk_fd, IOC_PR_PREEMPT, &pr_pre);
    if (ret != 0)
        panic("failed to preempt (%d): %s\n", ret, strerror(errno));

    // register again in case we preempted ourselves
    ret = ioctl(disk_fd, IOC_PR_REGISTER, &pr_reg);
    if (ret != 0)
        panic("failed to register (%d): %s\n", ret, strerror(errno));
    fprintf(stderr, "Register as key %lx\n", magic);


    struct pr_reservation pr_rev = {
        .key   = magic,
        .type  = PR_WRITE_EXCLUSIVE,
    };
    ret = ioctl(disk_fd, IOC_PR_RESERVE, &pr_rev);
    if (ret != 0)
        panic("failed to reserve (%d): %s\n", ret, strerror(errno));

    lease->magic = magic;
    gethostname(lease->holder, sizeof(lease->holder));

    while (!shutdown) {
        clock_gettime(CLOCK_MONOTONIC, &lease->acquire_time);
        ret = pwrite(disk_fd, lease, lease_alloc_size, 0);
        if (ret < 0)
            panic("failed to write lease: %s\n", strerror(errno));
        fprintf(stderr, "Refreshed lease\n");
        sleep(5);
    }
}

int timespec_compare(const struct timespec *a, const struct timespec *b) {
    if (a->tv_sec < b->tv_sec)
        return -1;
    if (a->tv_sec > b->tv_sec)
        return 1;
    if (a->tv_nsec < b->tv_nsec)
        return -1;
    if (a->tv_nsec > b->tv_nsec)
        return 1;
    return 0;
}

int main() {
    assert(lease_alloc_size >= sizeof(struct lease));
    lease = aligned_alloc(512, lease_alloc_size);
    if (lease == NULL)
        panic("failed to allocate memory\n");

    // char *reg_key_str = getenv("REG_KEY");
    // if (reg_key_str == NULL)
    //     panic("REG_KEY env not specified");

    // uint64_t reg_key = atoll(reg_key_str) | (magic << 32);
    // fprintf(stderr, "Will register as key %lx", reg_key);


    int disk_fd = open(disk_device, O_RDWR|O_DIRECT);
    if (disk_fd < 0)
        panic("failed to open disk: %s\n", strerror(errno));

    // setup signal handler
    struct sigaction sa = {
        .sa_handler = on_term,
    };
    sigaction(SIGTERM, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);

    struct timespec last_active_local;
    struct timespec last_active_remote;

    int ret = pread(disk_fd, lease, lease_alloc_size, 0);
    if (ret < 0)
        panic("failed to read lease: %s\n", strerror(errno));

    if (lease->magic != magic) {
        // new disk, no lease
        acquire_lease(disk_fd);
    } else {
        // someone else has the lease
        while (!shutdown) {
            struct timespec now;
            clock_gettime(CLOCK_MONOTONIC, &now);
            if (timespec_compare(&lease->acquire_time, &last_active_remote)) {
                fprintf(stderr, "Remote %s refreshed lease\n", lease->holder);
                last_active_remote = lease->acquire_time;
                last_active_local = now;
            } else if (now.tv_sec - last_active_local.tv_sec > 20) {
                // remote is dead
                fprintf(stderr, "Remote is dead, preempting\n");
                acquire_lease(disk_fd);
                break;
            }
            sleep(5);
            int ret = pread(disk_fd, lease, lease_alloc_size, 0);
            if (ret < 0)
                panic("failed to read lease: %s\n", strerror(errno));
        }
    }

    close(disk_fd);
}
#!/bin/bash

set -e

DISK_DEVICE="/dev/data-disk"
MAGIC=0x4745D0C5CD9A2FA4

SHUTDOWN=0
trap "SHUTDOWN=1" SIGINT SIGTERM

function acquire_lease() {
    # racqa:
    # 0: aquire
    # 1: preempt

    # rtype:
    # 1: write exclusive

    nvme resv-register $DISK_DEVICE --iekey --nrkey=$MAGIC
    nvme resv-acquire $DISK_DEVICE --racqa=1 --rtype=1 --prkey=$MAGIC --crkey=$MAGIC
    # register again in case we preempted ourselves
    nvme resv-register $DISK_DEVICE --iekey --nrkey=$MAGIC
    nvme resv-acquire $DISK_DEVICE --racqa=0 --rtype=1 --prkey=$MAGIC --crkey=$MAGIC

    while [[ $SHUTDOWN -eq 0 ]]; do
        echo "$MAGIC $(date +%s) $HOSTNAME" | dd of=$DISK_DEVICE bs=512 count=1 oflag=direct status=none
        echo "Refreshed lease"
        sleep 5
    done
}

LEASE=$(dd if=$DISK_DEVICE bs=512 count=1 iflag=direct status=none)

if [[ $LEASE != $MAGIC* ]]; then
    # new disk, no lease
    acquire_lease
else
    last_active_remote=-1
    last_active_local=-1
    while [[ $SHUTDOWN -eq 0 ]]; do
        now=$(date +%s)
        read -r magic timestamp holder < <(echo $LEASE)
        if [ "$last_active_remote" != "$timestamp" ]; then
            echo "Remote $holder refreshed the lease"
            last_active_remote=$timestamp
            last_active_local=$now
        elif (($now - $last_active_local > 10)); then
            echo "Remote is dead, preempting"
            acquire_lease
            break
        fi
        sleep 5
        LEASE=$(dd if=$DISK_DEVICE bs=512 count=1 iflag=direct status=none)
    done
fi

The lease.yaml file used in this example applies only if you deploy the application by using C. If you deploy the application by using bash, you must grant the required permissions to your container in the YAML file. Sample code:

securityContext:
  capabilities:
    add: ["SYS_ADMIN"]

View the Dockerfile

C:

# syntax=docker/dockerfile:1.4

FROM buildpack-deps:bookworm as builder

COPY lease.c /usr/src/nvme-resv/
RUN gcc -o /lease -O2 -Wall /usr/src/nvme-resv/lease.c

FROM debian:bookworm-slim

COPY --from=builder --link /lease /usr/local/bin/lease
ENTRYPOINT ["/usr/local/bin/lease"]

Bash:

# syntax=docker/dockerfile:1.4
FROM debian:bookworm-slim

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
    rm -f /etc/apt/apt.conf.d/docker-clean && \
    apt-get update && \
    apt-get install -y nvme-cli

COPY --link lease.sh /usr/local/bin/lease
ENTRYPOINT ["/usr/local/bin/lease"]

Step 1: Deploy an application and configure multi-attach

Create a StorageClass named alicloud-disk-shared, and enable the multi-attach feature for NVMe disks.

Create a persistent volume claim (PVC) named data-disk, and set the accessModes parameter to ReadWriteMany and the volumeMode parameter to Block.

Create a StatefulSet named lease-test that uses the image of the sample application in this example.

  1. Create the lease.yaml file that contains the following content:

    Replace the container image URL in the following YAML file with the image URL of your application.

    Important
    • The NVMe reservation feature takes effect on nodes. Multiple pods on the same node may interfere with each other. In this example, podAntiAffinity settings are configured to prevent multiple pods from being scheduled to the same node.

    • If your cluster contains other nodes that do not use the NVMe protocol, you must configure nodeAffinity settings to schedule pods to the nodes that use the NVMe protocol.

    View the lease.yaml file

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: alicloud-disk-shared
    parameters:
      type: cloud_essd
      multiAttach: "true"
    provisioner: diskplugin.csi.alibabacloud.com
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: data-disk
    spec:
      accessModes: [ "ReadWriteMany" ]
      storageClassName: alicloud-disk-shared
      volumeMode: Block
      resources:
        requests:
          storage: 20Gi
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: lease-test
    spec:
      replicas: 2
      serviceName: lease-test
      selector:
        matchLabels:
          app: lease-test
      template:
        metadata:
          labels:
            app: lease-test
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - lease-test
                topologyKey: "kubernetes.io/hostname"
          containers:
          - name: lease
            image: <IMAGE OF APP>   # Specify the image URL of your application. 
            volumeDevices:
            - name: data-disk
              devicePath: /dev/data-disk  
          volumes:
          - name: data-disk
            persistentVolumeClaim:
              claimName: data-disk

    Parameter or setting

    Configure the multi-attach feature

    Configure the regular attach feature

    StorageClass

    parameters.multiAttach

    true: enables the multi-attach feature for NVMe disks.

    No configuration is required.

    PVC

    accessModes

    ReadWriteMany

    ReadWriteOnce

    volumeMode

    Block

    Filesystem

    Volume attach method

    volumeDevices: uses block storage devices to access the data on disks.

    volumeMounts: attaches the volumes of file systems.

  2. Run the following command to deploy the application:

    kubectl apply -f lease.yaml

Step 2: Verify the multi-attach and NVMe reservation features

To ensure data consistency for NVMe cloud disks, you can use the reservation feature to control read and write permissions in your application. If one pod is performing write operations, other pods can only perform read operations.

Read data from and write data to a disk on multiple nodes

Run the following command to query pod logs:

kubectl logs -l app=lease-test --prefix -f

Expected results:

[pod/lease-test-0/lease] Register as key 4745d0c5cd9a2fa4
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease

The expected results show that Pod lease-test-1 can instantly read the data written by Pod lease-test-0.

Check whether an NVMe reservation is acquired

  1. Run the following command to query the disk ID:

    kubectl get pvc data-disk -ojsonpath='{.spec.volumeName}'
  2. Log on to one of the two nodes and run the following command to check whether an NVMe reservation is acquired:

    Replace 2zxxxxxxxxxxx in the following code with the content after d- in the disk ID that you queried in the previous step.

    nvme resv-report -c 1 /dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_2zxxxxxxxxxxx

    Expected results:

    NVME Reservation status:
    
    gen       : 3
    rtype     : 1
    regctl    : 1
    ptpls     : 1
    regctlext[0] :
      cntlid     : ffff
      rcsts      : 1
      rkey       : 4745d0c5cd9a2fa4
      hostid     : 4297c540000daf4a4*****

    The expected results show that an NVMe reservation is acquired.

Block write operations on abnormal nodes by using the NVMe reservation feature

  1. Log on to the node on which Pod lease-test-0 resides and run the following command to suspend the process for fault simulation:

    pkill -STOP -f /usr/local/bin/lease
  2. Wait 30 seconds and run the following command to query pod logs:

    kubectl logs -l app=lease-test --prefix -f

    Expected results:

    [pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
    [pod/lease-test-1/lease] Remote is dead, preempting
    [pod/lease-test-1/lease] Register as key 4745d0c5cd9a2fa4
    [pod/lease-test-1/lease] Refreshed lease
    [pod/lease-test-1/lease] Refreshed lease
    [pod/lease-test-1/lease] Refreshed lease

    The expected results show that Pod lease-test-1 has taken over the container lease and becomes the master node.

  3. Log on to the node on which Pod lease-test-0 resides again and run the following command to resume the suspended process:

    pkill -CONT -f /usr/local/bin/lease
  4. Run the following command to query pod logs again:

    kubectl logs -l app=lease-test --prefix -f

    Expected results:

    [pod/lease-test-0/lease] failed to write lease: Invalid exchange

    The expected results show that Pod lease-test-0 can no longer write data to disks, and the container lease is automatically restarted. This indicates that the write operations of Pod lease-test-0 are blocked by the NVMe reservation feature.

References

If your NVMe disks do not have sufficient space, see Expand disk volumes.