All Products
Search
Document Center

Container Service for Kubernetes:Enable the auto recovery feature for FUSE mount targets

Last Updated:Feb 28, 2026

During the lifecycle of an application pod, the Filesystem in Userspace (FUSE) daemon may crash. As a result, the application pod can no longer use the FUSE file system to access data. This topic describes how to enable the auto recovery feature for the mount targets of a FUSE file system to restore access to application data without restarting the application pods.

Overview

Problem

Application pods that use Fluid datasets access data in a distributed cache system via the FUSE file system. Each FUSE file system corresponds to a FUSE daemon process, which handles the file access requests sent to the FUSE file system.

During the lifecycle of an application pod, the FUSE daemon may crash. For example, when memory usage exceeds the upper limit, the daemon is killed. As a result, the "Transport Endpoint is Not Connected" error appears when the application pod accesses files in the FUSE file system. To resolve this issue, you must manually restart or rebuild the application pod to restore access to the FUSE file system.

How it works

Fluid provides the auto recovery feature for the mount targets of a FUSE file system. By periodically querying the status of the FUSE file system mounted to each application pod on a node, Fluid detects when a FUSE daemon has crashed. After the FUSE daemon container is automatically restarted by Kubernetes, Fluid detects the recovery and remounts the file system, which restores data access for the application pods without requiring you to restart or rebuild them.

ACK cluster vs. serverless: key differences

The auto recovery feature is available in both ACK Pro clusters and ACK Serverless Pro clusters. The following table summarizes the key differences between the two approaches.

AspectACK Pro clusterACK Serverless Pro cluster
Enable mechanismSet the FuseRecovery=true feature gate on the CSI DaemonSetSet the alibabacloud.com/fuse-recover-policy: auto annotation on each pod
Pod configurationLabel: fuse.serverful.fluid.io/inject: "true"Label: alibabacloud.com/fluid-sidecar-target: eci + annotations
Expected pod READY status1/12/2
FUSE crash simulation targetSeparate FUSE pod on the same nodeSidecar container fluid-fuse-0 within the same pod
Additional prerequisitesVirtual nodes deployed in the clusterNone

Limits

  • The auto recovery process has a delay and does not support seamless auto recovery for business applications. Business applications must tolerate data access failures and continue to retry until data access is restored.

  • You can enable auto recovery only for read-only datasets. If the cluster contains a dataset that can be read and written, make sure that this feature is disabled to prevent data from being unexpectedly written to the dataset.

  • This feature does not allow you to mount application pods to the persistent volume claims (PVCs) of datasets in subPath mode.

  • The auto recovery feature for FUSE must be enabled after the FUSE daemon is automatically restarted. The FUSE daemon runs in a container. When the FUSE daemon frequently crashes, the interval at which Kubernetes restarts the container exponentially increases. This increases the duration of auto recovery for FUSE.

Common prerequisites

The following prerequisites apply to both the ACK Pro cluster and ACK Serverless Pro cluster scenarios.

Cluster requirements

Important

The ack-fluid component is not currently supported on the ContainerOS.

Component requirements

  • The cloud-native AI suite is installed and the ack-fluid component is deployed.

Important

If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.

  • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Install the cloud-native AI suite.

  • If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.

Infrastructure requirements

Additional prerequisites for ACK Pro clusters

Create a Fluid dataset

This section describes how to create a Fluid dataset backed by OSS. These steps are shared by both the ACK Pro cluster and ACK Serverless Pro cluster scenarios.

In this example, JindoFS is deployed to accelerate access to OSS.

  1. Create a file named secret.yaml and copy the following content to the file: fs.oss.accessKeyId and fs.oss.accessKeySecret specify the AccessKey ID and AccessKey secret used to access OSS.

    apiVersion: v1
    kind: Secret
    metadata:
      name: mysecret
    stringData:
      fs.oss.accessKeyId: <YOUR_ACCESS_KEY_ID>
      fs.oss.accessKeySecret: <YOUR_ACCESS_KEY_SECRET>
  2. Run the following command to create a Secret:

    kubectl create -f secret.yaml
  3. Create a file named dataset.yaml and copy the following content to the file: The following table describes the parameters.

    ParameterDescription
    mountPointoss:/// specifies the path to the UFS that is mounted. The endpoint is not required in the path.
    fs.oss.endpointThe public or private endpoint of the OSS bucket. For more information, see OSS regions and endpoints.
    replicasThe number of workers in the JindoFS cluster.
    mediumtypeThe type of cache. When you create a JindoRuntime template, JindoFS supports only one of the following cache types: HDD, SSD, and MEM.
    pathThe storage path. You can specify only one path. If you set mediumtype to MEM, you must specify a path of the on-premises storage to store data such as logs.
    quotaThe maximum size of cached data. Unit: GB.
    highThe upper limit of the storage capacity.
    lowThe lower limit of the storage capacity.
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: demo-dataset
    spec:
      mounts:
        - mountPoint: oss://<oss_bucket>/<bucket_dir>
          options:
            fs.oss.endpoint: <oss_endpoint>
          name: mybucket
          path: "/"
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: fs.oss.accessKeySecret
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: demo-dataset
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            path: /dev/shm
            volumeType: emptyDir
            quota: 2Gi
            high: "0.99"
            low: "0.95"
  4. Run the following command to create a Dataset object and a JindoRuntime object:

    kubectl create -f dataset.yaml

Enable auto recovery for FUSE mount targets in an ACK cluster

This section describes how to enable and verify the FUSE auto recovery feature in an ACK Pro cluster.

Step 1: Enable auto recovery for FUSE mount targets

Run the following command to enable auto recovery for FUSE mount targets:

kubectl get ds -n fluid-system csi-nodeplugin-fluid -oyaml | sed 's/FuseRecovery=false/FuseRecovery=true/g' | kubectl apply -f -

Expected output:

daemonset.apps/csi-nodeplugin-fluid configured

Run the following command to check whether auto recovery is enabled for FUSE mount targets:

kubectl get ds -n fluid-system csi-nodeplugin-fluid -oyaml | grep '\- \-\-feature-gates='

If the following output is returned, auto recovery is enabled for FUSE mount targets:

- --feature-gates=FuseRecovery=true

Step 2: Create an application pod and mount the Fluid dataset

In this example, a Fluid dataset is mounted to an NGINX pod and the pod is used to access the data in the dataset.

  1. Create a file named app.yaml and copy the following content to the file: The fuse.serverful.fluid.io/inject=true label is used to enable auto recovery for the FUSE mount target of the pod.

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
      labels:
        fuse.serverful.fluid.io/inject: "true"
    spec:
      containers:
        - name: demo
          image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          volumeMounts:
            - mountPath: /data
              name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: demo-dataset   # The value of this parameter must be the same as the name of the Dataset.
  2. Run the following command to create an application pod:

    kubectl create -f app.yaml
  3. Run the following command to view the status of the application pod: If the STATUS field of the pod is Running, the application pod is started successfully.

    kubectl get pod demo-app
    NAME       READY   STATUS    RESTARTS   AGE
    demo-app   1/1     Running   0          16s

Step 3: Verify auto recovery for the FUSE mount target

  1. Run the following command to log on to the application pod and run a script that periodically accesses file metadata. The script lists the files in the mounted Fluid dataset every second.

    kubectl exec -it demo-app -- bash -c 'while true; do ls -l /data; sleep 1; done'
  2. Keep the preceding script running in the background and run the following command to simulate a crash in the FUSE component:

    # Obtain the node where demo-pod resides.
    demo_pod_node_name=$(kubectl get pod demo-app -ojsonpath='{.spec.nodeName}')
    # Obtain the name of the FUSE pod on the same node as demo-pod.
    fuse_pod_name=$(kubectl get pod --field-selector spec.nodeName=$demo_pod_node_name --selector role=jindofs-fuse,release=demo-dataset -oname)
    # Simulate a crash in the FUSE pod.
    kubectl exec -it $fuse_pod_name -- bash -c 'kill 1'
  3. View the output of the script that is run in demo-app. If the following output is returned, the FUSE mount point is recovered successfully.

    ...
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    ls: cannot access '/data/': Transport endpoint is not connected
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    ...

Enable auto recovery for FUSE mount targets in a serverless environment

This section describes how to enable and verify the FUSE auto recovery feature in an ACK Serverless Pro cluster. In a serverless environment, you do not need to set a feature gate. Instead, you enable auto recovery through a pod annotation.

You have created an ACK Serverless Pro cluster that uses an operating system other than ContainerOS, and the cluster version is 1.18 or later. For more information, see Create an ACK Serverless cluster.

Step 1: Create an application pod and mount the Fluid dataset

In this example, a Fluid dataset is mounted to an NGINX pod and the pod is used to access the data in the dataset.

  1. Create a file named app.yaml and copy the following content to the file: The alibabacloud.com/fuse-recover-policy=auto annotation is used to enable auto recovery for the FUSE mount target of the pod. This annotation takes effect only on application pods that run in the serverless environment.

    apiVersion: v1
    kind: Pod
    metadata:
      name: demo-app
      labels:
        alibabacloud.com/fluid-sidecar-target: eci
      annotations:
        # Disable the virtual node-based pod scheduling policy.
        alibabacloud.com/burst-resource: eci_only
        # Enable auto recovery for FUSE mount targets
        alibabacloud.com/fuse-recover-policy: auto
    spec:
      containers:
        - name: demo
          image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          volumeMounts:
            - mountPath: /data
              name: data-vol
      volumes:
        - name: data-vol
          persistentVolumeClaim:
            claimName: demo-dataset  # The value of this parameter must be the same as the name of the Dataset.
  2. Run the following command to create an application pod:

    kubectl create -f app.yaml
  3. Run the following command to view the status of the application pod: If the STATUS field of the pod is Running, the application pod is started successfully.

    kubectl get pod demo-app
    NAME       READY   STATUS    RESTARTS   AGE
    demo-app   2/2     Running   0          110s

Step 2: Verify the auto recovery feature for the FUSE mount target

  1. Run the following command to log on to the application pod and run a script that periodically accesses file metadata. The script lists the files in the mounted Fluid dataset every second.

    kubectl exec -it demo-app -c demo -- bash -c 'while true; do ls -l /data; sleep 1; done'
  2. Keep the preceding script running in the background and run the following command to simulate a crash in the FUSE component:

    # Simulate a crash in the FUSE pod.
    kubectl exec -it demo-app -c fluid-fuse-0 -- bash -c 'kill 1'
  3. View the output of the script that is run in demo-app. If the following output is returned, the FUSE mount point is recovered successfully.

    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    ls: cannot access '/data/demo2': Transport endpoint is not connected
    ls: cannot access '/data/demo2': Transport endpoint is not connected
    ls: cannot access '/data/demo2': Transport endpoint is not connected
    ls: cannot access '/data/demo2': Transport endpoint is not connected
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt
    total 172
    -rwxrwxr-x 1 root root          18 Jul  1 15:17 myfile
    -rwxrwxr-x 1 root root         154 Jul  1 17:06 myfile.txt