All Products
Search
Document Center

Container Service for Kubernetes:FAQ about OSS volumes

Last Updated:Nov 21, 2024

This topic provides answers to some frequently asked questions about Object Storage Service (OSS) volumes.

Category

Issue

FAQ about mounting OSS volumes

FAQ about using OSS volumes

FAQ about unmounting OSS volumes

What do I do if a pod remains in the Terminating state when I fail to unmount a statically provisioned OSS volume from the pod?

FAQ about detection failures in the ACK console

Others

Why does it require a long period of time to mount an OSS volume?

Issue

It requires a long period of time to mount an OSS volume.

Causes

If the following conditions are met, kubelet performs the chmod or chown operation when volumes are mounted, which increases the time consumption.

  • The AccessModes parameter is set to ReadWriteOnce in the persistent volume (PV) and persistent volume claim (PVC) templates.

  • The securityContext.fsgroup parameter is set in the application template.

Solutions

  • If the securityContext.fsgroup parameter is set in the application template, delete the fsgroup parameter in the securityContext section.

  • If you want to configure the user ID (UID) and mode of the files in the mounted path, you can manually mount the OSS bucket to an Elastic Compute Service (ECS) instance. You can then perform the chown and chmod operations by using a CLI and provision the OSS volume by using the Container Storage Interface (CSI) plug-in. For more information about how to provision OSS volumes by using the CSI plug-in, see Mount a statically provisioned OSS volume.

  • Apart from the preceding methods, for clusters that run Kubernetes 1.20 or later, you can set the fsGroupChangePolicy parameter to OnRootMismatch. This way, the chmod or chown operation is performed only when the system launches the pod for the first time. As a result, it requires a long time to mount the OSS volume during the first launch. The issue does not occur when you mount OSS volumes after the first launch is complete. For more information about fsGroupChangePolicy, see Set the security context for a pod or a container.

  • We recommend that you do not write the PVCs of OSS volumes mounted by ossfs. Theses PVCs are read-only by default.

How do I manage the permissions related to OSS volume mounting?

The Permission Denied error is displayed in the following scenarios.

Scenario 1: The Permission Denied error occurs when you access the mount target

Causes

By default, the root user of Linux is used to mount OSS volumes. The root user has the 700 permission. When a container process accesses an OSS volume as a non-root user, the system prompts an error due to insufficient permissions.

Solutions

Add configurations to modify the permissions on the root path.

Parameter

Description

allow_other

Set the 777 permission on the mount target.

mp_umask

Set the umask of the mount target. This parameter takes effect only when the allow_other parameter is set. The default value is 000. Examples:

  • To set the 770 permission on the mount target, add -o allow_other -o mp_umask=007.

  • To set the 700 permission on the mount target, add -o allow_other -o mp_umask=077.

Scenario 2: The Permission Denied error occurs when you use ossutil, the OSS console, or the SDK to upload files

Causes

By default, ossfs has the 640 permission on files uploaded by using methods other than ossfs. When a container process accesses an OSS volume as a non-root user, the system prompts an error due to insufficient permissions.

Solutions

Run the chmod command as a root user to modify the permissions on the desired file. You can also add the following configurations to modify the permissions on the subPath and files in the mount target.

umask: the umask of the subPath and files in the mount target. You can set the umask parameter in the same way as mp_umask. The umask parameter does not rely on the allow_other parameter.

The umask parameter defines only the permissions on existing files in the current ossfs process. It does not take effect on remounted files or files in other ossfs processes. Examples:

  • After you set -o umask=022, use stat to view the permission of the file uploaded from the OSS console. The permission on the file should be 755. Remount the file after you delete the -o umask=022 setting. The permission on the file should be 640.

  • After you set -o umask=133 as a root user in the current container process, run the chmod command to set the permission on a file to 777. After you stat the file, the permission on the file is 644. Remount the file after you delete the -o umask=133 setting. The permission on the file is changed to 777.

Scenario 3: The system prompts insufficient permissions when other container processes read or write files created in ossfs

Causes

The default permission on regular files created in ossfs is 644. After you set the fsGroup field in securityContext or run the chmod or chown command on a file, the permission or owner of the file may be changed. When another user accesses the file through a container process, the system may prompt insufficient permissions.

Solutions

stat the permission on the file. If the system prompts insufficient permissions, run the chmod command to modify the permission on the file as a root user.

The preceding solutions resolve the issue that the user of the current container process does not have sufficient permissions on a path or file. You can also change the owner of the subPath and files in the mount target of ossfs to resolve this issue.

If you specified the user of the container process when you build the container image or leave the securityContext.runAsUser and securityContext.runAsGroup fields in the application deployment empty, the container process runs as a non-root user.

Add the following configurations to change the UID and GID of the subPath and files in the mount target of ossfs to those of the user that runs the container process.

Parameter

Description

uid

The UID of the owner of the subPath and files in the mount target.

gid

The GID of the owner of the subPath and files in the mount target.

For example, if the corresponding IDs of the container process are uid=1000(biodocker), gid=1001(biodocker), and groups=1001(biodocker), set -o uid=1000 and -o gid=1001.

Scenario 4: The AccessKey pair stored in the Secret cannot be used to access files in the OSS volume after you set the nodePublishSecretRef field in the PV to reference the Secret The original AccessKey pair is revoked due to AccessKey pair rotation. The renewed AccessKey pair in the Secret does not take effect.

Causes

OSS volumes are FUSE file systems mounted by using ossfs. The AccessKey pair of an OSS volume cannot be renewed after the OSS volume is mounted. The application that uses the OSS volume uses only the original AccessKey pair to send requests to the OSS server.

Solutions

After the AccessKey pair in the Secret is renewed, you need to remount the OSS volume. In a non-containerized ossfs version or a containerized ossfs version mounting OSS volumes in exclusive mode, restart the application pod to trigger ossfs restart. For more information, see How do I restart the ossfs process when the OSS volume is shared by multiple pods?

Scenario 5: The Operation not permitted error occurs when you create a hard link

Causes

OSS volumes do not support hard links. In earlier CSI versions, the Operation not permitted error is returned when you create hard links.

Solutions

Avoid using hard links if your application uses OSS volumes. If hard links are mandatory, we recommend that you change the storage service.

Scenario 6: The system prompts insufficient read or write permissions when you use subPath or subPathExpr to mount an OSS volume

Causes

A container process that runs as a non-root user does not have permissions on the files in the /path/subpath/in/oss/ path. The default permission on the path is 640. When you use subPath to mount an OSS volume, the mount target on the OSS server is the path defined in the PV, which is /path in the preceding example, but not /path/subpath/in/oss/. The allow_other or mp_umask setting takes effect only on the /path path. The default permission on the /path/subpath/in/oss/ subPath is still 640.

Solutions

Use the umask parameter to modify the default permission on the subPath. For example, add -o umask=000 to set the default permission to 777.

What do I do if I fail to mount a statically provisioned OSS volume?

Issue

You failed to mount a statically provisioned OSS volume. The pod cannot be started and a FailedMount event is generated.

Causes

  • Cause 1: In earlier ossfs versions, you cannot mount OSS buckets to paths that do not exist. If the mount target does not exist, a mounting failure occurs.

    Important

    subPaths displayed in the OSS console may not exist on the OSS server. Use ossutil or the OSS API to confirm the subPaths. For example, if you create the /a/b/c/ path, the /a/b/c/ path object is created, but the /a/ or /a/b/ path object does not exist. If you upload the /a/* object, you can find the /a/b or /a/c object, but the /a/ path object does not exist.

  • Cause 2: Mount failed due to the AccessKey or RAM Roles for Service Accounts (RRSA) use incorrect role information or have insufficient permissions.

  • Cause 3: For CSI version 1.30.4 and later, the pod that runs ossfs is located in the ack-csi-fuse namespace. During the mounting process, CSI first launches the pod that runs ossfs, and then initializes the ossfs process in that pod through a Remote Procedure Call (RPC) request. If the event log contains the message FailedMount /run/fuse.ossfs/xxxxxx/mounter.sock: connect: no such file or directory, it indicates that the mounting failed because the pod that runs ossfs was not started properly or was deleted unexpectedly.

  • Cause 4: If the event contains the Failed to find executable /usr/local/bin/ossfs: No such file or directory message, the mounting failed because OSSFS failed to be installed on the node.

  • Cause 5: If the event contains the error while loading shared libraries: xxxxx: cannot open shared object file: No such file or directory message, the mounting failed because ossfs runs on nodes in the current CSI version but some dynamic libraries required by ossfs are missing in the operating system. Possible causes:

    • Another ossfs version was manually installed on the node and the required operating system differs from the operating system of the node.

    • The default OpenSSL version is changed after the node operating system is updated, such as an update from Alibaba Cloud Linux 2 to Alibaba Cloud Linux 3.

    • If ossfs runs on nodes, only the following operating systems are supported: CentOS, Alibaba Cloud Linux, ContainerOS, and Anolis OS.

    • Dynamic libraries required by ossfs, such as FUSE, cURL, and xml2, are deleted from the node that runs the required operating system, or the default OpenSSL version is changed.

  • Cause 6: A mirroring-based back-to-origin rule is configured for the bucket but the mount path is not synchronized from the origin.

  • Cause 7: Static website hosting is configured for the bucket. When ossfs checks the mount target on the OSS server, the index.html file is returned.

Solutions

  • Solution to cause 1:

    Check whether the subPath exists on the OSS server.

    Assume that the mount target of the PV is sub/path/. You can run stat (query bucket and object information) to query objects whose objectname is sub/path/, or run openapi HeadObject to query objects whose key is sub/path/. If 404 is returned, the subPath does not exist on the OSS server.

    1. You can use ossutil, the OSS SDK, or the OSS console to create the missing bucket or subPath and mount the bucket again.

    2. In ossfs versions later than 1.91, you can specify a mount target that does not exist. Therefore, you can also update ossfs to resolve this issue. For more information, see New features of ossfs 1.91 and later and stress tests.

  • Solution to cause 2:

    • Confirm that the policy permissions for the Resource Access Management (RAM) user or RAM role used for mounting are granted with the permissions listed in Step 2: Grant permissions to the demo-role-for-rrsa role.

    • Verify the file system permissions for the root directory and subPath of the mount target. For more information, see Scenario 1 and Scenario 6 in How do I manage the permissions related to OSS volume mounting?

    • For volumes mounted using AccessKey authentication as a RAM user, confirm that the AccessKey used during mounting is neither disabled nor rotated. For more information, see Scenario 4 in How do I manage the permissions related to OSS volume mounting?

    • For volumes mounted using RRSA authentication, confirm that the correct trust policy is configured for the RAM role. For more information about how to configure the trust policy, see (Optional) Step 1: Create a RAM role. By default, the trusted service account is csi-fuse-ossfs in the ack-csi-fuse namespace, rather than the service account used by the service.

    • Note

      The RRSA feature supports only ACK clusters that run Kubernetes 1.26 and later. ACK clusters that support the RRSA feature include ACK Basic clusters, ACK Pro clusters, ACK Serverless Basic clusters, and ACK Serverless Pro clusters. The version of the CSI component used by the cluster must be 1.30.4 or later. If you used the RRSA feature prior to version 1.30.4, you must attach policies to the RAM role. For more information, see [Product Changes] ossfs version upgrade and mounting process optimization in CSI.

  • Solution to cause 3:

    1. Run the following command to confirm that the pod running ossfs exists. Replace PV_NAME with the name of the OSS PV to be mounted, and NODE_NAME with the name of the node where the pod that requires the volume to be mounted resides.

      kubectl -n ack-csi-fuse get pod -l csi.alibabacloud.com/volume-id=<PV_NAME> -owide | grep <NODE_NAME>

      If the pod exists but is in an abnormal state, troubleshoot and make sure that the pod is in the Running state before restarting it to trigger a remount. If the pod does not exist, follow the subsequent steps to troubleshoot.

    2. (Optional) Verify if the pod was accidentally deleted by reviewing audit logs and other relevant sources. Common causes for accidental deletion include script cleanup, node draining, and node auto repair. We recommend that you make appropriate adjustments to prevent this issue from recurring.

    3. Make sure that both csi-provisioner and csi-plugin are updated to version 1.30.4 or later. Then, restart the pod to trigger a remount and verify that the pod running ossfs is created through a proper process.

  • Solution to cause 4:

    1. We recommend that you update csi-plugin to v1.26.2 or later. The issue that the ossfs installation fails during the initialization of newly added nodes is fixed in these versions.

    2. Run the following command to restart csi-plugin on the node and check whether the csi-plugin pod runs as normal. In the following code, csi-plugin-**** specifies the pod of csi-plugin.

      kubectl -n kube-system delete pod csi-plugin-****
    3. If the issue persists after you update or restart the component, log on to the node and run the following command:

      ls /etc/csi-tool

      Expected output:

      ... ossfs_<ossfsVer>_<ossfsArch>_x86_64.rpm ...
      • If the output displays the following ossfs RPM package, run the following command to check whether the csi-plugin pod runs as normal.

        rpm -i /etc/csi-tool/ossfs_<ossfsVer>_<ossfsArch>_x86_64.rpm
      • If the output does not contain the ossfs RPM package, submit a ticket.

  • Solution to cause 5:

    • If you have manually installed ossfs, check whether the required operating system is the same as the operating system of the node.

    • If you have updated the operating system of the node, run the following command to restart csi-plugin, update ossfs, and remount the OSS volume.

      kubectl -n kube-system delete pod -l app=csi-plugin
    • We recommend that you update CSI to 1.28 or later. In these versions, ossfs runs in containers. Therefore, it does not have requirements on the node operating system.

    • If you cannot update CSI, install the required operating system or manually install the missing dynamic libraries. In the following example, the node runs Ubuntu:

      • Run the which command to query the installation path of ossfs. The default path is /usr/local/bin/ossfs.

        which ossfs
      • Run the ldd command to query the missing dynamic libraries required by ossfs.

        ldd /usr/local/bin/ossfs
      • Run the apt-file command to query the package of the missing dynamic libraries (such as libcrypto.so.10).

        apt-get install apt-file
        apt-file update
        apt-file search libcrypto.so.10
      • Run the apt-get command to install the package, such as libssl.1.0.0.

        apt-get install libssl1.0.0
  • Solution to cause 6:

    Synchronize data from the origin before you mount the OSS volume. For more information, see Overview.

  • Solution to cause 7:

    Disable static website hosting or modify the configuration and try again. For more information, see Overview.

What do I do if I fail to access a statically provisioned OSS volume?

Issue

You failed to access a statically provisioned OSS volume.

Causes

You did not specify an AccessKey pair when you mount the statically provisioned OSS volume.

Solutions

Specify an AccessKey pair in the configurations of the statically provisioned OSS volume. For more information, see Mount a statically provisioned OSS volume.

What do I do if the read speed of a statically provisioned OSS volume is slow?

Issue

The read speed of a statically provisioned OSS volume is slow.

Causes

  • Cause 1: OSS does not limit the number of objects. However, when the number of objects exceeds 1000, FUSE may access an excessively amounts of metadata. Consequently, access to OSS buckets becomes slow.

  • Cause 2: After OSS has version control enabled, large numbers of delete tags are generated in the bucket, which compromise the performance of listObjectsV1.

  • Cause 3: The OSS server sets Storage Class to a class other than Standard. Access to buckets of these storage classes is slow.

Solutions

Solution to cause 1:

When you mount an OSS volume to a container, we recommend that you set the access mode of the OSS volume to read-only. If an OSS bucket stores a large number of files, we recommend that you use the OSS SDK or CLI to access the files in the bucket, instead of accessing the files by using a file system. For more information, see SDK demos overview.

Solution to cause 2:

  1. After CSI plugin is updated to v1.26.6, ossfs allows you to access buckets by using listObjectsV2.

  2. Add -o listobjectsv2 to the otherOpts field of the PV corresponding to the statically provisioned OSS volume.

Solution to cause 3:

Change the storage class or restore objects.

Why is 0 displayed for the size of a file in the OSS console after I write data to the file?

Issue

After you write data to an OSS volume mounted to a container, the size of the file displayed in the OSS console is 0.

Causes

The OSS bucket is mounted as a FUSE file system by using ossfs. In this case, a file is uploaded to the OSS server only if you run the close or flush command on the file.

Solutions

Run the lsof command with the name of the file to check whether the file is being used by processes. If the file is being used by processes, terminate the processes to release the file descriptor (FD) of the file. For more information about the lsof command, see lsof.

Why is a path displayed as an object after I mount the path to a container?

Issue

After you mount a path to a container, the path is displayed as an object.

Causes

Cause 1: The content type of the path on the OSS server is not the default application/octet-stream type, such as text/html or image/jpeg, or the size of the path object is not 0. In this case, ossfs displays the path as an object based on the metadata of the path object.

Cause 2: The metadata x-oss-meta-mode is missing in the path object.

Solutions

Solution to cause 1:

Use HeadObject or stat (query bucket and object information) to obtain the metadata of the path object. The path object must end with a forward slash (/), such as a/b/. Sample API response:

{
  "server": "AliyunOSS",
  "date": "Wed, 06 Mar 2024 02:48:16 GMT",
  "content-type": "application/octet-stream",
  "content-length": "0",
  "connection": "keep-alive",
  "x-oss-request-id": "65E7D970946A0030334xxxxx",
  "accept-ranges": "bytes",
  "etag": "\"D41D8CD98F00B204E9800998ECFxxxxx\"",
  "last-modified": "Wed, 06 Mar 2024 02:39:19 GMT",
  "x-oss-object-type": "Normal",
  "x-oss-hash-crc6xxxxx": "0",
  "x-oss-storage-class": "Standard",
  "content-md5": "1B2M2Y8AsgTpgAmY7Phxxxxx",
  "x-oss-server-time": "17"
}

In the preceding sample response:

  • content-type: The content type of the path object is application/octet-stream.

  • content-length: The size of the path object is 0.

If the preceding conditions are not met, perform the following steps:

  1. Use GetObject or ossutil to obtain the object and confirm the metadata. If the metadata of the object meets the requirement or you cannot confirm the metadata, we recommend that you back up the object. For example, change the name of the object and upload it to OSS. For a xx/ path object, do not use xx as the object name.

  2. Use DeleteObject or rm to delete the original path object and check whether ossfs displays the path object as normal.

Solution to cause 2:

If the issue persists after you perform the steps in the solution to cause 1, add -o complement_stat to the otherOpts field of the PV corresponding to the statically provisioned OSS volume when you mount the OSS volume.

Note

In CSI plugin v1.26.6 and later versions, this feature is enabled by default. You can update CSI plugin to v1.26.6 or later, and then restart the application pod and remount the OSS volume to resolve the issue.

What do I do if the OSS server identifies unexpected large numbers of requests?

Issue

When you mount an OSS volume to a container, the OSS server identifies unexpected large numbers of requests.

Causes

When ossfs mounts an OSS bucket, a mount target is generated on the node. When other processes on the ECS node scan the mount target, requests are sent to the OSS server. Fees are incurred if the number of requests exceeds the upper limit.

Solutions

Use ActionTrail to track the processes that generate the requests and fix the issue. You can perform the following operations on the node.

  1. Run the following command to install and launch auditd:

    sudo yum install auditd
    sudo service auditd start
  2. Monitor the ossfs mount targets.

    • Run the following command to monitor all mount targets:

      for i in $(mount | grep -i ossfs | awk '{print $3}');do auditctl -w ${i};done
    • Run the following command to monitor the mount target of a PV: Replace <pv-name> with the name of the PV.

      for i in $(mount | grep -i ossfs | grep -i <pv-name> | awk '{print $3}');do auditctl -w ${i};done
  3. Run the following command to print the audit log to view the processes that access the mount target in the OSS bucket:

    ausearch -i 

    In the following sample audit log, the log data separated by the --- delimiter records an operation performed on the mount target that is monitored. The log data indicates that the updatedb process performs the open operation on the mount target. The PID is 1636611.

    ---
    type=PROCTITLE msg=audit (September 22, 2023 15:09:26.244:291) : proctitle=updatedb
    type=PATH msg=audit (September 22, 2023 15:09:26.244:291) : item=0 name=. inode=14 dev=00:153 mode=dir,755 ouid=root ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
    type=CWD msg=audit (September 22, 2023 15:09:26.244:291) : cwd=/subdir1/subdir2
    type=SYSCALL msg=audit (September 22, 2023 15:09:26.244:291) : arch=x86_64 syscall=open success=yes exit=9a0=0x55f9f59da74e a1=O_RDONLY | O_DIRECTORY | O_NOATIME a2=0x7fff78c34f40 a3=0x0 items=1 ppid=1581119 pid=1636611 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts1 ses=1355 comm=updatedb exe=/usr/bin/updatedb key=(null)
    ---
  4. Check whether requests are sent from non-business processes and fix the issue.

    For example, the audit log indicates that updatedb scans all mount targets. In this case, modify /etc/updatedb.conf to skip it. To do this, perform the following steps:

    1. Set RUNEFS= to fuse.ossfs.

    2. Set PRUNEPATHS= to the mount target.

What do I do if the content type of the metadata of an object in an OSS volume is application/octet-stream?

Issue

The content type of the metadata of an object in an OSS volume is application/octet-stream. Consequently, the browser or other clients cannot recognize or process the object.

Causes

  • The default content type of objects in ossfs is binary stream.

  • After you modify the /etc/mime.types file to specify a content type, the modification does not take effect.

Solutions

  1. CSI 1.26.6 and 1.28.1 have compatibility issues in the content type setting. If you use the preceding versions, update CSI to the latest version. For more information, see [Component Notice] Incompatible configurations in csi-plugin 1.26.6 and 1.28.1, and csi-provisioner 1.26.6 and 1.28.1.

  2. If you have used mailcap or mime-support to generate the /etc/mime.types file on the node and specified the content type, update CSI and remount the OSS volume.

  3. If no content type is specified, specify a content type in the following ways:

    • Node-level setting: Generate a /etc/mime.types file on the node. The content type takes effect on all OSS volumes mounted to the node. For more information, see FAQ

    • Cluster-level setting: The content type takes effect on all newly mounted OSS volumes in the cluster. Make sure that the content of the /etc/mime.types file is the same as the default content generated by mailcap.

      1. Run the following command to check whether the csi-plugin configuration file exists.

        kubectl -n kube-system get cm csi-plugin

        If the file does not exist, create a csi-plugin ConfigMap with the same name based on the following content. If the ConfigMap already exists, add mime-support="true" in data.fuse-ossfs to the ConfigMap.

        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: csi-plugin
          namespace: kube-system
        data:
          fuse-ossfs: |
            mime-support=true
      2. Restart csi-plugin for the modification to take effect. The restart does not affect the volumes that are already mounted.

        kubectl -n kube-system delete pod -l app=csi-plugin
  4. Remount the desired OSS volume.

How do I use the specified ARNs or ServiceAccount in RRSA authentication?

You cannot use Amazon Resource Names (ARNs) of third-party OpenID Connect identity providers (OIDC IdPs) or ServiceAccounts other than the default one if you use RRSA authentication for OSS volumes.

To enable CSI to obtain the default role ARN and OIDC IdP ARN, set the roleName parameter in the PV to the desired RAM role. To customize RRSA authentication, modify the PV configuration as follows:

Note

Configure both roleArn and oidcProviderArn. You do not need to set roleName after you configure the preceding parameters.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-oss
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: pv-oss # Specify the name of the PV. 
    volumeAttributes:
      bucket: "oss"
      url: "oss-cn-hangzhou.aliyuncs.com"
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      authType: "rrsa"
      oidcProviderArn: "<oidc-provider-arn>"
      roleArn: "<role-arn>"
      #roleName: "<role-name>" #The roleName parameter becomes invalid after roleArn and oidcProviderArn are configured. 
      serviceAccountName: "csi-fuse-<service-account-name>" 

Parameter

Description

oidcProviderArn

Obtain the OIDC IdP ARN after the OIDC IdP is created. For more information, see Manage an OIDC IdP.

roleArn

Obtain the role ARN after a RAM role whose trusted entity is the preceding OIDC IdP is created. For more information, see Step 2: Create a RAM role for the OIDC IdP in Alibaba Cloud.

serviceAccountName

Optional. Specify the ServiceAccount used by the ossfs pod. Make sure that the pod is created.

If you leave this parameter empty, the default ServiceAccount maintained by CSI is used.

Important

The name of the ServiceAccount must start with csi-fuse-.

What do I do if the "Operation not supported" or "Operation not permitted" error occurs when I create a hard link?

Issue

The Operation not supported or Operation not permitted error occurs when you create a hard link.

Causes

The Operation not supported error occurs because OSS volumes do not support hard links. In earlier CSI versions, the Operation not permitted error is returned when you create hard links.

Solutions

Avoid using hard links if your application uses OSS volumes. If hard links are mandatory, we recommend that you change the storage service.

What do I do if a pod remains in the Terminating state when I fail to unmount a statically provisioned OSS volume from the pod?

Issue

You failed to unmount a statically provisioned OSS volume from a pod and the pod remains in the Terminating state.

Causes

If a pod remains in the Terminating state when the system deletes the pod, check the kubelet log. Possible causes of OSS volume unmounting failures:

  • Cause 1: The mount target on the node is occupied. CSI cannot unmount the mount target.

  • Cause 2: The specified OSS bucket or path in the PV is deleted. The status of the mount target is unknown.

Solutions

Solution to cause 1:

  1. Run the following command in the cluster to query the pod UID.

    Replace <ns-name> and <pod-name> with the actual values.

    kubectl -n <ns-name> get pod <pod-name> -ogo-template --template='{{.metadata.uid}}'

    Expected output:

    5fe0408b-e34a-497f-a302-f77049****
  2. Log on to the node that hosts the pod in the Terminating state.

  3. Run the following command to check whether processes occupy the mount target.

    lsof /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount/

    If yes, confirm and terminate the processes.

Solution to cause 2:

  1. Log on to the OSS console.

  2. Check whether the OSS bucket or path is deleted. If you use subPath to mount the OSS volume, you also need to check whether the subPath is deleted.

  3. If the unmounting fails because the path is deleted, perform the following steps:

    1. Run the following command in the cluster to query the pod UID.

      Replace <ns-name> and <pod-name> with the actual values.

      kubectl -n <ns-name> get pod <pod-name> -ogo-template --template='{{.metadata.uid}}'

      Expected output:

      5fe0408b-e34a-497f-a302-f77049****
    2. Log on to the node that hosts the pod in the Terminating state and run the following command to query the mount target of the pod:

      mount | grep <pod-uid> | grep fuse.ossfs

      Expected output:

      ossfs on /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount type fuse.ossfs (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
      ossfs on /var/lib/kubelet/pods/<pod-uid>/volume-subpaths/<pv-name>/<container-name>/0 type fuse.ossfs (ro,relatime,user_id=0,group_id=0,allow_other)

      The path between ossfs on and type is the actual mount target on the node.

    3. Manually unmount the mount target.

      umount /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount
      umount /var/lib/kubelet/pods/<pod-uid>/volume-subpaths/<pv-name>/<container-name>/0
    4. Wait for the kubelet to retry or run --force to delete the pod.

  4. If the issue persists, submit a ticket.

What do I do if the detection task in the ACK console becomes stuck for a long period of time, the detection task fails but no error message is displayed, or the system prompts "unknown error"?

Issue

The detection task becomes stuck for a long period of time, the detection task fails but no error message is displayed, or the system prompts "unknown error"?

Causes

If the detection task becomes stuck for a long period of time, it is usually caused by a network issue. If it is caused by other unknown issues, print the log or use ossutil to manually locate the cause.

Solutions

You can use logs and ossutil to locate the cause.

Use logs to locate the cause

  1. Run the following command to find the pod that runs the detection task.

    • osssecret-namespace: the namespace of the Secret.

    • pv-name: the name of the PV.

    kubectl -n <osssecret-namespace> get pod | grep <pv-name>-check

    Expected output:

    <pv-name>-check-xxxxx
  2. Run the following command to locate the cause:

    kubectl -n <osssecret-namespace> logs -f <pv-name>-check-xxxxx

    Expected output:

    check ossutil
    endpoint: oss-<region-id>-internal.aliyuncs.com
    bucket: <bucket-name>
    path: <path>
    Error: oss: service returned error: StatusCode=403, ErrorCode=InvalidAccessKeyId, ErrorMessage="The OSS Access Key Id you provided does not exist in our records.", RequestId=65267325110A0C3130B7071C, Ec=0002-00000901, Bucket=<bucket-name>, Object=<path>

Use ossutil to locate the cause

If the pod that runs the detection task is already deleted, use ossutil to recreate the detection task and locate the cause.

stat (query bucket and object information) is used to detect OSS access. Install ossutil on any node and run the following command.

ossutil -e "<endpoint>" -i "<accessKeyID>" -k "<accessKeySecret>" stat oss://"<bucket><path>"

Parameter

Description

endpoint

  • Set to oss-<region-id>-internal.aliyuncs.com if the internal endpoint of the bucket is used.

  • Set to oss-<region-id>.aliyuncs.com if the external endpoint of the bucket is used.

accessKeyID

The AccessKey ID in the Secret.

accessKeySecret

The AccessKey secret in the Secret.

bucket

The bucket ID.

path

The path. The path must end with a forward slash (/).

Run the following command if you use a volume in the following figure. image.png

ossutil -e "oss-<region-id>-internal.aliyuncs.com" -i "<accessKeyID>" -k "<accessKeySecret>" stat oss://"cnfs-oss-xxx-xxx/xx/"

How do I handle the "connection timed out" network error?

Issue

The connection timed out error occurs.

Causes

Access to the OSS bucket times out. Possible causes:

  • If the bucket and cluster reside in different regions and the internal endpoint of the bucket is used, access to the bucket fails.

  • If the external endpoint of the bucket is used but the cluster does not have Internet access, access to the bucket fails.

Solutions

  • Recreate the PV and select the external endpoint of the bucket.

  • If the bucket and cluster reside in the same region, you can recreate the PV and use the internal endpoint. If not, check the security group and network configurations, fix the issue, and recreate the PV.

How do I handle the "StatusCode=403" permission error?

Issue

The service returned error: StatusCode=403 error occurs.

Causes

Your AccessKey pair does not have read permissions on the OSS bucket to be mounted.

  • The StatusCode=403, ErrorCode=AccessDenied, ErrorMessage="You do not have read acl permission on this object." error indicates that your AccessKey pair does not have the required permissions.

  • The StatusCode=403, ErrorCode=InvalidAccessKeyId, ErrorMessage="The OSS Access Key Id you provided does not exist in our records." error indicates that the AccessKey pair does not exist.

  • The StatusCode=403, ErrorCode=SignatureDoesNotMatch, ErrorMessage="The request signature we calculated does not match the signature you provided. Check your key and signing method." error indicates that the AccessKey pair may contain spelling errors.

Solutions

Make sure that the AccessKey pair exists, does not contain spelling errors, and has read permissions on the bucket.

What do I do if the bucket or path does not exist and the StatusCode=404 status code is returned?

Issue

The service returned error: StatusCode=404 error occurs.

Causes

You cannot mount statically provisioned OSS volumes to buckets or subPaths that do not exist. You must create the buckets or subPaths in advance.

  • The StatusCode=404, ErrorCode=NoSuchBucket, ErrorMessage="The specified bucket does not exist." error indicates that the bucket does not exist.

  • The StatusCode=404, ErrorCode=NoSuchKey, ErrorMessage="The specified key does not exist." error indicates that the subPath object does not exist.

    Important

    subPaths displayed in the OSS console may not exist on the OSS server. Use ossutil or the OSS API to confirm the subPaths. For example, if you create the /a/b/c/ path, the /a/b/c/ path object is created, but the /a/ or /a/b/ path object does not exist. If you upload the /a/* object, you can find the /a/b or /a/c object, but the /a/ path object does not exist.

Solutions

Use ossutil, the SDK, or the OSS console to create the missing bucket or subPath, and then recreate the PV.

What do I do if other OSS status codes or error codes are returned?

Issue

The service returned error: StatusCode=xxx error occurs.

Causes

If an error occurs when you access OSS, OSS returns the status code, error code, and error message for troubleshooting.

Solutions

If OSS returns other status codes or error codes, see HTTP status code.

How do I launch ossfs in exclusive mode after ossfs is containerized?

Issue

Pods that use the same OSS volume on a node share the mount target.

Causes

Before ossfs is containerized, OSS volumes are mounted in exclusive mode by default. An ossfs process is launched for each pod that uses an OSS volume. Different ossfs processes have different mount targets. Therefore, pods that use the same OSS volume do not affect each other when they read or write data.

After ossfs is containerized, the ossfs process runs in the csi-fuse-ossfs-* pod in the kube-system or ack-csi-fuse namespace. In scenarios where an OSS volume is mounted to multiple pods, the exclusive mode will launch large numbers of ossfs pods. As a result, elastic network interfaces (ENIs) become insufficient. Therefore, after ossfs is containerized, use the shared mode to mount OSS volumes. This allows pods that use the same OSS volume on a node to the share the same mount target. This means that only one ossfs process in the csi-fuse-ossfs-* pod is launched.

Solutions

Important

In CSI 1.30.4 and later, the exclusive mode is no longer supported. If you need to restart or modify the configuration of ossfs, see How do I restart the ossfs process when the OSS volume is shared by multiple pods? If you have any other requirements for the exclusive mode with ossfs, Submit a ticket.

To use the exclusive mode before ossfs is containerized, add useSharedPath and set it to "false" when you create an OSS volume. Example:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: oss-pv
spec:
  accessModes:
  - ReadOnlyMany
  capacity:
    storage: 5Gi
  csi:
    driver: ossplugin.csi.alibabacloud.com
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: bucket-name
      otherOpts: -o max_stat_cache_size=0 -o allow_other
      url: oss-cn-zhangjiakou.aliyuncs.com
      useSharedPath: "false" 
    volumeHandle: oss-pv
  persistentVolumeReclaimPolicy: Delete
  volumeMode: Filesystem

How do I restart the ossfs process when the OSS volume is shared by multiple pods?

Issue

After you modify the authentication information or ossfs version, the running ossfs processes cannot automatically update the information.

Causes

  • ossfs cannot automatically update the authentication configuration. To update the modified authentication configuration, you must restart the ossfs process (the csi-fuse-ossfs-* pod in the kube-system or ack-csi-fuse namespace after ossfs is containerized) and the application pod. This causes business interruptions. Therefore, CSI does not restart running ossfs processes to update configurations by default. You need to manually configure ossfs to remount the OSS volume.

  • Normally, the deployment and removal of ossfs are handled by CSI. Manually deleting the pod running the ossfs process does not trigger the CSI deployment process.

Solutions

Important

To restart the ossfs process, you need to restart the application pod that mounts the corresponding OSS volume. Proceed with caution.

If the CSI version you use is not containerized or the exclusive mode is used to mount OSS volumes, you can directly restart the application pod. In containerized CSI versions, the shared mode is used to mount OSS volumes by default. This means that pods using the same OSS volume on a node share the same ossfs process.

  1. Confirm the application pods that use the FUSE pod.

    1. Run the following command to confirm the csi-fuse-ossfs-* pod.

      Replace <pv-name> with the PV name and <node-name> with the node name.

      Use the following command if the CSI version is earlier than 1.30.4:

      kubectl -n kube-system get pod -lcsi.alibabacloud.com/volume-id=<pv-name> -owide | grep <node-name>

      Use the following command if the CSI version is 1.30.4 or later:

      kubectl -n ack-csi-fuse get pod -lcsi.alibabacloud.com/volume-id=<pv-name> -owide | grep <node-name>

      Expected output:

      csi-fuse-ossfs-xxxx   1/1     Running   0          10d     192.168.128.244   cn-beijing.192.168.XX.XX   <none>           <none>
    2. Run the following command to confirm all pods that use the OSS volume.

      Replace <ns> with the namespace name and <pvc-name> with the PVC name.

    3. kubectl -n <ns> describe pvc <pvc-name>

      Expected output (including User By):

      Used By:       oss-static-94849f647-4****
                     oss-static-94849f647-6****
                     oss-static-94849f647-h****
                     oss-static-94849f647-v****
                     oss-static-94849f647-x****
    4. Run the following command to query the pods run on the same node as the csi-fuse-ossfs-xxxx process:

      kubectl -n <ns> get pod -owide | grep cn-beijing.192.168.XX.XX 

      Expected output:

      NAME                         READY   STATUS    RESTARTS   AGE     IP               NODE                         NOMINATED NODE   READINESS GATES
      oss-static-94849f647-4****   1/1     Running   0          10d     192.168.100.11   cn-beijing.192.168.100.3     <none>           <none>
      oss-static-94849f647-6****   1/1     Running   0          7m36s   192.168.100.18   cn-beijing.192.168.100.3     <none>           <none>
  2. Restart your application and the ossfs process.

    Delete the application pods (oss-static-94849f647-4**** and oss-static-94849f647-6**** in the preceding example) by using kubectl scale. When the OSS volume is not mounted to application pods, the csi-fuse-ossfs-xxxx pod is deleted. After the application pods are recreated, the OSS volume is mounted based on the new PV configuration by the ossfs process running in csi-fuse-ossfs-yyyy pod.

    A restart is triggered immediately when deleting a pod managed by a Deployment, StatefulSet, or DaemonSet. If you cannot delete all application pods at the same time or the pods can tolerate read and write failures:

    • If the CSI version is earlier than 1.30.4, you can directly delete the csi-fuse-ossfs-xxxx pod. In this case, the disconnected error is returned when the application pods read or write the OSS volume.

    • If the CSI version is 1.30.4 or later, run the following command:

      kubectl get volumeattachment | grep <pv-name> | grep cn-beijing.192.168.XX.XX 

      Expected output:

      csi-bd463c719189f858c2394608da7feb5af8f181704b77a46bbc219b**********   ossplugin.csi.alibabacloud.com    <pv-name>                   cn-beijing.192.168.XX.XX    true       12m

      If you directly delete this VolumeAttachment, the disconnected error is returned when the application pods read or write the OSS volume.

    Restart the application pods one after one. The restarted application pods will read and write the OSS volume through the csi-fuse-ossfs-yyyy pod created by CSI.

How do I view the access records of an OSS volume?

You can view the records of OSS operations in the OSS console. Make sure that the log query feature is enabled for OSS. For more information, see Enable real-time log query.

  1. Log on to the OSS console.

  2. In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.

  3. In the left-side navigation tree, choose Logging > Real-time Log Query.

  4. On the Real-time Log Query tab, enter a query statement and an analysis statement based on the query syntax and analysis syntax to analyze log fields. Use the user_agent and client_ip fields to confirm whether the log is generated by ACK.

    1. To find OSS requests sent by ACK, select the user_agent field. Requests whose user_agent contains ossfs are OSS requests sent by ACK.

      Important
      • The value of the user-agent field depends on the ossfs version, but the values all start with aliyun-sdk-http/1.0()/ossfs.

      • If you have used ossfs to mount OSS volumes on ECS instances, the relevant log data is also recorded.

    2. To locate an ECS instance or cluster, select the client_ip field and find the IP address of the ECS instance or cluster.

    The following figure displays the log data filtered based on the preceding fields. image

Fields that are queried

Field

Description

operation

The type of OSS operation. Examples: GetObject and GetBucketStat. For more information, see List of operations by function.

object

The name of the OSS object (path or file).

request_id

The request ID, which helps you find a request.

http_status and error_code

The returned status code or error code. For more information, see HTTP status code.

What do I do if an exception occurs when I use subPath or subPathExpr to mount an OSS volume?

Issue

The following exceptions occur when you use subPath or subPathExpr to mount OSS volumes:

  • Mounting failures: The pod to which the OSS volume to be mounted remains in the CreateContainerConfigError state after the pod is created. In addition, the following event is generated.

    Warning  Failed          10s (x8 over 97s)  kubelet            Error: failed to create subPath directory for volumeMount "pvc-oss" of container "nginx"
  • Read and write exceptions: The Operation not permitted or Permission denied error is returned when an application pod reads or writes the OSS volume.

  • Unmounting failures: When the system deletes the pod to which the OSS volume is mounted, the pod remains in the Terminating state.

Causes

Assume that:

PV-related configuration:

...
    volumeAttributes:
      bucket: bucket-name
      path: /path
      ...

Pod-related configuration:

...
       volumeMounts:
      - mountPath: /path/in/container
        name: oss-pvc
        subPath: subpath/in/oss
      ...

In this case, the subPath on the OSS server is set to the /path/subpath/in/oss/ path in the bucket.

  • Cause 1: The mount target /path/subpath/in/oss/ does not exist on the OSS server and the user or role does not have the PutObject permission on the OSS volume. For example, only the OSS ReadOnly permission is granted in read-only scenarios.

    The kubelet attempts to create the mount target /path/subpath/in/oss/ on the OSS server but failed due to insufficient permissions.

  • Cause 2: Application containers run by non-root users do not have the required permissions on files in the /path/subpath/in/oss/ path. The default permission is 640. When you use subPath to mount an OSS volume, the mount target on the OSS server is the path defined in the PV, which is /path in the preceding example, but not /path/subpath/in/oss/. The allow_other or mp_umask setting takes effect only on the /path path. The default permission on the /path/subpath/in/oss/ subPath is still 640.

  • Cause 3: The mount target /path/subpath/in/oss/ on the OSS server is deleted. Consequently, the kubelet fails to unmount the subPath.

Solutions

Solution to cause 1:

  • Create the /path/subpath/in/oss/ subPath on the OSS server for the kubelet.

  • If you need to create a large number of paths and some paths do not exist when you mount the OSS volume, you can grant the putObject permission to the user or role. This case occurs when you use subPathExpr to mount OSS volumes.

Solution to cause 2: Use the umask parameter to modify the default permission on the subPath. For example, add -o umask=000 to set the default permission to 777.

Solution to cause 3: Refer to the solution to cause 2 in What do I do if a pod remains in the Terminating state when I fail to unmount a statically provisioned OSS volume from the pod?

Does the capacity setting take effect on OSS volumes? Do I need to expand a volume if the actual capacity of the volume exceeds the capacity setting?

OSS does not limit the size of buckets or subPaths or have any capacity limits. Therefore, the pv.spec.capacity and pvc.spec.resources.requests.storage settings do not take effect. You need only to make sure that the capacity values in the PV and PVC are the same.

You can continue to use a volume as normal when the actual capacity of the volume exceeds the capacity setting. No volume expansion is needed.