This topic provides answers to some frequently asked questions about Object Storage Service (OSS) volumes.
Category | Issue |
FAQ about mounting OSS volumes | |
FAQ about using OSS volumes |
|
FAQ about unmounting OSS volumes | |
FAQ about detection failures in the ACK console | |
Others |
Why does it require a long period of time to mount an OSS volume?
Issue
It requires a long period of time to mount an OSS volume.
Causes
If the following conditions are met, kubelet performs the chmod or chown operation when volumes are mounted, which increases the time consumption.
The AccessModes parameter is set to ReadWriteOnce in the persistent volume (PV) and persistent volume claim (PVC) templates.
The securityContext.fsgroup parameter is set in the application template.
Solutions
If the securityContext.fsgroup parameter is set in the application template, delete the fsgroup parameter in the securityContext section.
If you want to configure the user ID (UID) and mode of the files in the mounted path, you can manually mount the OSS bucket to an Elastic Compute Service (ECS) instance. You can then perform the
chown
andchmod
operations by using a CLI and provision the OSS volume by using the Container Storage Interface (CSI) plug-in. For more information about how to provision OSS volumes by using the CSI plug-in, see Mount a statically provisioned OSS volume.Apart from the preceding methods, for clusters that run Kubernetes 1.20 or later, you can set the fsGroupChangePolicy parameter to OnRootMismatch. This way, the
chmod
orchown
operation is performed only when the system launches the pod for the first time. As a result, it requires a long time to mount the OSS volume during the first launch. The issue does not occur when you mount OSS volumes after the first launch is complete. For more information about fsGroupChangePolicy, see Set the security context for a pod or a container.We recommend that you do not write the PVCs of OSS volumes mounted by ossfs. Theses PVCs are read-only by default.
How do I manage the permissions related to OSS volume mounting?
The Permission Denied error is displayed in the following scenarios.
Scenario 1: The Permission Denied error occurs when you access the mount target
Causes
By default, the root user of Linux is used to mount OSS volumes. The root user has the 700 permission. When a container process accesses an OSS volume as a non-root user, the system prompts an error due to insufficient permissions.
Solutions
Add configurations to modify the permissions on the root path.
Parameter | Description |
allow_other | Set the 777 permission on the mount target. |
mp_umask | Set the umask of the mount target. This parameter takes effect only when the
|
Scenario 2: The Permission Denied error occurs when you use ossutil, the OSS console, or the SDK to upload files
Causes
By default, ossfs has the 640 permission on files uploaded by using methods other than ossfs. When a container process accesses an OSS volume as a non-root user, the system prompts an error due to insufficient permissions.
Solutions
Run the chmod command as a root user to modify the permissions on the desired file. You can also add the following configurations to modify the permissions on the subPath and files in the mount target.
umask: the umask of the subPath and files in the mount target. You can set the umask parameter in the same way as mp_umask. The umask parameter does not rely on the allow_other parameter.
The umask parameter defines only the permissions on existing files in the current ossfs process. It does not take effect on remounted files or files in other ossfs processes. Examples:
After you set
-o umask=022
, usestat
to view the permission of the file uploaded from the OSS console. The permission on the file should be 755. Remount the file after you delete the-o umask=022
setting. The permission on the file should be 640.After you set
-o umask=133
as a root user in the current container process, run the chmod command to set the permission on a file to 777. After youstat
the file, the permission on the file is 644. Remount the file after you delete the-o umask=133
setting. The permission on the file is changed to 777.
Scenario 3: The system prompts insufficient permissions when other container processes read or write files created in ossfs
Causes
The default permission on regular files created in ossfs is 644. After you set the fsGroup
field in securityContext
or run the chmod or chown command on a file, the permission or owner of the file may be changed. When another user accesses the file through a container process, the system may prompt insufficient permissions.
Solutions
stat
the permission on the file. If the system prompts insufficient permissions, run the chmod command to modify the permission on the file as a root user.
The preceding solutions resolve the issue that the user of the current container process does not have sufficient permissions on a path or file. You can also change the owner of the subPath and files in the mount target of ossfs to resolve this issue.
If you specified the user of the container process when you build the container image or leave the securityContext.runAsUser
and securityContext.runAsGroup
fields in the application deployment empty, the container process runs as a non-root user.
Add the following configurations to change the UID and GID of the subPath and files in the mount target of ossfs to those of the user that runs the container process.
Parameter | Description |
uid | The UID of the owner of the subPath and files in the mount target. |
gid | The GID of the owner of the subPath and files in the mount target. |
For example, if the corresponding IDs of the container process are uid=1000(biodocker)
, gid=1001(biodocker)
, and groups=1001(biodocker)
, set -o uid=1000
and -o gid=1001
.
Scenario 4: The AccessKey pair stored in the Secret cannot be used to access files in the OSS volume after you set the nodePublishSecretRef
field in the PV to reference the Secret The original AccessKey pair is revoked due to AccessKey pair rotation. The renewed AccessKey pair in the Secret does not take effect.
Causes
OSS volumes are FUSE file systems mounted by using ossfs. The AccessKey pair of an OSS volume cannot be renewed after the OSS volume is mounted. The application that uses the OSS volume uses only the original AccessKey pair to send requests to the OSS server.
Solutions
After the AccessKey pair in the Secret is renewed, you need to remount the OSS volume. In a non-containerized ossfs version or a containerized ossfs version mounting OSS volumes in exclusive mode, restart the application pod to trigger ossfs restart. For more information, see How do I restart the ossfs process when the OSS volume is shared by multiple pods?
Scenario 5: The Operation not permitted error occurs when you create a hard link
Causes
OSS volumes do not support hard links. In earlier CSI versions, the Operation not permitted error is returned when you create hard links.
Solutions
Avoid using hard links if your application uses OSS volumes. If hard links are mandatory, we recommend that you change the storage service.
Scenario 6: The system prompts insufficient read or write permissions when you use subPath or subPathExpr to mount an OSS volume
Causes
A container process that runs as a non-root user does not have permissions on the files in the /path/subpath/in/oss/
path. The default permission on the path is 640. When you use subPath to mount an OSS volume, the mount target on the OSS server is the path defined in the PV, which is /path
in the preceding example, but not /path/subpath/in/oss/
. The allow_other or mp_umask setting takes effect only on the /path
path. The default permission on the /path/subpath/in/oss/
subPath is still 640.
Solutions
Use the umask parameter to modify the default permission on the subPath. For example, add -o umask=000
to set the default permission to 777.
What do I do if I fail to mount a statically provisioned OSS volume?
Issue
You failed to mount a statically provisioned OSS volume. The pod cannot be started and a FailedMount event is generated.
Causes
Cause 1: In earlier ossfs versions, you cannot mount OSS buckets to paths that do not exist. If the mount target does not exist, a mounting failure occurs.
ImportantsubPaths displayed in the OSS console may not exist on the OSS server. Use ossutil or the OSS API to confirm the subPaths. For example, if you create the
/a/b/c/
path, the/a/b/c/
path object is created, but the/a/
or/a/b/
path object does not exist. If you upload the/a/*
object, you can find the/a/b
or/a/c
object, but the/a/
path object does not exist.Cause 2: Mount failed due to the AccessKey or RAM Roles for Service Accounts (RRSA) use incorrect role information or have insufficient permissions.
Cause 3: For CSI version 1.30.4 and later, the pod that runs ossfs is located in the ack-csi-fuse namespace. During the mounting process, CSI first launches the pod that runs ossfs, and then initializes the ossfs process in that pod through a Remote Procedure Call (RPC) request. If the event log contains the message
FailedMount /run/fuse.ossfs/xxxxxx/mounter.sock: connect: no such file or directory
, it indicates that the mounting failed because the pod that runs ossfs was not started properly or was deleted unexpectedly.Cause 4: If the event contains the
Failed to find executable /usr/local/bin/ossfs: No such file or directory
message, the mounting failed because OSSFS failed to be installed on the node.Cause 5: If the event contains the
error while loading shared libraries: xxxxx: cannot open shared object file: No such file or directory
message, the mounting failed because ossfs runs on nodes in the current CSI version but some dynamic libraries required by ossfs are missing in the operating system. Possible causes:Another ossfs version was manually installed on the node and the required operating system differs from the operating system of the node.
The default OpenSSL version is changed after the node operating system is updated, such as an update from Alibaba Cloud Linux 2 to Alibaba Cloud Linux 3.
If ossfs runs on nodes, only the following operating systems are supported: CentOS, Alibaba Cloud Linux, ContainerOS, and Anolis OS.
Dynamic libraries required by ossfs, such as FUSE, cURL, and xml2, are deleted from the node that runs the required operating system, or the default OpenSSL version is changed.
Cause 6: A mirroring-based back-to-origin rule is configured for the bucket but the mount path is not synchronized from the origin.
Cause 7: Static website hosting is configured for the bucket. When ossfs checks the mount target on the OSS server, the index.html file is returned.
Solutions
Solution to cause 1:
Check whether the subPath exists on the OSS server.
Assume that the mount target of the PV is
sub/path/
. You can run stat (query bucket and object information) to query objects whoseobjectname
issub/path/
, or run openapi HeadObject to query objects whosekey
issub/path/
. If 404 is returned, the subPath does not exist on the OSS server.You can use ossutil, the OSS SDK, or the OSS console to create the missing bucket or subPath and mount the bucket again.
In ossfs versions later than 1.91, you can specify a mount target that does not exist. Therefore, you can also update ossfs to resolve this issue. For more information, see New features of ossfs 1.91 and later and stress tests.
Solution to cause 2:
Confirm that the policy permissions for the Resource Access Management (RAM) user or RAM role used for mounting are granted with the permissions listed in Step 2: Grant permissions to the demo-role-for-rrsa role.
Verify the file system permissions for the root directory and subPath of the mount target. For more information, see Scenario 1 and Scenario 6 in How do I manage the permissions related to OSS volume mounting?
For volumes mounted using AccessKey authentication as a RAM user, confirm that the AccessKey used during mounting is neither disabled nor rotated. For more information, see Scenario 4 in How do I manage the permissions related to OSS volume mounting?
For volumes mounted using RRSA authentication, confirm that the correct trust policy is configured for the RAM role. For more information about how to configure the trust policy, see (Optional) Step 1: Create a RAM role. By default, the trusted service account is csi-fuse-ossfs in the ack-csi-fuse namespace, rather than the service account used by the service.
- Note
The RRSA feature supports only ACK clusters that run Kubernetes 1.26 and later. ACK clusters that support the RRSA feature include ACK Basic clusters, ACK Pro clusters, ACK Serverless Basic clusters, and ACK Serverless Pro clusters. The version of the CSI component used by the cluster must be 1.30.4 or later. If you used the RRSA feature prior to version 1.30.4, you must attach policies to the RAM role. For more information, see [Product changes] ossfs version upgrade and mounting process optimization in CSI.
Solution to cause 3:
Run the following command to confirm that the pod running ossfs exists. Replace
PV_NAME
with the name of the OSS PV to be mounted, andNODE_NAME
with the name of the node where the pod that requires the volume to be mounted resides.kubectl -n ack-csi-fuse get pod -l csi.alibabacloud.com/volume-id=<PV_NAME> -owide | grep <NODE_NAME>
If the pod exists but is in an abnormal state, troubleshoot and make sure that the pod is in the Running state before restarting it to trigger a remount. If the pod does not exist, follow the subsequent steps to troubleshoot.
(Optional) Verify if the pod was accidentally deleted by reviewing audit logs and other relevant sources. Common causes for accidental deletion include script cleanup, node draining, and node auto repair. We recommend that you make appropriate adjustments to prevent this issue from recurring.
Make sure that both csi-provisioner and csi-plugin are updated to version 1.30.4 or later. Then, restart the pod to trigger a remount and verify that the pod running ossfs is created through a proper process.
Solution to cause 4:
We recommend that you update csi-plugin to v1.26.2 or later. The issue that the ossfs installation fails during the initialization of newly added nodes is fixed in these versions.
Run the following command to restart csi-plugin on the node and check whether the csi-plugin pod runs as normal. In the following code,
csi-plugin-****
specifies the pod of csi-plugin.kubectl -n kube-system delete pod csi-plugin-****
If the issue persists after you update or restart the component, log on to the node and run the following command:
ls /etc/csi-tool
Expected output:
... ossfs_<ossfsVer>_<ossfsArch>_x86_64.rpm ...
If the output displays the following ossfs RPM package, run the following command to check whether the csi-plugin pod runs as normal.
rpm -i /etc/csi-tool/ossfs_<ossfsVer>_<ossfsArch>_x86_64.rpm
If the output does not contain the ossfs RPM package, submit a ticket.
Solution to cause 5:
If you have manually installed ossfs, check whether the required operating system is the same as the operating system of the node.
If you have updated the operating system of the node, run the following command to restart csi-plugin, update ossfs, and remount the OSS volume.
kubectl -n kube-system delete pod -l app=csi-plugin
We recommend that you update CSI to 1.28 or later. In these versions, ossfs runs in containers. Therefore, it does not have requirements on the node operating system.
If you cannot update CSI, install the required operating system or manually install the missing dynamic libraries. In the following example, the node runs Ubuntu:
Run the which command to query the installation path of ossfs. The default path is
/usr/local/bin/ossfs
.which ossfs
Run the ldd command to query the missing dynamic libraries required by ossfs.
ldd /usr/local/bin/ossfs
Run the apt-file command to query the package of the missing dynamic libraries (such as libcrypto.so.10).
apt-get install apt-file apt-file update apt-file search libcrypto.so.10
Run the apt-get command to install the package, such as libssl.1.0.0.
apt-get install libssl1.0.0
Solution to cause 6:
Synchronize data from the origin before you mount the OSS volume. For more information, see Overview.
Solution to cause 7:
Disable static website hosting or modify the configuration and try again. For more information, see Overview.
What do I do if I fail to access a statically provisioned OSS volume?
Issue
You failed to access a statically provisioned OSS volume.
Causes
You did not specify an AccessKey pair when you mount the statically provisioned OSS volume.
Solutions
Specify an AccessKey pair in the configurations of the statically provisioned OSS volume. For more information, see Mount a statically provisioned OSS volume.
What do I do if the read speed of a statically provisioned OSS volume is slow?
Issue
The read speed of a statically provisioned OSS volume is slow.
Causes
Cause 1: OSS does not limit the number of objects. However, when the number of objects exceeds 1000, FUSE may access an excessively amounts of metadata. Consequently, access to OSS buckets becomes slow.
Cause 2: After OSS has version control enabled, large numbers of delete tags are generated in the bucket, which compromise the performance of listObjectsV1.
Cause 3: The OSS server sets Storage Class to a class other than Standard. Access to buckets of these storage classes is slow.
Solutions
Solution to cause 1:
When you mount an OSS volume to a container, we recommend that you set the access mode of the OSS volume to read-only. If an OSS bucket stores a large number of files, we recommend that you use the OSS SDK or CLI to access the files in the bucket, instead of accessing the files by using a file system. For more information, see SDK demos overview.
Solution to cause 2:
After CSI plugin is updated to v1.26.6, ossfs allows you to access buckets by using listObjectsV2.
Add
-o listobjectsv2
to theotherOpts
field of the PV corresponding to the statically provisioned OSS volume.
Solution to cause 3:
Change the storage class or restore objects.
Why is 0 displayed for the size of a file in the OSS console after I write data to the file?
Issue
After you write data to an OSS volume mounted to a container, the size of the file displayed in the OSS console is 0.
Causes
The OSS bucket is mounted as a FUSE file system by using ossfs. In this case, a file is uploaded to the OSS server only if you run the close or flush command on the file.
Solutions
Run the lsof command with the name of the file to check whether the file is being used by processes. If the file is being used by processes, terminate the processes to release the file descriptor (FD) of the file. For more information about the lsof command, see lsof.
Why is a path displayed as an object after I mount the path to a container?
Issue
After you mount a path to a container, the path is displayed as an object.
Causes
Cause 1: The content type of the path on the OSS server is not the default application/octet-stream
type, such as text/html or image/jpeg, or the size of the path object is not 0. In this case, ossfs displays the path as an object based on the metadata of the path object.
Cause 2: The metadata x-oss-meta-mode
is missing in the path object.
Solutions
Solution to cause 1:
Use HeadObject or stat (query bucket and object information) to obtain the metadata of the path object. The path object must end with a forward slash (/
), such as a/b/
. Sample API response:
{
"server": "AliyunOSS",
"date": "Wed, 06 Mar 2024 02:48:16 GMT",
"content-type": "application/octet-stream",
"content-length": "0",
"connection": "keep-alive",
"x-oss-request-id": "65E7D970946A0030334xxxxx",
"accept-ranges": "bytes",
"etag": "\"D41D8CD98F00B204E9800998ECFxxxxx\"",
"last-modified": "Wed, 06 Mar 2024 02:39:19 GMT",
"x-oss-object-type": "Normal",
"x-oss-hash-crc6xxxxx": "0",
"x-oss-storage-class": "Standard",
"content-md5": "1B2M2Y8AsgTpgAmY7Phxxxxx",
"x-oss-server-time": "17"
}
In the preceding sample response:
content-type
: The content type of the path object isapplication/octet-stream
.content-length
: The size of the path object is 0.
If the preceding conditions are not met, perform the following steps:
Use GetObject or ossutil to obtain the object and confirm the metadata. If the metadata of the object meets the requirement or you cannot confirm the metadata, we recommend that you back up the object. For example, change the name of the object and upload it to OSS. For a
xx/
path object, do not usexx
as the object name.Use DeleteObject or rm to delete the original path object and check whether ossfs displays the path object as normal.
Solution to cause 2:
If the issue persists after you perform the steps in the solution to cause 1, add -o complement_stat
to the otherOpts
field of the PV corresponding to the statically provisioned OSS volume when you mount the OSS volume.
In CSI plugin v1.26.6 and later versions, this feature is enabled by default. You can update CSI plugin to v1.26.6 or later, and then restart the application pod and remount the OSS volume to resolve the issue.
What do I do if the OSS server identifies unexpected large numbers of requests?
Issue
When you mount an OSS volume to a container, the OSS server identifies unexpected large numbers of requests.
Causes
When ossfs mounts an OSS bucket, a mount target is generated on the node. When other processes on the ECS node scan the mount target, requests are sent to the OSS server. Fees are incurred if the number of requests exceeds the upper limit.
Solutions
Use ActionTrail to track the processes that generate the requests and fix the issue. You can perform the following operations on the node.
Run the following command to install and launch auditd:
sudo yum install auditd sudo service auditd start
Monitor the ossfs mount targets.
Run the following command to monitor all mount targets:
for i in $(mount | grep -i ossfs | awk '{print $3}');do auditctl -w ${i};done
Run the following command to monitor the mount target of a PV: Replace
<pv-name>
with the name of the PV.for i in $(mount | grep -i ossfs | grep -i <pv-name> | awk '{print $3}');do auditctl -w ${i};done
Run the following command to print the audit log to view the processes that access the mount target in the OSS bucket:
ausearch -i
In the following sample audit log, the log data separated by the
---
delimiter records an operation performed on the mount target that is monitored. The log data indicates that theupdatedb
process performs theopen
operation on the mount target. The PID is 1636611.--- type=PROCTITLE msg=audit (September 22, 2023 15:09:26.244:291) : proctitle=updatedb type=PATH msg=audit (September 22, 2023 15:09:26.244:291) : item=0 name=. inode=14 dev=00:153 mode=dir,755 ouid=root ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 type=CWD msg=audit (September 22, 2023 15:09:26.244:291) : cwd=/subdir1/subdir2 type=SYSCALL msg=audit (September 22, 2023 15:09:26.244:291) : arch=x86_64 syscall=open success=yes exit=9a0=0x55f9f59da74e a1=O_RDONLY | O_DIRECTORY | O_NOATIME a2=0x7fff78c34f40 a3=0x0 items=1 ppid=1581119 pid=1636611 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts1 ses=1355 comm=updatedb exe=/usr/bin/updatedb key=(null) ---
Check whether requests are sent from non-business processes and fix the issue.
For example, the audit log indicates that updatedb scans all mount targets. In this case, modify
/etc/updatedb.conf
to skip it. To do this, perform the following steps:Set
RUNEFS=
tofuse.ossfs
.Set
PRUNEPATHS=
to the mount target.
What do I do if the content type of the metadata of an object in an OSS volume is application/octet-stream?
Issue
The content type of the metadata of an object in an OSS volume is application/octet-stream. Consequently, the browser or other clients cannot recognize or process the object.
Causes
The default content type of objects in ossfs is binary stream.
After you modify the /etc/mime.types file to specify a content type, the modification does not take effect.
Solutions
CSI 1.26.6 and 1.28.1 have compatibility issues in the content type setting. If you use the preceding versions, update CSI to the latest version. For more information, see [Component Notice] Incompatible configurations in csi-plugin 1.26.6 and 1.28.1, and csi-provisioner 1.26.6 and 1.28.1.
If you have used
mailcap
ormime-support
to generate the/etc/mime.types
file on the node and specified the content type, update CSI and remount the OSS volume.If no content type is specified, specify a content type in the following ways:
Node-level setting: Generate a
/etc/mime.types
file on the node. The content type takes effect on all OSS volumes mounted to the node. For more information, see FAQCluster-level setting: The content type takes effect on all newly mounted OSS volumes in the cluster. Make sure that the content of the
/etc/mime.types
file is the same as the default content generated bymailcap
.Run the following command to check whether the csi-plugin configuration file exists.
kubectl -n kube-system get cm csi-plugin
If the file does not exist, create a csi-plugin ConfigMap with the same name based on the following content. If the ConfigMap already exists, add
mime-support="true"
indata.fuse-ossfs
to the ConfigMap.apiVersion: v1 kind: ConfigMap metadata: name: csi-plugin namespace: kube-system data: fuse-ossfs: | mime-support=true
Restart csi-plugin for the modification to take effect. The restart does not affect the volumes that are already mounted.
kubectl -n kube-system delete pod -l app=csi-plugin
Remount the desired OSS volume.
How do I use the specified ARNs or ServiceAccount in RRSA authentication?
You cannot use Amazon Resource Names (ARNs) of third-party OpenID Connect identity providers (OIDC IdPs) or ServiceAccounts other than the default one if you use RRSA authentication for OSS volumes.
To enable CSI to obtain the default role ARN and OIDC IdP ARN, set the roleName parameter in the PV to the desired RAM role. To customize RRSA authentication, modify the PV configuration as follows:
Configure both roleArn
and oidcProviderArn
. You do not need to set roleName
after you configure the preceding parameters.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-oss
spec:
capacity:
storage: 5Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: pv-oss # Specify the name of the PV.
volumeAttributes:
bucket: "oss"
url: "oss-cn-hangzhou.aliyuncs.com"
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
authType: "rrsa"
oidcProviderArn: "<oidc-provider-arn>"
roleArn: "<role-arn>"
#roleName: "<role-name>" #The roleName parameter becomes invalid after roleArn and oidcProviderArn are configured.
serviceAccountName: "csi-fuse-<service-account-name>"
Parameter | Description |
oidcProviderArn | Obtain the OIDC IdP ARN after the OIDC IdP is created. For more information, see Manage an OIDC IdP. |
roleArn | Obtain the role ARN after a RAM role whose trusted entity is the preceding OIDC IdP is created. For more information, see Step 2: Create a RAM role for the OIDC IdP in Alibaba Cloud. |
serviceAccountName | Optional. Specify the ServiceAccount used by the ossfs pod. Make sure that the pod is created. If you leave this parameter empty, the default ServiceAccount maintained by CSI is used. Important The name of the ServiceAccount must start with csi-fuse-. |
What do I do if the "Operation not supported" or "Operation not permitted" error occurs when I create a hard link?
Issue
The Operation not supported or Operation not permitted error occurs when you create a hard link.
Causes
The Operation not supported error occurs because OSS volumes do not support hard links. In earlier CSI versions, the Operation not permitted error is returned when you create hard links.
Solutions
Avoid using hard links if your application uses OSS volumes. If hard links are mandatory, we recommend that you change the storage service.
What do I do if a pod remains in the Terminating state when I fail to unmount a statically provisioned OSS volume from the pod?
Issue
You failed to unmount a statically provisioned OSS volume from a pod and the pod remains in the Terminating state.
Causes
If a pod remains in the Terminating state when the system deletes the pod, check the kubelet log. Possible causes of OSS volume unmounting failures:
Cause 1: The mount target on the node is occupied. CSI cannot unmount the mount target.
Cause 2: The specified OSS bucket or path in the PV is deleted. The status of the mount target is unknown.
Solutions
Solution to cause 1:
Run the following command in the cluster to query the pod UID.
Replace <ns-name> and <pod-name> with the actual values.
kubectl -n <ns-name> get pod <pod-name> -ogo-template --template='{{.metadata.uid}}'
Expected output:
5fe0408b-e34a-497f-a302-f77049****
Log on to the node that hosts the pod in the Terminating state.
Run the following command to check whether processes occupy the mount target.
lsof /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount/
If yes, confirm and terminate the processes.
Solution to cause 2:
Log on to the OSS console.
Check whether the OSS bucket or path is deleted. If you use subPath to mount the OSS volume, you also need to check whether the subPath is deleted.
If the unmounting fails because the path is deleted, perform the following steps:
Run the following command in the cluster to query the pod UID.
Replace <ns-name> and <pod-name> with the actual values.
kubectl -n <ns-name> get pod <pod-name> -ogo-template --template='{{.metadata.uid}}'
Expected output:
5fe0408b-e34a-497f-a302-f77049****
Log on to the node that hosts the pod in the Terminating state and run the following command to query the mount target of the pod:
mount | grep <pod-uid> | grep fuse.ossfs
Expected output:
ossfs on /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount type fuse.ossfs (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other) ossfs on /var/lib/kubelet/pods/<pod-uid>/volume-subpaths/<pv-name>/<container-name>/0 type fuse.ossfs (ro,relatime,user_id=0,group_id=0,allow_other)
The path between
ossfs on
andtype
is the actual mount target on the node.Manually unmount the mount target.
umount /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount umount /var/lib/kubelet/pods/<pod-uid>/volume-subpaths/<pv-name>/<container-name>/0
Wait for the kubelet to retry or run
--force
to delete the pod.
If the issue persists, submit a ticket.
What do I do if the detection task in the ACK console becomes stuck for a long period of time, the detection task fails but no error message is displayed, or the system prompts "unknown error"?
Issue
The detection task becomes stuck for a long period of time, the detection task fails but no error message is displayed, or the system prompts "unknown error"?
Causes
If the detection task becomes stuck for a long period of time, it is usually caused by a network issue. If it is caused by other unknown issues, print the log or use ossutil to manually locate the cause.
Solutions
You can use logs and ossutil to locate the cause.
Use logs to locate the cause
Run the following command to find the pod that runs the detection task.
osssecret-namespace
: the namespace of the Secret.pv-name
: the name of the PV.
kubectl -n <osssecret-namespace> get pod | grep <pv-name>-check
Expected output:
<pv-name>-check-xxxxx
Run the following command to locate the cause:
kubectl -n <osssecret-namespace> logs -f <pv-name>-check-xxxxx
Expected output:
check ossutil endpoint: oss-<region-id>-internal.aliyuncs.com bucket: <bucket-name> path: <path> Error: oss: service returned error: StatusCode=403, ErrorCode=InvalidAccessKeyId, ErrorMessage="The OSS Access Key Id you provided does not exist in our records.", RequestId=65267325110A0C3130B7071C, Ec=0002-00000901, Bucket=<bucket-name>, Object=<path>
Use ossutil to locate the cause
If the pod that runs the detection task is already deleted, use ossutil to recreate the detection task and locate the cause.
stat (query bucket and object information) is used to detect OSS access. Install ossutil on any node and run the following command.
ossutil -e "<endpoint>" -i "<accessKeyID>" -k "<accessKeySecret>" stat oss://"<bucket><path>"
Parameter | Description |
endpoint |
|
accessKeyID | The AccessKey ID in the Secret. |
accessKeySecret | The AccessKey secret in the Secret. |
bucket | The bucket ID. |
path | The path. The path must end with a forward slash |
Run the following command if you use a volume in the following figure.
ossutil -e "oss-<region-id>-internal.aliyuncs.com" -i "<accessKeyID>" -k "<accessKeySecret>" stat oss://"cnfs-oss-xxx-xxx/xx/"
How do I handle the "connection timed out" network error?
Issue
The connection timed out error occurs.
Causes
Access to the OSS bucket times out. Possible causes:
If the bucket and cluster reside in different regions and the internal endpoint of the bucket is used, access to the bucket fails.
If the external endpoint of the bucket is used but the cluster does not have Internet access, access to the bucket fails.
Solutions
Recreate the PV and select the external endpoint of the bucket.
If the bucket and cluster reside in the same region, you can recreate the PV and use the internal endpoint. If not, check the security group and network configurations, fix the issue, and recreate the PV.
How do I handle the "StatusCode=403" permission error?
Issue
The service returned error: StatusCode=403
error occurs.
Causes
Your AccessKey pair does not have read permissions on the OSS bucket to be mounted.
The
StatusCode=403, ErrorCode=AccessDenied, ErrorMessage="You do not have read acl permission on this object."
error indicates that your AccessKey pair does not have the required permissions.The
StatusCode=403, ErrorCode=InvalidAccessKeyId, ErrorMessage="The OSS Access Key Id you provided does not exist in our records."
error indicates that the AccessKey pair does not exist.The
StatusCode=403, ErrorCode=SignatureDoesNotMatch, ErrorMessage="The request signature we calculated does not match the signature you provided. Check your key and signing method."
error indicates that the AccessKey pair may contain spelling errors.
Solutions
Make sure that the AccessKey pair exists, does not contain spelling errors, and has read permissions on the bucket.
What do I do if the bucket or path does not exist and the StatusCode=404 status code is returned?
Issue
The service returned error: StatusCode=404
error occurs.
Causes
You cannot mount statically provisioned OSS volumes to buckets or subPaths that do not exist. You must create the buckets or subPaths in advance.
The
StatusCode=404, ErrorCode=NoSuchBucket, ErrorMessage="The specified bucket does not exist."
error indicates that the bucket does not exist.The
StatusCode=404, ErrorCode=NoSuchKey, ErrorMessage="The specified key does not exist."
error indicates that the subPath object does not exist.ImportantsubPaths displayed in the OSS console may not exist on the OSS server. Use ossutil or the OSS API to confirm the subPaths. For example, if you create the
/a/b/c/
path, the/a/b/c/
path object is created, but the/a/
or/a/b/
path object does not exist. If you upload the/a/*
object, you can find the/a/b
or/a/c
object, but the/a/
path object does not exist.
Solutions
Use ossutil, the SDK, or the OSS console to create the missing bucket or subPath, and then recreate the PV.
What do I do if other OSS status codes or error codes are returned?
Issue
The service returned error: StatusCode=xxx
error occurs.
Causes
If an error occurs when you access OSS, OSS returns the status code, error code, and error message for troubleshooting.
Solutions
If OSS returns other status codes or error codes, see HTTP status code.
How do I launch ossfs in exclusive mode after ossfs is containerized?
Issue
Pods that use the same OSS volume on a node share the mount target.
Causes
Before ossfs is containerized, OSS volumes are mounted in exclusive mode by default. An ossfs process is launched for each pod that uses an OSS volume. Different ossfs processes have different mount targets. Therefore, pods that use the same OSS volume do not affect each other when they read or write data.
After ossfs is containerized, the ossfs process runs in the csi-fuse-ossfs-*
pod in the kube-system
or ack-csi-fuse
namespace. In scenarios where an OSS volume is mounted to multiple pods, the exclusive mode will launch large numbers of ossfs pods. As a result, elastic network interfaces (ENIs) become insufficient. Therefore, after ossfs is containerized, use the shared mode to mount OSS volumes. This allows pods that use the same OSS volume on a node to the share the same mount target. This means that only one ossfs process in the csi-fuse-ossfs-*
pod is launched.
Solutions
In CSI 1.30.4 and later, the exclusive mode is no longer supported. If you need to restart or modify the configuration of ossfs, see How do I restart the ossfs process when the OSS volume is shared by multiple pods? If you have any other requirements for the exclusive mode with ossfs, Submit a ticket.
To use the exclusive mode before ossfs is containerized, add useSharedPath
and set it to "false"
when you create an OSS volume. Example:
apiVersion: v1
kind: PersistentVolume
metadata:
name: oss-pv
spec:
accessModes:
- ReadOnlyMany
capacity:
storage: 5Gi
csi:
driver: ossplugin.csi.alibabacloud.com
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: bucket-name
otherOpts: -o max_stat_cache_size=0 -o allow_other
url: oss-cn-zhangjiakou.aliyuncs.com
useSharedPath: "false"
volumeHandle: oss-pv
persistentVolumeReclaimPolicy: Delete
volumeMode: Filesystem
How do I restart the ossfs process when the OSS volume is shared by multiple pods?
Issue
After you modify the authentication information or ossfs version, the running ossfs processes cannot automatically update the information.
Causes
ossfs cannot automatically update the authentication configuration. To update the modified authentication configuration, you must restart the ossfs process (the
csi-fuse-ossfs-*
pod in thekube-system
orack-csi-fuse
namespace after ossfs is containerized) and the application pod. This causes business interruptions. Therefore, CSI does not restart running ossfs processes to update configurations by default. You need to manually configure ossfs to remount the OSS volume.Normally, the deployment and removal of ossfs are handled by CSI. Manually deleting the pod running the ossfs process does not trigger the CSI deployment process.
Solutions
To restart the ossfs process, you need to restart the application pod that mounts the corresponding OSS volume. Proceed with caution.
If the CSI version you use is not containerized or the exclusive mode is used to mount OSS volumes, you can directly restart the application pod. In containerized CSI versions, the shared mode is used to mount OSS volumes by default. This means that pods using the same OSS volume on a node share the same ossfs process.
Confirm the application pods that use the FUSE pod.
Run the following command to confirm the
csi-fuse-ossfs-*
pod.Replace
<pv-name>
with the PV name and<node-name>
with the node name.Use the following command if the CSI version is earlier than 1.30.4:
kubectl -n kube-system get pod -lcsi.alibabacloud.com/volume-id=<pv-name> -owide | grep <node-name>
Use the following command if the CSI version is 1.30.4 or later:
kubectl -n ack-csi-fuse get pod -lcsi.alibabacloud.com/volume-id=<pv-name> -owide | grep <node-name>
Expected output:
csi-fuse-ossfs-xxxx 1/1 Running 0 10d 192.168.128.244 cn-beijing.192.168.XX.XX <none> <none>
Run the following command to confirm all pods that use the OSS volume.
Replace
<ns>
with the namespace name and<pvc-name>
with the PVC name.kubectl -n <ns> describe pvc <pvc-name>
Expected output (including User By):
Used By: oss-static-94849f647-4**** oss-static-94849f647-6**** oss-static-94849f647-h**** oss-static-94849f647-v**** oss-static-94849f647-x****
Run the following command to query the pods run on the same node as the
csi-fuse-ossfs-xxxx
process:kubectl -n <ns> get pod -owide | grep cn-beijing.192.168.XX.XX
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES oss-static-94849f647-4**** 1/1 Running 0 10d 192.168.100.11 cn-beijing.192.168.100.3 <none> <none> oss-static-94849f647-6**** 1/1 Running 0 7m36s 192.168.100.18 cn-beijing.192.168.100.3 <none> <none>
Restart your application and the ossfs process.
Delete the application pods (oss-static-94849f647-4**** and oss-static-94849f647-6**** in the preceding example) by using
kubectl scale
. When the OSS volume is not mounted to application pods, thecsi-fuse-ossfs-xxxx
pod is deleted. After the application pods are recreated, the OSS volume is mounted based on the new PV configuration by the ossfs process running incsi-fuse-ossfs-yyyy
pod.A restart is triggered immediately when deleting a pod managed by a Deployment, StatefulSet, or DaemonSet. If you cannot delete all application pods at the same time or the pods can tolerate read and write failures:
If the CSI version is earlier than 1.30.4, you can directly delete the
csi-fuse-ossfs-xxxx
pod. In this case, thedisconnected error
is returned when the application pods read or write the OSS volume.If the CSI version is 1.30.4 or later, run the following command:
kubectl get volumeattachment | grep <pv-name> | grep cn-beijing.192.168.XX.XX
Expected output:
csi-bd463c719189f858c2394608da7feb5af8f181704b77a46bbc219b********** ossplugin.csi.alibabacloud.com <pv-name> cn-beijing.192.168.XX.XX true 12m
If you directly delete this VolumeAttachment, the
disconnected error
is returned when the application pods read or write the OSS volume.
Restart the application pods one after one. The restarted application pods will read and write the OSS volume through the
csi-fuse-ossfs-yyyy
pod created by CSI.
How do I view the access records of an OSS volume?
You can view the records of OSS operations in the OSS console. Make sure that the log query feature is enabled for OSS. For more information, see Enable real-time log query.
Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.
In the left-side navigation tree, choose
.On the Real-time Log Query tab, enter a query statement and an analysis statement based on the query syntax and analysis syntax to analyze log fields. Use the user_agent and client_ip fields to confirm whether the log is generated by ACK.
To find OSS requests sent by ACK, select the user_agent field. Requests whose user_agent contains ossfs are OSS requests sent by ACK.
ImportantThe value of the user-agent field depends on the ossfs version, but the values all start with
aliyun-sdk-http/1.0()/ossfs
.If you have used ossfs to mount OSS volumes on ECS instances, the relevant log data is also recorded.
To locate an ECS instance or cluster, select the client_ip field and find the IP address of the ECS instance or cluster.
The following figure displays the log data filtered based on the preceding fields.
Fields that are queried
Field | Description |
operation | The type of OSS operation. Examples: GetObject and GetBucketStat. For more information, see List of operations by function. |
object | The name of the OSS object (path or file). |
request_id | The request ID, which helps you find a request. |
http_status and error_code | The returned status code or error code. For more information, see HTTP status code. |
What do I do if an exception occurs when I use subPath or subPathExpr to mount an OSS volume?
Issue
The following exceptions occur when you use subPath or subPathExpr to mount OSS volumes:
Mounting failures: The pod to which the OSS volume to be mounted remains in the CreateContainerConfigError state after the pod is created. In addition, the following event is generated.
Warning Failed 10s (x8 over 97s) kubelet Error: failed to create subPath directory for volumeMount "pvc-oss" of container "nginx"
Read and write exceptions: The
Operation not permitted
orPermission denied
error is returned when an application pod reads or writes the OSS volume.Unmounting failures: When the system deletes the pod to which the OSS volume is mounted, the pod remains in the Terminating state.
Causes
Assume that:
PV-related configuration:
...
volumeAttributes:
bucket: bucket-name
path: /path
...
Pod-related configuration:
...
volumeMounts:
- mountPath: /path/in/container
name: oss-pvc
subPath: subpath/in/oss
...
In this case, the subPath on the OSS server is set to the /path/subpath/in/oss/
path in the bucket.
Cause 1: The mount target
/path/subpath/in/oss/
does not exist on the OSS server and the user or role does not have the PutObject permission on the OSS volume. For example, only the OSS ReadOnly permission is granted in read-only scenarios.The kubelet attempts to create the mount target
/path/subpath/in/oss/
on the OSS server but failed due to insufficient permissions.Cause 2: Application containers run by non-root users do not have the required permissions on files in the
/path/subpath/in/oss/
path. The default permission is 640. When you use subPath to mount an OSS volume, the mount target on the OSS server is the path defined in the PV, which is/path
in the preceding example, but not/path/subpath/in/oss/
. The allow_other or mp_umask setting takes effect only on the/path
path. The default permission on the/path/subpath/in/oss/
subPath is still 640.Cause 3: The mount target
/path/subpath/in/oss/
on the OSS server is deleted. Consequently, the kubelet fails to unmount the subPath.
Solutions
Solution to cause 1:
Create the
/path/subpath/in/oss/
subPath on the OSS server for the kubelet.If you need to create a large number of paths and some paths do not exist when you mount the OSS volume, you can grant the putObject permission to the user or role. This case occurs when you use subPathExpr to mount OSS volumes.
Solution to cause 2: Use the umask parameter to modify the default permission on the subPath. For example, add -o umask=000
to set the default permission to 777.
Solution to cause 3: Refer to the solution to cause 2 in What do I do if a pod remains in the Terminating state when I fail to unmount a statically provisioned OSS volume from the pod?
Does the capacity setting take effect on OSS volumes? Do I need to expand a volume if the actual capacity of the volume exceeds the capacity setting?
OSS does not limit the size of buckets or subPaths or have any capacity limits. Therefore, the pv.spec.capacity
and pvc.spec.resources.requests.storage
settings do not take effect. You need only to make sure that the capacity values in the PV and PVC are the same.
You can continue to use a volume as normal when the actual capacity of the volume exceeds the capacity setting. No volume expansion is needed.