This topic describes how to troubleshoot common issues related to storage and provides answers to some frequently asked questions about disk volumes and File Storage NAS (NAS) volumes.
Common issues
Perform the following steps to view the log of a volume plug-in and identify issues.
Run the following command to check whether events related to persistent volume claims (PVCs) or pods are generated:
kubectl get events
Expected output:
LAST SEEN TYPE REASON OBJECT MESSAGE 2m56s Normal FailedBinding persistentvolumeclaim/data-my-release-mariadb-0 no persistent volumes available for this claim and no storage class is set 41s Normal ExternalProvisioning persistentvolumeclaim/pvc-nas-dynamic-create-subpath8 waiting for a volume to be created, either by external provisioner "nasplugin.csi.alibabacloud.com" or manually created by system administrator 3m31s Normal Provisioning persistentvolumeclaim/pvc-nas-dynamic-create-subpath8 External provisioner is provisioning volume for claim "default/pvc-nas-dynamic-create-subpath8"
Check whether the FlexVolume or CSI plug-in is deployed in the cluster.
Run the following command to check whether the FlexVolume plug-in is deployed in the cluster:
kubectl get pod -n kube-system |grep flexvolume
Expected output:
NAME READY STATUS RESTARTS AGE flexvolume-*** 4/4 Running 0 23d
Run the following command to check whether the CSI plug-in is deployed in the cluster:
kubectl get pod -n kube-system |grep csi
Expected output:
NAME READY STATUS RESTARTS AGE csi-plugin-*** 4/4 Running 0 23d csi-provisioner-*** 7/7 Running 0 14d
Check whether the volume template matches the template of the volume plug-in used in the cluster. The supported volume plug-ins are FlexVolume and CSI.
If this is the first time you mount volumes in the cluster, check whether the driver specified in the persistent volume (PV) and StorageClass is a CSI driver or a FlexVolume driver. The name of the driver that you specified must be the same as the type of the volume plug-in that is deployed in the cluster.
Check whether the volume plug-in is updated to the latest version.
Run the following command to query the image version of the FlexVolume plug-in:
kubectl get ds flexvolume -n kube-system -oyaml | grep image
Expected output:
image: registry.cn-hangzhou.aliyuncs.com/acs/Flexvolume:v1.14.8.109-649dc5a-aliyun
For more information about FlexVolume, see FlexVolume (Deprecated).
Run the following command to query the image version of the CSI plug-in:
kubectl get ds csi-plugin -n kube-system -oyaml |grep image
Expected output:
image: registry.cn-hangzhou.aliyuncs.com/acs/csi-plugin:v1.18.8.45-1c5d2cd1-aliyun
For more information about the CSI plug-in, see csi-plugin and csi-provisioner.
View logs.
If a PVC of the disk type is in the Pending state, the related PV is not created. You must check the log of the Provisioner plug-in.
If the FlexVolume plug-in is deployed in the cluster, run the following command to print the log of alicloud-disk-controller:
podid=`kubectl get pod -nkube-system | grep alicloud-disk-controller | awk '{print $1}'` kubectl logs <PodID> -n kube-system
If the CSI plug-in is deployed in the cluster, run the following command to print the log of csi-provisioner:
podid=`kubectl get pod -n kube-system | grep csi-provisioner | awk '{print $1}'` kubectl logs <PodID> -n kube-system -c csi-provisioner
NoteTwo pods are created to run csi-provisioner. After you run the
kubectl get pod -nkube-system | grep csi-provisioner | awk '{print $1}'
command, twopodid
values are returned. Then, run thekubectl logs <PodID> -nkube-system -c csi-provisioner
command on each pod.
If a mounting error occurs when the system starts a pod, you must check the log of FlexVolume or csi-plugin.
If the FlexVolume plug-in is deployed in the cluster, run the following command to print the log of FlexVolume:
kubectl get pod <pod-name> -owide
Log on to the Elastic Compute Service (ECS) instance where the pod runs and check the log of FlexVolume in the
/var/log/alicloud/flexvolume_**.log
directory.If the CSI plug-in is deployed in the cluster, run the following command to print the log of csi-plugin:
nodeID=`kubectl get pod <pod-name> -owide | awk 'NR>1 {print $7}'` podID=`kubectl get pods -nkube-system -owide -lapp=csi-plugin | grep $nodeID|awk '{print $1}'` kubectl logs <PodID> -nkube-system
View the log of kubelet.
Run the following command to query the node on which the pod runs:
kubectl get pod <pod-name> -owide | awk 'NR>1 {print $7}'
Log on to the node and check the log files in the /var/log/message directory.
Quick recovery
If you fail to mount volumes to most of the pods on a node, you can schedule the pods to other nodes. For more information, see Schedule pods to specific nodes.
csi-plugin update failures
csi-plugin is deployed through a DaemonSet. If nodes that are in the NotReady state or a state other than Running exist in the cluster, Container Service for Kubernetes (ACK) fails to update csi-plugin. You need to manually fix the nodes and perform the update again. For more information, see Manage the CSI plug-in.
csi-plugin startup failures
Issue
csi-provisioner and csi-plugin fail to be started. The main container logs of csi-plugin and csi-provisioner report the 403 - Forbidden
error.
Cause
Security hardening is enabled for the metadata servers on nodes. The metadata cannot be accessed because CSI does not support security hardening.
Solution
Submit a ticket to contact the ECS team for technical support.
What do I do if the csi-provisioner update fails because the number of nodes in the cluster does not meet the requirements of the update precheck?
Issues
The csi-provisioner plug-in fails to pass the precheck because the number of nodes in the cluster does not meet the requirement.
The csi-provisioner plug-in passes the precheck and can be updated. However, the csi-provisioner pod crashes and the following
403 Forbidden
error is found in the log:time="2023-08-05T13:54:00+08:00" level=info msg="Use node id : <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n <head>\n <title>403 - Forbidden</title>\n </head>\n <body>\n <h1>403 - Forbidden</h1>\n </body>\n</html>\n"
Cause
Cause for issue 1:
To ensure the high availability of csi-provisioner, csi-provisioner runs in a primary pod and a secondary pod. The primary and secondary pods are scheduled to different nodes. If your cluster has only one node, you cannot update csi-provisioner.
Cause for issue 2:
The security hardening mode is enabled for the node where csi-provisioner resides. This mode prevents access to the metadata server on the node.
Solutions
Solution for issue 1:
Update csi-provisioner. For more information, see Manage the CSI plug-in.
Solution for issue 2:
Disable the security hardening mode on the node to allow CSI to access the metadata of the node.
What do I do if the csi-provisioner update fails due to StorageClasses attribute changes?
Issue
csi-provisioner fails the precheck because the attributes of StorageClasses do not meet the requirements.
Cause
The attributes of the default StorageClasses are modified. You have deleted and recreated StorageClasses that have the same names as the default StorageClasses. The attributes of the default StorageClasses cannot be changed. Otherwise, csi-provisioner may fail to be updated.
Solution
Delete the following default StorageClasses: alicloud-disk-essd, alicloud-disk-available, alicloud-disk-efficiency, alicloud-disk-ssd, and alicloud-disk-topology. The deletion operation does not affect the applications in the cluster. Then, reinstall csi-provisioner. After csi-provisioner is reinstalled, the preceding default StorageClasses are automatically recreated.
If you want to create custom StorageClasses, use names that are different from the names of the preceding default StorageClasses.
Do StorageClass changes affect existing volumes?
StorageClass changes do not affect existing volumes if the YAML files of the PVCs or PVs are not modified. For example, after you modify the ALLOWVOLUMEEXPANSION
setting in a StorageClass, the new setting takes effect only if you modify the Capacity parameter in the YAML file of the PVC.
What do I do if the "failed to renew lease xxx timed out waiting for the condition" error is displayed in the log of csi-provisioner?
Issue
After you run the kubectl logs csi-provisioner-xxxx -nkube-system
command to query the log of csi-provisioner, the failed to renew lease xxx timed out waiting for the condition
error appears in the log.
Cause
Multiple replicated pods are provisioned for csi-provisinoer to implement high availability. Kubernetes uses Leases to perform a leader election among the replicated pods of a component. During the election, csi-provisioner accesses the Kubernetes API server of the cluster to request the specified Lease. The replicated pod that acquires the Lease becomes the leader to provide services in the cluster. This issue occurs because csi-provisioner cannot access the Kubernetes API server of the cluster.
Solution
Check whether the cluster network and Kubernetes API server of the cluster are in the normal state. If the issue persists, submit a ticket.
OOM issues caused by volume plug-ins
csi-provisioner is a centralized volume plug-in. Sidecar containers are used to cache information about pods, PVs, and PVCs. When the size of the cluster grows, out of memory (OOM) errors may occur. When an OOM error occurs, you need to modify resource limits based on the size of the cluster.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Add-ons page, click the icon in the lower-right part of the csi-provisioner component and click View in YAML.
Modify the resource limits in the YAML file based on the size of the cluster.
Why does the system prompt no volume plugin matched for the PVC when I create or mount a volume?
Issue
The system prompts Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: failed to get Plugin from volumeSpec for volume "xxx" err=no volume plugin matched for the PVC when you create or mount a volume.
Cause
The volume plug-in does not match the YAML template. As a result, the system cannot find the corresponding volume plug-in when creating or mounting a volume.
Solution
Check whether the volume plug-in exists in the cluster.
If the volume plug-in is not installed, install the plug-in. For more information, see Manage components.
If the volume plug-in is already installed, check whether the volume plug-in matches the YAML templates of the PV and PVC and whether the YAML templates meet the following requirements:
The CSI plug-in is deployed by following the steps as required. For more information, see CSI overview.
The FlexVolume plug-in is deployed by following the steps as required. For more information, see FlexVolume overview.
ImportantFlexVolume is deprecated. If the version of your ACK cluster is earlier than 1.18, we recommend that you migrate from FlexVolume to CSI. For more information, see Migrate from FlexVolume to CSI.
What do I do if a large volume of traffic is recorded in the monitoring data of the csi-plugin pod?
Issue
A large volume of traffic is recorded in the monitoring data of the csi-plugin pod.
Cause
csi-plugin is responsible for mounting NAS volumes to nodes. If a NAS volume is mounted to a pod on a node, requests from the pod to the NAS volume pass through the namespace where csi-plugin is deployed. The requests are monitored by the cluster. As a result, a large volume of traffic is recorded in the monitoring data of the csi-plugin pod.
Solution
You do not need to fix this issue. The volume of traffic that flows through csi-plugin does not double. In addition, the traffic that flows through csi-plugin does not consume additional network bandwidth.
Why does the system generate the 0/x nodes are available: x pod has unbound immediate PersistentVolumeClaims event for a pod?
Issue
The system generates the 0/x nodes are available: x pod has unbound immediate PersistentVolumeClaims. preemption: 0/x nodes are available: x Preemption is not helpful for scheduling event for a pod.
Cause
The custom StorageClass referenced by the pod is not found because the custom StorageClass does not exist.
Solution
If the pod uses a dynamically provisioned volume, find the custom StorageClass that is referenced by the pod. If the StorageClass does not exist, create one.
What do I do if the PV is in the Released state and cannot be bound to the recreated PVC?
Issue
You accidentally deleted the PVC. The PV is in the Released state and cannot be bound to the PVC that you recreated.
Cause
If the reclaimPolicy of the PVC is Retain, the status of the PV changes to Released after you delete the PVC.
Solution
You need to delete the pv.spec.claimRef
field for the PV and then bind the PV to the PVC as a statically provisioned volume. This way, the status of the PV changes to Bound.
For more information about how to bind a statically provisioned PV that uses a NAS file system, see Use a statically provisioned NAS volume.
For more information about how to bind a statically provisioned PV that uses an Object Storage Service (OSS) bucket, see Use a statically provisioned OSS volume.
For more information about how to bind a statically provisioned PV that uses a disk, see Use a statically provisioned disk volume.
What do I do if the PV is in the Lost state and cannot be bound to the recreated PVC?
Issue
After the PVC and PV are created, the PV remains in the Lost state and cannot be bound to the PVC.
Cause
The PVC name that is specified in the claimRef
field of the PV does not exist. As a result, the status of the PV changes to Lost.
Solution
You need to delete the pv.spec.claimRef
field for the PV and then bind the PV to the PVC as a statically provisioned volume. This way, the status of the PV changes to Bound.
For more information about how to bind a statically provisioned PV that uses a NAS file system, see Use a statically provisioned NAS volume.
For more information about how to bind a statically provisioned PV that uses an Object Storage Service (OSS) bucket, see Use a statically provisioned OSS volume.
For more information about how to bind a statically provisioned PV that uses a disk, see Use a statically provisioned disk volume.
FAQ about migrating from FlexVolume to CSI
In earlier ACK versions, FlexVolume is used as the volume plug-in. FlexVolume is deprecated in later versions. If the version of your ACK cluster is earlier than 1.18, we recommend that you migrate from FlexVolume to CSI. For more information, see Migrate from FlexVolume to CSI.
Other StorageClass issues
In the case that the mountOption parameter contains spelling errors, the StorageClass referenced by a PVC does not exist, or the domain name of a mount target does not exist, we recommend that you use Container Network File System (CNFS) volumes. For more information about CNFS, see CNFS overview.
Can multiple applications in a cluster use the same volume?
Disk volumes: not supported
A disk volume can be mounted only to one pod and cannot be used by multiple applications.
NAS and OSS volumes: supported
NAS and OSS volumes can be shared by multiple pods. This means that a PVC can be used by multiple applications at the same time. For more information about the limits on concurrent writes to NAS, see How do I prevent exceptions that may occur when multiple processes or clients concurrently write data to a log file? and How do I resolve the latency in writing data to an NFS file system?
For more information about how to mount a NAS volume, see Use CNFS to manage NAS file systems (recommended), Mount a statically provisioned NAS volume, and Mount a dynamically provisioned NAS volume.
For more information about how to mount OSS volumes, see Mount a statically provisioned OSS volume. For more information about how to use CNFS to mount dynamically provisioned OSS volumes, see Manage the lifecycle of OSS buckets.
How do I change the configurations of the StorageClasses automatically created for a disk?
You cannot modify the StorageClasses that are automatically created.
After csi-provisioner is installed, StorageClasses such as alicloud-disk-topology-alltype are automatically created in the cluster. Do not modify these StorageClasses. For more information about the StorageClasses of disks, see StorageClass. If you need to modify the configurations of a StorageClass, such as the volume type, performance, and reclaim policy, you can create a new StorageClass. The number of StorageClasses that you can create is unlimited. For more information, see Create a StorageClass.