This topic describes the diagnostic procedure for pods and how to troubleshoot pod errors. This topic also provides answers to some frequently asked questions about pods.
To learn about how to troubleshoot pod issues in the console, see Troubleshoot in the console. In the console, you can view the status, basic information, configuration, events, and logs of pods, use a terminal to access containers, and enable pod diagnostics.
Diagnostic procedure
If a pod does not run as normal, you can identify the cause by checking the events, log, and configuration of the pod. The following figure shows the procedure.
Phase 1: Scheduling issues
The pod failed to be scheduled
If the pod remains in the Pending state for a long period of time, it means that the pod failed to be scheduled. The following table describes the possible causes.
Error message | Description | Suggested solution |
| The cluster does not have nodes available for pod scheduling. |
|
| The cluster does not have nodes that can fulfill the CPU or memory request of the pod. | On the Nodes page, view the pod, CPU, and memory usage and check the cluster resource utilization. Note When the CPU and memory utilization of a node is maintained at a low level, scheduling a new pod to the node does not exhaust the node resources. However, the scheduler still checks whether the new pod will cause node resource shortage during peak hours and attempts to avoid improper resource scheduling. When the CPU or memory resources in the cluster are exhausted, use the following methods:
|
| The cluster does not have nodes that match the node affinity rules (nodeSelector) or affinity and anti-affinity rules (podAffinity and podAnitiAffinity) of the pod. |
|
| The volume used by the pod has encounters a node affinity conflict because disk volumes cannot be mounted across zones. Consequently, the pod failed to be scheduled. |
|
| The disk type is not supported by Elastic Compute Service (ECS) instances. | Refer to Overview of instance families and check the disk types supported by ECS instances. Mount a disk that is supported by ECS instances to the pod. |
| The pod failed to be scheduled to the desired node because the node has a taint. |
|
| The node does not have sufficient ephemeral storage space. |
|
| The persistent volume claim (PVC) failed to be bound to the pod. | Check whether the PVC or PV specified for the pod is created. You can run the |
The pod is already scheduled to a node
If the pod is already scheduled to a node but remains in the Pending state, refer to the following solution.
Check whether the pod is configured with a
hostPort
: If the pod is configured with ahostPort
, only one pods that uses thehostPort
can run on each node. Therefore, the value ofReplicas
in a Deployment or ReplicationController must not exceed the number of nodes in the cluster. If the host port of a node is used by other applications, the pod failed to be scheduled to the node.The
hostPort
setting increases the complexity of pod management and scheduling. We recommend that you use Services to access pods. For more information, see Services.If the pod is not configured with a
hostPort
, perform the following steps.Run the
kubectl describe pod <pod-name>
command to view the events of the pod and troubleshoot the issue. The events may display the cause of the pod startup failure, such as image pulling failures, insufficient resources, limits due to security policies, or configuration errors.If you failed to locate the cause in the events, check the logs of the kubelet on the node. You can run the
grep -i <pod name> /var/log/messages* | less
command to check for log entries that contain the pod name in the system log file (/var/log/messages*
).
Phase 2: Image pulling issues
Error message | Description | Suggested solution |
| Access to the image repository is denied because | Check whether the Secret specified in the If Container Registry is used, you can use the aliyun-acr-credential-helper component to pull images without a password. For more information, see Use the aliyun-acr-credential-helper component to pull images without using a secret. |
| Failed to parse the image address when pulling an image from the specified image repository over HTTP. |
|
| The node does not have sufficient disk space. | Refer to Methods for connecting to an ECS instance and log on to the node of the pod, and run the |
| The third-party image repository uses a certificate signed and issued by an unknown or untrusted certificate authority. |
|
| The operation is canceled because the image file is too large. Kubernetes has a default timeout period for image pulling. if the progress of image pulling is not updated within a specific time period, Kubernetes considers that an error occurs or no response is returned and then cancels the operation. |
|
| Failed to connect to the image repository. |
|
| DockerHub sets a rate limit on image pulling requests. | Upload the image to Container Registry and pull the image from Container Registry. |
| The upper limit of image pulling by using the kubelet may be reached. | Refer to Customize the kubelet parameters of a node pool and modify the maximum image repository QPS (registryPullQPS) and the maximum size of a burst of image pulling (registryBurst). |
Phase 3: Startup issues
The pod is in the Init state
Error message | Description | Suggested solution |
The pod remains in the | The pod contains M Init containers. N containers have been started. M-N Init containers have not started. |
For more information about init containers, see Debug init containers. |
The pod remains in the | Init containers in the pod failed to start. | |
The pod remains in the | Init containers in the pod failed to start and repetitively restart. |
The pod is being created (Creating)
Error message | Description | Suggested solution |
| This error is expected due to the design of the Flannel network plug-in. | Update Flannel to v0.15.1.11-7e95fe23-aliyun or later. For more information, see Flannel. |
If the cluster runs a Kubernetes version earlier than 1.20 and a pod repetitively restarts or pods created by a CronJob complete the tasks and exit after a short period of time, an IP leak may occur. | Update the Kubernetes version of the cluster to 1.20 or later. We recommend that you update to the latest version. For more information, see Manually upgrade ACK clusters. | |
containerd and runC have defects. | For more information, see What do I do if a pod fails to be started and the "no IP addresses available in range" error message appears? | |
| The internal database status maintained by Terway and used to track and manage elastic network interfaces (ENIs) on the node of the pod is inconsistent with the actual network device configuration. Consequently, ENIs failed to be allocated. |
|
| Terway failed to request an IP address from the vSwitch. |
|
The pod failed to start (CrashLoopBackOff)
Error message | Description | Suggested solution |
The log displays |
| |
The event information displays | A health check failure occurred. | Check the liveness probe policy of the pod. Make sure that the liveness probing result can indicate the actual status of the applications running in the containers of the pod. |
The pod log displays | The disk space is insufficient. |
|
The pod failed to start and no event is generated | The resource limits of the pod are lower than the resources requested by the pod. Consequently, containers in the pod failed to start. | Check the resource configuration of the pod. You can enable resource profiling to obtain suggested resource requests and resource limits. |
The pod log displays | Containers in the pod encounter a port conflict. |
|
The pod log displays | The workload is mounted with a Secret. The key value of the Secret is not encoded by using Base64. |
|
A business issue exists. | Locate the cause based on the pod log. |
Phase 4: pod operation issues
OOM
If the memory usage of a container in the cluster exceeds the specified memory limit, the container may be terminated and trigger an OOM event, which causes the container to exit. For more information about OOM events, see Allocate memory resources to containers and pods.
If the terminated process causes the stuck container, the container may restart.
If an OOM error occurs, log on to the console and navigate to the pod details page. On the Events tab, you can view the following OOM event: pod was OOM killed.
If the cluster is configured with an alert rule for container replica exceptions, you will receive an alert when an OOM event occurs. For more information, see Alert management.
OOM level | Description | Suggested solution |
OS level | The kernel log ( | Possible causes are insufficient system global memory, insufficient node memory, or insufficient buddy system memory during memory fragmentation. For more information about the causes of memory shortage, see Possible causes. For more information about the solutions, see Solutions. |
cgroup level | The kernel log ( | If the process runs as normal, increase the memory limit of the pod accordingly. Make sure that the actual memory usage of the pod does not exceed 80% of the memory limit. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods. You can enable resource profiling to obtain suggested resource requests and resource limits for containers. |
Terminating
Possible cause | Description | Suggested solution |
The node is abnormal and in the NotReady state. | A NotReady node is automatically removed after it recovers. | |
Finalizers are configured for the pod. | If finalizers are configured for the pod, Kubernetes performs the cleanup operations specified by the finalizers before deleting the pod. If no responses are returned for the cleanup operations, the pod remains in the Terminating state. | Run the |
The preStop hook of the pod is invalid. | If the preStop hook is configured for the pod, Kubernetes performs the operations specified by the preStop hook before terminating the pod. The pod is at the preStop stage and will enter the Terminating state. | Run the |
The pod is configured with a graceful shutdown period. | If the pod is configured with a graceful shutdown period ( | After the containers gracefully exit, Kubernetes deletes the pod. |
Containers do not respond. | After you initiate a request to terminate or delete the pod, Kubernetes sends the |
|
Evicted
Possible cause | Description | Suggested solution |
The node does not have sufficient resources, such as memory or disk space resources. Consequently, the kubelet evicts one or more pods on the node to reclaim resources. | The node may not have sufficient memory, disk space, or PIDs. Run the
|
|
An unexpected eviction occurs. | The node that hosts the pod has the NoExecute taint. Consequently, an unexpected eviction occurs. | Run the |
The pod is not evicted as expected. |
| In a small cluster which contains no more than 50 nodes, if more than 55% of the nodes are down, pod eviction is stopped. For more information, see Rate limits on eviction. |
In a large cluster which contains more than 50 nodes, if the ratio of unhealthy nodes to total nodes exceeds the value of | ||
The pod is still frequently scheduled to the original node after it is evicted. | Pods are evicted from a node based on the resource usage of the node. The pod scheduling rule depends on the allocated resources on the node. An evicted pod may be scheduled to the original node again. | Check whether the resource requests of the pod are properly configured based on the allocatable resources on the node. For more information, see Modify the upper and lower limits of CPU and memory resources for a pod. You can enable resource profiling to obtain suggested resource requests and resource limits for containers. |
Completed
When a pod is in the Completed state, all containers in the pod have started and all processes in the containers have successfully exited. The Completed state is usually suitable for Job and Init containers.
Other frequently asked questions
The pod remains in the Running state but does not run as normal
If the pod YAML file contains errors, the pod may remain in the Running state but does not run as normal. To address this issue, perform the following steps.
Inspect the configurations of the pod and check whether the containers in the pod are configured as expected.
Use the following method to check whether the keys of the environment variables contain spelling errors.
If the key of an environment variable contains spelling errors, for example,
command
is spelled ascommnd
, the cluster can ignore the spelling errors and create pods based on the YAML file. However, containers cannot execute the commands specified in the YAML file.The following example describes how to identify spelling errors if you spell
command
ascommnd
.Add
--validate
before thekubectl apply -f
command and run thekubectl apply --validate -f XXX.yaml
command.If a spelling error exists, the
XXX] unknown field: commnd XXX] this may be a false alarm, see https://gXXXb.XXX/6842pods/test
message is displayed.Run the following command to compare the pod.yaml file that you checked with the YAML file used to create pods.
NoteReplace
[$Pod]
with the name of the abnormal pod. You can run thekubectl get pods
command to obtain the name.kubectl get pods [$Pod] -o yaml > pod.yaml
If the pod.yaml file contains more lines than the original YAML file used to create pods, this means that pods are created as expected.
If the YAML command lines for creating pods are not found in the pod.yaml file, this means that the original YAML file contains spelling errors.
Check the log of the pod and troubleshoot the issue based on the log data.
You can log on to a container of the pod by using the terminal and check the local files in the container.
A network disconnection issue occasionally occurs when a pod accesses a database
If a network disconnection issue occasionally occurs when a pod in the ACK cluster accesses a database, you can perform the following operations to troubleshoot the issue.
1. Check the pod
View the events of the pod and check for unstable connection events, such as events related to network exceptions, restarts, and insufficient resources.
View the logs of the pod and check for database connection errors, such as connection timeouts, authentication failures, or reconnection.
View the CPU and memory usage of the pod. Make sure that the application or database driver does not exceptionally exit due to insufficient resources.
View the resource requests and limits of the pod. Make sure that the pod has sufficient CPU and memory resources.
2. Check the node
Check the resource usage of the node and ensure that resources such as memory and disk resources are sufficient. For more information, see Monitor nodes.
Test whether network disconnection occasionally occurs between the node and the database.
3. Check the database
Check the status and performance metrics of the database to ensure that no restarts or performance bottlenecks exist.
View the number of abnormal connections and check the timeout period setting, and modify the setting based on your business requirements.
Analyze the database logs related to disconnections from the database.
4. Check the status of the cluster components
Cluster component exceptions affect the communication between pods and other components in the cluster. Run the following command to check the status of the components in the ACK cluster:
kubectl get pod -n kube-system # View the component status.
Check the network components:
CoreDNS: Check the status and logs of the component. Make sure that the pod can parse the address of the database service.
Flannel: Check the status and logs of kube-flannel.
Terway: Check the status and logs of terway-eniip.
5. Analyze the network traffic
You can use tcpdump
to capture packets and analyze network traffic to help locate the cause.
Run the following command to identify the pod and node where the database disconnection issue occurred.
kubectl get pod -n [namespace] -o wide
Log on to the node. For more information, see Methods for connecting to an ECS instance.
Run the following commands to query the container PID based on different Kubernetes versions.
containerd (Kubernetes versions later than 1.22)
Run the following command to view the
CONTAINER
:crictl ps |grep <Pod name keyword>
Expected output:
CONTAINER IMAGE CREATED STATE a1a214d2***** 35d28df4***** 2 days ago Running
Specify the
CONTAINER ID
parameter and run the following command to view the container PID:crictl inspect a1a214d2***** |grep -i PID
Expected output:
"pid": 2309838, # The PID of the container. "pid": 1 "type": "pid"
Docker (Kubernetes 1.22 and earlier)
Run the following command to view the
CONTAINER ID
:docker ps | grep <Pod name or keyword>
Expected output:
CONTAINER ID IMAGE COMMAND a1a214d2***** 35d28df4***** "/nginx
Specify the
CONTAINER ID
parameter and run the following command to view the PID of the container:docker inspect a1a214d2***** |grep -i PID
Expected output:
"Pid": 2309838, # The PID of the container. "PidMode": "", "PidsLimit": null,
Run the packet capture command.
Run the following command based on the container PID to capture packets transmitted between the pod and database:
nsenter -t <Container PID> tcpdump -i any -n -s 0 tcp and host <IP address of the database>
Run the following command based on the container PID to capture packets transmitted between the pod and host:
nsenter -t <Container PID> tcpdump -i any -n -s 0 tcp and host <IP address of the node>
Run the following command to capture packets transmitted between the host and database.
tcpdump -i any -n -s 0 tcp and host <IP address of the database>
6. Optimize the application
Configure your application to support the automatic reconnection mechanism to ensure that the application can automatically restore the connection without manual intervention when the database is changed or migrated.
Use persistent connections instead of short-lived connections to communicate with the database. Persistent connections can significantly reduce performance loss and resource consumption, and enhance the overall system efficiency.
Troubleshoot in the console
You can log on to the ACK console, navigate to the cluster details page, and then troubleshoot abnormal pods.
Operation | Layout of the console |
Check the status of a pod |
|
Check the basic information of a pod |
|
Check the configuration of a pod |
|
Check the events of a pod |
|
Check the logs of a pod |
Note ACK clusters are interfaced with Simple Log Service. You can enable Simple Log Service for your cluster to quickly collect container logs. For more information, see Collect text logs from Kubernetes containers in DaemonSet mode. |
Check the monitoring information about a pod |
Note ACK clusters are interfaced with Managed Service for Prometheus. You can enable Managed Service for Prometheus for an ACK cluster to monitor the cluster and containers in the cluster in real time. After you enable Managed Service for Prometheus, you can view metrics displayed on Grafana dashboards. For more information, see Managed Service for Prometheus. |
Log on to a container of a pod by using the terminal and view local files in the container |
|
Enable pod diagnostics |
Note Container Intelligent Service provides the cluster diagnostics feature to allow you to diagnose pods, Services, and Ingresses with one click and help you locate the cause. For more information, see Work with cluster diagnostics. |