All Products
Search
Document Center

Container Service for Kubernetes:Pod troubleshooting

Last Updated:May 13, 2024

This topic describes the diagnostic procedure for pods and how to troubleshoot pod errors. This topic also provides answers to some frequently asked questions about pods.

Table of contents

Item

Content

Diagnostic procedure

Diagnostic procedure

Common troubleshooting methods

FAQ and solutions

Diagnostic procedure

  1. Check whether the pod runs as normal. For more information, see Check the status of a pod.

    1. If the pod does not run as normal, you can identify the cause by checking the events, log, and configurations of the pod. For more information, see Common troubleshooting methods. For more information about the abnormal states of pods and how to troubleshoot pod errors, see Abnormal states of pods and troubleshooting.

    2. If the pod is in the Running state but does not run as normal, see Pods remain in the Running state but do not run as normal.

  2. If an out of memory (OOM) error occurs in the pod, see Troubleshoot OOM errors in pods.

  3. If the issue persists,submit a ticket.

Abnormal states of pods and troubleshooting

Pod status

Description

Solution

Pending

The pod is not scheduled to a node.

Pods remain in the Pending state

Init:N/M

The pod contains M init containers and N init containers are started.

Pods remain in the Init:N/M, Init:Error, or Init:CrashLoopBackOff state

Init:Error

Init containers fail to start up.

Pods remain in the Init:N/M, Init:Error, or Init:CrashLoopBackOff state

Init:CrashLoopBackOff

Init containers are stuck in a startup loop.

Pods remain in the Init:N/M, Init:Error, or Init:CrashLoopBackOff state

Completed

The pod has completed the startup command.

Pods remain in the Completed state

CrashLoopBackOff

The pod is stuck in a startup loop.

Pods remain in the CrashLoopBackOff state

ImagePullBackOff

The pod fails to pull the container image.

Pods remain in the ImagePullBackOff state

Running

  1. The pod works as normal.

  2. The pod is running but does not work as normal.

  1. No operation is required.

  2. Pods remain in the Running state but do not run as normal

Terminating

The pod is being terminated.

Pods remain in the Terminating state

Evicted

The pod is evicted.

Pods remain in the Evicted state

Common troubleshooting methods

Check the status of a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. In the upper-left corner of the Pods page, select the namespace to which the pod belongs and check the status of the pod.

Check the details of a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, click the name of the cluster or click Details in the Actions column of the cluster.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column to view information about the pod. You can view the name, image, and IP address of the pod and the node that hosts the pod.

Check the configurations of a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column.

  6. In the upper-right corner of the pod details page, click Edit to view the YAML file and configurations of the pod.

Check the events of a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column.

  6. In the lower part of the pod details page, click the Events tab to view events of the pod.

    Note

    By default, Kubernetes retains the events that occurred within the previous hour. If you want to retain events that occurred within a longer time period, see Create and use an event center.

Check the log of a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. In the upper-left corner of the Pods page, select the namespace to which the pod belongs. In the list of pods, find the pod and click the name of the pod or click View Details in the Actions column.

  6. In the lower part of the pod details page, click the Logs tab to view the log data of the pod.

    Note

    Container Service for Kubernetes (ACK) is integrated with Simple Log Service. When you create an ACK cluster, you can enable Simple Log Service to collect log data from containers of the ACK cluster. Kubernetes writes log data to the standard output and text files. For more information, see Collect log data from containers by using Simple Log Service.

Check the monitoring information about a pod

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the cluster details page, choose Operations > Prometheus Monitoring.

  5. On the Prometheus Monitoring page, click the Cluster Overview tab to view the following monitoring information about pods: CPU usage, memory usage, and network I/O.

Log on to a container by using the terminal

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. On the Pods page, find the pod that you want to manage and click Terminal in the Actions column.

    You can log on to a container of the pod by using the terminal and view local files in the container.

Pod diagnostics

  1. Log on to the ACK console.

  2. In the left-side navigation pane of the ACK console, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.

  4. In the left-side navigation pane of the details page, choose Workloads > Pods.

  5. On the Pods page, find the pod that you want to manage and click Diagnose in the Actions column.

    After the pod diagnostic is completed, you can view the diagnostic result and troubleshoot the issue. For more information, see Work with cluster diagnostics.

Pods remain in the Pending state

Cause

If a pod remains in the Pending state, the pod cannot be scheduled to a specific node. This issue occurs if the pod lacks required resources, does not have sufficient resources, uses the hostPort, or is configured with taints but lacks toleration rules.

Symptom

The pod remains in the Pending state.

Solution

Check the events of the pod and identify the reason why the pod cannot be scheduled to a node based on the events. Possible causes:

  • Resource dependency

    Some pods cannot be created without specific cluster resources, such as ConfigMaps and persistent volume claims (PVCs). For example, before you specify a PVC for a pod, you must associate the PVC with a persistent volume (PV).

  • Insufficient resources

    1. On the cluster details page, choose Nodes > Nodes. On the Nodes page, check the usage of the following resources in the cluster: pod, CPU, and memory.

      Note

      The CPU and memory usage on a node is low. If a pod is scheduled to the node, the resource usage on the node will exceed the upper limit. In this case, the scheduler does not schedule the pod to the node. This prevents resources on the node from being exhausted during peak hours.

    2. If the CPU or memory resources in the cluster are exhausted, you can use the following methods to resolve the issue:

  • Use of hostPort

    If you configure the hostPort for a pod, the value of Replicas that you specify for the Deployment or ReplicationController cannot be greater than the number of nodes in the cluster. This is because each node provides only one host port. If the host port of a node is used by other applications, the pod fails to be scheduled to the node. We recommend that you do not use the hostPort. You can create a Service and use the Service to access the pod. For more information, see Service.

  • Taints and toleration rules

    If the events of the pod contain Taints or Tolerations, the pod fails to be scheduled because of taints. You can delete the taints or configure toleration rules for the pod. For more information, see Manage taints, Create a stateless application by using a Deployment, and Taints and Tolerations.

Pods remain in the Init:N/M state, Init:Error state, or Init:CrashLoopBackOff state

Cause

  • If a pod remains in the Init:N/M state, the pod contains M init containers, N init containers are started, and M-N init containers fail to start up.

  • If a pod remains in the Init:Error state, the init containers in the pod fail to start up.

  • If a pod remains in the Init:CrashLoopBackOff state, the init containers in the pod are stuck in a startup loop.

Symptom

  • Pods remain in the Init:N/M state.

  • Pods remain in the Init:Error state.

  • Pods remain in the Init:CrashLoopBackOff state.

Solution

  1. View the events of the pod and check whether errors occur in the init containers that fail to start up in the pod. For more information, see Check the events of a pod.

  2. Check the logs of the init containers that fail to start up in the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.

  3. Check the configuration of the pod and make sure that the configuration of the init containers that fail to start up is valid. For more information, see Check the configurations of a pod. For more information about init containers, see Debug init containers.

Pods remain in the ImagePullBackOff state

Cause

If a pod remains in the ImagePullBackOff state, the pod is scheduled to a node but the pod fails to pull the container image.

Symptom

Pods remain in the ImagePullBackOff state.

Solution

Check the description of the corresponding pod event and check the name of the container image that fails to be pulled.

  1. Check whether the name of the container image is valid.

  2. Log on to the node where the pod is deployed and run the docker pull [$Image] command to check whether the container image can be pulled as normal.

    Note

    [$Image] specifies the name of the container image.

Pods remain in the CrashLoopBackOff state

Cause

If a pod remains in the CrashLoopBackOff state, the application in the pod encounters an error.

Symptom

Pods remain in the CrashLoopBackOff state.

Solution

  1. View the events of the pod and check whether errors occur in the pod. For more information, see Check the events of a pod.

  2. Check the log of the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.

  3. Inspect the configurations of the pod and check whether the health check configurations are valid. For more information, see Check the configurations of a pod. For more information about health checks for pods, see Configure Liveness, Readiness and Startup Probes.

Pods remain in the Completed state

Cause

If a pod is in the Completed state, the containers in the pod have completed the startup command and all the processes in the containers have exited.

Symptom

Pods remain in the Completed state.

Solution

  1. Inspect the configurations of the pod and check the startup command that is executed by the containers in the pod. For more information, see Check the configurations of a pod.

  2. Check the log of the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.

Pods remain in the Running state but do not run as normal

Cause

The YAML file that is used to deploy the pod contains errors.

Symptom

Pods remain in the Running state but do not run as normal.

Solution

  1. Inspect the configurations of the pod and check whether the containers in the pod are configured as expected. For more information, see Check the configurations of a pod.

  2. Use the following method to check whether the keys of the environment variables contain spelling errors.

    The following example describes how to identify spelling errors if you spell command as commnd.

    Note

    When you create a pod, the system ignores the spelling errors in the keys of the environment variables. For example, if you spell command as commnd, you can still use the YAML file to create the pod. However, the pod cannot run the command that contains the spelling error in the YAML file. Instead, the pod runs the default command in the image.

    1. Place --validate before the kubectl apply -f command and run the kubectl apply --validate -f XXX.yaml command.

      If you spell command as commnd, the following error occurs: XXX] unknown field: commnd XXX] this may be a false alarm, see https://gXXXb.XXX/6842pods/test.

    2. Run the following command and compare the pod.yaml file that is generated with the original file that is used to create the pod:

        kubectl get pods [$Pod] -o yaml > pod.yaml
      Note

      [$Pod] is the name of the abnormal pod. You can run the kubectl get pods command to view the name.

      • If the pod.yaml file contains more lines than the original file that is used to create the pod, the pod is created as expected.

      • If the pod.yaml file does not contain the command lines in the original file, the original file may contain spelling errors.

  3. Check the log of the pod and troubleshoot the issue based on the log data. For more information, see Check the log of a pod.

  4. You can log on to a container in the pod by using the terminal and check the local files in the container. For more information, see Log on to a container by using the terminal.

Pods remain in the Terminating state

Cause

If a pod is in the Terminating state, the pod is being terminated.

Symptom

Pods remain in the Terminating state.

Solution

Pods that remain in the Terminating state are deleted after a period of time. If a pod remains in the Terminating state for a long period of time, you can run the following command to forcefully delete the pod:

kubectl delete pod [$Pod] -n [$namespace] --grace-period=0 --force

Pods remain in the Evicted state

Cause

The kubelet automatically evicts one or more pods from a node to reclaim resources when the usage of certain resources on the node reaches a threshold. These resources include memory, storage, file system index nodes (inodes), and operating system process identifiers (PIDs).

Symptom

Pods remain in the Evicted state.

Solution

  1. Run the following command to query the status.message field of the pod and identify the reason that causes the eviction.

    kubectl  get pod [$Pod] -o yaml -n [$namespace]

    Expected output:

    status:
        message: 'Pod the node had condition: [DiskPressure].'
        phase: Failed
        reason: Evicted

    The status.message field in the preceding output indicates that the pod is evicted due to node disk pressure.

    Note

    Node disk pressure is used as an example. Other issues such as memory pressure and PID pressure are also displayed in a similar way.

  2. Run the following command to delete the evicted pod:

    kubectl get pods -n [$namespace]| grep Evicted | awk '{print $1}' | xargs kubectl delete pod -n [$namespace]

You can use the following methods to avoid pod eviction:

Troubleshoot OOM errors in pods

Cause

If the memory usage of a container in the cluster exceeds the specified memory limit, the container may be terminated and trigger an OOM event, which causes the container to exit. For more information about OOM events, see Allocate memory resources to containers and pods.

Symptom

  • If the terminated process causes the stuck container, the container may restart.

  • If an OOM error occurs, log on to the Container Service for Kubernetes (ACK) console and navigate to the pod details page. On the Events tab, you can view the following OOM event: pod was OOM killed. For more information, see Check the events of a pod.

  • If you configure alert rules for pod exceptions in the cluster, you can receive alert notifications when an OOM event occurs. For more information about how to configure alert rules, see Alert management.

Solution

  1. Check the node that hosts the pod in which an OOM error occurs.

    • Use commands: Run the following command to query information about the pod:

      kubectl  get pod [$Pod] -o wide -n [$namespace]

      Expected output:

      NAME        READY   STATUS    RESTARTS   AGE   IP            NODE
      pod_name    1/1     Running   0          25h   172.20.6.53   cn-hangzhou.192.168.0.198
    • In the ACK console: For more information about how to view node information on the pod details page, see Check the details of a pod.

  2. Log on to the node and check the kernel log in the /var/log/message file. Search for the out of memory keyword in the log file and check the process that is terminated due to an OOM error. If the process causes the stuck container, the container restarts after the process is terminated.

  3. Check the time when the error occurs based on the memory usage graph of the pod. For more information, see Check the monitoring information about a pod.

  4. Check whether memory leaks occur in the processes of the pod based on the following monitoring information: the points in time when spikes occur in memory usage, log data, and process names.

    • If the OOM error is caused by memory leaks, we recommend that you troubleshoot the issue based on your business scenario.

    • If the processes run as normal, increase the memory limit of the pod. Make sure that the actual memory usage of the pod does not exceed 80% of the memory limit of the pod. For more information, see Modify the upper limit and lower limit of CPU and memory resources for pods.

A network disconnection issue occasionally occurs when a pod accesses a database

If a network disconnection issue occasionally occurs when a pod in the ACK cluster accesses a database, you can perform the following operations to troubleshoot the issue:

  1. Check the pod:

    • View the events of the pod in the ACK cluster. For more information, see Check the events of a pod.

      Check whether unstable connection events are generated, such as events related to network exceptions, restarts, and insufficient resources.

    • View the pod log to check whether any error messages related to the database connection exist. For more information, see Check the log of a pod.

      Check whether timeouts or authentication failures occur or the reconnection mechanism is triggered.

    • View the CPU and memory usage of the pod. Make sure that the application or database driver does not exceptionally exit due to insufficient resources. For more information, see Check the monitoring information about a pod.

    • Check the resource requests and limits of the pod to ensure that sufficient CPU and memory resources are allocated.

  2. Check the node:

    • Check the resource usage of the node and ensure that resources such as memory and disk resources are sufficient. For more information, see Monitor nodes.

    • Test whether network disconnection occasionally occurs between the node and the database.

  3. Check the database:

    • Check the status and performance metrics of the database to ensure that no restarts or performance bottlenecks exist.

    • View the number of abnormal connections and check timeout period setting, and modify the setting based on your business requirements.

    • Analyze the database logs related to disconnections from the database.

  4. Cluster component status:

    Cluster component exceptions affect the communication between pods and other components in the cluster. Run the following command to check the status of components in the ACK cluster:

    kubectl get pod -n kube-system  # View the component status.

    • Check the following network plug-ins:

      • Check the status and log of the CoreDNS component to ensure that the pod can resolve the address of the database.

      • If your cluster uses Flannel, check the status and log of the kube-flannel component.

      • If your cluster uses Terway, check the status and log of the terway-eniip component.

  5. Network traffic analysis:

    You can use tcpdump to capture packets and analyze network traffic to help locate the cause.

    1. Run the following command to identify the pod and node where the database disconnection issue occurred.

      kubectl  get pod -n [namespace] -o wide 
    2. Log on to the node. For more information, see Connect to a Linux instance by using a password or key.

      Run the following commands to query the container PID based on different Kubernetes versions.

      • For Kubernetes versions later than 1.22, use containerd.

        Run the following command to view the CONTAINER:

        crictl ps |grep <Pod name or keyword>

        Expected output:

        CONTAINER           IMAGE               CREATED             STATE                      
        a1a214d2*****       35d28df4*****       2 days ago          Running

        Specify the CONTAINER ID parameter and run the following command to view the container PID:

        crictl inspect  a1a214d2***** |grep -i PID

        Expected output:

        "pid": 2309838, # The PID of the container. 
                    "pid": 1
                    "type": "pid"
      • For Kubernetes 1.22 and earlier versions, use Docker.

        Run the following command to view the CONTAINER ID:

        docker ps | grep <Pod name or keyword>

        Expected output:

        CONTAINER ID        IMAGE                  COMMAND     
        a1a214d2*****       35d28df4*****          "/nginx

        Specify the CONTAINER ID parameter and run the following command to view the PID of the container:

        docker inspect  a1a214d2***** |grep -i PID

        Expected output:

        "Pid": 2309838, # The PID of the container. 
                    "PidMode": "",
                    "PidsLimit": null,
    3. Run the packet capture command.

      Capture packets between the pod and the database. Run the following command with the obtained container PID:

      nsenter -t <Container PID> tcpdump -i any -n -s 0 tcp and host <IP address of the database>

      Capture packets between the pod and the host. Run the following command with the obtained container PID:

      nsenter -t <Container PID> tcpdump -i any -n -s 0 tcp and host <IP address of the node>

      Capture packets between the host and the database.

      tcpdump -i any -n -s 0 tcp and host <IP address of the database>
  6. Optimize your application:

    • Configure your application to support the automatic reconnection mechanism to ensure that the application can automatically restore the connection without manual intervention when the database is changed or migrated.

    • Use persistent connections instead of short-lived connections to communicate with the database. Persistent connections can significantly reduce performance loss and resource consumption, and enhance the overall efficiency of the system.