This topic describes how to collect diagnostic data from GPU-accelerated nodes.
Pod anomalies
If a pod that requests GPU resources fails to run as normal on a GPU-accelerated node,
perform the following steps to collect diagnostic data:
- Run the following command to query the node on which the pod runs.
In this example, the failed pod is named
test-pod and belongs to the
test-namespace namespace.
kubectl get pod test-pod -n test-namespace -o wide
- Log on to the GPU-accelerated node and run the following command to download and run
a diagnostic script:
curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash -s -- --pod test-pod
Expected output:
Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers
- Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz and diagnose-gpu.log files in the current directory to the Container Service for Kubernetes (ACK) technical
team for analysis.
GPU-accelerated node anomalies
If a GPU-accelerated node fails to run as normal or errors occur in the runtime environment
of the GPU-accelerated node, perform the following steps to collect diagnostic data:
- Log on to the GPU-accelerated node and run the following command to download and run
a diagnostic script:
curl https://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/diagnose/diagnose-gpu.sh | bash
Expected output:
Please upload diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz to ACK developers
- Submit a ticket to submit the diagnose-gpu_xx-xx-xx_xx-xx-xx.tar.gz file in the current directory to the ACK technical team for analysis.