When exceptions occur or no records are found on GPU monitoring dashboards, you can perform the steps in this topic to troubleshoot the issues.
Procedure
Step 1: Check whether GPU nodes exist in the cluster
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, click Clusters.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
On the Nodes page, check whether GPU nodes exist in the cluster.
NoteOn the Nodes page, if the values in the Configuration column contains ****ecs.gn****, the cluster has GPU nodes.
Step 2: Check whether ack-arms-prometheus is installed
Check whether ack-arms-prometheus is installed in the cluster. For more information, see Enable Prometheus Service.
If ack-arms-prometheus is installed, run the following command to query the pods of ack-arms-prometheus:
kubectl get pods -n arms-prom
Expected output:
NAME READY STATUS RESTARTS AGE arms-prom-ack-arms-prometheus-866cfd9f8f-x8jxl 1/1 Running 0 26d
If the pods are in the Running state, the pods run as expected. If the pods are not in the Running state, run the
kubectl describe pod
command to query the reason why the pods are not running.
Step 3: Check whether ack-prometheus-gpu-exporter is deployed
Run the following command to query the status and quantity of pods:
kubectl get pods -n arms-prom
Expected output:
NAME READY STATUS RESTARTS AGE
ack-prometheus-gpu-exporter-6kpj7 1/1 Running 0 7d19h
ack-prometheus-gpu-exporter-bkbf8 1/1 Running 0 18h
ack-prometheus-gpu-exporter-blbnq 1/1 Running 0 18h
The preceding output indicates that the number of pods is the same as the number of GPU nodes and the pods are in the Running state. This means that ack-prometheus-gpu-exporter is deployed on the GPU nodes. If the pods are not in the Running state, run the kubectl describe pod
command to query the reason why the pods are not running.
Step 4: Check whether data is collected by ack-prometheus-gpu-exporter
Run the following command to log on to a node in the cluster by using SSH:
sudo ssh root@198.51.XX.XX
root: the custom account name.
198.51.XX.XX: the public IP address of the node.
Run the following command to query the internal IP addresses of the pods:
kubectl get pods -n arms-prom -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ack-prometheus-gpu-exporter-4rdtl 1/1 Running 0 7h6m 172.21.XX.XX cn-beijing.192.168.0.22 <none> <none> ack-prometheus-gpu-exporter-vdkqf 1/1 Running 0 6d16h 172.21.XX.XX cn-beijing.192.168.94.7 <none> <none> ack-prometheus-gpu-exporter-x7v48 1/1 Running 0 7h6m 172.21.XX.XX cn-beijing.192.168.0.23 <none> <none>
Run the following command to call the
gpu exporter
service to obtain GPU metrics:NoteBy default, ack-prometheus-gpu-exporter uses port 9445.
curl 172.21.XX.XX:9445 | grep "nvidia_gpu"
Expected output:
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7518 100 7518 0 0 101k 0 --:--:-- --:--:-- --:--:-- 101k # HELP nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device # TYPE nvidia_gpu_duty_cycle gauge nvidia_gpu_duty_cycle{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 0 # HELP nvidia_gpu_memory_total_bytes Total memory of the GPU device # TYPE nvidia_gpu_memory_total_bytes gauge nvidia_gpu_memory_total_bytes{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 1.5811477504e+10 # HELP nvidia_gpu_memory_used_bytes Memory used by the GPU device # TYPE nvidia_gpu_memory_used_bytes gauge nvidia_gpu_memory_used_bytes{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 1.488453632e+10 # HELP nvidia_gpu_num_devices Number of GPU devices # TYPE nvidia_gpu_num_devices gauge nvidia_gpu_num_devices{node_name="cn-beijing.192.168.0.22"} 1 # HELP nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in watts # TYPE nvidia_gpu_power_usage_milliwatts gauge nvidia_gpu_power_usage_milliwatts{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 27000 # HELP nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius # TYPE nvidia_gpu_temperature_celsius gauge nvidia_gpu_temperature_celsius{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 44
If the output contains metric records that start with
nvidia_gpu
, data is collected by ack-prometheus-gpu-exporter.