全部產品
Search
文件中心

Container Service for Kubernetes:排查GPU監控常見問題

更新時間:Jun 19, 2024

當GPU監控大盤異常或無資料時,您可以按照本文描述的操作步驟排查GPU監控常見問題。

操作步驟

步驟一:查看叢集中是否有GPU節點

  1. 登入Container Service管理主控台

  2. 在控制台左側導覽列,單擊叢集

  3. 叢集列表頁面,單擊目的地組群名稱或者目的地組群右側操作列下的詳情

  4. 在叢集管理頁左側導覽列,選擇節點管理 > 節點

  5. 節點頁面,查看目的地組群中是否有GPU節點。

    說明

    節點頁面的配置列,如果配置名稱包含****ecs.gn****,則說明該叢集中有GPU節點。

步驟二:查看ack-arms-prometheus是否正確安裝

  1. 查看目的地組群是否安裝ack-arms-prometheus。具體操作,請參見開啟阿里雲Prometheus監控

  2. 如果已安裝ack-arms-prometheus,執行以下命令查看ack-arms-prometheus的Pod狀態。

    kubectl get pods -n arms-prom

    預期輸出:

    NAME                                             READY   STATUS    RESTARTS   AGE
    arms-prom-ack-arms-prometheus-866cfd9f8f-x8jxl   1/1     Running   0          26d

    如果是Running狀態,說明Pod運行正常。如果Pod的狀態不是Running,則執行kubectl describe pod命令,查看Pod狀態不正常原因。

步驟三:檢查ack-prometheus-gpu-exporter是否成功部署

執行以下命令,查看Pod的運行狀態和數量。

kubectl get pods -n arms-prom

預期輸出:

NAME                                                  READY   STATUS    RESTARTS   AGE
ack-prometheus-gpu-exporter-6kpj7                     1/1     Running   0          7d19h
ack-prometheus-gpu-exporter-bkbf8                     1/1     Running   0          18h
ack-prometheus-gpu-exporter-blbnq                     1/1     Running   0          18h

從上述輸出資訊,可以知道Pod數量和GPU節點數一致,且Pod的狀態是Running,說明ack-prometheus-gpu-exporter已在相應GPU節點成功部署。如果Pod的狀態不是Running,則執行kubectl describe pod命令,查看Pod狀態不正常原因。

步驟四:檢查ack-prometheus-gpu-exporter是否成功採集到資料

  1. 執行以下命令,SSH登入到目的地組群節點。

    sudo ssh root@198.51.XX.XX
    • root:使用者自訂使用者名稱。

    • 198.51.XX.XX:目的地組群的公網IP訪問地址。

  2. 執行以下命令,查看Pod的內網IP。

    kubectl get pods -n arms-prom -o wide

    預期輸出:

    NAME                                                   READY   STATUS    RESTARTS   AGE     IP             NODE                      NOMINATED NODE   READINESS GATES
    ack-prometheus-gpu-exporter-4rdtl                      1/1     Running   0          7h6m    172.21.XX.XX   cn-beijing.192.168.0.22   <none>           <none>
    ack-prometheus-gpu-exporter-vdkqf                      1/1     Running   0          6d16h   172.21.XX.XX   cn-beijing.192.168.94.7   <none>           <none>
    ack-prometheus-gpu-exporter-x7v48                      1/1     Running   0          7h6m    172.21.XX.XX   cn-beijing.192.168.0.23   <none>           <none>
  3. 執行以下命令,調用gpu exporter服務,擷取GPU指標資訊。

    說明

    ack-prometheus-gpu-exporter的預設連接埠是9445。

    sudo curl 172.21.XX.XX:9445 | grep "nvidia_gpu"

    預期輸出:

     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  7518  100  7518    0     0   101k      0 --:--:-- --:--:-- --:--:--  101k
    # HELP nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device
    # TYPE nvidia_gpu_duty_cycle gauge
    nvidia_gpu_duty_cycle{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 0
    # HELP nvidia_gpu_memory_total_bytes Total memory of the GPU device
    # TYPE nvidia_gpu_memory_total_bytes gauge
    nvidia_gpu_memory_total_bytes{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 1.5811477504e+10
    # HELP nvidia_gpu_memory_used_bytes Memory used by the GPU device
    # TYPE nvidia_gpu_memory_used_bytes gauge
    nvidia_gpu_memory_used_bytes{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 1.488453632e+10
    # HELP nvidia_gpu_num_devices Number of GPU devices
    # TYPE nvidia_gpu_num_devices gauge
    nvidia_gpu_num_devices{node_name="cn-beijing.192.168.0.22"} 1
    # HELP nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in watts
    # TYPE nvidia_gpu_power_usage_milliwatts gauge
    nvidia_gpu_power_usage_milliwatts{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 27000
    # HELP nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius
    # TYPE nvidia_gpu_temperature_celsius gauge
    nvidia_gpu_temperature_celsius{allocate_mode="exclusive",container_name="tfserving-gpu",minor_number="0",name="Tesla T4",namespace_name="default",node_name="cn-beijing.192.168.0.22",pod_name="fashion-mnist-eci-2-predictor-0-tfserving-proxy-tfserving-v789b",uuid="GPU-293f6608-281a-cc66-fcb3-0d366f32a31d"} 44

    如果輸出的資料中有nvidia_gpu開頭的指標資訊,說明ack-prometheus-gpu-exporter可以成功採集資料。