How to enable GPU monitoring for an ACK cluster - Container Service for Kubernetes

GPU monitoring 2.0 is a sophisticated GPU monitoring system developed based on NVIDIA Data Center GPU Manager (DCGM). This topic describes how to enable GPU monitoring for a Container Service for Kubernetes (ACK) cluster.

Prerequisites

An ACK managed cluster with GPU-accelerated nodes or ACK dedicated cluster with GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
Application Real-Time Monitoring Service (ARMS) is activated. For more information, see Activate ARMS.

Background information

Monitoring large numbers of GPU devices in Kubernetes is important to O&M engineers. By collecting GPU metrics from a cluster, you can gain insights into the GPU usage, health status, workloads, and performance of the cluster. The monitoring data can help you quickly diagnose issues, optimize GPU resource allocation, and increase resource utilization. GPU monitoring also helps data scientists and AI algorithm engineers optimize GPU resource allocation and task scheduling.

GPU monitoring 1.0 uses NVIDIA Management Library (NVML) to collect GPU metrics, and uses Prometheus and Grafana to visualize the collected metrics. You can use GPU monitoring 1.0 to monitor the usage of GPU resources in your cluster. The new-generation NVIDIA GPU uses a more complex architecture to meet user requirements in diverse scenarios. The GPU metrics provided by GPU monitoring 1.0 can no longer meet the growing demand.

The new-generation NVIDIA GPU supports Data Center GPU Manager (DCGM), which can be used to manage a large number of GPUs. GPU monitoring 2.0 is developed based on the powerful NVIDIA DCGM. DCGM provides various GPU metrics and supports the following features:

GPU behavior monitoring
GPU configuration management
GPU policy management
GPU health diagnostics
GPU statistics and thread statistics
NVSwitch configuration and monitoring

Limits

The version of the NVIDIA GPU driver must be 418.87.01 or later. If you want to use the GPU profiling feature, make sure that NVIDIA GPU driver 450.80.02 or later is installed. For more information about GPU profiling, see Feature Overview.
Note
- GPU monitoring 2.0 uses DCGM 2.3.6. DCGM 2.3.6 cannot properly collect metrics when NVIDIA GPU driver 5xx is installed. For more information, see dcgm-exporter collects metrics incorrectly?
- To check the version of the GPU driver installed on a node, use SSH to log on to the node and run the nvidia-smi command. For more information, see Connect to the master nodes of an ACK dedicated cluster by using SSH.
You cannot use GPU monitoring 2.0 to monitor the NVIDIA Multi-Instance GPU (MIG) feature.

Usage notes

DCGM 2.3.6 has a memory leak issue. ACK sets the resources.limits parameter for the pod of the exporter to avoid this issue. When the memory usage reaches the specified memory limit, the exporter is restarted. The exporter can sink metrics to Grafana as normal after the restart. However, certain metrics in Grafana may display abnormal values after the exporter is restarted. For example, the number of nodes may increase after the exporter is restarted. The values of these metrics are restored to normal after a few minutes. In most cases, the exporter is restarted once each month. For more information, see The DCGM has a memory leak?

Billing rules

By default, the Managed Service for Prometheus metrics collected by the ack-gpu-exporter component in an ACK cluster are considered basic metrics and free of charge. However, if you have increased the default retention period of monitoring data defined by Alibaba Cloud for basic monitoring services, additional fees may be charged. For more information about the billing of custom metrics in Managed Service for Prometheus, see Billing overview.

Procedure

Step 1: Enable Managed Service for Prometheus
Make sure the ack-arms-prometheus version is 1.1.7 or later and the GPU dashboard version is V2 or later.
Note
- Check and update the ack-arms-prometheus version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Add-ons. On the Add-ons page, enter arms in the search box and click the search icon. After the search result appears, you can check and update the ack-arms-prometheus version.
- Check and update the GPU dashboard version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring. In the upper-right corner of the Prometheus Monitoring page, click Go to Prometheus Service. On the Dashboards page, you can check and update the GPU dashboard version.
Verify the GPU monitoring capability of Managed Service for Prometheus.
1. Deploy an application named tensorflow-benchmark.
  1. Create a YAML file named tensorflow-benchmark.yaml and add the following content to the file:
```
apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-benchmark
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-benchmark
    spec:
      containers:
      - name: tensorflow-benchmark
        image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
        command:
        - bash
        - run.sh
        - --num_batches=50000
        - --batch_size=8
        resources:
          limits:
            nvidia.com/gpu: 1 #Apply for a GPU. 
        workingDir: /root
      restartPolicy: Never
```
  2. Run the following command to deploy the tensorflow-benchmark application on a GPU-accelerated node.
```
kubectl apply -f tensorflow-benchmark.yaml
```
  3. Run the following command to query the status of the pod that runs the application:
```
kubectl get po
```
    Expected output:
    NAME READY STATUS RESTARTS AGE tensorflow-benchmark-k*** 1/1 Running 0 114s
    The output indicates that the pod is in the Running state.
2. View GPU dashboards.
  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.
  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.
  3. On the Prometheus Monitoring page, click the GPU Monitoring tab and then click the GPUs - Cluster Dimension tab.
    The cluster dashboard shows that the GPU pod runs on the cn-beijing.192.168.10.163 node.
  4. Click the GPUs - Nodes tab, and then select cn-beijing.192.168.10.163 from the gpu_node drop-down list to view the GPU information of the node.