All Products
Search
Document Center

Container Service for Kubernetes:Enable GPU monitoring for an ACK cluster

Last Updated:Jul 23, 2024

GPU monitoring 2.0 is a sophisticated GPU monitoring system developed based on NVIDIA Data Center GPU Manager (DCGM). This topic describes how to enable GPU monitoring for a Container Service for Kubernetes (ACK) cluster.

Prerequisites

Background information

Monitoring large numbers of GPU devices in Kubernetes is important to O&M engineers. By collecting GPU metrics from a cluster, you can gain insights into the GPU usage, health status, workloads, and performance of the cluster. The monitoring data can help you quickly diagnose issues, optimize GPU resource allocation, and increase resource utilization. GPU monitoring also helps data scientists and AI algorithm engineers optimize GPU resource allocation and task scheduling.

GPU monitoring 1.0 uses NVIDIA Management Library (NVML) to collect GPU metrics, and uses Prometheus and Grafana to visualize the collected metrics. You can use GPU monitoring 1.0 to monitor the usage of GPU resources in your cluster. The new-generation NVIDIA GPU uses a more complex architecture to meet user requirements in diverse scenarios. The GPU metrics provided by GPU monitoring 1.0 can no longer meet the growing demand.

The new-generation NVIDIA GPU supports Data Center GPU Manager (DCGM), which can be used to manage a large number of GPUs. GPU monitoring 2.0 is developed based on the powerful NVIDIA DCGM. DCGM provides various GPU metrics and supports the following features:

  • GPU behavior monitoring

  • GPU configuration management

  • GPU policy management

  • GPU health diagnostics

  • GPU statistics and thread statistics

  • NVSwitch configuration and monitoring

Limits

  • The version of the NVIDIA GPU driver must be 418.87.01 or later. If you want to use the GPU profiling feature, make sure that NVIDIA GPU driver 450.80.02 or later is installed. For more information about GPU profiling, see Feature Overview.

    Note
  • You cannot use GPU monitoring 2.0 to monitor the NVIDIA Multi-Instance GPU (MIG) feature.

Usage notes

DCGM 2.3.6 has a memory leak issue. ACK sets the resources.limits parameter for the pod of the exporter to avoid this issue. When the memory usage reaches the specified memory limit, the exporter is restarted. The exporter can sink metrics to Grafana as normal after the restart. However, certain metrics in Grafana may display abnormal values after the exporter is restarted. For example, the number of nodes may increase after the exporter is restarted. The values of these metrics are restored to normal after a few minutes. In most cases, the exporter is restarted once each month. For more information, see The DCGM has a memory leak?

Billing rules

By default, the Managed Service for Prometheus metrics collected by the ack-gpu-exporter component in an ACK cluster are considered basic metrics and free of charge. However, if you have increased the default retention period of monitoring data defined by Alibaba Cloud for basic monitoring services, additional fees may be charged. For more information about the billing of custom metrics in Managed Service for Prometheus, see Billing overview.

Procedure

  1. Step 1: Enable Managed Service for Prometheus

    Make sure the ack-arms-prometheus version is 1.1.7 or later and the GPU dashboard version is V2 or later.

    Note
    • Check and update the ack-arms-prometheus version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Add-ons. On the Add-ons page, enter arms in the search box and click the search icon. After the search result appears, you can check and update the ack-arms-prometheus version.

    • Check and update the GPU dashboard version: Log on to the ACK console and go to the details page of the cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring. In the upper-right corner of the Prometheus Monitoring page, click Go to Prometheus Service. On the Dashboards page, you can check and update the GPU dashboard version.

  2. Verify the GPU monitoring capability of Managed Service for Prometheus.

    1. Deploy an application named tensorflow-benchmark.

      1. Create a YAML file named tensorflow-benchmark.yaml and add the following content to the file:

        apiVersion: batch/v1
        kind: Job
        metadata:
          name: tensorflow-benchmark
        spec:
          parallelism: 1
          template:
            metadata:
              labels:
                app: tensorflow-benchmark
            spec:
              containers:
              - name: tensorflow-benchmark
                image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3
                command:
                - bash
                - run.sh
                - --num_batches=50000
                - --batch_size=8
                resources:
                  limits:
                    nvidia.com/gpu: 1 #Apply for a GPU. 
                workingDir: /root
              restartPolicy: Never
      2. Run the following command to deploy the tensorflow-benchmark application on a GPU-accelerated node.

        kubectl apply -f tensorflow-benchmark.yaml
      3. Run the following command to query the status of the pod that runs the application:

        kubectl get po

        Expected output:

        NAME                         READY   STATUS    RESTARTS   AGE
        tensorflow-benchmark-k***   1/1     Running   0          114s

        The output indicates that the pod is in the Running state.

    2. View GPU dashboards.

      1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

      2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Prometheus Monitoring.

      3. On the Prometheus Monitoring page, click the GPU Monitoring tab and then click the GPUs - Cluster Dimension tab.

        The cluster dashboard shows that the GPU pod runs on the cn-beijing.192.168.10.163 node.

      4. Click the GPUs - Nodes tab, and then select cn-beijing.192.168.10.163 from the gpu_node drop-down list to view the GPU information of the node.