GPU monitoring built on NVIDIA Data Center GPU Manager (DCGM) gives you visibility into GPU utilization, health, and workload performance across your cluster. Use GPU metrics to diagnose issues, optimize resource allocation, and inform capacity planning.
How it works
DCGM is NVIDIA's tool for managing GPUs in large-scale clusters. A monitoring system built on DCGM provides:
GPU behavior monitoring
GPU configuration management
GPU policy management
GPU health diagnostics
GPU-level and thread-level statistics
NVSwitch configuration and monitoring
Prerequisites
Before you begin, make sure that you have:
Limitations
The NVIDIA driver on each node must be version 418.87.01 or later. Log on to a GPU node and run
nvidia-smito check the driver version.Profiling Metrics require NVIDIA driver version 450.80.02 or later. For details, see Feature Overview.
NVIDIA Multi-Instance GPU (MIG) monitoring is not supported.
Billing
GPU monitoring metrics are collected through Alibaba Cloud Managed Service for Prometheus. For pricing details, see Billing overview.
Step 1: Enable Prometheus monitoring
The ack-arms-prometheus component must be version 1.1.7 or later. Check the component version and upgrade if necessary.
Option A: Enable monitoring for an existing cluster
(Optional) For an ACK dedicated cluster, first grant authorization for monitoring policies to the cluster.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, select a container monitoring version and click Install.
After monitoring is enabled, default basic metrics are collected automatically. To collect custom metrics, see Collect custom metrics. Several preset dashboards are also available on this page, including Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.
Option B: Enable monitoring when creating a cluster
ACK managed cluster Pro Edition:
On the Component Configuration page, in the Container Monitoring section, select Container Cluster Monitoring Pro Edition or Container Cluster Monitoring Basic Edition. For details, see Create an ACK managed cluster.
Note: Auto Mode for smart hosting enables Container Monitoring Basic Edition by default.
ACK managed cluster Basic Edition, ACS clusters, and ACK Serverless clusters:
On the Component Configurations page of the cluster creation wizard, in the Monitor containers section, select Enable Managed Service for Prometheus to install Container Monitoring Basic Edition.
After monitoring is enabled, default basic metrics are collected automatically. To collect custom metrics, see Collect custom metrics. On the cluster details page, choose Operations > Prometheus Monitoring to view preset dashboards such as Cluster Overview, Node Monitoring, Application Monitoring, Network Monitoring, and Storage Monitoring.
For more information about enabling Prometheus monitoring, see Enable Prometheus monitoring for ACK.
If you use a self-managed, open-source Prometheus service and need GPU monitoring, install the ack-gpu-exporter component.
Step 2: Verify GPU monitoring components
After enabling Prometheus monitoring, verify that the GPU exporter pods are running:
kubectl get pods -n arms-prom -l k8s-app=ack-prometheus-gpu-exporterEach GPU node in the cluster should have a running ack-prometheus-gpu-exporter pod. This DaemonSet collects DCGM metrics from your GPU nodes.
Step 3: Deploy a sample application
To generate GPU metrics, deploy a sample workload on a GPU node.
Create a file named
tensorflow-benchmark.yamlwith the following content:apiVersion: batch/v1 kind: Job metadata: name: tensorflow-benchmark spec: parallelism: 1 template: metadata: labels: app: tensorflow-benchmark spec: containers: - name: tensorflow-benchmark image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.2.3 command: - bash - run.sh - --num_batches=50000 - --batch_size=8 resources: limits: nvidia.com/gpu: 1 # Request one GPU. workingDir: /root restartPolicy: NeverDeploy the application:
kubectl apply -f tensorflow-benchmark.yamlCheck the pod status:
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE tensorflow-benchmark-k*** 1/1 Running 0 114s
Step 4: View GPU monitoring data
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
Click the GPU Monitoring tab, then click the GPUs-Pods tab. The monitoring data shows which node each GPU pod is running on (for example,
cn-beijing.10.131.xx.xxx).
Click the GPUs-Nodes tab and set GPUNode to a specific node (for example,
cn-beijing.10.131.xx.xxx) to view detailed GPU information for that node. For parameter descriptions, see Dashboard description.
FAQ
DCGM memory leak
The ack-prometheus-gpu-exporter DaemonSet starts automatically when you install Managed Service for Prometheus. DCGM sometimes does not release memory correctly during runtime, causing memory usage to increase over time.
A resources.limits setting is configured on the exporter pod to mitigate this. When the memory limit is reached, the pod restarts automatically. This typically happens about once a month. After a restart, metrics resume reporting normally. Grafana may display anomalies for a few minutes (for example, a sudden spike in node count), but the display corrects itself.
For more information, see The DCGM has a memory leak? on GitHub.
ack-prometheus-gpu-exporter is killed by an out-of-memory event
The ack-prometheus-gpu-exporter uses DCGM in embedded mode, which consumes a large amount of memory on multi-GPU nodes and is prone to memory leaks. If you run multiple GPU processes on an instance with multiple GPUs and allocate too little memory to the exporter, the pod may be killed by an out-of-memory (OOM) event.
The pod typically resumes reporting metrics after it restarts. If out-of-memory kills happen frequently, increase the memory limits for the ack-prometheus-gpu-exporter DaemonSet in the arms-prom namespace.
ack-prometheus-gpu-exporter reports an error
If the pod logs contain an error similar to the following:
failed to get all process informations of gpu nvidia1,reason: failed to get gpu utilizations for all processes on device 1,reason: Not FoundThis error occurs because older versions of ack-prometheus-gpu-exporter cannot retrieve GPU metrics when no tasks are running on certain GPU cards.
Upgrade the ack-arms-prometheus component to the latest version to resolve this issue.