Metrics supported by GPU monitoring 2.0 - Container Service for Kubernetes

GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to build a GPU monitoring system that supports diverse scenarios. You can create Grafana dashboards that consist of GPU exporter metrics to monitor your Container Service for Kubernetes (ACK) clusters. This topic describes the metrics supported by GPU monitoring 2.0.

Description

The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.

The GPU metrics used by GPU monitoring 2.0 include metrics supported by the DCGM exporter and custom metrics.

Billing description

Fees are charged for custom metrics used by GPU monitoring.

Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.

Metrics supported by the DCGM exporter

Utilization metrics

Metric	Type	Unit	Description
DCGM_FI_DEV_GPU_UTIL	Gauge	%	The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.
DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	The memory bandwidth utilization. For example, the maximum memory bandwidth of GPU V100 is 900 GB/second. If the current memory bandwidth usage is 450 GB/second, the memory bandwidth utilization is 50%.
DCGM_FI_DEV_ENC_UTIL	Gauge	%	The encoder utilization.
DCGM_FI_DEV_DEC_UTIL	Gauge	%	The decoder utilization.

Memory metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_FB_FREE

Gauge

MiB

The amount of free framebuffer memory.

Note

Framebuffer memory is also known as GPU memory.

DCGM_FI_DEV_FB_USED

Gauge

MiB

The amount of occupied framebuffer memory.

The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.

Profiling metrics

Metric	Type	Unit	Description
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Gauge	%	The ratio of cycles during which a graphics engine or compute engine remains active. The value is an average of all graphics engines or compute engines. A graphics engine or compute engine is active if a graphics context or compute context is bound to the thread and the context is busy.
DCGM_FI_PROF_SM_ACTIVE	Gauge	%	The ratio of cycles during which at least one warp on a streaming multiprocessor (SM) remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this scenario, the status of the warp may be Computing or may not be Computing. For example, the warp may be waiting for memory requests. If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8. For example, a GPU has N SMs: If all SMs in N thread blocks run a kernel function within a cycle, the value of this metric is 1 (100%). If N/5 thread blocks run a kernel function within a cycle, the value of this metric is 0.2. If N thread blocks run a kernel function during 20% of the cycle, the value of this metric is 0.2.
DCGM_FI_PROF_SM_OCCUPANCY	Gauge	%	The ratio of the number of warps reside on an SM to the maximum number of warps supported by the SM within a cycle. The value is an average of all SMs with a cycle. A larger value of this metric does not indicate higher GPU utilization. Only when the DCGM_FI_PROF_DRAM_ACTIVE metric indicates that the GPU memory bandwidth is limited, a larger value of this metric indicates higher GPU utilization.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Gauge	%	The ratio of cycles during which the tensor (HMMA/IMMA) pipe remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher tensor core utilization. If the value is 1 (100%), tensor instructions are submitted at intervals within the cycle. Each instruction is executed within two intervals. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The tensor core utilization of 20% of the SMs within the cycle is 100%. The tensor core utilization of all SMs within the cycle is 20%. The tensor core utilization of all SMs within 20% of the cycle is 100%. Other conditions.
DCGM_FI_PROF_PIPE_FP64_ACTIVE	Gauge	%	The ratio of cycles during which the fp64 (double-precision) pipe remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp64 core utilization. If the value is 1 (100%), an fp64 instruction is executed every four weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The fp64 core utilization of 20% of the SMs within the cycle is 100%. The fp64 core utilization of all SMs within the cycle is 20%. The fp64 core utilization of all SMs within 20% of the cycle is 100%. Other conditions.
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Gauge	%	The ratio of cycles during which the Fused Multiply-Add (FMA) operation pipe remains active. FMA operations include FP32 (single-precision) operations and integer operations. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp32 core utilization. If the value is 1 (100%), an fp32 instruction is executed every two weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The fp32 core utilization of 20% of the SMs within the cycle is 100%. The fp32 core utilization of all SMs within the cycle is 20%. The fp32 core utilization of all SMs within 20% of the cycle is 100%. Other conditions.
DCGM_FI_PROF_PIPE_FP16_ACTIVE	Gauge	%	The ratio of cycles during which the fp16 pipe (half-precision) remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp16 core utilization. If the value is 1 (100%), an fp16 instruction is executed every two weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist: The fp16 core utilization of 20% of the SMs within the cycle is 100%. The fp16 core utilization of all SMs within the cycle is 20%. The fp16 core utilization of all SMs within 20% of the cycle is 100%. Other conditions.
DCGM_FI_PROF_DRAM_ACTIVE	Gauge	%	The ratio of cycles during which the device memory interface remains active to send or receive data. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher device memory utilization. If the value is 1 (100%), a DRAM instruction is executed every week within the cycle. The peak value of the metric can reach 0.8 (80%). If the value of this metric is 0.2 (20%), the device memory interface sends or receives data within 20% of the cycle.
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES	Counter	B/s	The TX rate of Peripheral Component Interconnect Express (PCIe) and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a cycle rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum PCIe Gen 3 bandwidth is 985 MB/second per lane.
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES	Counter	B/s	The TX rate of NvLink and the RX rate of NvLink. The bytes transmitted or received include both the header and payload. The value is an average value within a cycle rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum NvLink Gen 2 bandwidth is 25 GB/second per lane in each direction.

Clock metrics

Metric	Type	Unit	Description
DCGM_FI_DEV_SM_CLOCK	Gauge	MHz	The SM clock speed.
DCGM_FI_DEV_MEM_CLOCK	Gauge	MHz	The memory clock speed.
DCGM_FI_DEV_APP_SM_CLOCK	Gauge	MHz	The SM application clock speed.
DCGM_FI_DEV_APP_MEM_CLOCK	Gauge	MHz	The memory application clock speed.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS	Gauge	MHz	The reason for clock speed throttling.

XID error and violation metrics

Metric	Type	Unit	Description
DCGM_FI_DEV_XID_ERRORS	Gauge	-	The most recent XID error that occurred within a period of time.
DCGM_FI_DEV_POWER_VIOLATION	Counter	μs	The power violation time.
DCGM_FI_DEV_THERMAL_VIOLATION	Counter	μs	The thermal violation time.
DCGM_FI_DEV_SYNC_BOOST_VIOLATION	Counter	μs	The sync boost violation time.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	Counter	μs	The board violation time.
DCGM_FI_DEV_LOW_UTIL_VIOLATION	Counter	μs	The low utilization violation time.
DCGM_FI_DEV_RELIABILITY_VIOLATION	Counter	μs	The board reliability violation time.

BAR1

Metric	Type	Unit	Description
DCGM_FI_DEV_BAR1_USED	Gauge	MB	The amount of occupied BAR1.
DCGM_FI_DEV_BAR1_FREE	Gauge	MB	The amount of free BAR1.

Temperature and power metrics

Metric	Type	Unit	Description
DCGM_FI_DEV_MEMORY_TEMP	Gauge	C	The memory temperature.
DCGM_FI_DEV_GPU_TEMP	Gauge	C	The GPU temperature.
DCGM_FI_DEV_POWER_USAGE	Gauge	W	The power usage.
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Counter	J	The total energy consumption since the driver was last reloaded.

Retired page metrics

Metric	Type	Unit	Description
DCGM_FI_DEV_RETIRED_SBE	Gauge	-	The number of pages retired because of single bit errors.
DCGM_FI_DEV_RETIRED_DBE	Gauge	-	The number of pages retired because of double bit errors.

Custom metrics

Metric	Type	Unit	Description
DCGM_CUSTOM_PROCESS_SM_UTIL	Gauge	%	The SM utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL	Gauge	%	The memory copy utilization of GPU threads.
DCGM_CUSTOM_PROCESS_ENCODE_UTIL	Gauge	%	The encoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_DECODE_UTIL	Gauge	%	The decoder utilization of GPU threads.
DCGM_CUSTOM_PROCESS_MEM_USED	Gauge	MiB	The amount of GPU memory occupied by GPU threads.
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED	Gauge	MiB	The amount of GPU memory allocated to containers.
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED	Gauge	-	The ratio of GPU computing power allocated to a container to total GPU computing power provided by the GPU. The value range is [0, 1]. The value of this metric is 0 in exclusive GPU mode or shared GPU mode because containers in these modes request only GPU memory. The allocation of GPU computing power is unlimited. If a GPU can provide 100 CUs of GPU computing power and allocates 30 CUs to a container, the ratio of GPU computing power allocated to the container is 0.3 (30/100).
DCGM_CUSTOM_DEV_FB_TOTAL	Gauge	MiB	The total memory of the GPU.
DCGM_CUSTOM_DEV_FB_ALLOCATED	Gauge	-	The ratio of allocated GPU memory to total GPU memory. The value range is [0, 1].
DCGM_CUSTOM_ALLOCATE_MODE	Gauge	-	The mode in which the node runs. Valid values: 0: No GPU-accelerated pods are running on the node. 1: GPU-accelerated pods are running in exclusive GPU mode on the node. 2: GPU-accelerated pods are running in shared GPU mode on the node.

Deprecated metrics

Deprecated metric	Metric for replacement	Description
nvidia_gpu_temperature_celsius	DCGM_FI_DEV_GPU_TEMP
nvidia_gpu_power_usage_milliwatts	DCGM_FI_DEV_POWER_USAGE
nvidia_gpu_sharing_memory	DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL	The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU
nvidia_gpu_memory_used_bytes	DCGM_FI_DEV_FB_USED
nvidia_gpu_memory_total_bytes	DCGM_CUSTOM_DEV_FB_TOTAL
nvidia_gpu_memory_allocated_bytes	DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL	The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU
nvidia_gpu_duty_cycle	DCGM_FI_DEV_GPU_UTIL
nvidia_gpu_allocated_num_devices	sum(DCGM_CUSTOM_DEV_FB_ALLOCATED)	sum(The proportion of GPU memory requested per GPU on a node) = The total number of GPUs requested on the node
nvidia_gpu_num_devices	DCGM_FI_DEV_COUNT