All Products
Search
Document Center

Container Service for Kubernetes:Introduction to metrics

Last Updated:Jun 11, 2024

GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to build a GPU monitoring system that supports diverse scenarios. You can create Grafana dashboards that consist of GPU exporter metrics to monitor your Container Service for Kubernetes (ACK) clusters. This topic describes the metrics supported by GPU monitoring 2.0.

Description

The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.

The GPU metrics used by GPU monitoring 2.0 include metrics supported by the DCGM exporter and custom metrics.

Billing description

Fees are charged for custom metrics used by GPU monitoring.

Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.

Metrics supported by the DCGM exporter

Utilization metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_GPU_UTIL

Gauge

%

The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active.

This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information.

DCGM_FI_DEV_MEM_COPY_UTIL

Gauge

%

The memory bandwidth utilization.

For example, the maximum memory bandwidth of GPU V100 is 900 GB/second. If the current memory bandwidth usage is 450 GB/second, the memory bandwidth utilization is 50%.

DCGM_FI_DEV_ENC_UTIL

Gauge

%

The encoder utilization.

DCGM_FI_DEV_DEC_UTIL

Gauge

%

The decoder utilization.

Memory metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_FB_FREE

Gauge

MiB

The amount of free framebuffer memory.

Note

Framebuffer memory is also known as GPU memory.

DCGM_FI_DEV_FB_USED

Gauge

MiB

The amount of occupied framebuffer memory.

The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command.

Profiling metrics

Metric

Type

Unit

Description

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Gauge

%

The ratio of cycles during which a graphics engine or compute engine remains active.

The value is an average of all graphics engines or compute engines.

A graphics engine or compute engine is active if a graphics context or compute context is bound to the thread and the context is busy.

DCGM_FI_PROF_SM_ACTIVE

Gauge

%

The ratio of cycles during which at least one warp on a streaming multiprocessor (SM) remains active.

The value is an average of all SMs. The value does not vary with the number of warps included in the thread block.

When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this scenario, the status of the warp may be Computing or may not be Computing. For example, the warp may be waiting for memory requests.

If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8.

For example, a GPU has N SMs:

  • If all SMs in N thread blocks run a kernel function within a cycle, the value of this metric is 1 (100%).

  • If N/5 thread blocks run a kernel function within a cycle, the value of this metric is 0.2.

  • If N thread blocks run a kernel function during 20% of the cycle, the value of this metric is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

Gauge

%

The ratio of the number of warps reside on an SM to the maximum number of warps supported by the SM within a cycle.

The value is an average of all SMs with a cycle.

A larger value of this metric does not indicate higher GPU utilization. Only when the DCGM_FI_PROF_DRAM_ACTIVE metric indicates that the GPU memory bandwidth is limited, a larger value of this metric indicates higher GPU utilization.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Gauge

%

The ratio of cycles during which the tensor (HMMA/IMMA) pipe remains active.

The value is an average value within a cycle rather than an instantaneous value.

A larger value of this metric indicates higher tensor core utilization.

If the value is 1 (100%), tensor instructions are submitted at intervals within the cycle. Each instruction is executed within two intervals.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

  • The tensor core utilization of 20% of the SMs within the cycle is 100%.

  • The tensor core utilization of all SMs within the cycle is 20%.

  • The tensor core utilization of all SMs within 20% of the cycle is 100%.

  • Other conditions.

DCGM_FI_PROF_PIPE_FP64_ACTIVE

Gauge

%

The ratio of cycles during which the fp64 (double-precision) pipe remains active.

The value is an average value within a cycle rather than an instantaneous value.

A larger value of this metric indicates higher fp64 core utilization.

If the value is 1 (100%), an fp64 instruction is executed every four weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

  • The fp64 core utilization of 20% of the SMs within the cycle is 100%.

  • The fp64 core utilization of all SMs within the cycle is 20%.

  • The fp64 core utilization of all SMs within 20% of the cycle is 100%.

  • Other conditions.

DCGM_FI_PROF_PIPE_FP32_ACTIVE

Gauge

%

The ratio of cycles during which the Fused Multiply-Add (FMA) operation pipe remains active. FMA operations include FP32 (single-precision) operations and integer operations.

The value is an average value within a cycle rather than an instantaneous value.

A larger value of this metric indicates higher fp32 core utilization.

If the value is 1 (100%), an fp32 instruction is executed every two weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

  • The fp32 core utilization of 20% of the SMs within the cycle is 100%.

  • The fp32 core utilization of all SMs within the cycle is 20%.

  • The fp32 core utilization of all SMs within 20% of the cycle is 100%.

  • Other conditions.

DCGM_FI_PROF_PIPE_FP16_ACTIVE

Gauge

%

The ratio of cycles during which the fp16 pipe (half-precision) remains active.

The value is an average value within a cycle rather than an instantaneous value.

A larger value of this metric indicates higher fp16 core utilization.

If the value is 1 (100%), an fp16 instruction is executed every two weeks within the cycle when a Volta GPU is used.

If the value of this metric is 0.2 (20%), one of the following conditions may exist:

  • The fp16 core utilization of 20% of the SMs within the cycle is 100%.

  • The fp16 core utilization of all SMs within the cycle is 20%.

  • The fp16 core utilization of all SMs within 20% of the cycle is 100%.

  • Other conditions.

DCGM_FI_PROF_DRAM_ACTIVE

Gauge

%

The ratio of cycles during which the device memory interface remains active to send or receive data.

The value is an average value within a cycle rather than an instantaneous value.

A larger value of this metric indicates higher device memory utilization.

If the value is 1 (100%), a DRAM instruction is executed every week within the cycle. The peak value of the metric can reach 0.8 (80%).

If the value of this metric is 0.2 (20%), the device memory interface sends or receives data within 20% of the cycle.

  • DCGM_FI_PROF_PCIE_TX_BYTES

  • DCGM_FI_PROF_PCIE_RX_BYTES

Counter

B/s

The TX rate of Peripheral Component Interconnect Express (PCIe) and the RX rate of PCIe. The bytes transmitted or received include both the header and payload.

The value is an average value within a cycle rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum PCIe Gen 3 bandwidth is 985 MB/second per lane.

  • DCGM_FI_PROF_NVLINK_RX_BYTES

  • DCGM_FI_PROF_NVLINK_TX_BYTES

Counter

B/s

The TX rate of NvLink and the RX rate of NvLink. The bytes transmitted or received include both the header and payload.

The value is an average value within a cycle rather than an instantaneous value.

For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum NvLink Gen 2 bandwidth is 25 GB/second per lane in each direction.

Clock metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_SM_CLOCK

Gauge

MHz

The SM clock.

DCGM_FI_DEV_MEM_CLOCK

Gauge

MHz

The memory clock.

DCGM_FI_DEV_APP_SM_CLOCK

Gauge

MHz

The SM application clock.

DCGM_FI_DEV_APP_MEM_CLOCK

Gauge

MHz

The memory application clock.

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Gauge

MHz

The clock throttle reason.

XID error and violation metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_XID_ERRORS

Gauge

-

The most recent XID error that occurred within a period of time.

DCGM_FI_DEV_POWER_VIOLATION

Counter

μs

The power violation time.

DCGM_FI_DEV_THERMAL_VIOLATION

Counter

μs

The thermal violation time.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

Counter

μs

The sync boost violation time.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

Counter

μs

The board violation time.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

Counter

μs

The low utilization violation time.

DCGM_FI_DEV_RELIABILITY_VIOLATION

Counter

μs

The board reliability violation time.

BAR1

Metric

Type

Unit

Description

DCGM_FI_DEV_BAR1_USED

Gauge

MB

The amount of occupied BAR1.

DCGM_FI_DEV_BAR1_FREE

Gauge

MB

The amount of free BAR1.

Temperature and power metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_MEMORY_TEMP

Gauge

C

The memory temperature.

DCGM_FI_DEV_GPU_TEMP

Gauge

C

The GPU temperature.

DCGM_FI_DEV_POWER_USAGE

Gauge

W

The power usage.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Counter

J

The total energy consumption since the driver was last reloaded.

Retired page metrics

Metric

Type

Unit

Description

DCGM_FI_DEV_RETIRED_SBE

Gauge

-

The number of pages retired because of single bit errors.

DCGM_FI_DEV_RETIRED_DBE

Gauge

-

The number of pages retired because of double bit errors.

Custom metrics

Metric

Type

Unit

Description

DCGM_CUSTOM_PROCESS_SM_UTIL

Gauge

%

The SM utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL

Gauge

%

The memory copy utilization of GPU threads.

DCGM_CUSTOM_PROCESS_ENCODE_UTIL

Gauge

%

The encoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_DECODE_UTIL

Gauge

%

The decoder utilization of GPU threads.

DCGM_CUSTOM_PROCESS_MEM_USED

Gauge

MiB

The amount of GPU memory occupied by GPU threads.

DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED

Gauge

MiB

The amount of GPU memory allocated to containers.

DCGM_CUSTOM_CONTAINER_CP_ALLOCATED

Gauge

-

The ratio of GPU computing power allocated to a container to total GPU computing power provided by the GPU. The value ranges from 0 to 1.

The value of this metric is 0 in exclusive GPU mode or shared GPU mode because containers in these modes request only GPU memory. The allocation of GPU computing power is unlimited.

If a GPU can provide 100 CUs of GPU computing power and allocates 30 CUs to a container, the ratio of GPU computing power allocated to the container is 0.3 (30/100).

DCGM_CUSTOM_DEV_FB_TOTAL

Gauge

MiB

The total memory of the GPU.

DCGM_CUSTOM_DEV_FB_ALLOCATED

Gauge

-

The ratio of allocated GPU memory to total GPU memory. The value ranges from 0 to 1.

DCGM_CUSTOM_ALLOCATE_MODE

Gauge

-

The mode in which the node runs. Valid values:

  • 0: No GPU-accelerated pods are running on the node.

  • 1: GPU-accelerated pods are running in exclusive GPU mode on the node.

  • 2: GPU-accelerated pods are running in shared GPU mode on the node.

Deprecated metrics

Deprecated metric

Metric for replacement

Description

nvidia_gpu_temperature_celsius

DCGM_FI_DEV_GPU_TEMP

nvidia_gpu_power_usage_milliwatts

DCGM_FI_DEV_POWER_USAGE

nvidia_gpu_sharing_memory

DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL

The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU

nvidia_gpu_memory_used_bytes

DCGM_FI_DEV_FB_USED

nvidia_gpu_memory_total_bytes

DCGM_CUSTOM_DEV_FB_TOTAL

nvidia_gpu_memory_allocated_bytes

DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL

The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU

nvidia_gpu_duty_cycle

DCGM_FI_DEV_GPU_UTIL

nvidia_gpu_allocated_num_devices

sum(DCGM_CUSTOM_DEV_FB_ALLOCATED)

sum(The proportion of GPU memory requested per GPU on a node) = The total number of GPUs requested on the node

nvidia_gpu_num_devices

DCGM_FI_DEV_COUNT