GPU monitoring 2.0 uses a combination of exporter, Prometheus, and Grafana to build a GPU monitoring system that supports diverse scenarios. You can create Grafana dashboards that consist of GPU exporter metrics to monitor your Container Service for Kubernetes (ACK) clusters. This topic describes the metrics supported by GPU monitoring 2.0.
Description
The GPU exporter used by GPU monitoring 2.0 is compatible with the metrics provided by the DCGM exporter. The GPU exporter also provides custom metrics to meet the requirements of specific scenarios. For more information about the DCGM exporter, see DCGM exporter.
The GPU metrics used by GPU monitoring 2.0 include metrics supported by the DCGM exporter and custom metrics.
Billing description
Fees are charged for custom metrics used by GPU monitoring.
Before you enable this feature, we recommend that you read Billing overview to understand the billing rules of custom metrics. The fees may vary based on the cluster size and number of applications. You can follow the steps in View resource usage to monitor and manage resource usage.
Metrics supported by the DCGM exporter
Utilization metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_GPU_UTIL | Gauge | % | The GPU utilization within a cycle of 1 second or 1/6 second. The cycle varies based on the GPU model. A cycle is a period of time during which one or more kernel functions remain active. This metric only indicates that one or more kernel functions are occupying GPU resources. The metric does not display detailed GPU usage information. |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | The memory bandwidth utilization. For example, the maximum memory bandwidth of GPU V100 is 900 GB/second. If the current memory bandwidth usage is 450 GB/second, the memory bandwidth utilization is 50%. |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | The encoder utilization. |
DCGM_FI_DEV_DEC_UTIL | Gauge | % | The decoder utilization. |
Memory metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_FB_FREE | Gauge | MiB | The amount of free framebuffer memory. Note Framebuffer memory is also known as GPU memory. |
DCGM_FI_DEV_FB_USED | Gauge | MiB | The amount of occupied framebuffer memory. The value of this metric is the same as the value of Memory-Usage returned by the nvidia-smi command. |
Profiling metrics
Metric | Type | Unit | Description |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Gauge | % | The ratio of cycles during which a graphics engine or compute engine remains active. The value is an average of all graphics engines or compute engines. A graphics engine or compute engine is active if a graphics context or compute context is bound to the thread and the context is busy. |
DCGM_FI_PROF_SM_ACTIVE | Gauge | % | The ratio of cycles during which at least one warp on a streaming multiprocessor (SM) remains active. The value is an average of all SMs. The value does not vary with the number of warps included in the thread block. When a warp is scheduled and resources are allocated to the warp, the warp is considered active. In this scenario, the status of the warp may be Computing or may not be Computing. For example, the warp may be waiting for memory requests. If the value of this metric drops below 0.5, the GPU utilization is low. To ensure high GPU utilization, make sure that the value is greater than 0.8. For example, a GPU has N SMs:
|
DCGM_FI_PROF_SM_OCCUPANCY | Gauge | % | The ratio of the number of warps reside on an SM to the maximum number of warps supported by the SM within a cycle. The value is an average of all SMs with a cycle. A larger value of this metric does not indicate higher GPU utilization. Only when the DCGM_FI_PROF_DRAM_ACTIVE metric indicates that the GPU memory bandwidth is limited, a larger value of this metric indicates higher GPU utilization. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | The ratio of cycles during which the tensor (HMMA/IMMA) pipe remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher tensor core utilization. If the value is 1 (100%), tensor instructions are submitted at intervals within the cycle. Each instruction is executed within two intervals. If the value of this metric is 0.2 (20%), one of the following conditions may exist:
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE | Gauge | % | The ratio of cycles during which the fp64 (double-precision) pipe remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp64 core utilization. If the value is 1 (100%), an fp64 instruction is executed every four weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist:
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Gauge | % | The ratio of cycles during which the Fused Multiply-Add (FMA) operation pipe remains active. FMA operations include FP32 (single-precision) operations and integer operations. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp32 core utilization. If the value is 1 (100%), an fp32 instruction is executed every two weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist:
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE | Gauge | % | The ratio of cycles during which the fp16 pipe (half-precision) remains active. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher fp16 core utilization. If the value is 1 (100%), an fp16 instruction is executed every two weeks within the cycle when a Volta GPU is used. If the value of this metric is 0.2 (20%), one of the following conditions may exist:
|
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | The ratio of cycles during which the device memory interface remains active to send or receive data. The value is an average value within a cycle rather than an instantaneous value. A larger value of this metric indicates higher device memory utilization. If the value is 1 (100%), a DRAM instruction is executed every week within the cycle. The peak value of the metric can reach 0.8 (80%). If the value of this metric is 0.2 (20%), the device memory interface sends or receives data within 20% of the cycle. |
| Counter | B/s | The TX rate of Peripheral Component Interconnect Express (PCIe) and the RX rate of PCIe. The bytes transmitted or received include both the header and payload. The value is an average value within a cycle rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum PCIe Gen 3 bandwidth is 985 MB/second per lane. |
| Counter | B/s | The TX rate of NvLink and the RX rate of NvLink. The bytes transmitted or received include both the header and payload. The value is an average value within a cycle rather than an instantaneous value. For example, if 1 GB of data is transmitted within 1 second, the TX rate is 1 GB/second no matter whether the rate is a consistent value or a peak value. The theoretical maximum NvLink Gen 2 bandwidth is 25 GB/second per lane in each direction. |
Clock metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | The SM clock. |
DCGM_FI_DEV_MEM_CLOCK | Gauge | MHz | The memory clock. |
DCGM_FI_DEV_APP_SM_CLOCK | Gauge | MHz | The SM application clock. |
DCGM_FI_DEV_APP_MEM_CLOCK | Gauge | MHz | The memory application clock. |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | Gauge | MHz | The clock throttle reason. |
XID error and violation metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_XID_ERRORS | Gauge | - | The most recent XID error that occurred within a period of time. |
DCGM_FI_DEV_POWER_VIOLATION | Counter | μs | The power violation time. |
DCGM_FI_DEV_THERMAL_VIOLATION | Counter | μs | The thermal violation time. |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | Counter | μs | The sync boost violation time. |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | Counter | μs | The board violation time. |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | Counter | μs | The low utilization violation time. |
DCGM_FI_DEV_RELIABILITY_VIOLATION | Counter | μs | The board reliability violation time. |
BAR1
Metric | Type | Unit | Description |
DCGM_FI_DEV_BAR1_USED | Gauge | MB | The amount of occupied BAR1. |
DCGM_FI_DEV_BAR1_FREE | Gauge | MB | The amount of free BAR1. |
Temperature and power metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_MEMORY_TEMP | Gauge | C | The memory temperature. |
DCGM_FI_DEV_GPU_TEMP | Gauge | C | The GPU temperature. |
DCGM_FI_DEV_POWER_USAGE | Gauge | W | The power usage. |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Counter | J | The total energy consumption since the driver was last reloaded. |
Retired page metrics
Metric | Type | Unit | Description |
DCGM_FI_DEV_RETIRED_SBE | Gauge | - | The number of pages retired because of single bit errors. |
DCGM_FI_DEV_RETIRED_DBE | Gauge | - | The number of pages retired because of double bit errors. |
Custom metrics
Metric | Type | Unit | Description |
DCGM_CUSTOM_PROCESS_SM_UTIL | Gauge | % | The SM utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_MEM_COPY_UTIL | Gauge | % | The memory copy utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_ENCODE_UTIL | Gauge | % | The encoder utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_DECODE_UTIL | Gauge | % | The decoder utilization of GPU threads. |
DCGM_CUSTOM_PROCESS_MEM_USED | Gauge | MiB | The amount of GPU memory occupied by GPU threads. |
DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED | Gauge | MiB | The amount of GPU memory allocated to containers. |
DCGM_CUSTOM_CONTAINER_CP_ALLOCATED | Gauge | - | The ratio of GPU computing power allocated to a container to total GPU computing power provided by the GPU. The value ranges from 0 to 1. The value of this metric is 0 in exclusive GPU mode or shared GPU mode because containers in these modes request only GPU memory. The allocation of GPU computing power is unlimited. If a GPU can provide 100 CUs of GPU computing power and allocates 30 CUs to a container, the ratio of GPU computing power allocated to the container is 0.3 (30/100). |
DCGM_CUSTOM_DEV_FB_TOTAL | Gauge | MiB | The total memory of the GPU. |
DCGM_CUSTOM_DEV_FB_ALLOCATED | Gauge | - | The ratio of allocated GPU memory to total GPU memory. The value ranges from 0 to 1. |
DCGM_CUSTOM_ALLOCATE_MODE | Gauge | - | The mode in which the node runs. Valid values:
|
Deprecated metrics
Deprecated metric | Metric for replacement | Description |
nvidia_gpu_temperature_celsius | DCGM_FI_DEV_GPU_TEMP | |
nvidia_gpu_power_usage_milliwatts | DCGM_FI_DEV_POWER_USAGE | |
nvidia_gpu_sharing_memory | DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL | The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU |
nvidia_gpu_memory_used_bytes | DCGM_FI_DEV_FB_USED | |
nvidia_gpu_memory_total_bytes | DCGM_CUSTOM_DEV_FB_TOTAL | |
nvidia_gpu_memory_allocated_bytes | DCGM_CUSTOM_DEV_FB_ALLOCATED * DCGM_CUSTOM_DEV_FB_TOTAL | The proportion of GPU memory requested per GPU × The total amount of GPU memory of a GPU = The amount of GPU memory requested from the GPU |
nvidia_gpu_duty_cycle | DCGM_FI_DEV_GPU_UTIL | |
nvidia_gpu_allocated_num_devices | sum(DCGM_CUSTOM_DEV_FB_ALLOCATED) | sum(The proportion of GPU memory requested per GPU on a node) = The total number of GPUs requested on the node |
nvidia_gpu_num_devices | DCGM_FI_DEV_COUNT |