All Products
Search
Document Center

Container Service for Kubernetes:Implement auto scaling based on GPU metrics

Last Updated:Feb 09, 2026

Kubernetes provides a Custom Metrics mechanism that integrates with Managed Service for Prometheus to collect GPU metrics. This topic describes how to deploy Managed Service for Prometheus and provides an example of how to use Managed Service for Prometheus to monitor GPU metrics and implement auto scaling for containers.

Prerequisites

You have added a GPU node to your cluster or created a dedicated GPU cluster.

Overview

In high-performance computing (HPC) scenarios, such as deep learning model training and inference, GPUs are often used to accelerate computation. To save costs, use auto scaling based on GPU metrics, such as GPU utilization and GPU memory.

By default, Kubernetes uses CPU and memory as metrics for Horizontal Pod Autoscaler (HPA) auto scaling. For more complex scenarios, such as auto scaling based on GPU metrics, use Prometheus Adapter to adapt the GPU metrics collected by Prometheus. Then, use the Custom Metrics API to extend the HPA metrics. This workflow lets you implement auto scaling based on metrics such as GPU utilization and GPU memory. The following figure shows how GPU auto scaling works.

image

Step 1: Deploy Managed Service for Prometheus and Metrics Adapter

  1. Enable Prometheus monitoring.

    Note

    If you selected to install Prometheus when you created the cluster, you do not need to install it again.

  2. Install and configure ack-alibaba-cloud-metrics-adapter.

    1. Obtain the HTTP API address

    1. Log on to the ARMS console.

    2. In the left navigation pane, choose Managed Service for Prometheus > Instances.

    3. At the top of the page, select the region where your Container Service for Kubernetes (ACK) cluster is located. Then, click the name of the target instance.

    4. On the Settings page, click the Settings tab. In the HTTP API URL (Grafana Read Address) section, copy the internal network address.

    2. Configure the Prometheus URL

    1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

    2. On the App Marketplace page, click the App Catalog tab. Search for and click ack-alibaba-cloud-metrics-adapter.

    3. On the ack-alibaba-cloud-metrics-adapter page, click Deploy.

    4. In the Basic Information wizard, select a cluster and namespace, and then click Next.

    5. In the Parameters wizard, select a Chart Version. In the Parameters section, set the value of the Prometheus url parameter to the HTTP API address that you obtained. Then, click OK.

Step 2: Configure Adapter Rules for GPU metrics

1. Query GPU metrics

You can query GPU metrics. For more information, see Monitoring metric descriptions.

2. Configure Adapter Rules

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Helm.

  3. In the Actions column of the Helm release, click Update for ack-alibaba-cloud-metrics-adapter. Add the following rules to the custom field.

    Detailed rules

    - metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_GPU_UTIL{} # GPU utilization
    - metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_SM_UTIL{} # Container GPU utilization.
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NodeName:
            resource: node
      seriesQuery: DCGM_FI_DEV_FB_USED{} # GPU memory usage.
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          NodeName:
            resource: node
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{} # Container GPU memory usage.
    - metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>) / sum(DCGM_CUSTOM_CONTAINER_MEM_ALLOCATED{}) by (<<.GroupBy>>)
      name:
        as: ${1}_GPU_MEM_USED_RATIO
        matches: ^(.*)_MEM_USED
      resources:
        overrides:
          NamespaceName:
            resource: namespace
          PodName:
            resource: pod
      seriesQuery: DCGM_CUSTOM_PROCESS_MEM_USED{NamespaceName!="",PodName!=""}  # Container GPU memory utilization.

    After you add the rules, the configuration is displayed as shown in the following figure.

    1690252651140-f693f03a-0f9e-4a6a-8772-b7abe9b2912a.png

    Run the following command. If the output contains metrics that HPA can recognize, such as DCGM_FI_DEV_GPU_UTIL, DCGM_CUSTOM_PROCESS_SM_UTIL, DCGM_FI_DEV_FB_USED, and DCGM_CUSTOM_PROCESS_MEM_USED, the configuration is complete. This example uses DCGM_CUSTOM_PROCESS_SM_UTIL. The actual output may vary.

    Sample output

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

    Expected output (the resources section contains metrics related to DCGM_CUSTOM_PROCESS_SM_UTIL):

    {
      "kind": "APIResourceList",
      "apiVersion": "v1",
      "groupVersion": "custom.metrics.k8s.io/v1beta1",
      "resources": [
        {
          "name": "nodes/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        },
        {
          "name": "pods/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": true,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        },
        {
          "name": "namespaces/DCGM_CUSTOM_PROCESS_SM_UTIL",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        },
        {
          "name": "DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO",
          "singularName": "",
          "namespaced": false,
          "kind": "MetricValueList",
          "verbs": [
            "get"
          ]
        }
      ]
    }

Step 3: Implement auto scaling based on GPU metrics

This example deploys a model inference service on a GPU. Then, it runs a stress test on the service to test auto scaling based on GPU utilization.

1. Deploy the inference service

  1. Run the following command to deploy the inference service.

    Command details

    cat <<EOF | kubectl create -f -
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: bert-intent-detection
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: bert-intent-detection
      template:
        metadata:
          labels:
            app: bert-intent-detection
        spec:
          containers:
          - name: bert-container
            image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
            ports:
            - containerPort: 80
            resources:
              limits:
                nvidia.com/gpu: 1
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: bert-intent-detection-svc
      labels:
        app: bert-intent-detection
    spec:
      selector:
        app: bert-intent-detection
      ports:
      - protocol: TCP
        name: http
        port: 80
        targetPort: 80
      type: LoadBalancer
    EOF
  2. Check the status of the pod and service.

    • Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          3m24s   10.15.1.17   cn-beijing.192.168.94.107   <none>           <none>

      The expected output shows that only one pod is deployed on the GPU node 192.168.94.107.

    • Run the following command to check the service status.

      kubectl get svc bert-intent-detection-svc

      Expected output:

      NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
      bert-intent-detection-svc   LoadBalancer   172.16.186.159   47.95.XX.XX   80:30118/TCP   5m1s

      The service name in the expected output indicates that the service is deployed.

  3. Log on to the GPU node 192.168.94.107 using Secure Shell (SSH). Then, run the following command to check the GPU usage.

    nvidia-smi

    Expected output:

    Wed Feb 16 11:48:07 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
    | N/A   32C    P0    55W / 300W |  15345MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A   2305118      C   python                          15343MiB |
    +-----------------------------------------------------------------------------+

    The expected output shows that the inference service process is running on the GPU. Because no requests have been sent, the current GPU utilization is 0.

  4. Run the following command to call the inference service and verify the deployment.

    curl -v  "http://47.95.XX.XX/predict?query=Music"

    Expected output:

    *   Trying 47.95.XX.XX...
    * TCP_NODELAY set
    * Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
    > GET /predict?query=Music HTTP/1.1
    > Host: 47.95.XX.XX
    > User-Agent: curl/7.64.1
    > Accept: */*
    >
    * HTTP 1.0, assume close after body
    < HTTP/1.0 200 OK
    < Content-Type: text/html; charset=utf-8
    < Content-Length: 9
    < Server: Werkzeug/1.0.1 Python/3.6.9
    < Date: Wed, 16 Feb 2022 03:52:11 GMT
    <
    * Closing connection 0
    PlayMusic # Intent recognition result.

    The HTTP request returns a status code of 200 and the intent recognition result. This indicates that the inference service is deployed.

2. Configure the HPA

This example uses GPU utilization. When the GPU utilization of a pod exceeds 20%, a scale-out is triggered. The HPA supports the following metrics.

Metric

Description

Unit

DCGM_FI_DEV_GPU_UTIL

  • GPU card utilization.

  • This metric is valid only for dedicated GPU scheduling.

    Important

    For shared GPUs, the same GPU card is assigned to multiple pods. NVIDIA provides only card-level utilization, not application-level utilization. Therefore, the utilization shown when you run nvidia-smi in a pod is the utilization of the entire card.

%

DCGM_FI_DEV_FB_USED

  • GPU card memory usage.

  • This metric is valid only for dedicated GPU scheduling.

MiB

DCGM_CUSTOM_PROCESS_SM_UTIL

GPU utilization of the container.

%

DCGM_CUSTOM_PROCESS_MEM_USED

GPU memory usage of the container.

MiB

DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO

GPU memory utilization of the container.

GPU memory utilization of the container = Actual GPU memory usage of the current pod container (Used) / GPU memory allocated to the current pod container (Allocated)

%

  1. Run the following command to deploy the HPA.

    v1.23 or later

    cat <<EOF | kubectl create -f -
    apiVersion: autoscaling/v2  # Use the autoscaling/v2 version for HPA configuration.
    kind: HorizontalPodAutoscaler
    metadata:
      name: gpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: bert-intent-detection
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: DCGM_CUSTOM_PROCESS_SM_UTIL
          target:
            type: Utilization
            averageValue: 20 # A scale-out is triggered when the container's GPU utilization exceeds 20%.
    EOF

    Earlier than v1.23

    cat <<EOF | kubectl create -f -
    apiVersion: autoscaling/v2beta1  # Use the autoscaling/v2beta1 version for HPA configuration.
    kind: HorizontalPodAutoscaler
    metadata:
      name: gpu-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: bert-intent-detection
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metricName: DCGM_CUSTOM_PROCESS_SM_UTIL # GPU utilization of the pod.
          targetAverageValue: 20 # A scale-out is triggered when the container's GPU utilization exceeds 20%.
    EOF
  2. Run the following command to check the HPA status.

    kubectl get hpa

    Expected output:

    NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          74s

    The expected output shows that TARGETS is 0/20. This indicates that the current GPU utilization is 0. A scale-out is triggered when the GPU utilization exceeds 20%.

3. Test auto scaling for the inference service

Scale-out

  1. Run the following command to perform a stress test.

    hey -n 10000 -c 200 "http://47.95.XX.XX/predict?query=music"
    Note

    The formula to calculate the desired number of replicas for an HPA scale-out is: Desired Replicas = ceil[Current Replicas × (Current Metric / Desired Metric)]. For example, the current number of replicas is 1, the current metric is 23, and the desired metric is 20; the formula calculates the desired number of replicas as 2.

  2. During the stress test, monitor the status of the HPA and the pods.

    1. Run the following command to check the HPA status.

      kubectl get hpa

      Expected output:

      NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
      gpu-hpa   Deployment/bert-intent-detection   23/20     1         10        2          7m56s

      The output shows that the TARGETS value is 23/20. This value indicates that the current GPU utilization exceeds 20%, which triggers auto scaling and causes the cluster to scale out.

    2. Run the following command to check the pod status.

      kubectl get pods

      Expected output:

      NAME                                    READY   STATUS    RESTARTS   AGE
      bert-intent-detection-7b486f6bf-f****   1/1     Running   0          44m
      bert-intent-detection-7b486f6bf-m****   1/1     Running   0          14s

      The expected output shows that there are two pods. The formula calculates that the total number of pods should be 2. This matches the actual output.

    The expected outputs for the HPA and pods indicate that the pods were successfully scaled out.

Scale-in

When the stress test stops, the GPU utilization drops below 20%. The system then starts to scale in.

  1. Run the following command to check the HPA status.

    kubectl get hpa

    Expected output:

    NAME      REFERENCE                          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa   Deployment/bert-intent-detection   0/20      1         10        1          15m

    The expected output shows that TARGETS is 0/20 . This indicates that the current GPU utilization is 0. After about five minutes, the system starts to scale in.

  2. Run the following command to check the pod status.

    kubectl get pods

    Expected output:

    NAME                                    READY   STATUS    RESTARTS   AGE
    bert-intent-detection-7b486f6bf-f****   1/1     Running   0          52m

    The expected output shows that the number of pods is 1. This indicates that the scale-in was successful.

FAQ

How can I determine if a GPU card is being used?

You can check the GPU Monitoring tab to observe fluctuations in GPU card utilization to determine whether a GPU card is in use. If the utilization increases, the GPU card is being used. If the utilization does not change, the card is not being used. To do this, perform the following steps:

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose Operations > Prometheus Monitoring.

  3. On the Prometheus Monitoring page, click the GPU Monitoring tab. Observe the fluctuations in GPU card utilization.