Kubernetes supports auto scaling based on custom metrics. Kubernetes can work with Managed Service for Prometheus to implement auto scaling based on GPU metrics. This topic describes how to deploy Managed Service for Prometheus to monitor applications. This topic also provides examples on how to view GPU metrics that are collected by Managed Service for Prometheus and enable auto scaling of pods based on GPU metrics.
Prerequisites
An ACK cluster with GPU-accelerated nodes or ACK dedicated cluster with GPU-accelerated nodes is created.
Introduction
GPU-accelerated computing is widely used in high-performance computing scenarios, such as the training of deep learning models and inference. To reduce resource costs, you can enable cluster auto scaling based on GPU metrics, such as GPU utilization and GPU memory usage.
By default, Kubernetes enables horizontal pod autoscaling based on CPU and memory metrics. If you have higher requirements, you can use the Prometheus adapter to support the GPU metrics that are collected by Prometheus and use the custom metrics API to define custom metrics. This allows you to enable horizontal pod autoscaling based on GPU utilization and GPU memory usage. The following figure shows how auto scaling based on GPU metrics works.
Step 1: Deploy Managed Service for Prometheus and ack-alibaba-cloud-metrics-adapter
Enable Managed Service for Prometheus.
NoteYou can select Enable Managed Service for Prometheus when you create a cluster. This saves you the need to install Managed Service for Prometheus after the cluster is created.
Install and configure ack-alibaba-cloud-metrics-adapter.
a. Obtain the HTTP API endpoint
Log on to the ARMS console.
In the left-side navigation pane, choose .
In the top navigation bar, select the region where your ACK cluster is deployed. Then, click the name of a Prometheus instance whose Instance Type is Prometheus for Container Service. The details page of the Prometheus instance appears.
In the left-side navigation pane of the instance details page, click Settings and copy the internal endpoint in the HTTP API Address section.
b. Configure the Prometheus URL
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.
On the ack-alibaba-cloud-metrics-adapter page, click Deploy.
On the Basic Information wizard page, select a cluster and a namespace, and then click Next.
On the Parameters wizard page, select a chart version from the Chart Version drop-down list, set the Prometheus
URL
in the Parameters section to the HTTP API endpoint that you obtained, and then click OK.
Step 2: Configure rules for ack-alibaba-cloud-metrics-adapter
a. Query GPU metrics
Query GPU metrics. For more information, see Introduction to metrics.
b. Configure rules for ack-alibaba-cloud-metrics-adapter
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Helm page, click Update in the Actions column of ack-alibaba-cloud-metrics-adapter. Add the following
rules
belowcustom
.The following figure provides an example.
Run the following command. If the output includes
DCGM_FI_DEV_GPU_UTIL
,DCGM_CUSTOM_PROCESS_SM_UTIL
,DCGM_FI_DEV_FB_USED
, andDCGM_CUSTOM_PROCESS_MEM_USED
, the rules are configured. In the following example,DCGM_CUSTOM_PROCESS_SM_UTIL
is returned in the output.
Step 3: Enable auto scaling based on GPU metrics
The following example shows how to deploy a model inference service on a GPU-accelerated node and perform stress tests on the node to check whether auto scaling can be performed based on GPU metrics.
a. Deploy an inference service
Run the following command to deploy the inference service:
Query the status of the pod and Service.
Run the following command to query the status of the pod:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES bert-intent-detection-7b486f6bf-f**** 1/1 Running 0 3m24s 10.15.1.17 cn-beijing.192.168.94.107 <none> <none>
The output indicates that only one pod is deployed on the GPU-accelerated node 192.168.94.107.
Run the following command to query the status of the Service:
kubectl get svc bert-intent-detection-svc
Expected output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE bert-intent-detection-svc LoadBalancer 172.16.186.159 47.95.XX.XX 80:30118/TCP 5m1s
If the output displays the name of the Service, the Service is deployed.
Log on to the node 192.168.94.107 by using SSH and run the following command to query GPU utilization:
nvidia-smi
Expected output:
Wed Feb 16 11:48:07 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 | | N/A 32C P0 55W / 300W | 15345MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2305118 C python 15343MiB | +-----------------------------------------------------------------------------+
The output indicates that the inference service is running on the GPU-accelerated node. The GPU utilization is 0 because no request is sent to the service.
Run the following command to send requests to the inference service and check whether the service is deployed:
curl -v "http://47.95.XX.XX/predict?query=Music"
Expected output:
* Trying 47.95.XX.XX... * TCP_NODELAY set * Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0) > GET /predict?query=Music HTTP/1.1 > Host: 47.95.XX.XX > User-Agent: curl/7.64.1 > Accept: */* > * HTTP 1.0, assume close after body < HTTP/1.0 200 OK < Content-Type: text/html; charset=utf-8 < Content-Length: 9 < Server: Werkzeug/1.0.1 Python/3.6.9 < Date: Wed, 16 Feb 2022 03:52:11 GMT < * Closing connection 0 PlayMusic # The query result.
If the HTTP status code 200 and the query result are returned, the inference service is deployed.
b. Configure the HPA
The following example describes how to trigger auto scaling when the GPU utilization of a pod exceeds 20%. The following table describes the metrics that are supported by the Horizontal Pod Autoscaler (HPA).
Metric | Description | Unit |
DCGM_FI_DEV_GPU_UTIL |
| % |
DCGM_FI_DEV_FB_USED |
| MiB |
DCGM_CUSTOM_PROCESS_SM_UTIL | The GPU utilization of pods. | % |
DCGM_CUSTOM_PROCESS_MEM_USED | The amount of GPU memory that is used by pods. | MiB |
DCGM_CUSTOM_PROCESS_GPU_MEM_USED_RATIO | The GPU memory utilization of pods.
| % |
Run the following command to deploy the HPA:
Clusters that run Kubernetes 1.23 or later
cat <<EOF | kubectl create -f - apiVersion: autoscaling/v2 # Use the HPA configuration for API version autoscaling/v2. kind: HorizontalPodAutoscaler metadata: name: gpu-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-intent-detection minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metric: name: DCGM_CUSTOM_PROCESS_SM_UTIL target: type: Utilization averageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. EOF
Clusters that run Kubernetes versions earlier than 1.23
cat <<EOF | kubectl create -f - apiVersion: autoscaling/v2beta1 # Use the HPA configuration for API version autoscaling/v2beta1. kind: HorizontalPodAutoscaler metadata: name: gpu-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: bert-intent-detection minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metricName: DCGM_CUSTOM_PROCESS_SM_UTIL # The GPU utilization of pods. targetAverageValue: 20 # If the GPU utilization exceeds 20%, pods are scaled out. EOF
Run the following command to query the status of the HPA:
kubectl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 0/20 1 10 1 74s
The expected output indicates that
TARGETS
displays0/20
. The current GPU utilization is 0. When the GPU utilization exceeds 20%, pods are scaled out.
c. Test auto scaling on the inference service
Test scale-out activities
Run the following command to perform the stress test:
hey -n 10000 -c 200 "http://47.95.XX.XX/predict?query=music"
NoteThe following formula is used to calculate the expected number of pods after auto scaling:
Expected number of pods = ceil [Current number of pods × (Current GPU utilization/Expected GPU utilization)]
. For example, if the current number of pods is 1, current GPU utilization is 23, expected GPU utilization is 20, the expected number of pods after auto scaling is 2.During the stress test, run the following command to query the status of the HPA and the pods:
Run the following command to query the status of the HPA:
kubectl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 23/20 1 10 2 7m56s
The output indicates that the value in the
TARGETS
column is23/20
. The current GPU utilization exceeds the threshold 20%. In this case, auto scaling is triggered and the ACK cluster starts to scale out pods.Run the following command to query the status of the pods:
kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE bert-intent-detection-7b486f6bf-f**** 1/1 Running 0 44m bert-intent-detection-7b486f6bf-m**** 1/1 Running 0 14s
The output indicates that two pods are running. This value is the same as the expected number of pods calculated based on the preceding formula.
The output returned by the HPA and the pods indicates that the pods are scaled out.
Test scale-in activities
When the stress test stops and the GPU utilization drops below 20%, the ACK cluster starts to scale in pods.
Run the following command to query the status of the HPA:
kubectl get hpa
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE gpu-hpa Deployment/bert-intent-detection 0/20 1 10 1 15m
The output indicates that the value in the
TARGETS
column is0/20
. The current GPU utilization drops to 0. The ACK cluster starts to scale in pods after about 5 minutes.Run the following command to query the status of the pods:
kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE bert-intent-detection-7b486f6bf-f**** 1/1 Running 0 52m
The output indicates that number of pods is 1. This means that the pods are scaled in.
FAQ
How do I confirm whether a GPU is used?
You can check whether there are changes in the GPU utilization on the GPU Monitoring tab. If the GPU utilization increases, a GPU is used. If no changes are found in the GPU utilization, no GPU is used. To do this, perform the following steps:
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage. In the left-side navigation pane, choose .
On the Prometheus Monitoring page, click the GPU Monitoring tab and view changes in the GPU utilization.