This topic describes how to use Application Real-Time Monitoring Prometheus Service to monitor the GPU resources of a Kubernetes cluster.
Prerequisites
The following operations are completed:- Create an ACK managed cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
- Promtheus Service is enabled.
- Prometheus Service is installed. For more information, see Enable Prometheus Service.
Use Prometheus Service to monitor GPU resources
- Log on to the ACK console.
- On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
- In the left-side navigation pane of the details page, choose .
- On the Prometheus Monitoring page, you can click the GPU APP tab to view the GPU APP dashboard and click the GPU Node tab to view the GPU Node dashboard.
- The GPU APP dashboard displays monitoring information about the GPU resources used by each pod.
- The GPU Node dashboard displays monitoring information about the GPU resource usage of each node.
- Use the following YAML template to deploy an application on a GPU-accelerated node and test the monitoring of GPU resources.
apiVersion: apps/v1 kind: Deployment metadata: name: bert-intent-detection spec: replicas: 1 selector: matchLabels: app: bert-intent-detection template: metadata: labels: app: bert-intent-detection spec: containers: - name: bert-container image: registry.cn-beijing.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1 ports: - containerPort: 80 resources: limits: nvidia.com/gpu: 1 --- apiVersion: v1 kind: Service metadata: labels: run: bert-intent-detection name: bert-intent-detection-svc spec: ports: - port: 8500 targetPort: 80 selector: app: bert-intent-detection type: LoadBalancer
- On the Prometheus Monitoring page, click the GPU APP tab. On the GPU APP tab, you can view various metrics of the GPU resources used by each pod, including the used GPU memory, GPU memory usage, power consumption, and stability. You can also view the applications deployed on each GPU-accelerated node.
- Perform stress tests on the application deployed on the GPU-accelerated node and check the changes of metrics.