Work with cloud-native AI dashboards - Container Service for Kubernetes

You can install the cloud-native AI monitoring component ack-arena-exporter to monitor the use of GPU resources by cluster, node, or training job, and gain insights into resource quotas in different namespaces. This topic introduces dashboards for clusters, nodes, training jobs, and resource quotas. This topic also describes how to install ack-arena-exporter and work with these dashboards.

Prerequisites

Only Container Service for Kubernetes (ACK) Pro clusters are supported and the Kubernetes versions of the clusters must be 1.18.8 or later.
The Arena component of V0.7.0 or later is installed. For more information, see Configure the Arena client.
Install the ack-arms-prometheus monitoring component. For more information, see arms-prometheus.
The AI Dashboard component is installed. For more information, see Deploy the cloud-native AI suite.

Background information

After you use Arena to submit a training job, you may want to monitor the use of GPUs and GPU memory by cluster, node, training job, or resource quota. In traditional ACK clusters, you can monitor the use of GPU resources (including GPU utilization, GPU memory usage, and GPU power) only on a specific node or view GPU resources used by a specific pod.

To monitor resource usage in more dimensions, ACK offers cloud-native AI dashboards. Compared with the GPU dashboard in traditional ACK clusters, the AI dashboards have the following benefits:

The AI dashboards consist of the cluster dashboard, node dashboard, training job dashboard, and resource quota dashboard.
The cluster dashboard displays the total number of GPU-accelerated nodes, number of unhealthy GPU-accelerated nodes, average GPU utilization, and numbers of training jobs in different states.
The node dashboard displays the total number of GPUs, number of allocated GPUs, GPU utilization, and GPU memory usage on each node.
The training job dashboard displays the status, duration, number of requested GPUs, average GPU utilization, and GPU memory usage of each training job.
The resource quota dashboard displays the allocation and usage of a resource type in a specific namespace, such as Max Quota, Min Quota, and Used Quota.

Install ack-arena-exporter

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.
Install ack-arena-exporter.
- If you have installed the cloud-native AI suite:
  On the Cloud-native AI Suite page, find the cloud-native AI monitoring component named ack-arena-exporter, and click Deploy in the Actions column.
- If you have not installed the cloud-native AI suite:
  On the Cloud-native AI Suite page, click Deploy. In the Monitoring section, select Monitoring Component and click Deploy Cloud-native AI Suite.

Introduction to AI dashboards

After you log on to AI Dashboard, you are redirected to the cluster dashboard by default. Click the icon in the upper-left corner of the page to switch dashboards.

Cluster: Cluster dashboard
Nodes: Node dashboard
TrainingJobs: Training job dashboard
Quota: Resource quota dashboard

Cluster dashboard

For more information, see Method 1: Use an Internet-facing Ingress to access AI Dashboard.

You can view the following metrics in the cluster dashboard:

GPU Summary Of Cluster: displays the total number of GPU-accelerated nodes, the number of allocated GPU-accelerated nodes, and the number of unhealthy GPU-accelerated nodes in the cluster.
Total GPU Nodes: displays the total number of GPU-accelerated nodes in the cluster.
Unhealthy GPU Nodes: displays the number of unhealthy GPU-accelerated nodes in the cluster.
GPU Memory(Used/Total): displays the ratio of GPU memory used by the cluster to the total GPU memory.
GPU Memory(Allocated/Total): displays the ratio of GPU memory allocated by the cluster to the total GPU memory.
GPU Utilization: the average GPU utilization of the cluster.
GPUs(Allocated/Total): displays the ratio of the number of GPUs that are allocated by the cluster to the total number of GPUs.
Training Job Summary Of Cluster: displays the numbers of training jobs that are in the following states: Running, Pending, Succeeded, and Failed.

Node dashboard

On the Cluster page, click Nodes in the upper-right corner to navigate to the node dashboard.

You can view the following metrics in the node dashboard:

GPU Node Details: displays information about the cluster nodes in a table. The following information is displayed:
- The name of each node
- The IP address of each node in the cluster
- The role of each node in the cluster
- The status of each node
- The GPU mode of each node: exclusive or shared
- The number of GPUs owned by each node
- The total amount of GPU memory owned by each node
- The number of GPUs allocated on each node
- The amount of GPU memory allocated on each node
- The amount of GPU memory used on each node
- The average GPU utilization on each node
GPU Duty Cycle: displays the utilization of each GPU on each node.
GPU Memory Usage: displays the memory usage of each GPU on each node.
GPU Memory Usage Percentage: displays the percentage of memory usage per GPU on each node.
Allocated GPUs Per Node: displays the number of GPUs allocated on each node.
GPU Number Per Node: displays the total number of GPUs on each node.
Total GPU Memory Per Node: displays the total amount of GPU memory on each node.

jk21 jk22

You can use the gpu_node and GPU filters in the upper-left corner of the Nodes page to filter nodes and GPUs. This allows you to view the GPU usage on only specified nodes.

Training job dashboard

On the Nodes page, click TrainingJobs in the upper-right corner to navigate to the training job dashboard.

You can view the following metrics in the training job dashboard:

Training Jobs: displays the information about each training job in a table. The following information is displayed:
- The namespace of each training job
- The name of each training job
- The type of each training job
- The status of each training job
- The duration of each training job
- The number of GPUs that are requested by each training job
- The amount of GPU memory that is request by each training job
- The amount of GPU memory that is used by each training job
- The average GPU utilization of each training job
Job Instance Used GPU Memory: displays the amount of GPU memory that is used by each job instance.
Job Instance Used GPU Memory Percentage: displays the percentage of GPU memory that is used by each job instance.
Job Instance GPU Duty Cycle: displays the GPU utilization of each job instance.

You can use the job_namespace, job_type, and job_name filters to filter training jobs by namespace, type, and name. This allows you to view detailed information about only specified training jobs.

Resource quota dashboard

On the Training Jobs page, click Quota in the upper-right corner to navigate to the resource quota dashboard.

You can view the following metrics in the resource quota dashboard:

Quota(cpu)
Quota(memory)
Quota(nvidia.com/gpu)
Quota(aliyun.com/gpu-mem)
Quota(aliyun.com/gpu)

Each metric in the preceding list displays the information about resource quotas in a table. The following information is displayed:

Elastic Quota Name: displays the name of the quota group.
Namespace: displays the namespace to which resources belong.
Resource Name: displays the type of resources.
Max Quota: displays the maximum amount of resources that you can use in the specified namespace.
Min Quota: displays the minimum amount of resources that you can use in the specified namespace when the cluster does not have sufficient resources.
Used Quota: displays the amount of resources that are used in the specified namespace.

Examples on how to work with AI dashboards

In this example, Arena is used to submit an mpijob training job. After you submit the training job, view the changes in the AI dashboards.

After you log on to AI Dashboard, you are redirected to the cluster dashboard by default.
The following figure shows how the cluster dashboard looks like when no training job is submitted to the cluster.
The preceding figure indicates that the cluster contains two GPU-accelerated nodes and two GPUs.

Run the following command to submit an mpijob training job:

arena submit mpijob \
    --name=mpi-test \
    --gpus=1 \
    --workers=1 \
    --working-dir=/perseus-demo/tensorflow-demo/ \
    --image=registry.cn-beijing.aliyuncs.com/ai-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
    'mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --num_batches 5000 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10'

Check the cluster dashboard again.
The preceding figure indicates that metrics such as GPU Utilization and GPUs (Allocated/Total) are updated.

Run the following command to query the node where the job runs:

arena list

Expected output:

NAME      STATUS   TRAINER  DURATION  GPU(Requested)  GPU(Allocated)  NODE
mpi-test  RUNNING  MPIJOB   8s        1               1               192.168.1.151

On the Cluster page, click Nodes in the upper-right corner to navigate to the node dashboard. The output returned in Step 4 indicates that the IP address of the node where the job runs is 192.168.1.151. You can find that the GPU-related metrics of the node are updated, as shown in the following figure.
On the Nodes page, click TrainingJobs in the upper-right corner to navigate to the training job dashboard where you can view information about training jobs in the cluster, such as the names and status of training jobs.