Cloud-native AI Suite O&M guide - Container Service for Kubernetes

After you deploy a cluster that has the Cloud-native AI Suite installed, you can allocate cluster resources and view resource usage in multiple dimensions. This helps you optimize the utilization of cluster resources. This topic describes basic O&M operations that you can perform on the Cloud-native AI Suite. For example, you can install the Cloud-native AI Suite, view resources dashboards, and manage users and quotas.

Background Information

If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. However, the resource utilization varies by user groups. To improve the overall utilization of cluster resources, you can allow the users to share resources after you allocate cluster resources to them.

The following figure shows the organizational structure of an enterprise. You can set elastic quotas at different levels based on your business requirements. Each leaf node in the figure corresponds to a user group. To manage permissions and quotas separately, you can add users in a user group to one or more namespaces, and assign different roles to the users. This way, resources can be shared across user groups and users in the same user group can be isolated.

orgchart

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster is created. Make sure that Monitoring Agents and Simple Log Service are enabled on the Component Configurations wizard page when you create the cluster. For more information, see Create an ACK Pro cluster.
The Kubernetes version of the cluster is 1.18 or later.

Tasks

This topic describes how to complete the following tasks:

Install Cloud-native AI Suite
View resource dashboards
Set resource quotas for user groups
Manage users and groups
Use idle resources to submit more workloads after the minimum amount of resources for each user is exhausted
Set the maximum amount of resources for each user
Set the minimum amount of resources for each user

Step 1: Install the Cloud-native AI Suite

Cloud-native AI Suite consists of components for task elasticity, data acceleration, AI task scheduling, AI task lifecycle management, AI Dashboard, and AI Developer Console. You can install the components based on your business requirements.

Deploy Cloud-native AI Suite

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
On the page that appears, select the components that you want to install. Click Deploy Cloud-native AI Suite in the lower part of the page. The system checks the environment and the dependencies of the selected components. After the check is passed, the system deploys the selected components.
After the components are installed, you can view the following information in the Components list:
- You can view the names and versions of the components that are installed in the cluster. You can Deploy or Uninstall components.
- If a component is updatable, you can Upgrade the component.
- After you install ack-ai-dashboard and ack-ai-dev-console, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.
After the installation is complete, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.

Install and configure AI Dashboard

In the Interaction Mode section of the Cloud-native AI Suite page, select Console. The Note dialog box appears, as shown in the following figure.
- If Authorized is displayed, perform Step 3.
- If Unauthorized is displayed in red and the OK button is dimmed, perform Step 2.
Create a custom policy to grant permissions to the RAM worker role.
1. Create a custom policy.
  1. Log on to the RAM console. In the left-side navigation pane, choose Permissions > Policies.
  2. On the Policies page, click Create Policy.
  3. Click the JSON tab. Add the following content to the Action field and click Next to edit policy information.
```
 "log:GetProject",
 "log:GetLogStore",
 "log:GetConfig",
 "log:GetMachineGroup",
 "log:GetAppliedMachineGroups",
 "log:GetAppliedConfigs",
 "log:GetIndex",
 "log:GetSavedSearch",
 "log:GetDashboard",
 "log:GetJob",
 "ecs:DescribeInstances",
 "ecs:DescribeSpotPriceHistory",
 "ecs:DescribePrice",
 "eci:DescribeContainerGroups",
 "eci:DescribeContainerGroupPrice",
 "log:GetLogStoreLogs",
 "ims:CreateApplication",
 "ims:UpdateApplication",
 "ims:GetApplication",
 "ims:ListApplications",
 "ims:DeleteApplication",
 "ims:CreateAppSecret",
 "ims:GetAppSecret",
 "ims:ListAppSecretIds",
 "ims:ListUsers"
```
  4. Specify the Name parameter in the k8sWorkerRolePolicy-{ClusterID} format and click OK.
2. Grant permissions to the RAM worker role of the cluster.
  1. Log on to the RAM console. In the left-side navigation pane, choose Identities > Roles.
  2. Enter the RAM worker role in the KubernetesWorkerRole-{ClusterID} format into the search box. Find the role that you want to manage and click Grant Permission in the Actions column.
  3. In the Select Policy section, click Custom Policy.
  4. Enter the name of the custom policy that you created in the k8sWorkerRolePolicy-{ClusterID} format into the search box and select the policy.
  5. Click OK.
3. Return to the Note dialog box and click Authorization Check. If the authorization is successful, Authorized is displayed and the OK button becomes available. Then, perform Step 3.
Set the Console Data Storage parameter.
In this example, Pre-installed MySQL is selected. You can select ApsaraDB RDS in production environments. For more information, see Install and configure AI Dashboard and AI Developer Console.
Click Deploy Cloud-native AI Suite.
After the status of AI Dashboard changes to Ready, AI Dashboard is ready for use.

(Optional) Create a dataset

You can create and accelerate datasets based on the requirements of algorithm developers. The following section describes how to create a dataset in AI Dashboard or by using the CLI.

fashion-mnist dataset

Use kubectl to create a persistent volume (PV) and a persistent volume claim (PVC) of the Object Storage Service (OSS) type on a cluster node.

Run the following command to create a namespace named demo-ns:

kubectl create ns demo-ns

Create a YAML file named fashion-mnist.yaml.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fashion-demo-pv         
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi                
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeAttributes:
      bucket: fashion-mnist      
      otherOpts: "-o max_stat_cache_size=0 -o allow_other"
      url: oss-cn-beijing.aliyuncs.com   
      akId: "AKID"               
      akSecret: "AKSECRET"
    volumeHandle: fashion-demo-pv
  persistentVolumeReclaimPolicy: Retain
  storageClassName: oss
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fashion-demo-pvc
  namespace: demo-ns
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  selector:
    matchLabels:
      alicloud-pvname: fashion-demo-pv
  storageClassName: oss
  volumeMode: Filesystem
  volumeName: fashion-demo-pv

Parameter	Description
name: fashion-demo-pv	The name of the PV. The PV corresponds to the PVC named fashion-demo-pvc.
storage: 10Gi	The capacity of the PV is 10 GiB.
bucket: fashion-mnist	The name of the OSS bucket.
url: oss-cn-beijing.aliyuncs.com	The endpoint of the OSS bucket. In this example, an OSS endpoint in the China (Beijing) region is used.
akId: "AKID" akSecret: "AKSECRET"	The AccessKey ID and AccessKey secret used to access the OSS bucket.
namespace: demo-ns	The name of the namespace.

Run the following command to create a PV and a PVC:

kubectl create -f fashion-mnist.yaml

Check the status of the PV and PVC.

Run the following command to query the status of the PV:

kubectl get pv fashion-demo-pv -ndemo-ns

Expected output:

NAME                   CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS   REASON   AGE
fashion-demo-pv        10Gi       RWX            Retain           Bound    demo-ns/fashion-demo-pvc       oss                     8h

Run the following command to query the status of the PVC:

kubectl get pvc fashion-demo-pvc -ndemo-ns

Expected output:

NAME                       STATUS   VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fashion-demo-pvc           Bound    fashion-demo-pv        10Gi       RWX            oss            8h

Accelerate a dataset

You can accelerate datasets on AI Dashboard. The following example shows how to accelerate a dataset named fashion-demo-pvc in the demo-ns namespace.

Access AI Dashboard as an administrator.
In the left-side navigation pane of AI Dashboard, choose Dataset > Dataset List.
On the Dataset List page, find the dataset and click Accelerate in the Operator column.
The following figure shows the accelerated dataset.

Step 2: View resource dashboards

You can view the usage of cluster resources in multiple dimensions on resource dashboards provided by AI Dashboard. This helps you optimize resource allocation and improve resource utilization.

Cluster dashboard

After you log on to AI Dashboard, you are redirected to the cluster dashboard by default. You can view the following metrics on the cluster dashboard:

GPU Summary Of Cluster: displays the total number of GPU-accelerated nodes, the number of allocated GPU-accelerated nodes, and the number of unhealthy GPU-accelerated nodes in the cluster.
Total GPU Nodes: displays the total number of GPU-accelerated nodes in the cluster.
Unhealthy GPU Nodes: displays the number of unhealthy GPU-accelerated nodes in the cluster.
GPU Memory(Used/Total): displays the ratio of GPU memory used by the cluster to the total GPU memory.
GPU Memory(Allocated/Total): displays the ratio of GPU memory allocated by the cluster to the total GPU memory.
GPU Utilization: displays the average GPU utilization of the cluster.
GPUs(Allocated/Total): displays the ratio of the number of GPUs that are allocated by the cluster to the total number of GPUs.
Training Job Summary Of Cluster: displays the numbers of training jobs that are in the following states: Running, Pending, Succeeded, and Failed.

Node dashboard

On the Cluster page, click Nodes in the upper-right corner to navigate to the node dashboard. You can view the following metrics on the node dashboard:

GPU Node Details: displays information about the cluster nodes in a table. The following information is displayed: the name of each node, the IP address of each node, the role of each node, the GPU mode of each node (exclusive or shared), the number of GPUs provided by each node, the total amount of GPU memory provided by each node, the number of GPUs allocated on each node, the amount of GPU memory allocated on each node, the amount of GPU memory used on each node, and the average GPU utilization on each node.
GPU Duty Cycle: displays the utilization of each GPU on each node.
GPU Memory Usage: displays the memory usage of each GPU on each node.
GPU Memory Usage Percentage: displays the percentage of memory usage per GPU on each node.
Allocated GPUs Per Node: displays the number of GPUs allocated on each node.
GPU Number Per Node: displays the total number of GPUs on each node.
Total GPU Memory Per Node: displays the total amount of GPU memory on each node.

Training job dashboard

On the Nodes page, click TrainingJobs in the upper-right corner to navigate to the training job dashboard. You can view the following metrics in the training job dashboard:

Training Jobs: displays information about each training job in a table. The following information is displayed: the namespace of each training job, the name of each training job, the type of each training job, the status of each training job, the duration of each training job, the number of GPUs that are requested by each training job, the amount of GPU memory that is requested by each training job, the amount of GPU memory that is used by each training job, and the average GPU utilization of each training job.
Job Instance Used GPU Memory: displays the amount of GPU memory that is used by each job instance.
Job Instance Used GPU Memory Percentage: displays the percentage of GPU memory that is used by each job instance.
Job Instance GPU Duty Cycle: displays the GPU utilization of each job instance.

Resource quota dashboard

On the Training Jobs page, click Quota in the upper-right corner to navigate to the resource quota dashboard. You can view the following metrics on the resource quota dashboard: Quota (cpu), Quota (memory), Quota (nvidia.com/gpu), Quota (aliyun.com/gpu-mem), and Quota (aliyun.com/gpu). Each metric displays the information about resource quotas in a table. The following information is displayed:

Elastic Quota Name: displays the name of the quota group.
Namespace: displays the namespace to which resources belong.
Resource Name: displays the type of resources.
Max Quota: displays the maximum amount of resources that you can use in the specified namespace.
Min Quota: displays the minimum amount of resources that you can use in the specified namespace when the cluster does not have sufficient resources.
Used Quota: displays the amount of resources that are used in the specified namespace.

Step 3: Manage users and quotas

Cloud-native AI Suite allows you to manage users and resource quotas by using the following resource objects: Users, User Groups, Quota Trees, Quota Nodes, and Kubernetes Namespaces. The following figure describes the relationships among these resource objects. 概念关系

Quota trees allow you to configure hierarchical resource quotas. Quota trees are used by the capacity scheduling plug-in. To optimize the overall utilization of cluster resources, you can allow users to share resources after you use quota trees to allocate resources to the users.
Each user in Kubernetes owns a service account. The service account can be used as a credential to submit jobs and log on to the console. Permissions are granted to users based on user roles. For example, the admin role can log on to AI Dashboard and perform maintenance operations on a cluster. The researcher role can submit jobs, use cluster resources, and log on to AI Developer Console. The admin role has all permissions that the researcher role has.
User groups are the smallest unit in resource allocation. Each user group corresponds to a leaf node in the quota tree. Users must be associated with user groups before the users can use resources that are associated with the user groups.

The following section describes how to use a quota tree to set hierarchical resource quotas and how to use a user group to allocate resources to users. The following section also describes how to share and reclaim CPU resources by submitting a simple job.

Add a quota node and set resource quotas

You can set resource quotas by specifying the Min and Max parameters of each resource. The Min parameter specifies the minimum amount of resources that can be used. The Max parameter specifies the maximum amount of resources that can be used. After you associate namespaces with a leaf node of a quota tree, limits that are set on nodes between the root node and the leaf node apply to the namespaces.

If no namespace is available, you must first create namespaces. If namespaces are available, you must make sure that the namespace that you select does not contain pods in the Running state.
```
kubectl create ns namespace1
kubectl create ns namespace2
kubectl create ns namespace3
kubectl create ns namespace4
```
Create a quota node and associate it with a namespace.

Create users and user groups

A user can belong to one or more user groups. A user group can contain one or more users. You can associate user groups by users or associate users by user groups. You can allocate resources and grant permissions based on projects by using quota trees and user groups.

Create a user. For more information, see Generate the kubeconfig file and logon token of the newly created user.
Create user groups. For more information, see Add a user group.

Capacity scheduling example

The following section describes how capacity scheduling is used to share and reclaim resources by creating pods that request CPU cores. Each quota node is configured with the minimum amount of CPU resources and maximum amount of CPU resources. The following section describes the process:

Set both the minimum amount of CPU resources and maximum amount of CPU resources to 40 for the root node. This ensures that the quota tree has 40 CPU cores available.
Set the minimum amount of CPU resources to 20 and the maximum amount of CPU resources to 40 for root.a and root.b.
Set the minimum amount of CPU resources to 10 and the maximum amount of CPU resources to 20 for root.a.1, root.a.2, root.b.1, and root.b.2.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace1. The maximum amount of CPU resources is set to 20 for root.a.1. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace2. The maximum amount of CPU resources is set to 20 for root.a.2. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace3. The minimum amount of CPU resources is set to 10 for root.b.1. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, three pods (3 pods x 5 cores/pod = 15 cores) are running in namespace1 and namespace2, separately.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace4. The minimum amount of CPU resources is set to 10 for root.b.2. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal.

Perform the following operations:

Create namespaces and a quota tree.
1. Run the following command to create four namespaces.
  Run the following command to create namespace1:
```
kubectl create ns namespace1
kubectl create ns namespace2
kubectl create ns namespace3
kubectl create ns namespace4
```
2. Create a quota tree based on the following figure.
Create a Deployment in namespace1 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10 (cpu.min=10), which means that two pods are created. After you set elastic quotas:
- When 40 CPU cores are available in the cluster, four pods are created (4 pods x 5 core/pod = 20 cores).
- The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx1
  namespace: namespace1
  labels:
    app: nginx1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx1
  template:
    metadata:
      name: nginx1
      labels:
        app: nginx1
    spec:
      containers:
      - name: nginx1
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5
```
Create another Deployment in namespace2 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10 (cpu.min=10), which means that two pods are created. After you set elastic quotas:
- When 20 CPU cores (40 cores - 20 cores in namespace1) are available in the cluster, four pods (4 pods x 5 core/pod = 20 cores) are created.
- The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
- After you create the preceding two Deployments, the pods in namespace1 and namespace2 have used 40 CPU cores, which is the maximum number of CPU cores for the root node.
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx2
  namespace: namespace2
  labels:
    app: nginx2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx2
  template:
    metadata:
      name: nginx2
      labels:
        app: nginx2
    spec:
      containers:
      - name: nginx2
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5
```
Create a third Deployment in namespace3 by using the following YAML template. This Deployment provisions five pods and each pod requests five CPU cores.
- The cluster does not have idle resources. The scheduler reclaims 10 CPU cores from root.a to guarantee the minimum amount of CPU resources for root.b.1.
- Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority, availability, and creation time of the workloads of root.a. Therefore, after the pods of nginx3 are scheduled based on the reclaimed 10 CPU cores, two pods are in the Running state and the other three are in the Pending state.
- After 10 CPU cores are reclaimed from root.a, both namespace1 and namespace2 contain two pods that are in the Running state and three pods that are in the Pending state.
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx3
  namespace: namespace3
  labels:
    app: nginx3
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx3
  template:
    metadata:
      name: nginx3
      labels:
        app: nginx3
    spec:
      containers:
      - name: nginx3
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5
```
Create a forth Deployment in namespace4 by using the following YAML template. This Deployment provisions five pods and each pod requests five CPU cores.
- The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a.
- The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. In this scenario, two pods (2 pods x 5 CPU cores/pod = 10 CPU cores) are running in namespace1. Two pods (2 pods x 5 CPU cores/pod = 10 CPU cores) are running in namespace2.
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx4
  namespace: namespace4
  labels:
    app: nginx4
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx4
  template:
    metadata:
      name: nginx4
      labels:
        app: nginx4
    spec:
      containers:
      - name: nginx4
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5
```

The result shows the benefits of capacity scheduling in resource allocation.