After you deploy a cluster that has the cloud-native AI suite installed, you can allocate cluster resources and view resource usage in multiple dimensions. This helps you optimize the utilization of cluster resources. This topic describes basic O&M operations that you can perform on the cloud-native AI suite. For example, you can install the cloud-native AI suite, view resources dashboards, and manage users and quotas.
Background Information
After you deploy a cluster that has the cloud-native AI suite installed, you can allocate cluster resources and view resource usage in multiple dimensions. This helps you optimize the utilization of cluster resources.
If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. However, the resource utilization varies by user groups. To improve the overall utilization of cluster resources, you can allow the users to share resources after you allocate cluster resources to them.
The following figure shows the organizational structure of an enterprise. You can set elastic quotas at different levels based on your business requirements. Each leaf node in the figure corresponds to a user group. To manage permissions and quotas separately, you can add users in a user group to one or more namespaces, and assign different roles to the users. This way, resources can be shared across user groups and users in the same user group can be isolated.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster is created. Make sure that Monitoring Agents and Simple Log Service are enabled on the Component Configurations wizard page when you create the cluster. For more information, see Create an ACK Pro cluster.
The Kubernetes version of the cluster is 1.18 or later.
Tasks
This topic describes how to complete the following tasks:
Install the cloud-native AI suite.
View resource dashboards.
Set resource quotas for user groups.
Manage users and groups.
Use idle resources to submit more workloads after the minimum amount of resources for each user is exhausted.
Set the maximum amount of resources for each user.
Set the minimum amount of resources for each user.
Step 1: Install the cloud-native AI suite
The cloud-native AI suite consists of components for task elasticity, data acceleration, AI task scheduling, AI task lifecycle management, AI Dashboard, and AI Developer Console. You can install the components based on your business requirements.
Deploy the cloud-native AI suite
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
On the Cloud-native AI Suite page, click Deploy. On the page that appears, select the components that you want to install.
Click Deploy Cloud-native AI Suite in the lower part of the page. The system checks the environment and the dependencies of the selected components. After the environment and the dependencies pass the check, the system deploys the selected components.
After the components are installed, you can view the following information in the Components list:
You can view the names and versions of the components that are installed in the cluster. You can deploy or uninstall components.
If a component is updatable, you can update the component.
After you install ack-ai-dashboard and ack-ai-dev-console, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.
After the installation is complete, you can find the hyperlinks to AI Dashboard and AI Developer Console in the upper-left corner of the Cloud-native AI Suite page. You can click a hyperlink to access the corresponding component.
Install and configure AI Dashboard
In the Interaction Mode section of the Cloud-native AI Suite page, select Console. The Note dialog box appears, as shown in the following figure.
Create a custom policy to grant permissions to the RAM worker role.
Create a custom policy.
Log on to the RAM console. In the left-side navigation pane, choose Permissions > Policies.
On the Policies page, click Create Policy.
Click the JSON tab. Add the following content to the
Action
field and click Next to edit policy information."log:GetProject", "log:GetLogStore", "log:GetConfig", "log:GetMachineGroup", "log:GetAppliedMachineGroups", "log:GetAppliedConfigs", "log:GetIndex", "log:GetSavedSearch", "log:GetDashboard", "log:GetJob", "ecs:DescribeInstances", "ecs:DescribeSpotPriceHistory", "ecs:DescribePrice", "eci:DescribeContainerGroups", "eci:DescribeContainerGroupPrice", "log:GetLogStoreLogs", "ims:CreateApplication", "ims:UpdateApplication", "ims:GetApplication", "ims:ListApplications", "ims:DeleteApplication", "ims:CreateAppSecret", "ims:GetAppSecret", "ims:ListAppSecretIds", "ims:ListUsers"
Specify the Name parameter in the
k8sWorkerRolePolicy-{ClusterID}
format and click OK.
Grant permissions to the RAM worker role of the cluster.
Log on to the RAM console. In the left-side navigation pane, choose Identities > Roles.
Enter the RAM worker role in the
KubernetesWorkerRole-{ClusterID}
format into the search box. Find the role that you want to manage and click Grant Permission in the Actions column.In the Select Policy section, click Custom Policy.
Enter the name of the custom policy that you created in the
k8sWorkerRolePolicy-{ClusterID}
format into the search box and select the policy.Click OK.
Return to the Note dialog box and click Authorization Check. If the authorization is successful, Authorized is displayed and the OK button becomes available. Then, perform Step 3.
Set the Console Data Storage parameter.
In this example, Pre-installed MySQL is selected. You can select ApsaraDB RDS in production environments. For more information, see Install and configure AI Dashboard and AI Developer Console.
Click Deploy Cloud-native AI Suite.
After the status of AI Dashboard changes to Ready, AI Dashboard is ready for use.
(Optional) Create a dataset
You can create and accelerate datasets based on the requirements of algorithm developers. The following section describes how to create a dataset in AI Dashboard or by using the CLI.
fashion-mnist dataset
Use kubectl to create a persistent volume (PV) and a persistent volume claim (PVC) of the Object Storage Service (OSS) type on a cluster node.
Create namespace demo-ns based on the following sample YAML template:
kubectl create ns demo-ns
Create a YAML file named fashion-mnist.yaml.
apiVersion: v1
kind: PersistentVolume
metadata:
name: fashion-demo-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 10Gi
csi:
driver: ossplugin.csi.alibabacloud.com
volumeAttributes:
bucket: fashion-mnist
otherOpts: "-o max_stat_cache_size=0 -o allow_other"
url: oss-cn-beijing.aliyuncs.com
akId: "AKID"
akSecret: "AKSECRET"
volumeHandle: fashion-demo-pv
persistentVolumeReclaimPolicy: Retain
storageClassName: oss
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fashion-demo-pvc
namespace: demo-ns
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
selector:
matchLabels:
alicloud-pvname: fashion-demo-pv
storageClassName: oss
volumeMode: Filesystem
volumeName: fashion-demo-pv
Parameter | Description |
name: fashion-demo-pv | The name of the PV. The PV corresponds to the PVC named fashion-demo-pvc. |
storage: 10Gi | The capacity of the PV is 10 GiB. |
bucket: fashion-mnist | The name of the OSS bucket. |
url: oss-cn-beijing.aliyuncs.com | The endpoint of the OSS bucket. In this example, an OSS endpoint in the China (Beijing) region is used. |
akId: "AKID" akSecret: "AKSECRET" | The AccessKey ID and AccessKey secret used to access the OSS bucket. |
namespace: demo-ns | The name of the namespace. |
Create a PV and a PVC:
kubectl create -f fashion-mnist.yaml
Check the status of the PV and PVC:
Run the following command to query the status of the PV:
kubectl get pv fashion-demo-pv -ndemo-ns
Expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE fashion-demo-pv 10Gi RWX Retain Bound demo-ns/fashion-demo-pvc oss 8h
Run the following command to query the status of the PVC:
kubectl get pvc fashion-demo-pvc -ndemo-ns
Expected output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fashion-demo-pvc Bound fashion-demo-pv 10Gi RWX oss 8h
Accelerate a dataset
You can accelerate datasets on AI Dashboard. The following example shows how to accelerate a dataset named fashion-demo-pvc in the demo-ns namespace.
- Access AI Dashboard as an administrator.
- In the left-side navigation pane of AI Dashboard, choose .
On the Dataset List page, find the dataset and click Accelerate in the Operator column.
The following figure shows the accelerated dataset.
Step 2: View resource dashboards
You can view the usage of cluster resources in multiple dimensions on resource dashboards provided by AI Dashboard. This helps you optimize resource allocation and improve resource utilization.
Cluster dashboard
After you log on to AI Dashboard, you are redirected to the cluster dashboard by default. You can view the following metrics on the cluster dashboard:
GPU Summary Of Cluster: displays the total number of GPU-accelerated nodes, the number of allocated GPU-accelerated nodes, and the number of unhealthy GPU-accelerated nodes in the cluster.
Total GPU Nodes: displays the total number of GPU-accelerated nodes in the cluster.
Unhealthy GPU Nodes: displays the number of unhealthy GPU-accelerated nodes in the cluster.
GPU Memory(Used/Total): displays the ratio of GPU memory used by the cluster to the total GPU memory.
GPU Memory(Allocated/Total): displays the ratio of GPU memory allocated by the cluster to the total GPU memory.
GPU Utilization: displays the average GPU utilization of the cluster.
GPUs(Allocated/Total): displays the ratio of the number of GPUs that are allocated by the cluster to the total number of GPUs.
Training Job Summary Of Cluster: displays the numbers of training jobs that are in the following states: Running, Pending, Succeeded, and Failed.
Node dashboard
On the Cluster page, click Nodes in the upper-right corner to navigate to the node dashboard. You can view the following metrics on the node dashboard:
GPU Node Details: displays information about the cluster nodes in a table. The following information is displayed: the name of each node, the IP address of each node, the role of each node, the GPU mode of each node (exclusive or shared), the number of GPUs provided by each node, the total amount of GPU memory provided by each node, the number of GPUs allocated on each node, the amount of GPU memory allocated on each node, the amount of GPU memory used on each node, and the average GPU utilization on each node.
GPU Duty Cycle: displays the utilization of each GPU on each node.
GPU Memory Usage: displays the memory usage of each GPU on each node.
GPU Memory Usage Percentage: displays the percentage of memory usage per GPU on each node.
Allocated GPUs Per Node: displays the number of GPUs allocated on each node.
GPU Number Per Node: displays the total number of GPUs on each node.
Total GPU Memory Per Node: displays the total amount of GPU memory on each node.
Training job dashboard
On the Nodes page, click TrainingJobs in the upper-right corner to navigate to the training job dashboard. You can view the following metrics in the training job dashboard:
Training Jobs: displays information about each training job in a table. The following information is displayed: the namespace of each training job, the name of each training job, the type of each training job, the status of each training job, the duration of each training job, the number of GPUs that are requested by each training job, the amount of GPU memory that is requested by each training job, the amount of GPU memory that is used by each training job, and the average GPU utilization of each training job.
Job Instance Used GPU Memory: displays the amount of GPU memory that is used by each job instance.
Job Instance Used GPU Memory Percentage: displays the percentage of GPU memory that is used by each job instance.
Job Instance GPU Duty Cycle: displays the GPU utilization of each job instance.
Resource quota dashboard
On the Training Jobs page, click Quota in the upper-right corner to navigate to the resource quota dashboard. You can view the following metrics on the resource quota dashboard: Quota (cpu), Quota (memory), Quota (nvidia.com/gpu), Quota (aliyun.com/gpu-mem), and Quota (aliyun.com/gpu). Each metric displays the information about resource quotas in a table. The following information is displayed:
Elastic Quota Name: displays the name of the quota group.
Namespace: displays the namespace to which resources belong.
Resource Name: displays the type of resources.
Max Quota: displays the maximum amount of resources that you can use in the specified namespace.
Min Quota: displays the minimum amount of resources that you can use in the specified namespace when the cluster does not have sufficient resources.
Used Quota: displays the amount of resources that are used in the specified namespace.
Step 3: Manage users and quotas
The cloud-native AI suite allows you to manage users and resource quotas by using the following resource objects: Users, User Groups, Quota Trees, Quota Nodes, and Kubernetes Namespaces. The following figure describes the relationships among these resource objects.
Quota trees allow you to configure hierarchical resource quotas. Quota trees are used by the capacity scheduling plug-in. To optimize the overall utilization of cluster resources, you can allow users to share resources after you use quota trees to allocate resources to the users.
Each user in Kubernetes owns a service account. The service account can be used as a credential to submit jobs and log on to the console. Permissions are granted to users based on user roles. For example, the admin role can log on to AI Dashboard and perform maintenance operations on a cluster. The researcher role can submit jobs, use cluster resources, and log on to AI Developer Console. The admin role has all permissions that the researcher role has.
User groups are the smallest unit in resource allocation. Each user group corresponds to a leaf node in the quota tree. Users must be associated with user groups before the users can use resources that are associated with the user groups.
The following section describes how to use a quota tree to set hierarchical resource quotas and how to use a user group to allocate resources to users. The following section also describes how to share and reclaim CPU resources by submitting a simple job.
Add a quota node and set resource quotas
You can set resource quotas by specifying the Min and Max parameters of each resource. The Min parameter specifies the minimum amount of resources that can be used. The Max parameter specifies the maximum amount of resources that can be used. After you associate namespaces with a leaf node of a quota tree, limits that are set on nodes between the root node and the leaf node apply to the namespaces.
If no namespace is available, you must first create namespaces. If namespaces are available, you must make sure that the namespace that you select does not contain pods in the Running state.
kubectl create ns namespace1 kubectl create ns namespace2 kubectl create ns namespace3 kubectl create ns namespace4
Create a quota node and associate it with a namespace.
Create users and user groups
A user can belong to one or more user groups. A user group can contain one or more users. You can associate user groups by users or associate users by user groups. You can allocate resources and grant permissions based on projects by using quota trees and user groups.
Create a user. For more information, see Generate the kubeconfig file and logon token of the newly created user.
Create user groups. For more information, see Add a user group.
Capacity scheduling example
The following section describes how capacity scheduling is used to share and reclaim resources by creating pods that request CPU cores. Each quota node is configured with the minimum amount of CPU resources and maximum amount of CPU resources. The following section describes the process:
Set both the minimum amount of CPU resources and maximum amount of CPU resources to 40 for the root node. This ensures that the quota tree has 40 CPU cores available.
Set the minimum amount of CPU resources to 20 and the maximum amount of CPU resources to 40 for root.a and root.b.
Set the minimum amount of CPU resources to 10 and the maximum amount of CPU resources to 20 for root.a.1, root.a.2, root.b.1, and root.b.2.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace1. The maximum amount of CPU resources is set to 20 for root.a.1. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace2. The maximum amount of CPU resources is set to 20 for root.a.2. Therefore, four pods (4 pods x 5 cores/pod = 20 cores) can run as normal.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace3. The minimum amount of CPU resources is set to 10 for root.b.1. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal. The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a. The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. As a result, three pods (3 pods x 5 cores/pod = 15 cores) are running in namespace1 and namespace2, separately.
Submit a task that runs five pods (5 pods x 5 cores/pod = 25 cores) to namespace4. The minimum amount of CPU resources is set to 10 for root.b.2. Therefore, two pods (2 pods x 5 cores/pod = 10 cores) can run as normal.
Perform the following operations:
Create namespaces and a quota tree.
Run the following command to create four namespaces.
Run the following command to create namespace1:
kubectl create ns namespace1 kubectl create ns namespace2 kubectl create ns namespace3 kubectl create ns namespace4
Create a quota tree based on the following figure.
Create a Deployment in namespace1 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10, which means that two pods are created. After you set elastic quotas:
When 40 CPU cores are available in the cluster, four pods are created (4 pods x 5 core/pod = 20 cores).
The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx1 namespace: namespace1 labels: app: nginx1 spec: replicas: 5 selector: matchLabels: app: nginx1 template: metadata: name: nginx1 labels: app: nginx1 spec: containers: - name: nginx1 image: nginx resources: limits: cpu: 5 requests: cpu: 5
Create another Deployment in namespace2 by using the following YAML template. The Deployment provisions five pods and each pod requests five CPU cores.
If you do not set elastic quotas, users can use only 10 CPU cores because the minimum amount of CPU resources is set to 10, which means that two pods are created. After you set elastic quotas:
When 20 CPU cores (40 cores - 20 cores in namespace1) are available in the cluster, four pods (4 pods x 5 core/pod = 20 cores) are created.
The last pod is in the Pending state because the maximum amount of resources (cpu.max=20) is reached.
After you create the preceding two Deployments, the pods in namespace1 and namespace2 have used 40 CPU cores, which is the maximum number of CPU cores for the root node.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx2 namespace: namespace2 labels: app: nginx2 spec: replicas: 5 selector: matchLabels: app: nginx2 template: metadata: name: nginx2 labels: app: nginx2 spec: containers: - name: nginx2 image: nginx resources: limits: cpu: 5 requests: cpu: 5
Create a third Deployment in namespace3 by using the following YAML template. This Deployment provisions five pods and each pod requests five CPU cores.
The cluster does not have idle resources. The scheduler reclaims 10 CPU cores from root.a to guarantee the minimum amount of CPU resources for root.b.1.
Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority, availability, and creation time of the workloads of root.a. Therefore, after the pods of nginx3 are scheduled based on the reclaimed 10 CPU cores, two pods are in the Running state and the other three are in the Pending state.
After 10 CPU cores are reclaimed from root.a, both namespace1 and namespace2 contain two pods that are in the Running state and three pods that are in the Pending state.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx3 namespace: namespace3 labels: app: nginx3 spec: replicas: 5 selector: matchLabels: app: nginx3 template: metadata: name: nginx3 labels: app: nginx3 spec: containers: - name: nginx3 image: nginx resources: limits: cpu: 5 requests: cpu: 5
Create a forth Deployment in namespace4 by using the following YAML template. This Deployment provisions five pods and each pod requests five CPU cores.
The scheduler considers the priority, availability, and creation time of each task and reclaims CPU resources from root.a.
The scheduler reclaims one CPU core from root.a.1 and one CPU core from root.a.2. In this scenario, two pods (2 pods x 5 CPU cores/pod = 10 CPU cores) are running in namespace1. Two pods (2 pods x 5 CPU cores/pod = 10 CPU cores) are running in namespace2.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx4 namespace: namespace4 labels: app: nginx4 spec: replicas: 5 selector: matchLabels: app: nginx4 template: metadata: name: nginx4 labels: app: nginx4 spec: containers: - name: nginx4 image: nginx resources: limits: cpu: 5 requests: cpu: 5
The result shows the benefits of capacity scheduling in resource allocation.