Alibaba Cloud Machine Learning Platform for AI (PAI) provides Lingjun resources for AI development and training. You can create resource quotas for purchased Lingjun resources to perform high-performance AI training and computing. This topic describes how to create, manage, and use resource quotas.
Prerequisites
A dedicated resource group for Lingjun resources is created and Lingjun resources are purchased. For more information, see Create a resource group and purchase Lingjun resources.
A virtual private cloud (VPC), a vSwitch, and a security group are created. For more information, see Create and manage a VPC and Create a security group.
Create a resource quota
You can create a resource quota to allocate the resources in the resource pool. To create a resource quota, perform the following steps:
Log on to the PAI console. In the left-side navigation pane, choose AI Computing Resources > Resource Quota.
On the Intelligent Computing Lingjun resources tab, click Add Resource Quota.
On the Add Resource Quota page, configure the parameters and click Submit.
Parameter
Description
Name
The name of the resource quota.
Scheduling Policy
The scheduling policy. Select an appropriate scheduling policy to improve the utilization of computing resources. Valid values:
Intelligent
Balance
Round Robin
FIFO
Associate Workspace
The workspace with which the resource quota is associated.
Description
The description that is used to distinguish different resource quotas.
Source Type
The type of source of resources to be allocated to the resource quota. Valid values:
Dedicated Resource Group: Allocate resources from a dedicated resource group to the resource quota.
Existing Resource Quota: Allocate resources from an existing resource quota to the resource quota.
Source
The source of resources to be allocated to the resource quota. Select a dedicated resource group or an existing resource quota from the Source drop-down list.
Specifications/Resources
Click Add. In the panel that appears, specify the specifications and node quantity for resources that you want to allocate from a dedicated resource group or an existing resource quota.
VPC
Select a VPC, a vSwitch, and a security group from the drop-down lists.
NoteIf your Lingjun resources need to access the Internet, you must configure an Internet NAT gateway for the selected VPC and associate an elastic IP address (EIP) with the Internet NAT gateway. We recommend that you select the VPC that you want to use to access the Internet. For more information, see Use the SNAT feature of an Internet NAT gateway to access the Internet.
Security Group
vSwitch
Manage resource quotas
After you create a resource quota, you can click the name of the resource quota to view the basic information and resource usage and manage the resource quota. You can also increase or decrease the resource quota limit or create child-level resource quotas to optimize the allocation of resources. For more information, see Manage resource quotas.
Use a resource quota
Associate a resource quota with a workspace
Before you use a resource quota to perform AI development and training jobs, you must associate the resource quota with a workspace. For more information, see Overview.
Use a resource quota that is associated with a workspace for AI development and training
Select an image.
To submit a Deep Learning Containers (DLC) training job by using a resource quota for Lingjun resources involves the integration of hardware and software, such as the servers, networks, drivers, and training frameworks. Therefore, we recommend that you use the official PAI image or build an image based on the official PAI image.
NoteIf you use a custom image, you may need to update the drivers, frameworks, and software to appropriate versions to make full use of the high-performance Lingjun resources.
Image name
Framework
Model
CUDA
Operating system
Supported region
Programming language and version
deepspeed-training:23.06-gpu-py310-cu121-ubuntu22.04
PyTorch 2.1
Megatron-LM 23.06
DeepSpeed 0.9.5
Transformers 4.29.2
Nemo 1.19.0
GPU
121
ubuntu22.04
China (Ulanqab)
Python3.10
megatron-training:23.06-gpu-py310-cu121-ubuntu22.04
PyTorch 2.1
Megatron-LM 23.06
DeepSpeed 0.9.5
Transformers 4.29.2
Nemo 1.19.0
GPU
121
ubuntu22.04
China (Ulanqab)
Python3.10
nemo-training:23.06-gpu-py310-cu121-ubuntu22.04
PyTorch 2.1
Megatron-LM 23.06
DeepSpeed 0.9.5
Transformers 4.29.2
Nemo 1.19.0
GPU
121
ubuntu22.04
China (Ulanqab)
Python3.10
Submit a DLC training job by using a resource quota for Lingjun resources. For more information, see Submit training jobs.
Create a Data Science Workshop (DSW) instance based on Lingjun resources. For more information, see Create a DSW instance.
Deploy services by using Elastic Algorithm Service (EAS). For more information, see Model service deployment by using the PAI console.