After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console or by using SDK for Python or the command line. This topic describes how to submit a DLC job.
Prerequisites
The required resources, images, datasets, and code builds are prepared. For more information, see Before you begin.
Environment variables are configured for using the SDK for Python to submit a DLC job. For more information, see the "Install the Credentials tool" section in the Manage access credentials topic and the "Step 2: Configure environment variables" section in the Get started with Alibaba Cloud Darabonba SDK for Python topic.
Submit a job in the PAI console
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job. The Create Job page appears.
Basic Information
In the Basic Information section, configure the Job Name and Tag parameters.
Environment Information
In the Environment Information section, configure the key parameters. The following table describes the parameters.
Parameter | Description |
Node Image | The worker node image. You can select one of the following node images:
|
Data Set | You can select one of the following dataset types.
Set the Mount Path parameter to the specific path in the DLC training container, such as Important
|
Direct mount | Click OSS to mount an OSS path to the specified DLC path. |
Startup Command | The command that the job runs. Shell commands are supported. For example, you can use the When you submit a job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the Note
|
Environment Variable | Additional configuration information or parameters. The format is |
Third-party Libraries | Valid values:
|
Code Builds | Valid values:
|
Resource Information
In the Resource Information section, configure the key parameters. The following table describes the parameters.
Parameter | Description |
Instance type | This parameter is available only if the workspace allows you to use Lingjun resources and general computing resources to submit jobs in DLC.
|
Source | Public Resources and Resource Quota, which includes general computing resources and Lingjun resources, are available. Note
|
Resource Quota | This parameter is required only if you set the Source parameter to Resource Quota. Select the resource quota that you prepared. For more information about how to prepare a resource quota, see Resource quota overview. |
Priority | This parameter is available only if you set the Source parameter to Resource Quota. Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority. |
Framework | Specify the deep learning training framework and tool. The framework provides rich features and operations that you can use to build, train, and optimize deep learning models.
Note If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, MPIJob, Slurm, and Ray. |
Job Resource | Configure the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes.
|
Node-Specific Scheduling | After you enable this feature, you can select nodes for scheduling. Note This parameter is available only when you use resource quota. |
CPU Affinity | Enabling CPU affinity allows processes in a container or Pod to be bound to specific CPU cores for execution. This approach can reduce CPU cache misses and context switching, improving CPU utilization and enhancing application performance. It is suitable for scenarios that are sensitive to performance and have high real-time requirements. |
Maximum Duration | You can specify the maximum duration for which a job runs. The job is automatically stopped when the uptime of the job exceeds the maximum duration. Default value: 30. Unit: hours. |
Retention Period | Specify the retention period of jobs after they completed or fail. During the retention period, the resources are occupied. After the retention period ends, the jobs are deleted. Important DLC jobs that are deleted cannot be restored. Exercise caution when you delete the jobs. |
VPC
This parameter is available only if you set the Source parameter to Public Resources.
If you do not configure a VPC, Internet connection is used. Due to the limited bandwidth of the Internet, the job may not progress or may not run as expected.
To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.
Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC and performs access control based on the selected security group.
You can also configure the Internet Access Gateway parameter.
Private Gateway: dedicated bandwidth. You can configure the bandwidth based on your business requirements. If you access the Internet by using a private gateway, you need to create an Internet NAT gateway, associate an elastic IP adress (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Enable Internet access for a DSW instance by using a private Internet NAT gateway.
Public Gateway: shared public bandwidth. The download rate is slow in high concurrency scenarios.
Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.
If you select a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.
Fault Tolerance and Diagnosis
In the Fault Tolerance and Diagnosis section, configure the key parameters. The following table describes the parameters.
Parameter | Description |
Automatic Fault Tolerance | After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify algorithmic errors of the jobs and improve GPU utilization. For more information, see AIMaster: elastic fault tolerance engine. Note After you enable Automatic Fault Tolerance, the system starts an AIMaster instance that runs together with the job instance and occupies the following resources:
|
Sanity Check | After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. This prevents job failure in the early stage of training and improves the training success rate. For more information, see Sanity Check. Note You can enable the sanity check feature only for PyTorch jobs that run on Lingjun resources and use GPU. |
Roles and Permissions
In the Roles and Permissions section, configure the Instance RAM Role parameter. For more information, see Configure the DLC RAM role.
Instance RAM Role | Description |
Default Role of PAI | The default role of PAI is developed based on the AliyunPAIDLCDefaultRole role and has only the permissions to access MaxCompute and OSS. You can use this role to implement fine-grained permission management. The temporary credentials issued by the default role of PAI:
|
Custom Role | Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials. |
Does Not Associate Role | Do not associate a RAM role with the DLC job. By default, this option is selected. |
After you confire the parameters, click OK to submit the job.
Submit a job by using SDK for Python or the command line
Use SDK for Python
Step 1: Install SDK for Python
Install the workspace SDK.
pip install alibabacloud_aiworkspace20210204==3.0.1
Install the DLC SDK.
pip install alibabacloud_pai_dlc20201203==1.4.0
Step 2: Submit the job
If you want to submit a job that runs on pay-as-you-go resources, you can use public resources. Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of tasks.
If you want to submit a job that runs on subscription resources, you can use dedicated resources, such as general computing resources or Lingjun resources. You can use dedicated resources to ensure resource availability in high workload scenarios.
Use public resources to submit jobs
The following sample code provides an example on how to create and submit a DLC job:
#!/usr/bin/env python3
from __future__ import print_function
import json
import time
from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
ListJobsRequest,
ListEcsSpecsRequest,
CreateJobRequest,
GetJobRequest,
)
from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
ListWorkspacesRequest,
CreateDatasetRequest,
ListDatasetsRequest,
ListImagesRequest,
ListCodeSourcesRequest
)
def create_nas_dataset(client, region, workspace_id, name,
nas_id, nas_path, mount_path):
'''Create a NAS dataset.
'''
response = client.create_dataset(CreateDatasetRequest(
workspace_id=workspace_id,
name=name,
data_type='COMMON',
data_source_type='NAS',
property='DIRECTORY',
uri=f'nas://{nas_id}.{region}{nas_path}',
accessibility='PRIVATE',
source_type='USER',
options=json.dumps({
'mountPath': mount_path
})
))
return response.body.dataset_id
def create_oss_dataset(client, region, workspace_id, name,
oss_bucket, oss_endpoint, oss_path, mount_path):
'''Create an OSS dataset.
'''
response = client.create_dataset(CreateDatasetRequest(
workspace_id=workspace_id,
name=name,
data_type='COMMON',
data_source_type='OSS',
property='DIRECTORY',
uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
accessibility='PRIVATE',
source_type='USER',
options=json.dumps({
'mountPath': mount_path
})
))
return response.body.dataset_id
def wait_for_job_to_terminate(client, job_id):
while True:
job = client.get_job(job_id, GetJobRequest()).body
print('job({}) is {}'.format(job_id, job.status))
if job.status in ('Succeeded', 'Failed', 'Stopped'):
return job.status
time.sleep(5)
return None
def main():
# Make sure that your Alibaba Cloud account has the required permissions on DLC.
region_id = 'cn-hangzhou'
# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console.
# We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account.
# In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification.
cred = CredClient()
# 1. create client;
workspace_client = AIWorkspaceClient(
config=Config(
credential=cred,
region_id=region_id,
endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
)
)
dlc_client = DLCClient(
config=Config(
credential=cred,
region_id=region_id,
endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
)
)
print('------- Workspaces -----------')
# Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter.
workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
page_number=1, page_size=1, workspace_name='',
module_list='PAI'
))
for workspace in workspaces.body.workspaces:
print(workspace.workspace_name, workspace.workspace_id,
workspace.status, workspace.creator)
if len(workspaces.body.workspaces) == 0:
raise RuntimeError('found no workspaces')
workspace_id = workspaces.body.workspaces[0].workspace_id
print('------- Images ------------')
# Obtain the image list.
images = workspace_client.list_images(ListImagesRequest(
labels=','.join(['system.supported.dlc=true',
'system.framework=Tensorflow 1.15',
'system.pythonVersion=3.6',
'system.chipType=CPU'])))
for image in images.body.images:
print(json.dumps(image.to_map(), indent=2))
image_uri = images.body.images[0].image_uri
print('------- Datasets ----------')
# Obtain the dataset.
datasets = workspace_client.list_datasets(ListDatasetsRequest(
workspace_id=workspace_id,
name='example-nas-data', properties='DIRECTORY'))
for dataset in datasets.body.datasets:
print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)
if len(datasets.body.datasets) == 0:
# Create a dataset if the specified dataset does not exist.
dataset_id = create_nas_dataset(
client=workspace_client,
region=region_id,
workspace_id=workspace_id,
name='example-nas-data',
# The ID of the NAS file system.
# General-purpose NAS: 31a8e4****.
# Extreme NAS: The ID must start with extreme-. Example: extreme-0015****.
# CPFS: The ID must start with cpfs-. Example: cpfs-125487****.
nas_id='***',
nas_path='/',
mount_path='/mnt/data/nas')
print('create dataset with id: {}'.format(dataset_id))
else:
dataset_id = datasets.body.datasets[0].dataset_id
print('------- Code Sources ----------')
# Obtain the source code file list.
code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
workspace_id=workspace_id))
for code_source in code_sources.body.code_sources:
print(code_source.display_name, code_source.code_source_id, code_source.code_repo)
print('-------- ECS SPECS ----------')
# Obtain the DLC node specification list.
ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
for spec in ecs_specs.body.ecs_specs:
print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)
print('-------- Create Job ----------')
# Create a DLC job.
create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
'WorkspaceId': workspace_id,
'DisplayName': 'sample-dlc-job',
'JobType': 'TFJob',
'JobSpecs': [
{
"Type": "Worker",
"Image": image_uri,
"PodCount": 1,
"EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
"UseSpotInstance": False,
},
],
"UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
'DataSources': [
{
"DataSourceId": dataset_id,
},
],
}))
job_id = create_job_resp.body.job_id
wait_for_job_to_terminate(dlc_client, job_id)
print('-------- List Jobs ----------')
# Obtain the DLC job list.
jobs = dlc_client.list_jobs(ListJobsRequest(
workspace_id=workspace_id,
page_number=1,
page_size=10,
))
for job in jobs.body.jobs:
print(job.display_name, job.job_id, job.workspace_name,
job.status, job.job_type)
pass
if __name__ == '__main__':
main()
Use subscription resources to submit jobs
Log on to the PAI console.
Follow the instructions to obtain your workspace ID on the Workspaces page.
Follow the instructions to obtain the resource quota ID of your dedicated resource group.
The following sample code provides an example on how to create and submit a job. For information about the available public images, see the "Step 2: Prepare an image" section in the Before you begin topic.
from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_credentials.client import Client as CredClient from alibabacloud_tea_openapi.models import Config from alibabacloud_pai_dlc20201203.models import ( CreateJobRequest, JobSpec, ResourceConfig, GetJobRequest ) # Initialize a client to access the DLC API operations. region = 'cn-hangzhou' # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. cred = CredClient() client = Client( config=Config( credential=cred, region_id=region, endpoint=f'pai-dlc.{region}.aliyuncs.com', ) ) # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. spec = JobSpec( type='Worker', image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04', pod_count=1, resource_config=ResourceConfig(cpu='1', memory='2Gi') ) # Specify the execution information for the job. req = CreateJobRequest( resource_id='<Your resource quota ID>', workspace_id='<Your workspace ID>', display_name='sample-dlc-job', job_type='TFJob', job_specs=[spec], user_command='echo "Hello World"', ) # Submit the job. response = client.create_job(req) # Obtain the job ID. job_id = response.body.job_id # Query the job status. job = client.get_job(job_id, GetJobRequest()).body print('job status:', job.status) # View the commands that the job runs. job.user_command
Use the command line
Step 1: Download the DLC client and perform user authentication
Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.
Step 2: Submit the job
Log on to the PAI console.
Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.
Follow the instructions shown in the following figure to obtain the resource quota ID.
Create a parameter file named
tfjob.params
and copy the following content into the file. Change the parameter values based on your business requirements. For information about how to use the command line in the DLC client, see Supported commands.name=test_cli_tfjob_001 workers=1 worker_cpu=4 worker_gpu=0 worker_memory=4Gi worker_shared_memory=4Gi worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=echo good && sleep 120 resource_id=<Your resource quota ID> workspace_id=<Your workspace ID>
The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource quota.
./dlc submit tfjob --job_file ./tfjob.params
The following sample code provides an example on how to query the DLC jobs that you created.
./dlc get job <jobID>
What to do next
After you submit the job, you can perform the following operations:
View the basic information, resource view, and operation logs of the job. For more information, see View training jobs.
Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.
View the training results on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.
View the billing details when the job is completed. For more information, see Bill details.
Enable the log forwarding feature to forward logs of DLC jobs from the current workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.
You can create a notification rule for a PAI workspace to track and monitor the status of DLC jobs. For more information, see Create a notification rule.
If you have other questions about DLC jobs, see FAQ about DLC.