Submit training jobs

Updated at: 2025-01-24 05:39

After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console or by using SDK for Python or the command line. This topic describes how to submit a DLC job.

Prerequisites

Submit a job in the PAI console

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. Configure the parameters in the following sections.

    • Basic Information

      In this section, configure the Job Name and Tag parameters.

    • Environment Information

      Parameter

      Description

      Parameter

      Description

      Node Image

      The worker node image. You can select one of the following node images:

      • Alibaba Cloud Image: an image provided by Alibaba Cloud PAI. Such images support different Python versions and deep learning frameworks, such as TensorFlow, PyTorch, Ray, and XGBoost. For more information, see Before you begin.

      • Custom Image: a custom image that you uploaded to PAI. For more information about how to upload a custom image, see Custom images.

        Note
        • If you want to use Lingjun resources and custom images, install Remote Direct Memory Access (RDMA) network to use the high-performance RDMA network of Lingjun resources. For more information, see RDMA: high-performance networks for distributed training.

        • You need to set the image repository type to public or store the image in Alibaba Cloud Container Registry. This way, you can directly use the image.

      • Image Address: a custom, community, or Alibaba Cloud image that can be accessed by using the image address. If you select Image Address, you must also specify the public URL of the Docker registry image that you want to access.

        If you want to specify the private URL of an image, click enter the username and password and specify the Username and Password parameters to grant permissions on the private image repository.

        You can also use an accelerated image to accelerate model training. For more information, see Use an accelerated image in PAI.

      Data Set

      You can select one of the following dataset types.

      • Custom Dataset: Select a dataset that you prepared. If the dataset has multiple versions, you can click Versions in the Actions column to select the required version. For more information about how to prepare a dataset, see Step 3: Prepare a dataset.

      • Public Dataset: Select an existing public dataset provided by PAI. The mount option for the data is read-only.

      Set the Mount Path parameter to the specific path in the DLC training container, such as /mnt/data. DLC retrieves the required files based on the mount path you specified. For more information, see Use cloud storage for a DLC training job.

      Important
      • If you select an OSS dataset or an File Storage NAS (NAS) dataset, you must grant PAI the permissions to access OSS or NAS. Otherwise, PAI cannot read or write data. For more information, see the "Grant PAI the permissions to access OSS and NAS" section in the Grant the permissions that are required to use DLC topic.

      • If you select a Cloud Parallel File Storage (CPFS) dataset, you must configure a virtual private cloud (VPC). The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

      Directly Mount

      You can click OSS, General purpose NAS, Extreme speed NAS, and BMCPFS to directly mount the related data sources to a DLC container. You must specify the data source parameters and the mount path.

      Note

      Only jobs that use Lingjun resources support the BMCPFS data source.

      Startup Command

      The command that the job runs. Shell commands are supported. For example, you can use the python -c "print('Hello World')" command to run Python.

      When you submit a job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the $Environment Variable Name parameter. For more information about the general environment variables provided by DLC, see General environment variables.

      Note
      • If you configure a dataset, you can export the training results to the directory on which the dataset is mounted by using commands. This allows you to view the training results in the dataset.

      Environment Variable

      Additional configuration information or parameters. The format is key:value. You can configure up to 20 environment variables.

      Third-party Libraries

      Valid values:

      • Select from List: Enter the name of a third-party library in the field.

      • Directory of requirements.txt: Enter the path of the requirements.txt file in the field. You must upload the file to a DLC container by using code, datasets or direct mounting. Then, enter the path of file in the DLC container in the text field.

      Code Builds

      You must upload the code build that is required for the training to a DLC container. Valid values:

      • Online Configuration

        Specify the location of the repository that stores the code file. In this example, a code build that you prepared is selected. For information about how to create a code build, see the "Step 4: Prepare a code build" section in the Before you begin topic.

        Note

        DLC automatically downloads the code to the specified working path. Make sure that your account has permissions to access the repository.

      • Local Upload

        Click the image.png icon and follow the on-screen instructions to upload the code build. After the upload succeeds, set the Mount Path parameter to the specified path in the container, such as /mnt/data.

    • Resource Information

      Parameter

      Description

      Parameter

      Description

      Instance type

      This parameter is available only if the workspace allows you to use Lingjun resources and general computing resources to submit jobs in DLC. Valid values:

      • Lingjun Resources

        Note

        Lingjun resources are available only in the China (Ulanqab) and Singapore regions.

      • General Computing Resources

      Source

      Public Resources, Resource Quota, and Preemptible Resources, are available. Resource Quota includes general computing resources and Lingjun resources.

      Note
      • Public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your account manager.

      • For more information about the limitations and usage of preemptible resources, see Use a preemptible job.

      Resource Quota

      This parameter is required only if you set the Source parameter to Resource Quota. Select the resource quota that you prepared. For more information about how to prepare a resource quota, see Resource quota overview.

      Priority

      This parameter is available only if you set the Source parameter to Resource Quota.

      Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.

      Framework

      Specify the deep learning training framework and tool. The framework provides rich features and operations that you can use to build, train, and optimize deep learning models.

      • Tensorflow

      • PyTorch

      • ElasticBatch

      • XGBoost

      • OneFlow

      • MPIJob

      • Ray

      Note

      If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, MPIJob, and Ray.

      Job Resource

      Configure the resources of the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes.

      • Use public resources

        Configure the following parameters:

        • Number of Nodes: the number of nodes on which the DLC job runs.

        • Resource Type: Click the image.png icon to select an instance type. The prices of the specifications are displayed in the Instance Type panel. For information about the billing of resource specifications, see Billing of DLC.

      • Use general computing resources or Lingjun resources

        Configure the following parameters for the nodes: Number of Nodes, vCPUs, GPUs, Memory (GiB), and Shared Memory (GiB).

      • Use preemptible resources

        Configure the following parameters. For more information about preemptible resources, see Use a preemptible job.

        • Number of Nodes: the number of nodes on which the DLC job runs.

        • Resource Type: Click the image.png icon to select an instance type.

        • Bid Price: the maximum bid price to apply for the preemptible resources. You can click the image icon to switch the bidding method.

          • Bid Price (Discount): The maximum bid price ranges from 10% to 90% of the market price with a 10% interval. You can get the preemptible resources if your bid meets or exceeds the market price and inventory is available.

          • Bid Price ($/Minutes): The maximum bid price range is based on the market price range.

      Node-Specific Scheduling

      After you enable this feature, you can select nodes for scheduling.

      Note

      This parameter is available only when you use resource quota.

      CPU Affinity

      Enabling CPU affinity allows processes in a container or Pod to be bound to specific CPU cores for execution. This approach can reduce CPU cache misses and context switching, improving CPU utilization and enhancing application performance. It is suitable for scenarios that are sensitive to performance and have high real-time requirements.

      Maximum Duration

      You can specify the maximum duration for which a job runs. The job is automatically stopped when the uptime of the job exceeds the maximum duration. Default value: 30. Unit: hours.

      Retention Period

      Specify the retention period of jobs after they completed or fail. During the retention period, the resources are occupied. After the retention period ends, the jobs are deleted.

      Important

      DLC jobs that are deleted cannot be restored. Exercise caution when you delete the jobs.

    • VPC

      • If you do not configure a VPC, Internet connection is used. Due to the limited bandwidth of the Internet, the job may not progress or may not run as expected.

      • To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.

        Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC and performs access control based on the selected security group.

        You can also configure the Internet Gateway parameter.

        • Private Gateway: You can select a dedicated bandwidth based on your business requirements. If you access the Internet by using a private gateway, you need to create an Internet NAT gateway, associate an elastic IP address (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Enable Internet access for a DSW instance by using a private Internet NAT gateway.

        • Public Gateway: The shared public bandwidth is used. The download rate is slow in high concurrency scenarios.

      Important
      • Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.

      • If you select a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

      • If you use Lingjun preemptible resources to submit a DLC job, you must configure a VPC.

    • Fault Tolerance and Diagnosis

      Parameter

      Description

      Parameter

      Description

      Automatic Fault Tolerance

      After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify algorithmic errors of the jobs and improve GPU utilization. For more information, see AIMaster: elastic fault tolerance engine.

      Note

      After you enable Automatic Fault Tolerance, the system starts an AIMaster instance that runs together with the job instance and occupies the following resources:

      • Resource quota: one CPU core and 1 GB of memory.

      • Public resources: uses ecs.c6.large.

      Sanity Check

      After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. This prevents job failure in the early stage of training and improves the training success rate. For more information, see Sanity Check.

      Note

      You can enable the sanity check feature only for PyTorch jobs that run on Lingjun resources and use GPU.

    • Roles and Permissions

      The following table describes how to configure the Instance RAM Role parameter. For more information, see Configure the DLC RAM role.

      Instance RAM Role

      Description

      Instance RAM Role

      Description

      Default Role of PAI

      The default role of PAI is developed based on the AliyunPAIDLCDefaultRole role and has only the permissions to access MaxCompute and OSS. You can use this role to implement fine-grained permission management. The temporary credentials issued by the default role of PAI:

      • You are granted the same permissions as the owner of a DLC job when you access MaxCompute tables.

      • When you access OSS, you can access only the bucket that is configured as the default storage path for the current workspace.

      Custom Roles

      Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials.

      Does Not Associate Role

      Do not associate a RAM role with the DLC job. By default, this option is selected.

  3. After you configure the parameters, click Confirm.

Submit a job by using SDK for Python or the command line

Use SDK for Python
Use the command line

Step 1: Install SDK for Python

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.17

Step 2: Submit the job

  • If you want to submit a job that runs on pay-as-you-go resources, you can use public resources. Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of tasks.

  • If you want to submit a job that runs on subscription resources, you can use AI computing resources, such as general computing resources or Lingjun resources. You can use AI computing resources to ensure resource availability in high workload scenarios.

  • If you want to reduce the resource cost for job execution, you can use preemptible resources. Preemptible resources offer a certain discount. However, preemptible resources may be preempted or released. For more information about the limitations and usage of preemptible resources, see Use a preemptible job.

Use public resources to submit jobs
Use subscription resources to submit jobs
Use preemptible resources to submit jobs

The following sample code provides an example on how to create and submit a DLC job:

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset if the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()
  1. Log on to the PAI console.

  2. Follow the instructions to obtain your workspace ID on the Workspaces page. image.png

  3. Follow the instructions to obtain the resource quota ID of your dedicated resource group.image

  4. The following sample code provides an example on how to create and submit a job. For information about the available public images, see the "Step 2: Prepare an image" section in the Before you begin topic.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API operations. 
    region = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. 
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Specify the execution information for the job. 
    req = CreateJobRequest(
            resource_id='<Your resource quota ID>',
            workspace_id='<Your workspace ID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job. 
    response = client.create_job(req)
    # Obtain the job ID. 
    job_id = response.body.job_id
    
    # Query the job status. 
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the commands that the job runs. 
    job.user_command

Step 1: Download the DLC client and perform user authentication

Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.

    image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID.

    image

  4. Create a parameter file named tfjob.params and copy the following content into the file. Change the parameter values based on your business requirements. For information about how to use the command line in the DLC client, see Supported commands.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Your resource quota ID>  
    workspace_id=<Your workspace ID> 
  5. The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource quota.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. The following sample code provides an example on how to query the DLC jobs that you created.

    ./dlc get job <jobID>

What to do next

After you submit the job, you can perform the following operations:

  • View the basic information, resource view, and operation logs of the job. For more information, see View training jobs.

  • Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.

  • View the training results on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.

  • View the billing details when the job is completed. For more information, see Bill details.

  • Enable the log forwarding feature to forward logs of DLC jobs from the current workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.

  • Create a notification rule for a PAI workspace to track and monitor the status of DLC jobs. For more information, see Notification rule.

  • If you have other questions about DLC jobs, see FAQ about DLC.

  • On this page (1)
  • Prerequisites
  • Submit a job in the PAI console
  • Submit a job by using SDK for Python or the command line
  • What to do next
Feedback