All Products
Search
Document Center

Platform For AI:Submit training jobs

Last Updated:Nov 28, 2024

After you complete the preparations, you can submit Deep Learning Containers (DLC) jobs in the Platform for AI (PAI) console or by using SDK for Python or the command line. This topic describes how to submit a DLC job.

Prerequisites

Submit a job in the PAI console

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. Configure the parameters in the following sections.

    Basic Information

    In the Basic Information section, configure the Job Name and Tag parameters.

    Environment Information

    In the Environment Information section, configure the key parameters. The following table describes the parameters.

    Parameter

    Description

    Node Image

    The worker node image. You can select one of the following node images:

    • Alibaba Cloud Image: an image provided by Alibaba Cloud PAI. Such images support different Python versions and deep learning frameworks, such as TensorFlow and PyTorch. For more information, see Before you begin.

    • Custom Image: a custom image that you uploaded to PAI. Before you select this option, you must upload your custom image to PAI. For more information about how to upload an image, see Custom images.

      Note

      If you want to use Lingjun resources, install Remote Direct Memory Access (RDMA) network to use the high-performance RDMA network of Lingjun resources. For more information, see RDMA: high-performance networks for distributed training.

    • Image Address: a custom, community, or Alibaba Cloud image that can be accessed by using the image address. If you select Image Address, you must also specify the public URL of the Docker registry image that you want to access.

      If you want to specify the private URL of an image, click Enter and configure the Image Repository Username and Image Repository Password parameters to grant permissions on the private image registry.

      You can also use an accelerated image to accelerate model training. For more information, see Use an accelerated image in PAI.

    Data Set

    You can select one of the following dataset types.

    • Custom Dataset:Select a dataset that you prepared. For more information about how to prepare a dataset, see Step 3: Prepare a dataset.

    • Public Dataset: Select an existing public dataset provided by PAI. The mount option for the data is read-only.

    Set the Mount Path parameter to the specific path in the DLC training container, such as /mnt/data. DLC retrieves the required files based on the mount path you specified. For more information, see Use cloud storage for a DLC training job.

    Important
    • If you select an OSS dataset or an File Storage NAS (NAS) dataset, you must grant PAI the permissions to access OSS or NAS. Otherwise, PAI cannot read or write data. For more information, see the "Grant PAI the permissions to access OSS and NAS" section in the Grant the permissions that are required to use DLC topic.

    • If you select a Cloud Parallel File Storage (CPFS) dataset, you must configure a virtual private cloud (VPC). The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

    Direct mount

    Click OSS to mount an OSS path to the specified DLC path.

    Startup Command

    The command that the job runs. Shell commands are supported. For example, you can use the python -c "print('Hello World')" command to run Python.

    When you submit a job, PAI automatically injects multiple general environment variables. To obtain the values of specific environment variables, configure the $Environment Variable Name parameter. For more information about the general environment variables provided by DLC, see General environment variables.

    Note
    • If you configure a dataset, the training results are stored in the directory on which the dataset is mounted.

    • If you specify the output path by using variables in the command, the training results are stored in the specified path.

    Environment Variable

    Additional configuration information or parameters. The format is key:value. You can configure up to 20 environment variables.

    Third-party Libraries

    Valid values:

    • Select from List: Enter the name of a third-party library in the field.

    • Directory of requirements.txt: Enter the path of the requirements.txt file in the field. You must include the address of the third-party library in the requirements.txt file.

    Code Builds

    Valid values:

    • Online Configuration

      Specify the location of the repository that stores the code file of the job. In this example, a code build that you prepared is selected. For information about how to create a code build, see the "Step 4: Prepare a code build" section in the Before you begin topic.

      Note

      DLC automatically downloads the code to the specified working path. Make sure that your account has permissions to access the repository.

    • Local Upload

      Click the image.png icon and follow the on-screen instructions to upload the code build. After the upload succeeds, set the Mount Path parameter to the specified path in the container, such as /mnt/data.

    Resource Information

    In the Resource Information section, configure the key parameters. The following table describes the parameters.

    Parameter

    Description

    Instance type

    This parameter is available only if the workspace allows you to use Lingjun resources and general computing resources to submit jobs in DLC.

    • Lingjun Resources

      Note

      Lingjun resources are available only in the China (Ulanqab) and Singapore regions.

    • General Computing Resources

    Source

    Public Resources and Resource Quota, which includes general computing resources and Lingjun resources, are available.

    Note
    • Public resources can provide up to two GPUs and eight vCPUs. To increase the resource quota, contact your account manager.

    • Preemptible resources have the following limitations. For more information, see Use a preemptible job.

      • Before you use a preemptible job, contact your business manager to add you to the whitelist.

      • Preemptible jobs are available only in the China (Ulanqab) and Singapore regions.

      • Only Lingjun resources support preemptible jobs.

    Resource Quota

    This parameter is required only if you set the Source parameter to Resource Quota. Select the resource quota that you prepared. For more information about how to prepare a resource quota, see Resource quota overview.

    Priority

    This parameter is available only if you set the Source parameter to Resource Quota.

    Specify the priority for running the job. Valid values: 1 to 9. A greater value indicates a higher priority.

    Framework

    Specify the deep learning training framework and tool. The framework provides rich features and operations that you can use to build, train, and optimize deep learning models.

    • Tensorflow

    • PyTorch

    • ElasticBatch

    • XGBoost

    • OneFlow

    • MPIJob

    • Slurm

    • Ray

    Note

    If you set the Resource Quota parameter to Lingjun resources, you can submit only the following types of jobs: TensorFlow, PyTorch, ElasticBatch, MPIJob, Slurm, and Ray.

    Job Resource

    Configure the following nodes based on the framework you selected: worker nodes, parameter server (PS) nodes, chief nodes, evaluator nodes, and GraphLearn nodes.

    • Use public resources

      Configure the following parameters:

      • Number of Nodes: the number of nodes on which the DLC job runs.

      • Instance Type: Click the image.png icon to select an instance type. For information about the billing of resource specifications, see Billing of DLC.

    • Use general computing resources or Lingjun resources

      Configure the following parameters for the nodes: Number of Nodes, vCPUs, GPUs, Memory (GiB), and Shared Memory (GiB).

    • Use preemptible resources

      Configure the following parameters:

      • Number of Nodes: the number of nodes on which the DLC job runs.

      • Instance Type: Click the image.png icon to select an instance type.

      • Maxium Bid Price: the maxium bid price to apply for the resources. The maxium bid price is ranges from 10% to 90% of the market price with a 10% interval. You can get the preemptible resources if your bid meets or exceeds the market price and inventory is available.

    Node-Specific Scheduling

    After you enable this feature, you can select nodes for scheduling.

    Note

    This parameter is available only when you use resource quota.

    CPU Affinity

    Enabling CPU affinity allows processes in a container or Pod to be bound to specific CPU cores for execution. This approach can reduce CPU cache misses and context switching, improving CPU utilization and enhancing application performance. It is suitable for scenarios that are sensitive to performance and have high real-time requirements.

    Maximum Duration

    You can specify the maximum duration for which a job runs. The job is automatically stopped when the uptime of the job exceeds the maximum duration. Default value: 30. Unit: hours.

    Retention Period

    Specify the retention period of jobs after they completed or fail. During the retention period, the resources are occupied. After the retention period ends, the jobs are deleted.

    Important

    DLC jobs that are deleted cannot be restored. Exercise caution when you delete the jobs.

    VPC

    This parameter is available only if you set the Source parameter to Public Resources.

    • If you do not configure a VPC, Internet connection is used. Due to the limited bandwidth of the Internet, the job may not progress or may not run as expected.

    • To ensure sufficient network bandwidth and stable performance, we recommend that you configure a VPC.

      Select a VPC, a vSwitch, and a security group in the current region. When the configuration takes effect, the cluster on which the job runs directly accesses the services in the VPC and performs access control based on the selected security group.

      You can also configure the Internet Access Gateway parameter.

      • Private Gateway: dedicated bandwidth. You can configure the bandwidth based on your business requirements. If you access the Internet by using a private gateway, you need to create an Internet NAT gateway, associate an elastic IP address (EIP) with a DSW instance and configure SNAT in the VPC that is associated with the DSW instance. For more information, see Enable Internet access for a DSW instance by using a private Internet NAT gateway.

      • Public Gateway: shared public bandwidth. The download rate is slow in high concurrency scenarios.

    Important
    • Before you run a DLC job, make sure that instances in the resource group and the OSS bucket of the dataset reside in the VPCs of the same region, and that the VPCs are connected to the networks of the code repository.

    • If you select a CPFS dataset, you must configure a VPC. The VPC must be the same as the VPC configured for the CPFS dataset. Otherwise, the job may stay in the preparing environment state for a long time.

    Fault Tolerance and Diagnosis

    In the Fault Tolerance and Diagnosis section, configure the key parameters. The following table describes the parameters.

    Parameter

    Description

    Automatic Fault Tolerance

    After you turn on Automatic Fault Tolerance and configure the related parameters, the system checks the jobs to identify algorithmic errors of the jobs and improve GPU utilization. For more information, see AIMaster: elastic fault tolerance engine.

    Note

    After you enable Automatic Fault Tolerance, the system starts an AIMaster instance that runs together with the job instance and occupies the following resources:

    • Resource quota: one CPU core and 1 GB of memory.

    • Public resources: uses ecs.c6.large.

    Sanity Check

    After you turn on Sanity Check, the system detects the resources that are used to run the jobs, isolates faulty nodes, and triggers automated O&M processes in the background. This prevents job failure in the early stage of training and improves the training success rate. For more information, see Sanity Check.

    Note

    You can enable the sanity check feature only for PyTorch jobs that run on Lingjun resources and use GPU.

    Roles and Permissions

    In the Roles and Permissions section, configure the Instance RAM Role parameter. For more information, see Configure the DLC RAM role.

    Instance RAM Role

    Description

    Default Role of PAI

    The default role of PAI is developed based on the AliyunPAIDLCDefaultRole role and has only the permissions to access MaxCompute and OSS. You can use this role to implement fine-grained permission management. The temporary credentials issued by the default role of PAI:

    • You are granted the same permissions as the owner of a DLC job when you access MaxCompute tables.

    • When you access OSS, you can access only the bucket that is configured as the default storage path for the current workspace.

    Custom Role

    Select or create a custom Resource Access Management (RAM) role. You are granted the same permissions as the custom role you select when you call API operations of other Alibaba Cloud services by using Security Token Service (STS) temporary credentials.

    Does Not Associate Role

    Do not associate a RAM role with the DLC job. By default, this option is selected.

  3. After you confirm the parameters, click OK to submit the job.

Submit a job by using SDK for Python or the command line

Use SDK for Python

Step 1: Install SDK for Python

  • Install the workspace SDK.

    pip install alibabacloud_aiworkspace20210204==3.0.1
  • Install the DLC SDK.

    pip install alibabacloud_pai_dlc20201203==1.4.0

Step 2: Submit the job

  • If you want to submit a job that runs on pay-as-you-go resources, you can use public resources. Training jobs that run on public resources may encounter queuing delays. We recommend that you use public resources in time-insensitive scenarios that involve a small number of tasks.

  • If you want to submit a job that runs on subscription resources, you can use dedicated resources, such as general computing resources or Lingjun resources. You can use dedicated resources to ensure resource availability in high workload scenarios.

Use public resources to submit jobs

The following sample code provides an example on how to create and submit a DLC job:

#!/usr/bin/env python3

from __future__ import print_function

import json
import time

from alibabacloud_tea_openapi.models import Config
from alibabacloud_credentials.client import Client as CredClient
from alibabacloud_pai_dlc20201203.client import Client as DLCClient
from alibabacloud_pai_dlc20201203.models import (
    ListJobsRequest,
    ListEcsSpecsRequest,
    CreateJobRequest,
    GetJobRequest,
)

from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
from alibabacloud_aiworkspace20210204.models import (
    ListWorkspacesRequest,
    CreateDatasetRequest,
    ListDatasetsRequest,
    ListImagesRequest,
    ListCodeSourcesRequest
)


def create_nas_dataset(client, region, workspace_id, name,
                       nas_id, nas_path, mount_path):
    '''Create a NAS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='NAS',
        property='DIRECTORY',
        uri=f'nas://{nas_id}.{region}{nas_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id


def create_oss_dataset(client, region, workspace_id, name,
                       oss_bucket, oss_endpoint, oss_path, mount_path):
    '''Create an OSS dataset. 
    '''
    response = client.create_dataset(CreateDatasetRequest(
        workspace_id=workspace_id,
        name=name,
        data_type='COMMON',
        data_source_type='OSS',
        property='DIRECTORY',
        uri=f'oss://{oss_bucket}.{oss_endpoint}{oss_path}',
        accessibility='PRIVATE',
        source_type='USER',
        options=json.dumps({
            'mountPath': mount_path
        })
    ))
    return response.body.dataset_id



def wait_for_job_to_terminate(client, job_id):
    while True:
        job = client.get_job(job_id, GetJobRequest()).body
        print('job({}) is {}'.format(job_id, job.status))
        if job.status in ('Succeeded', 'Failed', 'Stopped'):
            return job.status
        time.sleep(5)
    return None


def main():

    # Make sure that your Alibaba Cloud account has the required permissions on DLC. 
    region_id = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()

    # 1. create client;
    workspace_client = AIWorkspaceClient(
        config=Config(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )

    dlc_client = DLCClient(
         config=Config(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dlc.{}.aliyuncs.com'.format(region_id),
         )
    )

    print('------- Workspaces -----------')
    # Obtain the workspace list. You can specify the name of the workspace that you created in the workspace_name parameter. 
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1, page_size=1, workspace_name='',
        module_list='PAI'
    ))
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name, workspace.workspace_id,
              workspace.status, workspace.creator)

    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('found no workspaces')

    workspace_id = workspaces.body.workspaces[0].workspace_id

    print('------- Images ------------')
    # Obtain the image list. 
    images = workspace_client.list_images(ListImagesRequest(
        labels=','.join(['system.supported.dlc=true',
                         'system.framework=Tensorflow 1.15',
                         'system.pythonVersion=3.6',
                         'system.chipType=CPU'])))
    for image in images.body.images:
        print(json.dumps(image.to_map(), indent=2))

    image_uri = images.body.images[0].image_uri

    print('------- Datasets ----------')
    # Obtain the dataset. 
    datasets = workspace_client.list_datasets(ListDatasetsRequest(
        workspace_id=workspace_id,
        name='example-nas-data', properties='DIRECTORY'))
    for dataset in datasets.body.datasets:
        print(dataset.name, dataset.dataset_id, dataset.uri, dataset.options)

    if len(datasets.body.datasets) == 0:
        # Create a dataset if the specified dataset does not exist. 
        dataset_id = create_nas_dataset(
            client=workspace_client,
            region=region_id,
            workspace_id=workspace_id,
            name='example-nas-data',
            # The ID of the NAS file system. 
            # General-purpose NAS: 31a8e4****. 
            # Extreme NAS: The ID must start with extreme-. Example: extreme-0015****. 
            # CPFS: The ID must start with cpfs-. Example: cpfs-125487****. 
            nas_id='***',
            nas_path='/',
            mount_path='/mnt/data/nas')
        print('create dataset with id: {}'.format(dataset_id))
    else:
        dataset_id = datasets.body.datasets[0].dataset_id

    print('------- Code Sources ----------')
    # Obtain the source code file list. 
    code_sources = workspace_client.list_code_sources(ListCodeSourcesRequest(
        workspace_id=workspace_id))
    for code_source in code_sources.body.code_sources:
        print(code_source.display_name, code_source.code_source_id, code_source.code_repo)

    print('-------- ECS SPECS ----------')
    # Obtain the DLC node specification list. 
    ecs_specs = dlc_client.list_ecs_specs(ListEcsSpecsRequest(page_size=100, sort_by='Memory', order='asc'))
    for spec in ecs_specs.body.ecs_specs:
        print(spec.instance_type, spec.cpu, spec.memory, spec.memory, spec.gpu_type)

    print('-------- Create Job ----------')
    # Create a DLC job. 
    create_job_resp = dlc_client.create_job(CreateJobRequest().from_map({
        'WorkspaceId': workspace_id,
        'DisplayName': 'sample-dlc-job',
        'JobType': 'TFJob',
        'JobSpecs': [
            {
                "Type": "Worker",
                "Image": image_uri,
                "PodCount": 1,
                "EcsSpec": ecs_specs.body.ecs_specs[0].instance_type,
                "UseSpotInstance": False,
            },
        ],
        "UserCommand": "echo 'Hello World' && ls -R /mnt/data/ && sleep 30 && echo 'DONE'",
        'DataSources': [
            {
                "DataSourceId": dataset_id,
            },
        ],
    }))
    job_id = create_job_resp.body.job_id

    wait_for_job_to_terminate(dlc_client, job_id)

    print('-------- List Jobs ----------')
    # Obtain the DLC job list. 
    jobs = dlc_client.list_jobs(ListJobsRequest(
        workspace_id=workspace_id,
        page_number=1,
        page_size=10,
    ))
    for job in jobs.body.jobs:
        print(job.display_name, job.job_id, job.workspace_name,
              job.status, job.job_type)
    pass


if __name__ == '__main__':
    main()

Use subscription resources to submit jobs

  1. Log on to the PAI console.

  2. Follow the instructions to obtain your workspace ID on the Workspaces page. image.png

  3. Follow the instructions to obtain the resource quota ID of your dedicated resource group.image

  4. The following sample code provides an example on how to create and submit a job. For information about the available public images, see the "Step 2: Prepare an image" section in the Before you begin topic.

    from alibabacloud_pai_dlc20201203.client import Client
    from alibabacloud_credentials.client import Client as CredClient
    from alibabacloud_tea_openapi.models import Config
    from alibabacloud_pai_dlc20201203.models import (
        CreateJobRequest,
        JobSpec,
        ResourceConfig, GetJobRequest
    )
    
    # Initialize a client to access the DLC API operations. 
    region = 'cn-hangzhou'
    # The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    # We recommend that you do not save the AccessKey ID and the AccessKey secret in your project code. Otherwise, the AccessKey pair may be leaked, and this may compromise the security of all resources within your account. 
    # In this example, the Credentials SDK reads the AccessKey pair from the environment variables to perform identity verification. 
    cred = CredClient()
    client = Client(
        config=Config(
            credential=cred,
            region_id=region,
            endpoint=f'pai-dlc.{region}.aliyuncs.com',
        )
    )
    
    # Specify the resource configurations of the job. You can select a public image or specify an image address. For information about the available public images, see the reference documentation. 
    spec = JobSpec(
        type='Worker',
        image=f'registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04',
        pod_count=1,
        resource_config=ResourceConfig(cpu='1', memory='2Gi')
    )
    
    # Specify the execution information for the job. 
    req = CreateJobRequest(
            resource_id='<Your resource quota ID>',
            workspace_id='<Your workspace ID>',
            display_name='sample-dlc-job',
            job_type='TFJob',
            job_specs=[spec],
            user_command='echo "Hello World"',
    )
    
    # Submit the job. 
    response = client.create_job(req)
    # Obtain the job ID. 
    job_id = response.body.job_id
    
    # Query the job status. 
    job = client.get_job(job_id, GetJobRequest()).body
    print('job status:', job.status)
    
    # View the commands that the job runs. 
    job.user_command

Use the command line

Step 1: Download the DLC client and perform user authentication

Download the DLC client for your operating system and verify your credentials. For more information, see Before you begin.

Step 2: Submit the job

  1. Log on to the PAI console.

  2. Follow the instructions shown in the following figure to obtain your workspace ID on the Workspace page.

    image.png

  3. Follow the instructions shown in the following figure to obtain the resource quota ID.

    image

  4. Create a parameter file named tfjob.params and copy the following content into the file. Change the parameter values based on your business requirements. For information about how to use the command line in the DLC client, see Supported commands.

    name=test_cli_tfjob_001
    workers=1
    worker_cpu=4
    worker_gpu=0
    worker_memory=4Gi
    worker_shared_memory=4Gi
    worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
    command=echo good && sleep 120
    resource_id=<Your resource quota ID>  
    workspace_id=<Your workspace ID> 
  5. The following sample code provides an example on how to specify the params_file parameter to submit a DLC job to the specified workspace and resource quota.

    ./dlc submit tfjob --job_file  ./tfjob.params
  6. The following sample code provides an example on how to query the DLC jobs that you created.

    ./dlc get job <jobID>

What to do next

After you submit the job, you can perform the following operations:

  • View the basic information, resource view, and operation logs of the job. For more information, see View training jobs.

  • Manage jobs, including cloning, stopping, and deleting jobs. For more information, see Manage training jobs.

  • View the training results on TensorBoard. For more information, see Use TensorBoard to view training results in DLC.

  • View the billing details when the job is completed. For more information, see Bill details.

  • Enable the log forwarding feature to forward logs of DLC jobs from the current workspace to a specific Logstore for custom analysis. For more information, see Subscribe to job logs.

  • You can create a notification rule for a PAI workspace to track and monitor the status of DLC jobs. For more information, see Create a notification rule.

  • If you have other questions about DLC jobs, see FAQ about DLC.