All Products
Search
Document Center

Platform For AI:Commands used to submit jobs

Last Updated:Nov 19, 2024

You can use the Deep Learning Containers (DLC) client to submit training jobs of various types. This topic describes the commands used to submit training jobs, including call formats, parameter descriptions, and usage examples.

Common parameters that are used to submit training jobs

The parameters described in the following table are required for submitting training jobs by using the DLC client, regardless of whether the training jobs are of the TensorFlow, PyTorch, or XGBoost types. The following table lists the common parameters used to submit training jobs.

Table 1. Common parameters used to submit training jobs

Parameter

Required

Description

Type

Supported in the parameter description file

name

Yes

The name of the job. The name does not need to be unique.

STRING

Yes

command

Yes

The command that is run to start the node.

STRING

Yes

data_sources

No

The ID of the associated dataset. You can obtain the dataset ID on the Datasets page. For more information, see Create and manage datasets. Separate multiple data sources with commas (,). By default, this parameter is left empty.

STRING

Yes

code_source

No

The ID of the code set. You can obtain the code set ID on the Source Code Repositories page. For more information, see Code builds. You can specify only a single code source. By default, this parameter is left empty.

STRING

Yes

code_branch

No

The branch of the code repository. This parameter is used together with the code_source parameter.

STRING

Yes

code_commit

No

The commit ID of the code repository. This parameter is used together with the code_source parameter.

STRING

Yes

thirdparty_libs

No

The third-party Python library. Separate multiple libraries with commas (,). By default, this parameter is left empty.

STRING

Yes

thirdparty_lib_dir

No

The directory that contains the text file named requirements.txt. The file is used to install third-party Python libraries. By default, this parameter is left empty.

STRING

No

vpc_id

No

The ID of the available virtual private cloud (VPC) for the job. By default, this parameter is left empty.

STRING

Yes

switch_id

No (required if the vpc_id parameter is configured)

The ID of the available vSwitch for the job in the VPC that is specified by the vpc_id parameter. By default, this parameter is left empty.

STRING

Yes

security_group_id

No (required if the vpc_id parameter is configured)

The ID of the available security group for the job in the VPC that is specified by the vpc_id parameter. By default, this parameter is left empty.

STRING

Yes

job_file

No

The parameter description file of the job. If this parameter is specified, the parameters described in the file take precedence. Specify the parameters in the description file in the key=value format. The keys are the same as the keys of the parameters used in the client.

STRING

No

interactive

No

Specifies whether to start the job in interactive mode.

BOOL

Yes

job_max_running_time_minutes

No

The maximum uptime of the job. The default value is 0, which indicates that the uptime of the job is unlimited.

INT64

Yes

success_policy

No

Only TensorFlow jobs are supported. Valid values:

  • ChiefWorker: The job is complete if the pod on the chief node is terminated.

  • AllWorkers: The job is complete only if the pods on all nodes are terminated.

By default, this parameter is left empty, which is equivalent to AllWorkers.

STRING

Yes

envs

No

The environment variables for the worker node. Separate environment variables with commas (,). Separate a key and a value in an environment variable with an equal sign (=). Configure the environment variables in the key1=value1,key2=value2 format.

StringToString

Yes

tags

No

The tags that you want to add to the job. Separate tags with commas (,). Separate a key and a value in a tag with an equal sign (=). Configure the environment variables in the key1=value1,key2=value2 format.

StringToString

Yes

oversold_type

No

The way in which computing resources for off-peak hours are used for the job. Valid values:

  • AcceptQuotaOverSold: Computing resources for off-peak hours can be used for the job.

  • ForceQuotaOverSold: Only computing resources for off-peak hours can be used for the job.

  • ForbiddenQuotaOverSold: Only resources in the associated quota can be used for the job. Computing resources for off-peak hours cannot be used for the job.

STRING

Yes

driver

No

The GPU driver version used for the job.

STRING

Yes

default_route

No

The method to access the Internet if you select a virtual private cloud (VPC). Valid values:

  • eth0 (default): A public gateway is used to access the Internet.

  • eth1: A dedicated gateway is used to access the Internet over the selected VPC.

STRING

Yes

priority

No

The priority of the job. Valid values: 1 to 9. Default value: 1.

  • The value 1 indicates the lowest priority.

  • The value 9 indicates the highest priority.

INT32

Yes

exit_code_on_stopped

No

The exit code of the CML when a task that is run in interactive mode is stopped. Default value: 0.

INT32

Yes

job_reserved_minutes

No

The retention period after the task ends. Unit: minutes. Default value: 0.

INT32

Yes

job_reserved_policy

No

The policy that is used to retain the task. Valid values:

  • Always (default): The task is retained regardless of whether the task runs successfully or fails.

  • OnFailure: The task is retained if the task fails.

  • OnSucceed: The task is retained if the task runs successfully.

STRING

Yes

Submit TensorFlow training jobs

  • Feature description

    Submit TensorFlow training jobs.

  • Syntax

    You can use a command that contains related parameters or use a parameter description file to submit a TensorFlow training job.

    ./dlc submit tfjob [flags]
  • Parameter description

    If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the <parameterName>=<parameterValue> format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting TensorFlow jobs.

    Table 2. Parameters specific to submitting TensorFlow training jobs

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    workspace_id

    Yes

    The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.

    STRING

    Yes

    chief

    No

    Specifies whether to start the chief node. Default value: false. Valid values:

    • false: does not start the chief node.

    • true: starts the chief node.

    BOOL

    Yes

    chief_image

    No

    The image of the chief node. By default, this parameter is left empty.

    STRING

    Yes

    chief_spec

    No

    The node type of the chief node. By default, this parameter is left empty.

    STRING

    Yes

    master_image

    No

    The image of the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_spec

    No

    The node type of the master node.

    STRING

    Yes

    masters

    No

    The number of master nodes. Default value: 0.

    INT

    Yes

    ps

    No

    The number of parameter servers. Default value: 0.

    INT

    Yes

    ps_image

    No

    The image of the parameter server. By default, this parameter is left empty.

    STRING

    Yes

    ps_spec

    No

    The node type of the parameter server. By default, this parameter is left empty.

    STRING

    Yes

    worker_image

    No

    The image of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_spec

    No

    The node type of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    workers

    No

    The number of worker nodes. Default value: 0.

    INT

    Yes

    evaluator_image

    No

    The image of the evaluator node. By default, this parameter is left empty.

    STRING

    Yes

    evaluator_spec

    No

    The node type of the evaluator node. By default, this parameter is left empty.

    STRING

    Yes

    evaluators

    No

    The number of evaluator nodes. Default value: 0.

    INT

    Yes

    graphlearn_image

    No

    The image of the GraphLearn node. By default, this parameter is left empty.

    STRING

    Yes

    graphlearn_spec

    No

    The node type of the GraphLearn node. By default, this parameter is left empty.

    STRING

    Yes

    graphlearns

    No

    The number of GraphLearn nodes. Default value: 0.

    INT

    Yes

    Table 3. Parameters specific to submitting TensorFlow training jobs to dedicated resource groups

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    resource_id

    No (required if you want to submit a job to a dedicated resource group)

    The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.

    STRING

    Yes

    priority

    No

    The priority of the job. Default value: 1.

    INT

    Yes

    chief_cpu

    No

    The number of CPU cores used by the chief node. By default, this parameter is left empty.

    STRING

    Yes

    chief_gpu

    No

    The number of GPU cores used by the chief node. By default, this parameter is left empty.

    STRING

    Yes

    chief_gpu_type

    No

    The GPU type used by the chief node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    chief_memory

    No

    The amount of memory used by the chief node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    chief_shared_memory

    No

    The amount of memory shared by the chief node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    master_cpu

    No

    The number of CPU cores that are used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu

    No

    The number of GPU cores that are used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu_type

    No

    The GPU type used by the master node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    master_memory

    No

    The amount of memory used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    master_shared_memory

    No

    The amount of memory shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    *_cpu

    No

    The number of CPU cores used by the specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.

    STRING

    Yes

    *_gpu

    No

    The number of GPU cores used by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.

    STRING

    Yes

    *_gpu_type

    No

    The GPU type of a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Example: GU50. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.

    STRING

    Yes

    *_memory

    No

    The amount of memory used by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Examples: 500Mi and 1Gi. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.

    STRING

    Yes

    *_shared_memory

    No

    The amount of memory shared by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Examples: 500Mi and 1Gi. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.

    STRING

    Yes

  • Examples

    • Run a command to submit a job that involves two worker nodes and one parameter server.

      ./dlc submit tfjob --name=test_2021 --ps=1 \
        --ps_spec=ecs.g6.8xlarge \
        --ps_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 \
        --workers=2 \
        --worker_spec=ecs.g6.4xlarge \
        --worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 \
        --command="python /root/data/dist_mnist/code/dist-main.py --max_steps=10000 --data_dir=/root/data/dist_mnist/data/" \
        --workspace_id=***** \
        --data_sources=data-2021xxxxxxxxxx-xxxxxxxxxxxx

      The system displays information similar to the following output:

      +----------------------------------+--------------------------------------+
      |              JobId               |              RequestId               |
      +----------------------------------+--------------------------------------+
      | dlcmp6vwljkz****                 | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx |
      +----------------------------------+--------------------------------------+
    • Use a parameter description file to submit a job that involves two worker nodes and one parameter server.

      ./dlc submit tfjob --job_file=job_file.dist_mnist.1ps2w

      job_file.dist_mnist.1ps2w indicates the parameter description file in which parameters are provided in the <parameterName>=<parameterValue> format. The job_file.dist_mnist.1ps2w file contains the following content:

      name=test_2021
      workers=2
      worker_spec=ecs.g6.4xlarge
      worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
      ps=1
      ps_spec=ecs.g6.8xlarge
      ps_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04
      command=python /root/data/dist_mnist/code/dist-main.py --max_steps=10000 --data_dir=/root/data/dist_mnist/data/
      workspace_id=*****
      data_sources=data-2021xxxxxxxxxx-xxxxxxxxxxxx

Submit PyTorch training jobs

  • Feature description

    Submit PyTorch training jobs.

  • Syntax

    You can use a command that contains related parameters or use a parameter description file to submit a PyTorch training job.

    ./dlc submit pytorchjob [flags]
  • Parameter description

    If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the <parameterName>=<parameterValue> format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting PyTorch jobs.

    Table 4. Parameters specific to submitting PyTorch training jobs

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    workspace_id

    Yes

    The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.

    STRING

    Yes

    master_image

    No

    The image of the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_spec

    No

    The node type of the master node. By default, this parameter is left empty.

    STRING

    Yes

    masters

    No

    The number of master nodes. Default value: 0.

    INT

    Yes

    worker_image

    No

    The image of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_spec

    No

    The node type of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    workers

    No

    The number of worker nodes. Default value: 0.

    INT

    Yes

    Table 5. Parameters specific to submitting PyTorch training jobs to dedicated resource groups

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    resource_id

    No (required if you want to submit a job to a dedicated resource group)

    The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.

    STRING

    Yes

    priority

    No

    The priority of the job. The number of threads used by the component. Default value: 1.

    INT

    Yes

    master_cpu

    No

    The number of CPU cores used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu

    No

    The number of GPU cores used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu_type

    No

    The GPU type that is used by the master node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    master_memory

    No

    The amount of memory that is used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    master_shared_memory

    No

    The amount of memory that is shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    worker_cpu

    No

    The number of CPU cores that are used by the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_gpu

    No

    The number of GPU cores that are used by the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_gpu_type

    No

    The GPU type that is used by the worker node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    worker_memory

    No

    The amount of memory that is used by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    worker_shared_memory

    No

    The amount of memory that is shared by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

  • Examples

    Run a command that contains related parameters to submit a GPU model training job.

    ./dlc submit pytorchjob --name=test_pt_face \
      --workers=1 \
      --worker_spec=ecs.gn6e-c12g1.3xlarge \
      --worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.7.1-gpu-py37-cu110-ubuntu18.04 \
      --command="apt-get update; apt-get -y --allow-downgrades install libpcre3=2:8.38-3.1 libpcre3-dev libgl1-mesa-glx libglib2.0-dev; cd /root/data/face; python train.py --num_workers 0 --save_folder outputs" \
      --data_sources=data-20210410224621-xxxxxxxxxxxx \
      --workspace_id=*****

    The system displays information similar to the following output:

    +----------------------------------+--------------------------------------+
    |              JobId               |              RequestId               |
    +----------------------------------+--------------------------------------+
    | dlcu704xxuxk****                 | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx |
    +----------------------------------+--------------------------------------+

Submit XGBoost training jobs

  • Feature description

    Submit XGBoost training jobs.

  • Syntax

    You can use a command that contains related parameters or use a parameter description file to submit an XGBoost training job.

    ./dlc submit xgboostjob [flags]
  • Parameter description

    If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the <parameterName>=<parameterValue> format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting XGBoost training jobs.

    Table 6. Parameters specific to submitting XGBoost training jobs

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    workspace_id

    Yes

    The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.

    STRING

    Yes

    master_image

    No

    The image of the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_spec

    No

    The node type of the master node. By default, this parameter is left empty.

    STRING

    Yes

    masters

    No

    The number of master nodes. Default value: 0.

    INT

    Yes

    worker_image

    No

    The image of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_spec

    No

    The node type of the worker node. By default, this parameter is left empty.

    STRING

    Yes

    workers

    No

    The number of worker nodes. Default value: 0.

    INT

    Yes

    Table 7. Parameters specific to submitting XGBoost training jobs to dedicated resource groups

    Parameter

    Required

    Description

    Type

    Supported in the parameter description file

    resource_id

    No (required if you want to submit a job to a dedicated resource group)

    The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.

    STRING

    Yes

    priority

    No

    The priority of the job. The number of threads used by the component. Default value: 1.

    INT

    Yes

    master_cpu

    No

    The number of CPU cores used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu

    No

    The number of GPU cores used by the master node. By default, this parameter is left empty.

    STRING

    Yes

    master_gpu_type

    No

    The GPU type that is used by the master node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    master_memory

    No

    The amount of memory that is used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    master_shared_memory

    No

    The amount of memory shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    worker_cpu

    No

    The number of CPU cores used by the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_gpu

    No

    The number of GPU cores used by the worker node. By default, this parameter is left empty.

    STRING

    Yes

    worker_gpu_type

    No

    The GPU type that is used by the worker node. By default, this parameter is left empty. Example: GU50.

    STRING

    Yes

    worker_memory

    No

    The amount of memory that is used by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

    worker_shared_memory

    No

    The amount of memory that is shared by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.

    STRING

    Yes

  • Examples

    Run a command that contains related parameters to submit an XGBoost training job.

    ./dlc submit xgboostjob --name=test_xgboost \
      --workers=1 \
      --worker_spec=ecs.gn6e-c12g1.3xlarge \
      --worker_image=xgboost-training:1.6.0-cpu-py36-ubuntu18.04 \
      --command="python /root/code/horovod/xgboost/main.py --job_type=Train --xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=50 --model_path=autoAI/xgb-opt/2" \
      --workspace_id=*****

    The system displays information similar to the following output:

    +----------------------------------+--------------------------------------+
    |              JobId               |              RequestId               |
    +----------------------------------+--------------------------------------+
    | dlc1nvu3gli0****                 | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx |
    +----------------------------------+--------------------------------------+

Advanced parameters that are used to submit training jobs

Specify nodes when submitting jobs

You can configure parameters to specify nodes when submitting training jobs with Lingjun or general computing resource quotas by using the DLC client.

Note

This feature is available only for users in a whitelist. Contact your account manager to add your account to the whitelist.

  • Parameters

    Parameter

    Description

    Example

    --allow_nodes="${allow_nodes}"

    A list of allowed nodes. Multiple modes are separated by commas (,). We recommend that you do not include spaces in between.

    lingjuc47iextvg9-***,lingjuc47iextvg9-***

    --deny_nodes="${deny_nodes}"

    A list of denied nodes. Multiple modes are separated by commas (,). We recommend that you do not include spaces in between.

    lingjuc47iextvg9-***,lingjuc47iextvg9-***

  • Examples

    Command line parameters

    Sample command:

    • No nodes specified

      ./dlc submit pytorchjob --name=assign_node_test_no_node  \--workers=1 \
          --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \
          --command="sleep 1000" \
          --workspace_id='****' \
          --resource_id='quotau2h98mt****' \
          --worker_cpu="1" \
          --worker_memory='2Gi'  
    • Specify allowed nodes

      ./dlc submit pytorchjob --name=assign_node_test_2_allow_nodes  \--workers=1 \
          --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \
          --command="sleep 1000" \
          --workspace_id='****' \
          --resource_id='quotau2h98mt****' \
          --worker_cpu="1" \
          --worker_memory='2Gi' \
          --allow_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****" 
    • Specify denied nodes

       ./dlc submit pytorchjob --name=assign_node_test_two_deny_nodes  \--workers=1 \
          --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \
          --command="sleep 1000" \
          --workspace_id='****' \
          --resource_id='quotau2h98mt****' \
          --worker_cpu="1" \
          --worker_memory='2Gi' \
          --deny_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****"
    • Specify allowed and denied nodes

      ./dlc submit pytorchjob --name=assign_node_test_two_allow_two_deny  \--workers=1 \
          --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \
          --command="sleep 1000" \
          --workspace_id='****' \
          --resource_id='quotau2h98mt****' \
          --worker_cpu="1" \
          --worker_memory='2Gi' \
          --allow_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****" \
          --deny_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****"

    Read file

    • Sample command:

      ./dlc submit pytorchjob -f job_file
    • Example of job parameter configuration file, job_file:

      • No nodes specified

        name=assign_node_test_no_node
        workers=1
        worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04
        command=sleep 1000
        workspace_id=****
        resource_id=quotau2h98mt****
        worker_cpu=1
        worker_memory=2Gi
        
      • Specify allowed nodes

        name=assign_node_test_2_allow_nodes
        workers=1
        worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04
        command=sleep 1000
        workspace_id=****
        resource_id=quotau2h98mt****
        worker_cpu=1
        worker_memory=2Gi
        allow_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
        
      • Specify denied nodes

        name=assign_node_test_two_allow_two_deny
        workers=1
        worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04
        command=sleep 1000
        workspace_id=****
        resource_id=quotau2h98mt****
        worker_cpu=1
        worker_memory=2Gi
        deny_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
        
      • Specify allowed and denied nodes

        name=assign_node_test_two_allow_two_deny
        workers=1
        worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04
        command=sleep 1000
        workspace_id=****
        resource_id=quotau2h98mt****
        worker_cpu=1
        worker_memory=2Gi
        allow_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
        deny_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
        

Disable pay-as-you-go inventory check when submitting jobs

You can configure the disable_ecs_stock_check parameter to disable pay-as-you-go inventory check when submitting training jobs by using the DLC client.

  • Parameters

    Parameter

    Description

    Example

    disable_ecs_stock_check

    Whether to disable pay-as-you-go inventory check. Valid values:

    • false (default): Enable pay-as-you-go inventory check.

    • true: Disable pay-as-you-go inventory check.

    true or false

  • Examples

    Command line parameters

    Sample command:

    • Enable pay-as-you-go inventory check

      ./dlc submit pytorchjob \
          --name=test_skip_checking3 \
          --command='sleep 1000' \
          --workspace_id=**** \
          --priority=1 \
          --workers=1 \
          --worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 \
          --worker_spec=ecs.g6.xlarge  
    • Disable pay-as-you-go inventory check

      ./dlc submit pytorchjob \
          --name=test_skip_checking3 \
          --command='sleep 1000' \
          --workspace_id=**** \
          --priority=1 \
          --workers=1 \
          --worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 \
          --worker_spec=ecs.g6.xlarge \
          --disable_ecs_stock_check=true
       

    Read file

    Sample command:

    ./dlc submit pytorchjob -f job_file

    Example of job parameter configuration file, job_file:

    • Enable pay-as-you-go inventory check

      name=test_skip_checking3
      workers=1
      worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04
      command=sleep 1000
      workspace_id=****
      worker_spec=ecs.g6.xlarge
      
    • Disable pay-as-you-go inventory check

      name=test_skip_checking3
      workers=1
      worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04
      command=sleep 1000
      workspace_id=****
      worker_spec=ecs.g6.xlarge
      disable_ecs_stock_check=true
      

References