You can use the Deep Learning Containers (DLC) client to submit training jobs of various types. This topic describes the commands used to submit training jobs, including call formats, parameter descriptions, and usage examples.
Common parameters that are used to submit training jobs
The parameters described in the following table are required for submitting training jobs by using the DLC client, regardless of whether the training jobs are of the TensorFlow, PyTorch, or XGBoost types. The following table lists the common parameters used to submit training jobs.
Table 1. Common parameters used to submit training jobs
Parameter | Required | Description | Type | Supported in the parameter description file |
name | Yes | The name of the job. The name does not need to be unique. | STRING | Yes |
command | Yes | The command that is run to start the node. | STRING | Yes |
data_sources | No | The ID of the associated dataset. You can obtain the dataset ID on the Datasets page. For more information, see Create and manage datasets. Separate multiple data sources with commas (,). By default, this parameter is left empty. | STRING | Yes |
code_source | No | The ID of the code set. You can obtain the code set ID on the Source Code Repositories page. For more information, see Code builds. You can specify only a single code source. By default, this parameter is left empty. | STRING | Yes |
code_branch | No | The branch of the code repository. This parameter is used together with the code_source parameter. | STRING | Yes |
code_commit | No | The commit ID of the code repository. This parameter is used together with the code_source parameter. | STRING | Yes |
thirdparty_libs | No | The third-party Python library. Separate multiple libraries with commas (,). By default, this parameter is left empty. | STRING | Yes |
thirdparty_lib_dir | No | The directory that contains the text file named requirements.txt. The file is used to install third-party Python libraries. By default, this parameter is left empty. | STRING | No |
vpc_id | No | The ID of the available virtual private cloud (VPC) for the job. By default, this parameter is left empty. | STRING | Yes |
switch_id | No (required if the vpc_id parameter is configured) | The ID of the available vSwitch for the job in the VPC that is specified by the vpc_id parameter. By default, this parameter is left empty. | STRING | Yes |
security_group_id | No (required if the vpc_id parameter is configured) | The ID of the available security group for the job in the VPC that is specified by the vpc_id parameter. By default, this parameter is left empty. | STRING | Yes |
job_file | No | The parameter description file of the job. If this parameter is specified, the parameters described in the file take precedence. Specify the parameters in the description file in the | STRING | No |
interactive | No | Specifies whether to start the job in interactive mode. | BOOL | Yes |
job_max_running_time_minutes | No | The maximum uptime of the job. The default value is 0, which indicates that the uptime of the job is unlimited. | INT64 | Yes |
success_policy | No | Only TensorFlow jobs are supported. Valid values:
By default, this parameter is left empty, which is equivalent to AllWorkers. | STRING | Yes |
envs | No | The environment variables for the worker node. Separate environment variables with commas (,). Separate a key and a value in an environment variable with an equal sign (=). Configure the environment variables in the | StringToString | Yes |
tags | No | The tags that you want to add to the job. Separate tags with commas (,). Separate a key and a value in a tag with an equal sign (=). Configure the environment variables in the | StringToString | Yes |
oversold_type | No | The way in which computing resources for off-peak hours are used for the job. Valid values:
| STRING | Yes |
driver | No | The GPU driver version used for the job. | STRING | Yes |
default_route | No | The method to access the Internet if you select a virtual private cloud (VPC). Valid values:
| STRING | Yes |
priority | No | The priority of the job. Valid values: 1 to 9. Default value: 1.
| INT32 | Yes |
exit_code_on_stopped | No | The exit code of the CML when a task that is run in interactive mode is stopped. Default value: 0. | INT32 | Yes |
job_reserved_minutes | No | The retention period after the task ends. Unit: minutes. Default value: 0. | INT32 | Yes |
job_reserved_policy | No | The policy that is used to retain the task. Valid values:
| STRING | Yes |
Submit TensorFlow training jobs
Feature description
Submit TensorFlow training jobs.
Syntax
You can use a command that contains related parameters or use a parameter description file to submit a TensorFlow training job.
./dlc submit tfjob [flags]
Parameter description
If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the
<parameterName>=<parameterValue>
format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting TensorFlow jobs.Table 2. Parameters specific to submitting TensorFlow training jobs
Parameter
Required
Description
Type
Supported in the parameter description file
workspace_id
Yes
The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.
STRING
Yes
chief
No
Specifies whether to start the chief node. Default value: false. Valid values:
false: does not start the chief node.
true: starts the chief node.
BOOL
Yes
chief_image
No
The image of the chief node. By default, this parameter is left empty.
STRING
Yes
chief_spec
No
The node type of the chief node. By default, this parameter is left empty.
STRING
Yes
master_image
No
The image of the master node. By default, this parameter is left empty.
STRING
Yes
master_spec
No
The node type of the master node.
STRING
Yes
masters
No
The number of master nodes. Default value: 0.
INT
Yes
ps
No
The number of parameter servers. Default value: 0.
INT
Yes
ps_image
No
The image of the parameter server. By default, this parameter is left empty.
STRING
Yes
ps_spec
No
The node type of the parameter server. By default, this parameter is left empty.
STRING
Yes
worker_image
No
The image of the worker node. By default, this parameter is left empty.
STRING
Yes
worker_spec
No
The node type of the worker node. By default, this parameter is left empty.
STRING
Yes
workers
No
The number of worker nodes. Default value: 0.
INT
Yes
evaluator_image
No
The image of the evaluator node. By default, this parameter is left empty.
STRING
Yes
evaluator_spec
No
The node type of the evaluator node. By default, this parameter is left empty.
STRING
Yes
evaluators
No
The number of evaluator nodes. Default value: 0.
INT
Yes
graphlearn_image
No
The image of the GraphLearn node. By default, this parameter is left empty.
STRING
Yes
graphlearn_spec
No
The node type of the GraphLearn node. By default, this parameter is left empty.
STRING
Yes
graphlearns
No
The number of GraphLearn nodes. Default value: 0.
INT
Yes
Table 3. Parameters specific to submitting TensorFlow training jobs to dedicated resource groups
Parameter
Required
Description
Type
Supported in the parameter description file
resource_id
No (required if you want to submit a job to a dedicated resource group)
The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.
STRING
Yes
priority
No
The priority of the job. Default value: 1.
INT
Yes
chief_cpu
No
The number of CPU cores used by the chief node. By default, this parameter is left empty.
STRING
Yes
chief_gpu
No
The number of GPU cores used by the chief node. By default, this parameter is left empty.
STRING
Yes
chief_gpu_type
No
The GPU type used by the chief node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
chief_memory
No
The amount of memory used by the chief node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
chief_shared_memory
No
The amount of memory shared by the chief node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
master_cpu
No
The number of CPU cores that are used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu
No
The number of GPU cores that are used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu_type
No
The GPU type used by the master node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
master_memory
No
The amount of memory used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
master_shared_memory
No
The amount of memory shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
*_cpu
No
The number of CPU cores used by the specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.
STRING
Yes
*_gpu
No
The number of GPU cores used by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.
STRING
Yes
*_gpu_type
No
The GPU type of a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Example: GU50. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.
STRING
Yes
*_memory
No
The amount of memory used by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Examples: 500Mi and 1Gi. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.
STRING
Yes
*_shared_memory
No
The amount of memory shared by a specified type of node, which is indicated by the wildcard character (*). By default, this parameter is left empty. Examples: 500Mi and 1Gi. The wildcard character (*) can represent a parameter server, worker, evaluator, or GraphLearn.
STRING
Yes
Examples
Run a command to submit a job that involves two worker nodes and one parameter server.
./dlc submit tfjob --name=test_2021 --ps=1 \ --ps_spec=ecs.g6.8xlarge \ --ps_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 \ --workers=2 \ --worker_spec=ecs.g6.4xlarge \ --worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 \ --command="python /root/data/dist_mnist/code/dist-main.py --max_steps=10000 --data_dir=/root/data/dist_mnist/data/" \ --workspace_id=***** \ --data_sources=data-2021xxxxxxxxxx-xxxxxxxxxxxx
The system displays information similar to the following output:
+----------------------------------+--------------------------------------+ | JobId | RequestId | +----------------------------------+--------------------------------------+ | dlcmp6vwljkz**** | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx | +----------------------------------+--------------------------------------+
Use a parameter description file to submit a job that involves two worker nodes and one parameter server.
./dlc submit tfjob --job_file=job_file.dist_mnist.1ps2w
job_file.dist_mnist.1ps2w indicates the parameter description file in which parameters are provided in the
<parameterName>=<parameterValue>
format. The job_file.dist_mnist.1ps2w file contains the following content:name=test_2021 workers=2 worker_spec=ecs.g6.4xlarge worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 ps=1 ps_spec=ecs.g6.8xlarge ps_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/tensorflow-training:1.12.2PAI-cpu-py27-ubuntu16.04 command=python /root/data/dist_mnist/code/dist-main.py --max_steps=10000 --data_dir=/root/data/dist_mnist/data/ workspace_id=***** data_sources=data-2021xxxxxxxxxx-xxxxxxxxxxxx
Submit PyTorch training jobs
Feature description
Submit PyTorch training jobs.
Syntax
You can use a command that contains related parameters or use a parameter description file to submit a PyTorch training job.
./dlc submit pytorchjob [flags]
Parameter description
If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the
<parameterName>=<parameterValue>
format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting PyTorch jobs.Table 4. Parameters specific to submitting PyTorch training jobs
Parameter
Required
Description
Type
Supported in the parameter description file
workspace_id
Yes
The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.
STRING
Yes
master_image
No
The image of the master node. By default, this parameter is left empty.
STRING
Yes
master_spec
No
The node type of the master node. By default, this parameter is left empty.
STRING
Yes
masters
No
The number of master nodes. Default value: 0.
INT
Yes
worker_image
No
The image of the worker node. By default, this parameter is left empty.
STRING
Yes
worker_spec
No
The node type of the worker node. By default, this parameter is left empty.
STRING
Yes
workers
No
The number of worker nodes. Default value: 0.
INT
Yes
Table 5. Parameters specific to submitting PyTorch training jobs to dedicated resource groups
Parameter
Required
Description
Type
Supported in the parameter description file
resource_id
No (required if you want to submit a job to a dedicated resource group)
The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.
STRING
Yes
priority
No
The priority of the job. The number of threads used by the component. Default value: 1.
INT
Yes
master_cpu
No
The number of CPU cores used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu
No
The number of GPU cores used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu_type
No
The GPU type that is used by the master node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
master_memory
No
The amount of memory that is used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
master_shared_memory
No
The amount of memory that is shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
worker_cpu
No
The number of CPU cores that are used by the worker node. By default, this parameter is left empty.
STRING
Yes
worker_gpu
No
The number of GPU cores that are used by the worker node. By default, this parameter is left empty.
STRING
Yes
worker_gpu_type
No
The GPU type that is used by the worker node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
worker_memory
No
The amount of memory that is used by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
worker_shared_memory
No
The amount of memory that is shared by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
Examples
Run a command that contains related parameters to submit a GPU model training job.
./dlc submit pytorchjob --name=test_pt_face \ --workers=1 \ --worker_spec=ecs.gn6e-c12g1.3xlarge \ --worker_image=registry-vpc.cn-beijing.aliyuncs.com/pai-dlc/pytorch-training:1.7.1-gpu-py37-cu110-ubuntu18.04 \ --command="apt-get update; apt-get -y --allow-downgrades install libpcre3=2:8.38-3.1 libpcre3-dev libgl1-mesa-glx libglib2.0-dev; cd /root/data/face; python train.py --num_workers 0 --save_folder outputs" \ --data_sources=data-20210410224621-xxxxxxxxxxxx \ --workspace_id=*****
The system displays information similar to the following output:
+----------------------------------+--------------------------------------+ | JobId | RequestId | +----------------------------------+--------------------------------------+ | dlcu704xxuxk**** | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx | +----------------------------------+--------------------------------------+
Submit XGBoost training jobs
Feature description
Submit XGBoost training jobs.
Syntax
You can use a command that contains related parameters or use a parameter description file to submit an XGBoost training job.
./dlc submit xgboostjob [flags]
Parameter description
If you use a command that contains related parameters, include both the parameter keys and their actual values in the command. If you use a parameter description file, specify related parameters in the
<parameterName>=<parameterValue>
format in the file. The parameters common to all types of training jobs are described in the "Common parameters used to submit training jobs" section of this topic. The following table describes the parameters specific to submitting XGBoost training jobs.Table 6. Parameters specific to submitting XGBoost training jobs
Parameter
Required
Description
Type
Supported in the parameter description file
workspace_id
Yes
The ID of the workspace that is used to submit the job. By default, this parameter is left empty. For information about how to create a workspace, see Create a workspace.
STRING
Yes
master_image
No
The image of the master node. By default, this parameter is left empty.
STRING
Yes
master_spec
No
The node type of the master node. By default, this parameter is left empty.
STRING
Yes
masters
No
The number of master nodes. Default value: 0.
INT
Yes
worker_image
No
The image of the worker node. By default, this parameter is left empty.
STRING
Yes
worker_spec
No
The node type of the worker node. By default, this parameter is left empty.
STRING
Yes
workers
No
The number of worker nodes. Default value: 0.
INT
Yes
Table 7. Parameters specific to submitting XGBoost training jobs to dedicated resource groups
Parameter
Required
Description
Type
Supported in the parameter description file
resource_id
No (required if you want to submit a job to a dedicated resource group)
The ID of the dedicated resource quota. By default, this parameter is left empty. For more information about how to create a dedicated resource quota, see General computing resource quotas.
STRING
Yes
priority
No
The priority of the job. The number of threads used by the component. Default value: 1.
INT
Yes
master_cpu
No
The number of CPU cores used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu
No
The number of GPU cores used by the master node. By default, this parameter is left empty.
STRING
Yes
master_gpu_type
No
The GPU type that is used by the master node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
master_memory
No
The amount of memory that is used by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
master_shared_memory
No
The amount of memory shared by the master node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
worker_cpu
No
The number of CPU cores used by the worker node. By default, this parameter is left empty.
STRING
Yes
worker_gpu
No
The number of GPU cores used by the worker node. By default, this parameter is left empty.
STRING
Yes
worker_gpu_type
No
The GPU type that is used by the worker node. By default, this parameter is left empty. Example: GU50.
STRING
Yes
worker_memory
No
The amount of memory that is used by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
worker_shared_memory
No
The amount of memory that is shared by the worker node. By default, this parameter is left empty. Examples: 500Mi and 1Gi.
STRING
Yes
Examples
Run a command that contains related parameters to submit an XGBoost training job.
./dlc submit xgboostjob --name=test_xgboost \ --workers=1 \ --worker_spec=ecs.gn6e-c12g1.3xlarge \ --worker_image=xgboost-training:1.6.0-cpu-py36-ubuntu18.04 \ --command="python /root/code/horovod/xgboost/main.py --job_type=Train --xgboost_parameter=objective:multi:softprob,num_class:3 --n_estimators=50 --model_path=autoAI/xgb-opt/2" \ --workspace_id=*****
The system displays information similar to the following output:
+----------------------------------+--------------------------------------+ | JobId | RequestId | +----------------------------------+--------------------------------------+ | dlc1nvu3gli0**** | xxxxxxxx-79AF-4EFC-9CE9-xxxxxxxxxxxx | +----------------------------------+--------------------------------------+
Advanced parameters that are used to submit training jobs
Specify nodes when submitting jobs
You can configure parameters to specify nodes when submitting training jobs with Lingjun or general computing resource quotas by using the DLC client.
This feature is available only for users in a whitelist. Contact your account manager to add your account to the whitelist.
Parameters
Parameter
Description
Example
--allow_nodes="${allow_nodes}"
A list of allowed nodes. Multiple modes are separated by commas (,). We recommend that you do not include spaces in between.
lingjuc47iextvg9-***,lingjuc47iextvg9-***
--deny_nodes="${deny_nodes}"
A list of denied nodes. Multiple modes are separated by commas (,). We recommend that you do not include spaces in between.
lingjuc47iextvg9-***,lingjuc47iextvg9-***
Examples
Command line parameters
Sample command:
No nodes specified
./dlc submit pytorchjob --name=assign_node_test_no_node \--workers=1 \ --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \ --command="sleep 1000" \ --workspace_id='****' \ --resource_id='quotau2h98mt****' \ --worker_cpu="1" \ --worker_memory='2Gi'
Specify allowed nodes
./dlc submit pytorchjob --name=assign_node_test_2_allow_nodes \--workers=1 \ --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \ --command="sleep 1000" \ --workspace_id='****' \ --resource_id='quotau2h98mt****' \ --worker_cpu="1" \ --worker_memory='2Gi' \ --allow_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****"
Specify denied nodes
./dlc submit pytorchjob --name=assign_node_test_two_deny_nodes \--workers=1 \ --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \ --command="sleep 1000" \ --workspace_id='****' \ --resource_id='quotau2h98mt****' \ --worker_cpu="1" \ --worker_memory='2Gi' \ --deny_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****"
Specify allowed and denied nodes
./dlc submit pytorchjob --name=assign_node_test_two_allow_two_deny \--workers=1 \ --worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 \ --command="sleep 1000" \ --workspace_id='****' \ --resource_id='quotau2h98mt****' \ --worker_cpu="1" \ --worker_memory='2Gi' \ --allow_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****" \ --deny_nodes="lingjuc47iextvg9-****,lingjuc47iextvg9-****"
Read file
Sample command:
./dlc submit pytorchjob -f job_file
Example of job parameter configuration file, job_file:
No nodes specified
name=assign_node_test_no_node workers=1 worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 command=sleep 1000 workspace_id=**** resource_id=quotau2h98mt**** worker_cpu=1 worker_memory=2Gi
Specify allowed nodes
name=assign_node_test_2_allow_nodes workers=1 worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 command=sleep 1000 workspace_id=**** resource_id=quotau2h98mt**** worker_cpu=1 worker_memory=2Gi allow_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
Specify denied nodes
name=assign_node_test_two_allow_two_deny workers=1 worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 command=sleep 1000 workspace_id=**** resource_id=quotau2h98mt**** worker_cpu=1 worker_memory=2Gi deny_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
Specify allowed and denied nodes
name=assign_node_test_two_allow_two_deny workers=1 worker_image=dsw-registry-vpc.****.cr.aliyuncs.com/pai/easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04 command=sleep 1000 workspace_id=**** resource_id=quotau2h98mt**** worker_cpu=1 worker_memory=2Gi allow_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-**** deny_nodes=lingjuc47iextvg9-****,lingjuc47iextvg9-****
Disable pay-as-you-go inventory check when submitting jobs
You can configure the disable_ecs_stock_check parameter to disable pay-as-you-go inventory check when submitting training jobs by using the DLC client.
Parameters
Parameter
Description
Example
disable_ecs_stock_check
Whether to disable pay-as-you-go inventory check. Valid values:
false (default): Enable pay-as-you-go inventory check.
true: Disable pay-as-you-go inventory check.
true or false
Examples
Command line parameters
Sample command:
Enable pay-as-you-go inventory check
./dlc submit pytorchjob \ --name=test_skip_checking3 \ --command='sleep 1000' \ --workspace_id=**** \ --priority=1 \ --workers=1 \ --worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 \ --worker_spec=ecs.g6.xlarge
Disable pay-as-you-go inventory check
./dlc submit pytorchjob \ --name=test_skip_checking3 \ --command='sleep 1000' \ --workspace_id=**** \ --priority=1 \ --workers=1 \ --worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 \ --worker_spec=ecs.g6.xlarge \ --disable_ecs_stock_check=true
Read file
Sample command:
./dlc submit pytorchjob -f job_file
Example of job parameter configuration file, job_file:
Enable pay-as-you-go inventory check
name=test_skip_checking3 workers=1 worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 command=sleep 1000 workspace_id=**** worker_spec=ecs.g6.xlarge
Disable pay-as-you-go inventory check
name=test_skip_checking3 workers=1 worker_image=registry.cn-hangzhou.aliyuncs.com/pai-dlc/tensorflow-training:1.12PAI-gpu-py36-cu101-ubuntu18.04 command=sleep 1000 workspace_id=**** worker_spec=ecs.g6.xlarge disable_ecs_stock_check=true
References
After you submit a job, you can use the DLC client to manage the job. For more information, see Command used to stop training jobs and Commands used to query logs or jobs.
You can also manage submitted jobs in the PAI console. For more information, see Manage training jobs.