Parameters of PAI-TensorFlow tasks - Platform For AI - Alibaba Cloud Documentation Center

Platform for AI (PAI) provides the PAI-TensorFlow deep learning computing framework that supports training based on multiple models. This topic describes the command parameters and I/O parameters that are used to run PAI-TensorFlow tasks.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Commands and parameters

To initiate a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client, an SQL node in the DataWorks console, or the Machine Learning Designer page in the PAI console. You can also use TensorFlow components provided by Machine Learning Designer. This section describes the PAI commands and parameters.

# Specify actual values for the parameters. 
pai -name tensorflow1120_ext
    -project algo_public
    -Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
    -DentryFile='entry_file.py'
    -Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dtables='odps://prj_name/tables/table_name'
    -Doutputs='odps://prj_name/tables/table_name'
    -DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
    -Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
    -DossHost="oss-cn-beijing-internal.aliyuncs.com"

The following table describes the parameters in the preceding syntax.

Important

The name and project parameters have fixed values and cannot be changed.

Parameter	Description	Example	Default value	Required
script	The script of the TensorFlow algorithm that is used to run the PAI-TensorFlow task. You can specify a file that contains the script in the `file:///path/to/file` or `project_name/resources/resource_name` format. `file:///path/to/file` is an absolute path. The TensorFlow model file in Python. The file can be of one of the following types: An on-premises file. An on-premises TAR package. The package is compressed by using gzip. The file name extension of the package is tar.gz. A Python file. If the Python file is stored in Object Storage Service (OSS), you can specify the file in the `oss://..aliyuncs.com/.tar.gz` or `oss://..aliyuncs.com/*.py` format.	`oss://demo-yuze.oss-cn-beijing-internal.aliyuncs.com/deepfm/deepfm.tar.gz`	None	Yes
entryFile	The entry script. If the script that you specify for the script parameter is a TAR package, You must configure this parameter.	`main.py`	If the script that you specify for the script parameter is a single file, you do not need to set this parameter.	Yes
buckets	The input bucket. Separate multiple buckets with commas (,). Each bucket name must end with a forward slash (`/`).	`oss://..aliyuncs.com/`	None	No
tables	The input table. Separate multiple tables with commas (,).	`odps:///tables/`	None	No
outputs	The output table. Separate multiple tables with commas (,).	`odps:///tables/`	None	No
gpuRequired	Specifies whether the server of the training script specified by the script parameter requires GPUs. Default value: 100. A value of 100 specifies that one GPU is required. A value of 200 specifies that two GPUs are required. This parameter takes effect only for standalone training. For information about multi-server training, see the cluster parameter. If you do not require GPUs, set the gpuRequired parameter to 0. This feature is available only for TensorFlow1120.	100	None	No
checkpointDir	The TensorFlow checkpoint directory.	`oss://..aliyuncs.com/`	None	No
cluster	The information about the distributed servers on which you want to run the PAI-TensorFlow task. For more information, see the next table in this topic.	`{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}`	None	No
enableDynamicCluster	Specifies whether to enable the failover feature for a single worker node. If you set this parameter to true, a worker node restarts when a failure occurs on the node. This ensures that the PAI-TensorFlow task continues to run even if worker node issues occur.	true false	false	No
jobName	The name of the experiment. You must specify a name. This allows you to search for historical data and analyze the performance of the experiment. Set this parameter to a descriptive string instead of values such as `test`.	jk_wdl_online_job	None	Yes
maxHungTimeBeforeGCInSeconds	The maximum duration for which a GPU is suspended before automatic reclamation is performed. This is a new parameter. If you set this parameter to 0, the automatic reclamation feature is disabled.	3600	3600	No
ossHost	The endpoint of OSS. For more information, see Regions and endpoints.	oss-cn-beijing-internal.aliyuncs.com	None	No

You can run a PAI-TensorFlow task in distributed mode. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. The value of the cluster parameter must be in the JSON format, and quotation marks must be escaped. Example:

{
  "ps": {
    "count": 2
  },
  "worker": {
    "count": 4
  }
}

The JSON value consists of two keys: ps and worker. The following table describes the parameters that are nested under each key.

Parameter	Description	Default value	Required
count	The number of PSs or workers.	None	Yes
gpu	The number of GPUs for PSs or workers. A value of 100 specifies one GPU. If you set the gpu parameter under worker to 0, CPU clusters are scheduled for the PAI-TensorFlow task and GPU resources are not consumed.	By default, the gpu parameter under ps is set to 0 and the gpu parameter under worker is set to 100.	No
cpu	The number of CPU cores for PSs or workers. A value of 100 specifies one CPU core.	600	No
memory	The memory size for PSs or workers. A value of 100 specifies 100 MB of memory.	30000	No

I/O parameters

The following table describes the I/O parameters that are used to run PAI-TensorFlow tasks.

Parameter	Description
tables	The path of the table from which you want to read data.
outputs	The path of the table to which you want to write data. Separate multiple paths with commas (,). For a non-partitioned table, specify the path in the `odps://<prj_name>/tables/<table_name>` format. For a partitioned table, specify the path in the `odps://<proj_name>tables/<table_name>/<pt_key1=v1>` format. For a multi-level partitioned table, specify the path in the `odps://<prj_name>/tables/<table_name>/<pt_key1=v1>/<pt_key2=v2>` format.
buckets	The OSS bucket that stores the objects that you want the algorithm to read. I/O operations on MaxCompute data are different from those on OSS objects. To read OSS objects, you must configure the role_arn and host parameters. To obtain the value of the role_arn parameter, perform the following steps: Log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS and click View authorization in the Actions column. For more information, see Grant the permissions that are required to use Machine Learning Designer.
checkpointDir	The OSS bucket to which you want to write data.