Platform for AI (PAI) provides the PAI-TensorFlow deep learning computing framework that supports training based on multiple models. This topic describes the command parameters and I/O parameters that are used to run PAI-TensorFlow tasks.
GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.
Commands and parameters
To initiate a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client, an SQL node in the DataWorks console, or the Machine Learning Designer page in the PAI console. You can also use TensorFlow components provided by Machine Learning Designer. This section describes the PAI commands and parameters.
# Specify actual values for the parameters.
pai -name tensorflow1120_ext
-project algo_public
-Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
-DentryFile='entry_file.py'
-Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
-Dtables='odps://prj_name/tables/table_name'
-Doutputs='odps://prj_name/tables/table_name'
-DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
-Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
-Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
-DossHost="oss-cn-beijing-internal.aliyuncs.com"
The following table describes the parameters in the preceding syntax.
The name and project parameters have fixed values and cannot be changed.
Parameter | Description | Example | Default value | Required |
script | The script of the TensorFlow algorithm that is used to run the PAI-TensorFlow task. You can specify a file that contains the script in the The TensorFlow model file in Python. The file can be of one of the following types:
If the Python file is stored in Object Storage Service (OSS), you can specify the file in the |
| None | Yes |
entryFile | The entry script. If the script that you specify for the script parameter is a TAR package, You must configure this parameter. |
| If the script that you specify for the script parameter is a single file, you do not need to set this parameter. | Yes |
buckets | The input bucket. Separate multiple buckets with commas (,). Each bucket name must end with a forward slash ( |
| None | No |
tables | The input table. Separate multiple tables with commas (,). |
| None | No |
outputs | The output table. Separate multiple tables with commas (,). |
| None | No |
gpuRequired | Specifies whether the server of the training script specified by the script parameter requires GPUs. Default value: 100. A value of 100 specifies that one GPU is required. A value of 200 specifies that two GPUs are required. This parameter takes effect only for standalone training. For information about multi-server training, see the cluster parameter. If you do not require GPUs, set the gpuRequired parameter to 0. This feature is available only for TensorFlow1120. | 100 | None | No |
checkpointDir | The TensorFlow checkpoint directory. |
| None | No |
cluster | The information about the distributed servers on which you want to run the PAI-TensorFlow task. For more information, see the next table in this topic. |
| None | No |
enableDynamicCluster | Specifies whether to enable the failover feature for a single worker node. If you set this parameter to true, a worker node restarts when a failure occurs on the node. This ensures that the PAI-TensorFlow task continues to run even if worker node issues occur. |
| false | No |
jobName | The name of the experiment. You must specify a name. This allows you to search for historical data and analyze the performance of the experiment. Set this parameter to a descriptive string instead of values such as | jk_wdl_online_job | None | Yes |
maxHungTimeBeforeGCInSeconds | The maximum duration for which a GPU is suspended before automatic reclamation is performed. This is a new parameter. If you set this parameter to 0, the automatic reclamation feature is disabled. | 3600 | 3600 | No |
ossHost | The endpoint of OSS. For more information, see Regions and endpoints. | oss-cn-beijing-internal.aliyuncs.com | None | No |
You can run a PAI-TensorFlow task in distributed mode. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. The value of the cluster parameter must be in the JSON format, and quotation marks must be escaped. Example:
{
"ps": {
"count": 2
},
"worker": {
"count": 4
}
}
The JSON value consists of two keys: ps and worker. The following table describes the parameters that are nested under each key.
Parameter | Description | Default value | Required |
count | The number of PSs or workers. | None | Yes |
gpu | The number of GPUs for PSs or workers. A value of 100 specifies one GPU. If you set the gpu parameter under worker to 0, CPU clusters are scheduled for the PAI-TensorFlow task and GPU resources are not consumed. | By default, the gpu parameter under ps is set to 0 and the gpu parameter under worker is set to 100. | No |
cpu | The number of CPU cores for PSs or workers. A value of 100 specifies one CPU core. | 600 | No |
memory | The memory size for PSs or workers. A value of 100 specifies 100 MB of memory. | 30000 | No |
I/O parameters
The following table describes the I/O parameters that are used to run PAI-TensorFlow tasks.
Parameter | Description |
tables | The path of the table from which you want to read data. |
outputs | The path of the table to which you want to write data. Separate multiple paths with commas (,).
|
buckets | The OSS bucket that stores the objects that you want the algorithm to read. I/O operations on MaxCompute data are different from those on OSS objects. To read OSS objects, you must configure the role_arn and host parameters. To obtain the value of the role_arn parameter, perform the following steps: Log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS and click View authorization in the Actions column. For more information, see Grant the permissions that are required to use Machine Learning Designer. |
checkpointDir | The OSS bucket to which you want to write data. |