All Products
Search
Document Center

Platform For AI:Parameters of PAI-TensorFlow tasks

Last Updated:Aug 14, 2024

Platform for AI (PAI) provides the PAI-TensorFlow deep learning computing framework that supports training based on multiple models. This topic describes the command parameters and I/O parameters that are used to run PAI-TensorFlow tasks.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Commands and parameters

To initiate a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client, an SQL node in the DataWorks console, or the Machine Learning Designer page in the PAI console. You can also use TensorFlow components provided by Machine Learning Designer. This section describes the PAI commands and parameters.

# Specify actual values for the parameters. 
pai -name tensorflow1120_ext
    -project algo_public
    -Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
    -DentryFile='entry_file.py'
    -Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dtables='odps://prj_name/tables/table_name'
    -Doutputs='odps://prj_name/tables/table_name'
    -DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
    -Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
    -DossHost="oss-cn-beijing-internal.aliyuncs.com"

The following table describes the parameters in the preceding syntax.

Important

The name and project parameters have fixed values and cannot be changed.

Parameter

Description

Example

Default value

Required

script

The script of the TensorFlow algorithm that is used to run the PAI-TensorFlow task. You can specify a file that contains the script in the file:///path/to/file or project_name/resources/resource_name format. file:///path/to/file is an absolute path.

The TensorFlow model file in Python. The file can be of one of the following types:

  • An on-premises file.

  • An on-premises TAR package. The package is compressed by using gzip. The file name extension of the package is tar.gz.

  • A Python file.

If the Python file is stored in Object Storage Service (OSS), you can specify the file in the oss://..aliyuncs.com/.tar.gz or oss://..aliyuncs.com/*.py format.

oss://demo-yuze.oss-cn-beijing-internal.aliyuncs.com/deepfm/deepfm.tar.gz

None

Yes

entryFile

The entry script. If the script that you specify for the script parameter is a TAR package, You must configure this parameter.

main.py

If the script that you specify for the script parameter is a single file, you do not need to set this parameter.

Yes

buckets

The input bucket.

Separate multiple buckets with commas (,). Each bucket name must end with a forward slash (/).

oss://..aliyuncs.com/

None

No

tables

The input table. Separate multiple tables with commas (,).

odps:///tables/

None

No

outputs

The output table. Separate multiple tables with commas (,).

odps:///tables/

None

No

gpuRequired

Specifies whether the server of the training script specified by the script parameter requires GPUs.

Default value: 100. A value of 100 specifies that one GPU is required. A value of 200 specifies that two GPUs are required. This parameter takes effect only for standalone training. For information about multi-server training, see the cluster parameter. If you do not require GPUs, set the gpuRequired parameter to 0. This feature is available only for TensorFlow1120.

100

None

No

checkpointDir

The TensorFlow checkpoint directory.

oss://..aliyuncs.com/

None

No

cluster

The information about the distributed servers on which you want to run the PAI-TensorFlow task. For more information, see the next table in this topic.

{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}

None

No

enableDynamicCluster

Specifies whether to enable the failover feature for a single worker node.

If you set this parameter to true, a worker node restarts when a failure occurs on the node. This ensures that the PAI-TensorFlow task continues to run even if worker node issues occur.

  • true

  • false

false

No

jobName

The name of the experiment. You must specify a name. This allows you to search for historical data and analyze the performance of the experiment.

Set this parameter to a descriptive string instead of values such as test.

jk_wdl_online_job

None

Yes

maxHungTimeBeforeGCInSeconds

The maximum duration for which a GPU is suspended before automatic reclamation is performed. This is a new parameter.

If you set this parameter to 0, the automatic reclamation feature is disabled.

3600

3600

No

ossHost

The endpoint of OSS. For more information, see Regions and endpoints.

oss-cn-beijing-internal.aliyuncs.com

None

No

You can run a PAI-TensorFlow task in distributed mode. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. The value of the cluster parameter must be in the JSON format, and quotation marks must be escaped. Example:

{
  "ps": {
    "count": 2
  },
  "worker": {
    "count": 4
  }
}

The JSON value consists of two keys: ps and worker. The following table describes the parameters that are nested under each key.

Parameter

Description

Default value

Required

count

The number of PSs or workers.

None

Yes

gpu

The number of GPUs for PSs or workers. A value of 100 specifies one GPU. If you set the gpu parameter under worker to 0, CPU clusters are scheduled for the PAI-TensorFlow task and GPU resources are not consumed.

By default, the gpu parameter under ps is set to 0 and the gpu parameter under worker is set to 100.

No

cpu

The number of CPU cores for PSs or workers. A value of 100 specifies one CPU core.

600

No

memory

The memory size for PSs or workers. A value of 100 specifies 100 MB of memory.

30000

No

I/O parameters

The following table describes the I/O parameters that are used to run PAI-TensorFlow tasks.

Parameter

Description

tables

The path of the table from which you want to read data.

outputs

The path of the table to which you want to write data. Separate multiple paths with commas (,).

  • For a non-partitioned table, specify the path in the odps://<prj_name>/tables/<table_name> format.

  • For a partitioned table, specify the path in the odps://<proj_name>tables/<table_name>/<pt_key1=v1> format.

  • For a multi-level partitioned table, specify the path in the odps://<prj_name>/tables/<table_name>/<pt_key1=v1>/<pt_key2=v2> format.

buckets

The OSS bucket that stores the objects that you want the algorithm to read.

I/O operations on MaxCompute data are different from those on OSS objects. To read OSS objects, you must configure the role_arn and host parameters.

To obtain the value of the role_arn parameter, perform the following steps: Log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS and click View authorization in the Actions column. For more information, see Grant the permissions that are required to use Machine Learning Designer.

checkpointDir

The OSS bucket to which you want to write data.