general environment variables in DLC - Platform For AI - Alibaba Cloud Documentation Center

When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), the system automatically injects multiple general environment variables that you can use in the code. This topic describes the environment variables that are provided in DLC.

Common environment variables

For information about the environment variables that are used for Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in the RDMA: high-performance networks for distributed training topic.

PyTorch environment variables

In distributed PyTorch training tasks, the master and worker nodes play different roles. You need to establish a connection between the nodes to allow communication. DLC provides environment variables to communicate necessary information, such as the address and port number of the master node. The following table describes the general environment variables for PyTorch training tasks in DLC.

Environment variable	Description
MASTER_ADDR	The service address of the master node. Example: `dlc18isgeayd****-master-0`.
MASTER_PORT	The port of the master node. Example: 23456.
WORLD_SIZE	The total number of nodes in the distributed training task. For example, if you submit a task that contains one master node and one worker node, the WORLD_SIZE parameter is set to 2.
RANK	The index of the node. For example, if you submit a job that contains one master node and two worker nodes, the RANK parameters of the master node, worker node-0, and worker node-1 are set to 0, 1, and 2, respectively.

TensorFlow environment variables

Distributed TensorFlow training tasks use the TF_CONFIG environment variable to build a distributed network topology. The following table describes the general environment variables for TensorFlow training tasks in DLC.

Environment variable

Description

TF_CONFIG

The distributed network topology of the TensorFlow training task. Example:

{
  "cluster": {
    "worker": [
      "dlc1y3madghd****-worker-0.t1612285282502324.svc:2222",
      "dlc1y3madghd****-worker-1.t1612285282502324.svc:2222"
    ]
  },
  "task": {
    "type": "worker",
    "index": 0
  },
  "environment": "cloud"
}