When you submit a training job in Deep Learning Containers (DLC) of Platform for AI (PAI), the system automatically injects multiple general environment variables that you can use in the code. This topic describes the environment variables that are provided in DLC.
Common environment variables
For information about the environment variables that are used for Lingjun AI Computing Service (Lingjun), see the "Configure high-performance network variables" section in the RDMA: high-performance networks for distributed training topic.
PyTorch environment variables
In distributed PyTorch training tasks, the master and worker nodes play different roles. You need to establish a connection between the nodes to allow communication. DLC provides environment variables to communicate necessary information, such as the address and port number of the master node. The following table describes the general environment variables for PyTorch training tasks in DLC.
Environment variable | Description |
MASTER_ADDR | The service address of the master node. Example: |
MASTER_PORT | The port of the master node. Example: 23456. |
WORLD_SIZE | The total number of nodes in the distributed training task. For example, if you submit a task that contains one master node and one worker node, the WORLD_SIZE parameter is set to 2. |
RANK | The index of the node. For example, if you submit a job that contains one master node and two worker nodes, the RANK parameters of the master node, worker node-0, and worker node-1 are set to 0, 1, and 2, respectively. |
TensorFlow environment variables
Distributed TensorFlow training tasks use the TF_CONFIG environment variable to build a distributed network topology. The following table describes the general environment variables for TensorFlow training tasks in DLC.
Environment variable | Description |
TF_CONFIG | The distributed network topology of the TensorFlow training task. Example:
|