Distributed deep learning training across multiple nodes requires complex parallelism orchestration, inter-node communication, and synchronization logic. Easy Parallel Library (EPL) simplifies this by wrapping your existing TensorFlow code with annotations that handle data parallelism, tensor model parallelism, and pipeline parallelism automatically. Use EPL in Deep Learning Containers (DLC) on Platform for AI (PAI) to scale model training with minimal code changes.
How EPL works
EPL provides a unified interface for multiple parallelism strategies. Instead of rewriting your training scripts for distributed execution, add EPL annotations to your existing TensorFlow code. EPL then manages communication and synchronization across nodes.
For API details and parallelism strategy options, see EPL documentation.
Prerequisites
Before you begin, make sure that you have:
Authorized the service-linked role for DLC. For details, see Cloud service dependencies and authorization: DLC
An image running NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15
Image selection
EPL availability depends on your image type:
| Image type | EPL installation | Details |
|---|---|---|
| Official image (PAI-optimized) | Pre-installed, ready to use | See Official images |
| Community image (standard) | Manual installation required | See Community images |
For DLC, use the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Install EPL by running the commands in the startup script shown in Create a training job. For other environments, see Install EPL.
Set up a code build
A code build links a Git repository to DLC so that each training job automatically clones the latest code. This example uses the EPL repository, which includes a ResNet50 sample.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. Click the name of your workspace.
In the left-side navigation pane, choose AI Asset Management > Source Code Repositories.
On the Code Configuration page, click Create Code Build.
Configure the following parameters and click Submit. For other parameters, see Code configuration.
Parameter Value Git Repository Address https://github.com/alibaba/EasyParallelLibrary.gitCode Branch main
Create a training job
Log on to the PAI console, select a region, select a workspace, and click Enter Deep Learning Containers (DLC).
On the Distributed Training (DLC) page, click Create Job.
In the Basic Information section, enter a job name.
In the Environment Information section, configure the following parameters. Startup script: This script installs NCCL dependencies, builds EPL from source, and launches a data-parallel ResNet50 training run.
Parameter Value Node Image Select Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04 Code Configuration From the Online Configuration drop-down list, select the code build created in Set up a code build. Set Branch to main. Startup Command See the startup script below. apt update apt install libnccl2 libnccl-dev cd /root/code/EasyParallelLibrary/ pip install . cd examples/resnet bash scripts/train_dp.shIn the Resource Information section, configure the following parameters.
Parameter Value Resource Source Select Public Resources Framework Select TensorFlow In the Job Resource Configuration section, configure the following parameters.
Parameter Value Number Of Nodes 2(adjust based on your training requirements)Node Configuration On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge Maximum Running Time 2(hours)Click OK to submit the job.
Verify the training job
After you submit the job, verify that it runs successfully:
On the Distributed Training Jobs page, click the job name to open the job details page.
Check the job status and wait for it to show Succeeded.
Review the training logs to confirm that the model is training across both nodes.
For more information about job monitoring, see View training details.