All Products
Search
Document Center

Platform For AI:Accelerate distributed training with EPL

Last Updated:Feb 28, 2026

Distributed deep learning training across multiple nodes requires complex parallelism orchestration, inter-node communication, and synchronization logic. Easy Parallel Library (EPL) simplifies this by wrapping your existing TensorFlow code with annotations that handle data parallelism, tensor model parallelism, and pipeline parallelism automatically. Use EPL in Deep Learning Containers (DLC) on Platform for AI (PAI) to scale model training with minimal code changes.

How EPL works

EPL provides a unified interface for multiple parallelism strategies. Instead of rewriting your training scripts for distributed execution, add EPL annotations to your existing TensorFlow code. EPL then manages communication and synchronization across nodes.

For API details and parallelism strategy options, see EPL documentation.

Prerequisites

Before you begin, make sure that you have:

Image selection

EPL availability depends on your image type:

Image typeEPL installationDetails
Official image (PAI-optimized)Pre-installed, ready to useSee Official images
Community image (standard)Manual installation requiredSee Community images
Note

For DLC, use the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. Install EPL by running the commands in the startup script shown in Create a training job. For other environments, see Install EPL.

Set up a code build

A code build links a Git repository to DLC so that each training job automatically clones the latest code. This example uses the EPL repository, which includes a ResNet50 sample.

  1. Log on to the PAI console.

  2. In the left-side navigation pane, click Workspaces. Click the name of your workspace.

  3. In the left-side navigation pane, choose AI Asset Management > Source Code Repositories.

  4. On the Code Configuration page, click Create Code Build.

  5. Configure the following parameters and click Submit. For other parameters, see Code configuration.

    ParameterValue
    Git Repository Addresshttps://github.com/alibaba/EasyParallelLibrary.git
    Code Branchmain

Create a training job

  1. Log on to the PAI console, select a region, select a workspace, and click Enter Deep Learning Containers (DLC).

  2. On the Distributed Training (DLC) page, click Create Job.

  3. In the Basic Information section, enter a job name.

  4. In the Environment Information section, configure the following parameters. Startup script: This script installs NCCL dependencies, builds EPL from source, and launches a data-parallel ResNet50 training run.

    ParameterValue
    Node ImageSelect Official Image > tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04
    Code ConfigurationFrom the Online Configuration drop-down list, select the code build created in Set up a code build. Set Branch to main.
    Startup CommandSee the startup script below.
       apt update
       apt install libnccl2 libnccl-dev
       cd /root/code/EasyParallelLibrary/
       pip install .
       cd examples/resnet
       bash scripts/train_dp.sh
  5. In the Resource Information section, configure the following parameters.

    ParameterValue
    Resource SourceSelect Public Resources
    FrameworkSelect TensorFlow
  6. In the Job Resource Configuration section, configure the following parameters.

    ParameterValue
    Number Of Nodes2 (adjust based on your training requirements)
    Node ConfigurationOn the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge
    Maximum Running Time2 (hours)
  7. Click OK to submit the job.

Verify the training job

After you submit the job, verify that it runs successfully:

  1. On the Distributed Training Jobs page, click the job name to open the job details page.

  2. Check the job status and wait for it to show Succeeded.

  3. Review the training logs to confirm that the model is training across both nodes.

For more information about job monitoring, see View training details.

References