All Products
Search
Document Center

Platform For AI:Distributed deep learning framework Whale

Last Updated:Nov 29, 2024

Easy Parallel Library (EPL) is an efficient and easy-to-use framework for distributed model training. EPL adopts multiple training optimization technologies and provides easy-to-use API operations that allow you to use parallelism strategies. You can use EPL to reduce costs and improve the efficiency of distributed model training. This topic describes how to use EPL to accelerate TensorFlow distributed model trainings in Deep Learning Containers (DLC).

Prerequisites

Before you perform the operations described in this topic, make sure that the following requirements are met:

  • The required service-linked role is created for DLC. For more information, see Grant the permissions that are required to use DLC.

  • The official image or one of the following community images is prepared: NVIDIA TensorFlow 1.15 or TensorFlow-GPU 1.15.

    • If you use the official image, you can use EPL without the need to install it. For more information about official images, see Alibaba Cloud image.

    • If you use an open source image, you must first install EPL. For more information about community images, see Community image. For more information about how to install EPL, see Install EPL.

    Note

    If you use DLC, we recommend that you select the community image tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04. You can run commands to install EPL in DLC.

Step 1: Configure a code build

You can use EPL to write code for TensorFlow-based distributed model training. For more information, see Quick Start.

You can also use the sample code provided by EPL to start the TensorFlow distributed model training. In this example, the training dataset ResNet50 is used to create a code build. You can use the code build to submit a TensorFlow training job. Each time model training is performed, the latest version is automatically cloned. To configure a code build, perform the following steps.

  1. Go to the code builds page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose AI Computing Asset Management > Source Code Repositories to go to the code builds page.

  2. On the Code Configuration page, click Create Code Build.

  3. In the Create Code Build panel, configure the parameters and click Submit.

    Set the Repository parameter to https://github.com/alibaba/EasyParallelLibrary.git and the Code branch parameter to main. For more information about other parameters, see Code builds.

Step 2: Start a training job

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the parameters in the Basic Information and Resource Configuration sections. For more information about other parameters, see Submit training jobs. Click Submit.

    • The following table describes the parameters in the Basic Information section.

      Parameter

      Example

      Resource Quota

      Public resource group.

      Job Name

      Specify a name for the training job.

      Node Image

      Click Community Image and select tensorflow-training:1.15-gpu-py36-cu100-ubuntu18.04 from the image list.

      Framework

      TensorFlow.

      Code Builds

      Click Online Configuration and select the dataset that you configured in step 1 from the drop-down list.

      Code Branch

      main.

      Job Command

      apt update
      apt install libnccl2 libnccl-dev
      cd /root/code/EasyParallelLibrary/
      pip install .
      cd examples/resnet
      bash scripts/train_dp.sh
    • The following table describes the parameters in the Resource Configuration section.

      Parameter

      Example

      Number of Nodes

      Set the value to 2. You can change the value based on the requirements of the training job.

      Node Configuration

      On the GPU Instance tab, select ecs.gn6v-c8g1.2xlarge.

      Maximum Duration

      2. Unit: hours.

  3. On the Distributed Training Jobs page, click the name of the job that you want to manage and go to the job details page. View the running status of the job. For more information, see View training jobs.

References