AIACC-Training allows you to run distributed training tasks by using models that are built based on mainstream AI computing frameworks, such as PyTorch, TensorFlow, MXNet, and Caffe. AIACC-Training is compatible with the APIs of PyTorch DistributedDataParallel (DDP) and Horovod. You can use AIACC-Training to directly improve the performance of code that is written in these distributed training frameworks. This topic describes how to install AIACC-Training 1.5.0.
Prerequisites
An Alibaba Cloud GPU-accelerated instance that meets the following requirements is created:
The OS of the instance is Alibaba Cloud Linux, CentOS 7.x and later, or Ubuntu 16.04 and later.
An NVIDIA driver and CUDA 10.0 or later are installed on the instance.
Background information
You can use one of the following methods to install AIACC-Training based on your business scenarios. In this example, AIACC-Training 1.5.0 is installed.
Scenario | Installation method |
If an AI training environment for deep learning is deployed, you can install AIACC-Training in automatic or manual mode. | Method 1: Install AIACC-Training in an existing AI software environment |
If you want to use a Conda environment, you can run a single command to create a Conda environment that contains AIACC-Training. | Method 2: Install a Conda environment that contains AIACC-Training |
If you want to use a Docker environment, you can install a Docker image that is configured with AIACC-Training. | Method 3: Install a Docker image that is configured with AIACC-Training |
If you select AIACC-Training when you create a GPU-accelerated instance in the Elastic Compute Service (ECS) console, AIACC-Training 1.3.3 is automatically installed when the GPU-accelerated instance is created. We recommend that you install AIACC-Training 1.5.0 by using one of the methods provided in the preceding table.
Supported frameworks
Alibaba Cloud provides AIACC-Training software packages for different versions of deep learning frameworks. The following table lists the versions of the frameworks that are supported by AIACC-Training.
CUDA version | Framework type | Framework version |
10.0 | PyTorch | 1.2.0 and 1.3.0 |
TensorFlow | 1.14.0, 1.15.0, and 2.0.0 | |
MXNet | 1.4.1, 1.5.0, and 1.7.0 | |
10.1 | PyTorch | 1.6.0, 1.5.1, and 1.4.0 |
TensorFlow | 2.1.0, 2.2.0, and 2.3.0 | |
MXNet | 1.4.1, 1.5.0, 1.6.0, 1.7.0, and 1.9.0 | |
10.2 | PyTorch | 1.5.1, 1.6.0, 1.8.0, 1.8.2, 1.9.0, and 1.10.0 |
MXNet | 1.9.0 | |
11.0 | PyTorch | 1.7.0 and 1.7.1 |
TensorFlow | 2.4.0 | |
MXNet | 1.9.0 |
TensorFlow and MXNet support only Python 3.6.
PyTorch supports Python 3.6 to Python 3.9. The Python versions that are displayed on the PyTorch download page prevail.
If the version of the framework that you use is not contained in the preceding table, submit a ticket to obtain assistance.
Method 1: Install AIACC-Training in an existing AI software environment
If an AI training environment for deep learning is deployed, you can install AIACC-Training in automatic or manual mode. Before you install AIACC-Training, make sure that your environment meets the following requirements:
Python 3 and pip are installed.
A deep learning framework such as PyTorch, TensorFlow, or MXNet is installed.
If you re-install your deep learning framework, you must re-install AIACC-Training.
(Recommended) Install AIACC-Training in automatic mode
AIACC-Training provides Python software packages for different framework versions. You can run a single script to install AIACC-Training in automatic mode. Sample code:
wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh
The default installation script uses python3
as the Python version. If you want to use another Python version, you can add the Python version to the end of the script. For example, you can add python
to the end of the script and run wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/install_AIACC-Training.sh && bash install_AIACC-Training.sh python
to install AIACC-Training.
Install AIACC-Training in manual mode
You can run one of the following commands to use pip to install the latest AIACC-Training software package in manual mode.
If you use PyTorch, run the following command to install AIACC-Training:
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_torch-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
Parameters:
${cuda_version}
: the CUDA version. You must remove periods (.) from the version. For example, if you use CUDA 11.0, addcuda_version=110
to the command.${framework_version}
: the framework version. For example, if you use PyTorch 1.7.1, addframework_version=1.7.1
to the command.${python_version}
: the Python version. You must remove periods (.) from the version. For example, if you use Python 3.6, addpython_version=36
to the command.
If you use a WHL package of Python 3.8 or later, you can use the following download URL:
https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}-linux_x86_64.whl
In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:
cuda_version=110 # The version cannot contain periods (.). framework=torch framework_version=1.7.1 python_version=36 pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-cp${python_version}-cp${python_version}m-linux_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
If you use TensorFlow or MXNet, run the following command to install AIACC-Training:
pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
In this example, TensorFlow 1.15.0, CUDA 10.0, and Python 3.6 are used. Sample code:
cuda_version=100 # The version cannot contain periods (.). framework=tensorflow framework_version=1.15.0 pip install --force-reinstall https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/cuda${cuda_version}/perseus_${framework}-1.5.0%2B${framework_version}-py2.py3-none-manylinux1_x86_64.whl --trusted-host mirrors.aliyun.com -i http://mirrors.aliyun.com/pypi/simple/
Method 2: Install a Conda environment that contains AIACC-Training
Conda is an open source system that is used to manage software packages and environments. This system is supported on different platforms. You can create a Conda environment that contains AIACC-Training by running a single command. In the Conda environment, CUDA Toolkit, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly build and manage different deep learning frameworks and framework versions. This also helps you significantly improve training performance by using AIACC-Training.
Visit the Conda official website to download and install the latest version of Miniconda. For more information, see Miniconda.
Run the following command based on the required framework version and environment to create a Conda environment that contains AIACC-Training:
conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
In this example, PyTorch 1.7.1, CUDA 11.0, and Python 3.6 are used. Sample code:
cuda_version=11.0 framework=torch framework_version=1.7.1 python_version=36 conda env create -f https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/conda/latest/${framework}_${framework_version}_cu${cuda_version}_py${python_version}.yaml
Parameters:
${cuda_version}
: the CUDA version. Specify the full version that contains periods (.). The version cannot be later than the CUDA version that is already installed on the GPU-accelerated instance.${framework}
: the type of the deep learning framework. If you use TensorFlow, MXNet, or PyTorch, set the parameter to tensorflow, mxnet, or torch.${framework_version}
: the version of the deep learning framework. For example, if you use PyTorch 1.7.1, addframework_version=1.7.1
to the command.${python_version}
: the Python version. You must remove periods (.) from the version. For example, if you use Python 3.6, addpython_version=36
to the command.
ImportantIf the system prompts that the URL of the Conda environment cannot be found, the specified framework version is not supported. For more information, see Supported frameworks.
Method 3: Install a Docker image that is configured with AIACC-Training
You can download a Docker image that is configured with AIACC-Training. In the Docker image, CUDA, Python 3, deep learning frameworks, and the latest AIACC-Training software are installed. This helps you quickly deploy deep learning environments and manage different CUDA environments. This also helps you significantly improve training performance by using AIACC-Training.
Before you install a Docker image, make sure that your environment meets the following requirements:
Docker is installed in the environment. For more information about how to install Docker on Alibaba Cloud Linux 2, see Install and use Docker on a Linux instance.
NVIDIA Container Toolkit is installed. For more information, see Installing the NVIDIA Container Toolkit.
Run the following command based on the required framework version and environment to download a Docker image that is configured with AIACC-Training:
docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest
Parameters:
Parameter | Description | Example |
| The OS type of the Docker image. Note The OS type of the Docker image is independent of the OS of the GPU-accelerated instance. | centos7 |
| The CUDA version. Note The version must contain periods (.), and cannot be later than the CUDA version that is already installed on the GPU-accelerated instance. | 11.0 |
| The abbreviation of the deep learning framework. If you use TensorFlow, MXNet, or PyTorch, set the parameter to tf, mx, or pt. | tf |
| The version of the deep learning framework. The version is in the xx.xx.xx format. | 2.4.0 |
| The Python version. Note The version cannot contain periods (.). For example, if you use Python 3.6, Python 3.7, or Python 3.8, set the parameter to 36, 37, or 38. | 36 |
You can run a single command to download the Docker image that is configured with AIACC-Training. In this example, CentOS 7, CUDA 11.0, and TensorFlow 2.4.0 are used. Sample code:
os_type=centos7
cuda_version=11.0
framework=tf
framework_version=2.4.0
python_version=36
docker pull registry.cn-beijing.aliyuncs.com/cto_office/perseus-training:${os_type}-cu${cuda_version}-${framework}${framework_version}-py${python_version}-latest
For more information about how to use Docker to perform distributed training, see Horovod in Docker.
If the system prompts that the Docker image cannot be found, the specified framework version is not supported. For more information, see Supported frameworks.
If you use a container to perform distributed training, you must allocate more size to the shared memory (SHM) when you run the
docker run
command to start the container. For example, you can add--shm-size=1g --ulimit memlock=-1
to the command.