ACCL: high-performance collective communication library developed by Alibaba Cloud - Platform For AI

Alibaba Collective Communication Library (ACCL) is a collective communication library developed by Alibaba Cloud, based on Nvidia Collective Communication Library (NCCL). It capitalizes on the networking capabilities and deep expertise of Alibaba Cloud in communication optimization for large-scale model jobs, delivering enhanced communication performance for your jobs. ACCL also features fault diagnosis and self-recovery. This topic describes the key features of ACCL and how to install it.

Enhanced features supported by ACCL

ACCL supports a range of enhanced features, which can be enabled or disabled by using environment variables:

Fixs bugs found in the corresponding open-source version of NCCL;
Optimizes various operators and message sizes for collective communication, offering superior performance to the open-source NCCL;
Provides statistical analysis of collective communication during training to help diagnose slowdowns and hangs due to device faults. When used with Alibaba Cloud PAI's AIMaster: Elastic fault tolerance engine and C4D: Model training job diagnosis tool, it enables rapid anomaly detection and automatic fault tolerance;
Supports multi-path transmission and load balancing to mitigate or eliminate congestion in training clusters caused by uneven hashing, thus enhancing overall training throughput;

Limits

ACCL must be installed before submitting DLC jobs with Lingjun resources and custom images in regions where Lingjun resources are available.

Install the ACCL library

The official images provided by Platform for AI (PAI) are pre-installed with ACCL. If you use the official images to submit DLC jobs, the following steps are not necessary.

Step 1: Check whether the NCCL library used by Pytorch in the image is dynamic

In a custom image container, take the following steps:

Determine the location of the PyTorch library.

If you know the path where PyTorch is installed, search within the path. For instance, if PyTorch is located in /usr/local/lib, use the following command to locate the libtorch.so file:

find /usr/local/lib -name "libtorch*"
# Example results:
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorchcuda.so
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorchbindtest.so

Check the dependency on the NCCL library of the PyTorch library with the ldd command.
```
ldd libtorch.so | grep nccl
```
- If a similar result is returned, the NCCL library is dynamic and you can proceed to download and install ACCL.
```
libnccl.so.2=>/usr/lib/x86_64-linux-gnu/libnccl.so.2(0x00007feab3b27000)
```
- If no result is returned, the NCCL is static and you cannot install ACCL. You need to create a custom image based on the official NVIDIA NGC image or switch to a PyTorch version that relies on dynamic NCCL library.

Step 2: Check the CUDA version used in the image

In a custom image container, use the following command to verify the CUDA version:

nvidia-smi

The output below indicates that the CUDA version is 12.2. Make sure to refer to the actual version returned by your command.

Step 3: Download ACCL corresponding to the CUDA version

Download links for ACCL:

CUDA version	ACCL download link
12.3	https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.3/lib/libnccl.so.2
12.2	https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.2/lib/libnccl.so.2
12.1	https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.1/lib/libnccl.so.2
11.7	https://accl-n.oss-cn-beijing.aliyuncs.com/cuda11.7/lib/libnccl.so.2
11.4	https://accl-n.oss-cn-beijing.aliyuncs.com/cuda11.4/lib/libnccl.so.2

To download ACCL corresponding to your CUDA version in a custom image container, use the following command. In the following example, the CUDA version is 12.3:

wget http://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.3/lib/libnccl.so.2

Step 4: Install ACCL

Before installing ACCL, check whether NCCL is installed in your system. Run the following command to check whether libnccl.so.2 exists:

sudo find / -name "libnccl.so.2"

Depending on the query results, take one of the following actions:

If the libnccl.so.2 file is not found, or if it is located in the system directories /usr/lib64 or /lib64, use the cp command to copy the newly downloaded libnccl.so.2 file to the appropriate system directory.
```
sudo cp -f ./libnccl.so.2 /usr/lib64
```
If the libnccl.so.2 file is found in a non-standard directory, such as /opt/xxx/, this may indicate a custom NCCL installation path. Use the cp command to overwrite the existing file with the newly downloaded libnccl.so.2.
```
sudo cp -f libnccl.so.2 /opt/xxx/
```

Step 5: Refresh the dynamic library

Refresh the dynamic library cache in with the following command:

sudo ldconfig

Step 6: Check whether ACCL is loaded

Submit a DLC job by using a custom image. For more information, see Submit training jobs.
Check the job log. If the startup log displays ACCL version information, the ACCL library is loaded. For information about how to check job logs, see View training jobs.
Note
Ensure the log includes the accl-n identifier. Otherwise, ACCL is not loaded
```
NCCL version 2.20.5.7-accl-n+cuda12.4, COMMIT_ID Zeaa6674c2f1f896e3a6bbd77e85231e0700****, BUILD_TIME 2024-05-10 15:40:56
```

Recommended environment variable configuration

Based on our rich experience with ACCL, the PAI team compiles a list of environment variables that can enhance communication throughput across various settings:

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_MIN_NCHANNELS=4
export NCCL_NET_PLUGIN=none
export ACCL_C4_STATS_MODE=CONN
export ACCL_IB_SPLIT_DATA_NUM=4
export ACCL_IB_QPS_LOAD_BALANCE=1
export ACCL_IB_GID_INDEX_FIX=1
export ACCL_LOG_TIME=1

The following table describes the key environment variables:

Environment variable	Description
NCCL_IB_TC	Specifies the network mapping rules used by Alibaba Cloud. If the parameter is not set or invalid, network performance may be impacted.
NCCL_IB_GID_INDEX	The GID for RDMA protocol use. If the parameter is not set or invalid, NCCL may encounter errors.
NCCL_SOCKET_IFNAME	The port for NCCL to establish connection. Different specifications require different ports. If the parameter is not set or invalid, NCCL may fail to establish connection.
NCCL_DEBUG	The log level of NCCL. Sset this parameter to INFO for more logs and efficient troubleshooting.
NCCL_IB_HCA	The network interface card for RDMA communication. If the parameter is not set or invalid, network performance may be impacted.
NCCL_IB_TIMEOUT	The RDMA connection timeout time, which can enhance fault tolerance during training jobs. If the parameter is not set or invalid, interruptions may occur.
NCCL_IB_QPS_PER_CONNECTION	The number of queue pairs per connection. You can increase the number appropriately to significantly boost network throughput.
NCCL_NET_PLUGIN	The network plugin for NCCL. We recommend that you set this parameter to none, so that no other plugins are loaded to ensure performance.
ACCL_C4_STATS_MODE	The granularity of ACCL statistics information. We recommend that you set this parameter to CONN, which aggregates statistics by connection.
ACCL_IB_SPLIT_DATA_NUM	Specifies whether to split data across multiple queue pairs for transmission.
ACCL_IB_QPS_LOAD_BALANCE	Specifies whether to enable load balancing.
ACCL_IB_GID_INDEX_FIX	Specifies whether to automatically check for and bypasses GID anomalies before job initiation. Set the parameter to 1 to enable it.
ACCL_LOG_TIME	Specifies whether to prepend time to log entries for troubleshooting. Set the parameter to 1 to enable it.