Alibaba Collective Communication Library (ACCL) is a collective communication library developed by Alibaba Cloud, based on Nvidia Collective Communication Library (NCCL). It capitalizes on the networking capabilities and deep expertise of Alibaba Cloud in communication optimization for large-scale model jobs, delivering enhanced communication performance for your jobs. ACCL also features fault diagnosis and self-recovery. This topic describes the key features of ACCL and how to install it.
Enhanced features supported by ACCL
ACCL supports a range of enhanced features, which can be enabled or disabled by using environment variables:
Fixs bugs found in the corresponding open-source version of NCCL;
Optimizes various operators and message sizes for collective communication, offering superior performance to the open-source NCCL;
Provides statistical analysis of collective communication during training to help diagnose slowdowns and hangs due to device faults. When used with Alibaba Cloud PAI's AIMaster: Elastic fault tolerance engine and C4D: Model training job diagnosis tool, it enables rapid anomaly detection and automatic fault tolerance;
Supports multi-path transmission and load balancing to mitigate or eliminate congestion in training clusters caused by uneven hashing, thus enhancing overall training throughput;
Limits
ACCL must be installed before submitting DLC jobs with Lingjun resources and custom images in regions where Lingjun resources are available.
Install the ACCL library
The official images provided by Platform for AI (PAI) are pre-installed with ACCL. If you use the official images to submit DLC jobs, the following steps are not necessary.
Step 1: Check whether the NCCL library used by Pytorch in the image is dynamic
In a custom image container, take the following steps:
Determine the location of the PyTorch library.
If you know the path where PyTorch is installed, search within the path. For instance, if PyTorch is located in
/usr/local/lib
, use the following command to locate thelibtorch.so
file:find /usr/local/lib -name "libtorch*" # Example results: /usr/local/lib/python3.10/dist-packages/torch/lib/libtorchcuda.so /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so /usr/local/lib/python3.10/dist-packages/torch/lib/libtorchbindtest.so
Check the dependency on the NCCL library of the PyTorch library with the
ldd
command.ldd libtorch.so | grep nccl
If a similar result is returned, the NCCL library is dynamic and you can proceed to download and install ACCL.
libnccl.so.2=>/usr/lib/x86_64-linux-gnu/libnccl.so.2(0x00007feab3b27000)
If no result is returned, the NCCL is static and you cannot install ACCL. You need to create a custom image based on the official NVIDIA NGC image or switch to a PyTorch version that relies on dynamic NCCL library.
Step 2: Check the CUDA version used in the image
In a custom image container, use the following command to verify the CUDA version:
nvidia-smi
The output below indicates that the CUDA version is 12.2. Make sure to refer to to the actual version returned by your command.
Step 3: Download ACCL corresponding to the CUDA version
Download links for ACCL:
CUDA version | ACCL download link |
12.3 | https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.3/lib/libnccl.so.2 |
12.2 | https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.2/lib/libnccl.so.2 |
12.1 | https://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.1/lib/libnccl.so.2 |
11.7 | https://accl-n.oss-cn-beijing.aliyuncs.com/cuda11.7/lib/libnccl.so.2 |
11.4 | https://accl-n.oss-cn-beijing.aliyuncs.com/cuda11.4/lib/libnccl.so.2 |
To download ACCL corresponding to your CUDA version in a custom image container, use the following command. In the following example, the CUDA version is 12.3:
wget http://accl-n.oss-cn-beijing.aliyuncs.com/cuda12.3/lib/libnccl.so.2
Step 4: Install ACCL
Before installing ACCL, check whether NCCL is installed in your system. Run the following command to check whether libnccl.so.2
exists:
sudo find / -name "libnccl.so.2"
Depending on the query results, take one of the following actions:
If the
libnccl.so.2
file is not found, or if it is located in the system directories/usr/lib64
or/lib64
, use thecp
command to copy the newly downloadedlibnccl.so.2
file to the appropriate system directory.sudo cp -f ./libnccl.so.2 /usr/lib64
If the libnccl.so.2 file is found in a non-standard directory, such as /opt/xxx/, this may indicate a custom NCCL installation path. Use the
cp
command to overwrite the existing file with the newly downloadedlibnccl.so.2
.sudo cp -f libnccl.so.2 /opt/xxx/
Step 5: Refresh the dynamic library
Refresh the dynamic library cache in with the following command:
sudo ldconfig
Step 6: Check whether ACCL is loaded
Submit a DLC job by using a custom image. For more information, see Submit training jobs.
Check the job log. If the startup log displays ACCL version information, the ACCL library is loaded. For information about how to check job logs, see View training jobs.
NoteEnsure the log includes the
accl-n
identifier. Otherwise, ACCL is not loadedNCCL version 2.20.5.7-accl-n+cuda12.4, COMMIT_ID Zeaa6674c2f1f896e3a6bbd77e85231e0700****, BUILD_TIME 2024-05-10 15:40:56
Recommended environment variable configuration
Based on our rich experience with ACCL, the PAI team compiles a list of environment variables that can enhance communication throughput across various settings:
export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_MIN_NCHANNELS=4
export NCCL_NET_PLUGIN=none
export ACCL_C4_STATS_MODE=CONN
export ACCL_IB_SPLIT_DATA_NUM=4
export ACCL_IB_QPS_LOAD_BALANCE=1
export ACCL_IB_GID_INDEX_FIX=1
export ACCL_LOG_TIME=1
The following table describes the key environment variables:
Environment variable | Description |
NCCL_IB_TC | Specifies the network mapping rules used by Alibaba Cloud. If the parameter is not set or invalid, network performance may be impacted. |
NCCL_IB_GID_INDEX | The GID for RDMA protocol use. If the parameter is not set or invalid, NCCL may encounter errors. |
NCCL_SOCKET_IFNAME | The port for NCCL to establish connection. Different specifications require different ports. If the parameter is not set or invalid, NCCL may fail to establish connection. |
NCCL_DEBUG | The log level of NCCL. Sset this parameter to INFO for more logs and efficient troubleshooting. |
NCCL_IB_HCA | The network interface card for RDMA communication. If the parameter is not set or invalid, network performance may be impacted. |
NCCL_IB_TIMEOUT | The RDMA connection timeout time, which can enhance fault tolerance during training jobs. If the parameter is not set or invalid, interruptions may occur. |
NCCL_IB_QPS_PER_CONNECTION | The number of queue pairs per connection. You can increase the number appropriately to significantly boost network throughput. |
NCCL_NET_PLUGIN | The network plugin for NCCL. We recommend that you set this parameter to none, so that no other plugins are loaded to ensure performance. |
ACCL_C4_STATS_MODE | The granularity of ACCL statistics information. We recommend that you set this parameter to CONN, which aggregates statistics by connection. |
ACCL_IB_SPLIT_DATA_NUM | Specifies whether to split data across multiple queue pairs for transmission. |
ACCL_IB_QPS_LOAD_BALANCE | Specifies whether to enable load balancing. |
ACCL_IB_GID_INDEX_FIX | Specifies whether to automatically check for and bypasses GID anomalies before job initiation. Set the parameter to 1 to enable it. |
ACCL_LOG_TIME | Specifies whether to prepend time to log entries for troubleshooting. Set the parameter to 1 to enable it. |