ack-ai-installer is a collection of device plug-ins that are used to enhance the scheduling capabilities of Container Service for Kubernetes (ACK) Pro clusters and ACK Edge Pro clusters. You can use ack-ai-installer together with the ACK scheduler to schedule heterogeneous computing resources based on GPU sharing and topology-aware GPU scheduling. The ACK scheduler is a scheduling system developed based on the Kubernetes scheduling framework to schedule different elastic resources to different workloads. This topic introduces ack-ai-installer and describes the usage notes and release notes for ack-ai-installer.
Introduction
ack-ai-installer can work with the ACK scheduler to implement GPU sharing (including GPU memory isolation) and topology-aware GPU scheduling. ack-ai-installer consists of the following components.
gpushare-device-plugin and cgpu-installer
By default, the ACK scheduler used by ACK Pro clusters and ACK Edge Pro clusters schedules exclusive GPU resources. You can use ack-ai-installer (gpushare-device-plugin) with the ACK scheduler to implement GPU sharing and GPU memory isolation. GPU sharing allows multiple applications or processes to share the same GPU in order to improve resource utilization. You can also use ack-ai-installer (cgpu-installer) together with cGPU, which is a GPU virtualization and sharing service of Alibaba Cloud, to implement GPU memory isolation. GPU memory isolation allows you to isolate different applications or processes in GPU memory to prevent mutual interference and improve the overall performance and efficiency of the system. In addition, ack-ai-installer (cgpu-installer) supports GPU computing power isolation and provides different scheduling policies, including fair-share scheduling, preemptive scheduling, and weight-based preemptive scheduling, to schedule and use GPU computing power in a more fine-grained manner. For more information about GPU sharing and GPU memory isolation, such as the installation procedure and use scenarios, see Configure the GPU sharing component and Use cGPU to allocate computing power.
gputopo-device-plugin
You can use ack-ai-installer (gputopo-device-plugin) together with the ACK scheduler to implement topology-aware GPU scheduling and select the optimal combination of GPUs to accelerate training jobs. For more information about topology-aware GPU scheduling, such as the installation procedure and use scenarios, see GPU topology-aware scheduling.
Usage notes
You can install ack-ai-installer only from the AI Developer Console page of an ACK Pro cluster or an ACK Edge Pro cluster that runs Kubernetes 1.18 or later. ack-ai-installer is pre-installed as a component in ACK Lingjun clusters that run Kubernetes 1.18 or later.
Release notes
November 2024
Version | Description | Release date | Impact |
1.11.1 | cGPU is updated to 1.5.13. The issue that residual processes of the container may cause occasional kernel crashes is fixed. | 2024-11-19 | No impact on workloads. |
1.10.1 | cGPU is updated to 1.5.12. The issue that memory isolation for Compute Unified Device Architecture (CUDA) API fails in driver version 535 or later is fixed. | 2024-11-07 | No impact on workloads. |
September 2024
Version | Description | Release date | Impact |
1.9.16 |
| 2024-09-26 | No impact on workloads. |
1.9.15 | cGPU is updated to 1.5.11. The issues that are related to decoding are fixed. | 2024-09-19 | No impact on workloads. |
August 2024
Version | Description | Release date | Impact |
1.9.14 |
| 2024-08-21 | No impact on workloads. |
1.9.14 | cGPU is updated to 1.5.9. Policy 6 is added for equal partitioning of compute resources and memory. | 2024-08-13 | No impact on workloads. |
May 2024
Version | Description | Release date | Impact |
1.9.11 | cGPU is updated to 1.5.7. L series GPUs and GPU driver version 550 and later are supported. | 2024-05-14 | No impact on workloads. |
1.9.10 | cGPU is updated to 1.5.7. The issue that the | 2024-05-09 | No impact on workloads. |
January 2024
Version | Description | Release date | Impact |
1.8.8 | cGPU is updated to 1.5.6. A new cGPU license server policy is released. | 2024-01-04 | No impact on workloads. |
December 2023
Version | Description | Release date | Impact |
1.8.7 |
| 2023-12-20 | No impact on workloads. |
November 2023
Version | Description | Release date | Impact |
1.8.5 | cGPU is updated to 1.5.5. The kernel panic issue triggered by cgpu-procfs is fixed. | 2023-11-23 | No impact on workloads. |
August 2023
Version | Description | Release date | Impact |
1.8.2 |
| 2023-08-29 | No impact on workloads. |
July 2023
Version | Description | Release date | Impact |
1.7.7 |
| 2023-07-04 | No impact on workloads. |
April 2023
Version | Description | Release date | Impact |
1.7.6 |
| 2023-04-26 | No impact on workloads. |
1.7.5 | cGPU is updated to 1.5.2. | 2023-04-18 | No impact on workloads. |