All Products
Search
Document Center

Container Service for Kubernetes:Introduction and release notes for ack-ai-installer

最終更新日:Dec 23, 2024

ack-ai-installer is a collection of device plug-ins that are used to enhance the scheduling capabilities of Container Service for Kubernetes (ACK) Pro clusters and ACK Edge Pro clusters. You can use ack-ai-installer together with the ACK scheduler to schedule heterogeneous computing resources based on GPU sharing and topology-aware GPU scheduling. The ACK scheduler is a scheduling system developed based on the Kubernetes scheduling framework to schedule different elastic resources to different workloads. This topic introduces ack-ai-installer and describes the usage notes and release notes for ack-ai-installer.

Introduction

ack-ai-installer can work with the ACK scheduler to implement GPU sharing (including GPU memory isolation) and topology-aware GPU scheduling. ack-ai-installer consists of the following components.

gpushare-device-plugin and cgpu-installer

By default, the ACK scheduler used by ACK Pro clusters and ACK Edge Pro clusters schedules exclusive GPU resources. You can use ack-ai-installer (gpushare-device-plugin) with the ACK scheduler to implement GPU sharing and GPU memory isolation. GPU sharing allows multiple applications or processes to share the same GPU in order to improve resource utilization. You can also use ack-ai-installer (cgpu-installer) together with cGPU, which is a GPU virtualization and sharing service of Alibaba Cloud, to implement GPU memory isolation. GPU memory isolation allows you to isolate different applications or processes in GPU memory to prevent mutual interference and improve the overall performance and efficiency of the system. In addition, ack-ai-installer (cgpu-installer) supports GPU computing power isolation and provides different scheduling policies, including fair-share scheduling, preemptive scheduling, and weight-based preemptive scheduling, to schedule and use GPU computing power in a more fine-grained manner. For more information about GPU sharing and GPU memory isolation, such as the installation procedure and use scenarios, see Configure the GPU sharing component and Use cGPU to allocate computing power.

gputopo-device-plugin

You can use ack-ai-installer (gputopo-device-plugin) together with the ACK scheduler to implement topology-aware GPU scheduling and select the optimal combination of GPUs to accelerate training jobs. For more information about topology-aware GPU scheduling, such as the installation procedure and use scenarios, see GPU topology-aware scheduling.

Usage notes

You can install ack-ai-installer only from the AI Developer Console page of an ACK Pro cluster or an ACK Edge Pro cluster that runs Kubernetes 1.18 or later. ack-ai-installer is pre-installed as a component in ACK Lingjun clusters that run Kubernetes 1.18 or later.

Release notes

November 2024

Version

Description

Release date

Impact

1.11.1

cGPU is updated to 1.5.13. The issue that residual processes of the container may cause occasional kernel crashes is fixed.

2024-11-19

No impact on workloads.

1.10.1

cGPU is updated to 1.5.12. The issue that memory isolation for Compute Unified Device Architecture (CUDA) API fails in driver version 535 or later is fixed.

2024-11-07

No impact on workloads.

September 2024

Version

Description

Release date

Impact

1.9.16

  • The version of cGPU is 1.5.11.

  • Modify the cGPU installation process to the init-container.

2024-09-26

No impact on workloads.

1.9.15

cGPU is updated to 1.5.11. The issues that are related to decoding are fixed.

2024-09-19

No impact on workloads.

August 2024

Version

Description

Release date

Impact

1.9.14

  • The issues that are related to the usage of Multi-Process Service (MPS) Daemon are fixed.

  • cGPU is updated to 1.5.10. Policy 6 is added for equal partitioning of compute resources and memory.

2024-08-21

No impact on workloads.

1.9.14

cGPU is updated to 1.5.9. Policy 6 is added for equal partitioning of compute resources and memory.

2024-08-13

No impact on workloads.

May 2024

Version

Description

Release date

Impact

1.9.11

cGPU is updated to 1.5.7. L series GPUs and GPU driver version 550 and later are supported.

2024-05-14

No impact on workloads.

1.9.10

cGPU is updated to 1.5.7. The issue that the cgpu policy set command does not take effect is fixed.

2024-05-09

No impact on workloads.

January 2024

Version

Description

Release date

Impact

1.8.8

cGPU is updated to 1.5.6. A new cGPU license server policy is released.

2024-01-04

No impact on workloads.

December 2023

Version

Description

Release date

Impact

1.8.7

  • The version of cGPU is 1.5.5.

  • GPU resources can be scheduled and shared by using MPS.

2023-12-20

No impact on workloads.

November 2023

Version

Description

Release date

Impact

1.8.5

cGPU is updated to 1.5.5. The kernel panic issue triggered by cgpu-procfs is fixed.

2023-11-23

No impact on workloads.

August 2023

Version

Description

Release date

Impact

1.8.2

  • The version of cGPU is 1.5.3.

  • Dynamic Multi-Instance GPU splitting is supported.

  • The issue that device-plugin-recover repeatedly restarts is fixed.

2023-08-29

No impact on workloads.

July 2023

Version

Description

Release date

Impact

1.7.7

  • cGPU is updated to 1.5.3.

  • The issue related to the incorrect symlink configuration of nvidia-container-toolkit and nvidia-container-runtime-hook is fixed.

  • The issue that cGPU is incompatible with higher driver versions (470.182.03, 515.105.01, 525.105.17, and later) is fixed.

2023-07-04

No impact on workloads.

April 2023

Version

Description

Release date

Impact

1.7.6

  • cGPU is updated to 1.5.2. The issue of permissions misconfiguration in systemd cgroups is fixed.

  • The issue that cGPU is incompatible with driver versions later than 5xx is fixed.

  • The issue that cGPU does not support nvidia-container-runtime versions later than 1.10 is fixed.

  • The issue that cGPU 1.5.1 does not support containerd is fixed.

2023-04-26

No impact on workloads.

1.7.5

cGPU is updated to 1.5.2.

2023-04-18

No impact on workloads.