All Products
Search
Document Center

Container Service for Kubernetes:Introduction and release notes for ack-ai-installer

Last Updated:Jul 17, 2024

ack-ai-installer is a collection of device plug-ins that are used to enhance the scheduling capabilities of Container Service for Kubernetes (ACK) Pro clusters and ACK Edge Pro clusters. ack-ai-installer can be used together with the ACK scheduler to schedule heterogeneous computing resources based on GPU sharing and topology-aware GPU scheduling. The ACK scheduler is a scheduling system developed based on the Kubernetes scheduling framework to schedule different elastic resources to different workloads. This topic introduces ack-ai-installer and describes the usage notes and release notes for ack-ai-installer.

Introduction

ack-ai-installer can work with the ACK scheduler to implement GPU sharing (including GPU memory isolation) and topology-aware GPU scheduling. ack-ai-installer consists of the following components.

gpushare-device-plugin and cgpu-installer

By default, the ACK scheduler used by ACK Pro clusters and ACK Edge Pro clusters schedules exclusive GPU resources. You can use ack-ai-installer (gpushare-device-plugin) with the ACK scheduler to implement GPU sharing and GPU memory isolation. GPU sharing allows multiple applications or processes to share the same GPU in order to improve the resource utilization. ack-ai-installer (cgpu-installer) also works with cGPU, a GPU virtualization and sharing service of Alibaba Cloud, to implement GPU memory isolation. GPU memory isolation can isolate different applications or processes in GPU memory to prevent mutual interference and improve the overall performance and efficiency of the system. In addition, ack-ai-installer (cgpu-installer) supports GPU computing power isolation and provides different scheduling policies, including fair-share scheduling, preemptive scheduling, and weight-based preemptive scheduling, to schedule and use GPU computing power in a more fine-grained manner. For more information about GPU sharing and GPU memory isolation, such as the installation procedure and use scenarios, see Configure the GPU sharing component and Use cGPU to allocate computing power.

gputopo-device-plugin

gputopo-device-plugin works with the ACK scheduler to implement topology-aware GPU scheduling and select the optimal combination of GPUs to accelerate training jobs. For more information about topology-aware GPU scheduling, such as the installation procedure and use scenarios, see GPU topology-aware scheduling.

Usage notes

You can install ack-ai-installer only from the AI Developer Console page of an ACK Pro cluster or ACK Edge Pro cluster that runs Kubernetes 1.18 or later. ack-ai-installer is pre-installed as a component in an ACK Lingjun cluster that runs Kubernetes 1.18 or later.

Description

December 2023

Version

Description

Release date

Impact

1.8.7

  • GPU resources can be scheduled and shared by using Multi-Process Service (MPS).

  • cGPU is updated to 1.5.5.

2023-12-20

No impact on workloads.

August 2023

Version

Description

Release date

Impact

1.8.2

  • Dynamic Multi-Instance GPU splitting is supported.

  • The issue that device-plugin-recover repeatedly restarts is fixed.

  • cGPU is updated to 1.5.3.

2023-08-29

No impact on workloads.

April 2023

Version

Description

Release date

Impact

1.7.6

  • cGPU is updated to 1.5.2.

  • The issue that cGPU is incompatible with driver versions later than 5xx is fixed.

  • The issue that cGPU does not support nvidia-container-runtime versions later than 1.10 is fixed.

  • The issue that cGPU 1.5.1 does not support containerd is fixed.

2023-04-26

No impact on workloads.