ack-kube-queue is a kube-queue component provided by the cloud-native AI suite. It works with the scheduler and quota system to allow you to manage job queues, schedule jobs based on priorities, and use elastic quotas. ack-kube-queue can optimize the management and scheduling of AI/machine learning (ML) workloads and batch workloads in Kubernetes. This topic introduces ack-kube-queue and describes the usage notes and release notes for ack-kube-queue.
Introduction
AI/ML jobs or batch jobs in Kubernetes usually create a large number of pods, which increase the loads of the scheduler. In addition, jobs submitted by different users may interfere with each other. ack-kube-queue provides all features of kube-queue to manage AI/ML workloads and batch workloads in Kubernetes. This component allows system admins to customize job queue management to improve the flexibility of queues. Combined with a quota system, ack-kube-queue can automate and optimize the management of workloads and resource quotas to maximize resource utilization in Kubernetes clusters.
Usage notes
Only Container Service for Kubernetes (ACK) Pro clusters, ACK Serverless Pro clusters, and ACK Edge Pro cluster whose Kubernetes versions are 1.18 and later support ack-kube-queue.
You can install ack-kube-queue when you deploy the cloud-native AI suite or install it after the cloud-native AI suite is deployed. After you install ack-kube-queue, you can use features such as blocking queues and strict priority scheduling. For more information about how to install and use ack-kube-queue, see Use ack-kube-queue to manage job queues.
Description
June 2023
Version | Description | Release date | Impact |
v0.1.10 | ARM-based nodes are supported by components such as kube-queue-controller, tf-operator-extension, and pytorch-operator-extension. | June 14, 2023 | No impact on workloads |
May 2023
Version | Description | Release date | Impact |
v0.1.9 | Jobs that remain pending for a long period of time can be resubmitted to the job queue and multi-queue fair queuing is supported. If the pods created by a job remain pending for a long period of time due to topology-aware scheduling, node affinity, or resource fragments, ack-kube-queue reclaims the job and resubmits the job to the queue. This helps release the resource quota occupied by the job and improves the overall resource quota utilization. | 2023-05-16 | No impact on workloads |
April 2023
Version | Description | Release date | Impact |
v0.1.8 | Blocking queues and strict priority scheduling are supported. For more information, see Enable blocking queues and Enable strict priority scheduling. | 2023-04-25 | No impact on workloads |
March 2023
Version | Description | Release date | Impact |
v0.1.6 | The issue that the status of TensorFlow jobs is not displayed is fixed. | 2023-03-15 | No impact on workloads |
February 2023
Version | Description | Release date | Impact |
v0.1.5 | The issue that ack-kube-queue occasionally fails to delete jobs is fixed. | 2023-02-28 | No impact on workloads |
v0.1.4 | The issue that the Used information is occasionally lost after a job queue unit is dequeued is fixed. | 2023-02-14 | No impact on workloads |
January 2023
Version | Description | Release date | Impact |
v0.1.3 | The issue that job queue units are occasionally lost is fixed. | 2023-01-12 | No impact on workloads |
v0.1.2 | The occasionally occurred issue that jobs cannot be dequeued for a long period of time is fixed. | 2023-01-12 | No impact on workloads |
v0.1.1 | Multi-queue is supported. Jobs with different resource quotas are submitted to different queues to avoid congestion. | 2023-01-10 | No impact on workloads |
October 2022
Version | Description | Release date | Impact |
v0.1.0 | This is the first release. | 2022-10-15 | No impact on workloads |