All Products
Search
Document Center

Container Service for Kubernetes:ack-kube-queue

Last Updated:Jan 09, 2025

ack-kube-queue is a kube-queue component provided by the cloud-native AI suite. It works with the scheduler and quota system to allow you to manage job queues, schedule jobs based on priorities, and use elastic quotas. ack-kube-queue can optimize the management and scheduling of AI/machine learning (ML) workloads and batch workloads in Kubernetes. This topic introduces ack-kube-queue and describes the usage notes and release notes for ack-kube-queue.

Introduction

AI/ML jobs or batch jobs in Kubernetes usually create a large number of pods, which increase the loads of the scheduler. In addition, jobs submitted by different users may interfere with each other. ack-kube-queue provides all features of kube-queue to manage AI/ML workloads and batch workloads in Kubernetes. This component allows system admins to customize job queue management to improve the flexibility of queues. Combined with a quota system, ack-kube-queue can automate and optimize the management of workloads and resource quotas to maximize resource utilization in Kubernetes clusters.

Usage notes

Only Container Service for Kubernetes (ACK) Pro clusters, ACK Serverless Pro clusters, and ACK Edge Pro clusters whose Kubernetes versions are 1.18 and later support ack-kube-queue.

You can install ack-kube-queue when you deploy the cloud-native AI suite or install it after the cloud-native AI suite is deployed. After you install ack-kube-queue, you can use features such as blocking queues and strict priority scheduling. For more information about how to install and use ack-kube-queue, see Use ack-kube-queue to manage job queues.

Description

January 2024

Version

Description

Release date

Impact

v0.3.4

The head-of-line blocking that occasionally occurs in block mode when you delete the first task in the queue is fixed.

2024-01-04

This update has no impact on workloads.

December 2023

Version

Description

Release date

Impact

v0.3.3

Setting blocking queues globally by using environment variables refreshes the blocking queue mode for all queues.

2023-12-26

This update has no impact on workloads.

September 2023

Version

Description

Release date

Impact

v0.3.1

The queue errors that occasionally occur during QueueUnit deletion is fixed.

2023-09-13

This update has no impact on workloads.

v0.3.0

Job sequence information can be retrieved from queues.

2023-09-13

This update has no impact on workloads.

August 2023

Version

Description

Release date

Impact

v0.2.1

The issue that the NodeSelector in the template prevents scheduling on worker nodes is fixed.

2023-08-31

This update has no impact on workloads.

v0.2.0

  • Message Passing Interface (MPI) jobs can be submitted by using Arena.

  • Argo Workflows can be queued.

  • The number of concurrently dequeued jobs can be limited by using kube-queue/max-jobs as the resource name in the ElasticQuotaTree.

  • Logs for job dequeuing failures are optimized.

2023-08-29

This update has no impact on workloads.

July 2023

Version

Description

Release date

Impact

v0.1.13

The function issue that may occur in case of a missing LastUpdateTime field is fixed.

2023-07-26

This update has no impact on workloads.

v0.1.12

A switch is added to configure the blocking queue feature for different queues. You can disable the re-queuing feature by setting the timeout parameter in the extension to 0.

2023-07-20

This update has no impact on workloads.

June 2023

Version

Description

Release date

Impact

v0.1.11

The Queueunit status is synchronized when tasks are updated.

2023-06-30

This update has no impact on workloads.

v0.1.10

ARM-based nodes are supported by components such as kube-queue-controller, tf-operator-extension, and pytorch-operator-extension.

2023-06-14

This update has no impact on workloads.

May 2023

Version

Description

Release date

Impact

v0.1.9

Jobs that remain pending for a long period of time can be resubmitted to the job queue and multi-queue fair queuing is supported. If the pods created by a job remain pending for a long period of time due to topology-aware scheduling, node affinity, or resource fragments, ack-kube-queue reclaims the job and resubmits the job to the queue. This helps release the resource quota occupied by the job and improves the overall resource quota utilization.

2023-05-16

This update has no impact on workloads.

April 2023

Version

Description

Release date

Impact

v0.1.8

Blocking queues and strict priority scheduling are supported. For more information, see Enable blocking queues and Enable strict priority scheduling.

2023-04-25

This update has no impact on workloads.

March 2023

Version

Description

Release date

Impact

v0.1.6

The issue that the status of TensorFlow jobs is not displayed is fixed.

2023-03-15

This update has no impact on workloads.

February 2023

Version

Description

Release date

Impact

v0.1.5

The issue that ack-kube-queue occasionally fails to delete jobs is fixed.

2023-02-28

This update has no impact on workloads.

v0.1.4

The issue that the Used information is occasionally lost after a job queue unit is dequeued is fixed.

2023-02-14

This update has no impact on workloads.

January 2023

Version

Description

Release date

Impact

v0.1.3

The issue that job queue units are occasionally lost is fixed.

2023-01-12

This update has no impact on workloads.

v0.1.2

The occasionally occurred issue that jobs cannot be dequeued for a long period of time is fixed.

2023-01-12

This update has no impact on workloads.

v0.1.1

Multi-queue is supported. Jobs with different resource quotas are submitted to different queues to avoid congestion.

2023-01-10

This update has no impact on workloads.

October 2022

Version

Description

Release date

Impact

v0.1.0

This is the first release.

2022-10-15

This update has no impact on workloads