ack-kube-queue is a job queue system for ACK clusters, designed to optimize the management and resource utilization of AI/ML and batch processing workloads. It enables system administrators to enhance cluster resource utilization and task execution efficiency through flexible job queue management, automatic workload allocation optimization, and resource quota management. This topic explains how to install, configure the ack-kube-queue job queue, and submit jobs.
Limits
Supports only ACK managed clusters, ACK Edge clusters, and ACK Lingjun clusters version 1.18 and above.
Install ack-kube-queue widget
ACK managed clusters and ACK Edge clusters
Cloud-native AI suite not deployed
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.-
On the Cloud-native AI Suite page, click One-click Deployment at the bottom.
-
In the Scheduling area, select Kube Queue, in the Interaction Mode area, select Arena, and then click Deploy Cloud-native AI Suite at the bottom of the page.
Cloud-native AI suite deployed
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.-
Install ack-arena and ack-kube-queue.
-
On the Cloud-native AI Suite page, in the Operation column, click Deploy corresponding to the Ack-arena component. On the Parameter Configuration page, click OK.
-
On the Cloud-native AI Suite page, in the Operation column, click Deploy corresponding to the Ack-kube-queue component. In the pop-up page, click OK.
After installing ack-arena and ack-kube-queue, the Component List displays the Status of each component as Deployed.
-
ACK Lingjun clusters
Log on to the ACK console. In the left-side navigation pane, choose .
-
On the Application Marketplace page, search for ack-kube-queue, and then click the application.
-
On the product page, click One-click Deployment in the upper right corner to enter the Basic Information page. After filling in the target cluster, namespace, and release name, click Next.
-
On the Parameter Configuration page, select the latest Chart Version, and then click OK.
Configure ack-kube-queue widget
ack-kube-queuesupports queuing for various job types, including TfJob, PytorchJob, MpiJob, Argo Workflow, RayJob, SparkApplication, and native Job. By default, the ack-kube-queue widget only enables queuing for native BatchJob. You can enable or disable queuing for any job type as needed.
Limits
TfJob, PytorchJob, and MpiJob require the Operator provided by the ack-arena widget.
When using the queuing feature for native Job types, the cluster version must be 1.22 or above.
MpiJob currently only supports submission through Arena.
Argo Workflow currently only supports queuing for the entire workflow. You can specify the resources required by the Workflow by declaring the following in the Annotation.
... annotations: kube-queue/min-resources: | cpu: 5 memory: 5G ...
Enable job type support
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.-
Find the ack-kube-queue application, and click Update in the Operation column on its right.
-
To enable support for related workloads, modify the YAML content according to the table.
Configuration item
Description
extension.argo.enable
set totrue
Enable Argo Workflow support.
extension.mpi.enable
set totrue
Enable MpiJob support.
extension.ray.enable
set totrue
Enable RayJob support.
extension.spark.enable
set totrue
Enable SparkApplication support.
extension.tf.enable
set totrue
Enable TfJob support.
extension.pytorch.enable
set totrue
Enable PytorchJob support.
Submit jobs
Submit TfJob, PytorchJob, MpiJob
You need to add the scheduling.x-k8s.io/suspend="true"
identity in the Job's Annotation. The following example uses TfJob for illustration.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "job1"
annotations:
scheduling.x-k8s.io/suspend: "true"
spec:
...
Submit native Job
You need to set the suspend
field of the Job to true
. The following example generates a queuing unit requiring 100m
CPU. When the queuing unit is dequeued, the suspend
field of the Job will be changed to false
, and the cluster widget will start executing the Job.
apiVersion: batch/v1
kind: Job
metadata:
generateName: pi-
spec:
suspend: true
...
Submit Argo Workflow
Please install the Argo Workflows widget in the Application Marketplace in advance.
Please add a custom template named kube-queue-suspend
of type suspend
in the Argo Workflow. At the same time, set the suspend
status to true
when submitting the Workflow. An example is as follows.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: $example-name
spec:
suspend: true # This item needs to be set to true.
entrypoint: $example-entrypoint
templates:
# A suspend template named kube-queue-suspend needs to be added.
- name: kube-queue-suspend
suspend: {}
- name: $example-entrypoint
...
Submit SparkApplication
Please install the ack-spark-operator widget in the Application Marketplace in advance.
When submitting a SparkApplication, you can add the scheduling.x-k8s.io/suspend="true"
annotation to the SparkApplication
.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
generateName: spark-pi-suspend-
namespace: spark-operator
annotations:
scheduling.x-k8s.io/suspend: "true"
spec:
...
Submit RayJob
Please install the managed Kuberay-Operator widget in the cluster widget management in advance. For specific operations, see Manage widgets.
When submitting a RayJob, please set the spec.suspend
field to true
.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
spec:
# Suspend specifies whether the RayJob controller should create a RayCluster instance.
# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
suspend: true
...