how to use ack-kube-queue to manage AI/ML workloads - Container Service for Kubernetes

ack-kube-queue is a job queue system for ACK clusters, designed to optimize the management and resource utilization of AI/ML and batch processing workloads. It enables system administrators to enhance cluster resource utilization and task execution efficiency through flexible job queue management, automatic workload allocation optimization, and resource quota management. This topic explains how to install, configure the ack-kube-queue job queue, and submit jobs.

Limits

Supports only ACK managed clusters, ACK Edge clusters, and ACK Lingjun clusters version 1.18 and above.

Install ack-kube-queue widget

ACK managed clusters and ACK Edge clusters

Cloud-native AI suite not deployed

Enable cloud-native AI suite.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click One-click Deployment at the bottom.
In the Scheduling area, select Kube Queue, in the Interaction Mode area, select Arena, and then click Deploy Cloud-native AI Suite at the bottom of the page.

Cloud-native AI suite deployed

Enable cloud-native AI suite.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.
Install ack-arena and ack-kube-queue.
- On the Cloud-native AI Suite page, in the Operation column, click Deploy corresponding to the Ack-arena component. On the Parameter Configuration page, click OK.
- On the Cloud-native AI Suite page, in the Operation column, click Deploy corresponding to the Ack-kube-queue component. In the pop-up page, click OK.
After installing ack-arena and ack-kube-queue, the Component List displays the Status of each component as Deployed.

ACK Lingjun clusters

Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
On the Application Marketplace page, search for ack-kube-queue, and then click the application.
On the product page, click One-click Deployment in the upper right corner to enter the Basic Information page. After filling in the target cluster, namespace, and release name, click Next.
On the Parameter Configuration page, select the latest Chart Version, and then click OK.

Configure ack-kube-queue widget

ack-kube-queuesupports queuing for various job types, including TfJob, PytorchJob, MpiJob, Argo Workflow, RayJob, SparkApplication, and native Job. By default, the ack-kube-queue widget only enables queuing for native BatchJob. You can enable or disable queuing for any job type as needed.

Limits

TfJob, PytorchJob, and MpiJob require the Operator provided by the ack-arena widget.
When using the queuing feature for native Job types, the cluster version must be 1.22 or above.
MpiJob currently only supports submission through Arena.
Argo Workflow currently only supports queuing for the entire workflow. You can specify the resources required by the Workflow by declaring the following in the Annotation.
```
...
 annotations:
   kube-queue/min-resources: |
     cpu: 5
     memory: 5G
...
```

Enable job type support

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Helm.
Find the ack-kube-queue application, and click Update in the Operation column on its right.

To enable support for related workloads, modify the YAML content according to the table.

Configuration item	Description
`extension.argo.enable` set to `true`	Enable Argo Workflow support.
`extension.mpi.enable` set to `true`	Enable MpiJob support.
`extension.ray.enable` set to `true`	Enable RayJob support.
`extension.spark.enable` set to `true`	Enable SparkApplication support.
`extension.tf.enable` set to `true`	Enable TfJob support.
`extension.pytorch.enable` set to `true`	Enable PytorchJob support.

Submit jobs

Submit TfJob, PytorchJob, MpiJob

You need to add the scheduling.x-k8s.io/suspend="true" identity in the Job's Annotation. The following example uses TfJob for illustration.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "job1"
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

Submit native Job

You need to set the suspend field of the Job to true. The following example generates a queuing unit requiring 100m CPU. When the queuing unit is dequeued, the suspend field of the Job will be changed to false, and the cluster widget will start executing the Job.

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  suspend: true
...

Submit Argo Workflow

Note

Please install the Argo Workflows widget in the Application Marketplace in advance.

Please add a custom template named kube-queue-suspend of type suspend in the Argo Workflow. At the same time, set the suspend status to true when submitting the Workflow. An example is as follows.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: $example-name
spec:
  suspend: true # This item needs to be set to true.
  entrypoint: $example-entrypoint
  templates:
  # A suspend template named kube-queue-suspend needs to be added.
  - name: kube-queue-suspend
    suspend: {}
  - name: $example-entrypoint
...

Submit SparkApplication

Note

Please install the ack-spark-operator widget in the Application Marketplace in advance.

When submitting a SparkApplication, you can add the scheduling.x-k8s.io/suspend="true" annotation to the SparkApplication.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  generateName: spark-pi-suspend-
  namespace: spark-operator
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

Submit RayJob

Note

Please install the managed Kuberay-Operator widget in the cluster widget management in advance. For specific operations, see Manage widgets.

When submitting a RayJob, please set the spec.suspend field to true.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
  suspend: true
...