Job distribution overview - Container Service for Kubernetes

Fleet instances of Distributed Cloud Container Platform for Kubernetes (ACK One) provide the job distribution feature for multi-cluster and hybrid cloud scenarios, enabling centralized scheduling and distribution of AI workloads. When a single Container Service for Kubernetes (ACK) cluster cannot meet the resource requirements for large-scale AI training or inference tasks, or when there are idle resources across multiple ACK clusters, this feature lets you distribute jobs across these clusters to optimize resource utilization.

Features

ACK One multi-cluster job distribution offers the following capabilities:

Multiple-job type support: Compatible with PyTorchJob, SparkApplication, and TFJob frameworks.
Multi-cluster gang scheduling: Distribute jobs across clusters through resource pre-allocation or dynamic resource checks, ensuring successful task deployment to sub-clusters and improving overall scheduling efficiency.
Multi-tenant quota management: Enforce resource limits per tenant using ElasticQuotaTree-based namespace quotas in multi-tenant environments.
Priority-based scheduling: Prioritize important tasks for resource allocation based on PriorityClass defined in PodTemplate for AI jobs.
Multiple task queuing policy configuration: Allow for flexible queue policies to support cluster utilization optimization and task priority gurantee modes, supporting both blocking and non-blocking scheduling patterns.
Job rescheduling on failure: The Global Scheduler automatically reclaims failed jobs and reschedules them to eligible clusters with sufficient resources.

How it works

Job submission: Submit PyTorchJob, SparkApplications, or TFJob type jobs with the distribution policy PropagationPolicy to the Fleet instance.
Priority and quota validation: The Fleet instance performs capacity scheduling based on job priorities and tenant quotas.
Global scheduling: The Global Scheduler in the Fleet instance applies multi-cluster dynamic resource scheduling and gang scheduling for dequeued jobs, reserving resources or dynamically checking for eligible clusters. If scheduling fails, the job is re-queued.
Job distribution: Successfully scheduled jobs are propagated to designated ACK clusters.
Failure retry: If a job fails in a sub-cluster, the Global Scheduler reclaims and reschedules it to other eligible clusters.