The gang scheduling feature provided by Container Service for Kubernetes (ACK) is developed on top of the new kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. It is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. This topic describes how to enable gang scheduling.
Usage notes
Make sure that the resource capacity of the elastic node pool that you use and the node labels meet the requirements for pod scheduling. Otherwise, pods may fail to be scheduled to the nodes in the node pool.
Prerequisites
An ACK Pro cluster that runs Kubernetes 1.16 or later is created. For more information, see Create an ACK Pro cluster.
ImportantGang scheduling is available only for ACK Pro clusters. To enable gang scheduling in ACK dedicated clusters, submit a ticket.
Usage notes
Kubernetes is widely used in online service orchestration. ACK wants to use Kubernetes as a platform for central management of online services and offline jobs. This improves the resource utilization and performance of clusters. However, kube-scheduler cannot migrate specific offline workloads to Kubernetes clusters. For example, if a job requires all-or-nothing scheduling, all tasks of the job must be scheduled at the same time. If only some of the tasks are started, the started jobs must wait until all the remaining tasks are scheduled. If each submitted job contains unscheduled tasks, all submitted jobs remain in the Pending state and the cluster is deadlocked. To avoid this situation, you must enable gang scheduling for kube-scheduler.
Gang scheduling is a scheduling algorithm that schedules multiple correlated processes to different processors in a parallel system and simultaneously starts these processes. Gang scheduling aims to start all correlated processes at the same time. This ensures that the process group is not stuck when the system fails to start some processes. For example, if you submit a batch job that contains multiple tasks, either all of the tasks are scheduled or none of them is scheduled. Task scheduling in all-or-nothing scenarios is known as gang scheduling.
In ACK, a PodGroup is a group of pods that need to be scheduled at the same time. When you submit a job that requires all-or-nothing scheduling, you can add labels to pods. The labels specify the name of the PodGroup to which the job belongs and the minimum number of tasks that must be scheduled to run the job. kube-scheduler schedules tasks based on the minimum number of tasks that must be scheduled. The tasks are scheduled only when the cluster resources are sufficient to schedule the required number of tasks. Otherwise, the job remains in the Pending state.
How to enable gang scheduling
To enable gang scheduling, set min-available and name by adding labels to the pods. When this method is used, kube-scheduler automatically creates a PodGroup named after the value of the
pod-group.scheduling.sigs.k8s.io/name
label. You must set the value ofpod-group.scheduling.sigs.k8s.io/name
to a subdomain name. For more information, see Object names and IDs.labels: pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu pod-group.scheduling.sigs.k8s.io/min-available: "3"
name: the name of a PodGroup.
min-available: the minimum number of pods that must be scheduled to run a job.
You can use one of the following methods to enable gang scheduling. In clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than
1.xx.xx-aliyun-4.0
.Create a PodGroup and use the
pod-group.scheduling.sigs.k8s.io
orpod-group.scheduling.sigs.k8s.io/name
label to specify the PodGroup to which your pods belong. The pods and the PodGroup must belong to the same namespace.ImportantSince version 1.31, ACK no longer supports the PodGroup resource of version
scheduling.sigs.k8s.io/v1alpha1
. It supports only the PodGroup resource of versionscheduling.x-k8s.io/v1alpha1
.# PodGroup CRD spec apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: nginx spec: scheduleTimeoutSeconds: 10 minMember: 3 --- # Add the pod-group.scheduling.sigs.k8s.io/name label to the pods. labels: pod-group.scheduling.sigs.k8s.io/name: nginx
Add the min-available and name annotations to the configurations of the pods that you want to manage. The total-number and mode parameters in the koordinator API are not supported.
annotations: gang.scheduling.koordinator.sh/name: "gang-example" gang.scheduling.koordinator.sh/min-available: "2"
Pods that belong to the same PodGroup must be assigned the same priority.
Advanced gang scheduling configurations
Limits
To use advanced gang scheduling configurations in clusters that run Kubernetes 1.22 or later, the kube-scheduler version must be later than 1.xx.xx-aliyun-4.0
.
Declare a GangGroup
When you use gang scheduling, some jobs may use different roles that have different requirements on the value of the min-available parameter. For example, PyTorch training jobs use parameter servers and workers. In this case, if you use a PodGroup to manage the pods of all roles, the requirements on the min-available parameter for different roles may not be met at the same time. If you create multiple PodGroups for the roles, the pods of the roles cannot be scheduled in one batch. To resolve this issue, we recommend that you use the GangGroup feature to manage multiple gangs as a group. The job can be run only when the number of pods that are scheduled reaches the value of min-available for each gang. This ensures that the requirements on the min-available parameter for different roles are met.
If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
pod-group.scheduling.sigs.k8s.io/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
If you use annotations to enable gang scheduling, add the following label to the configurations of the pod:
gang.scheduling.koordinator.sh/groups: "[\"default/gang-example1\", \"default/gang-example2\"]"
Declare a matchpolicy
When you enable gang scheduling, you can declare a match-policy to enable a PodGroup to count pods by type.
If you use labels to enable gang scheduling, add the following label to the configurations of the pod:
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
If you use PodGroups to enable gang scheduling, add the following label to the configurations of the PodGroups:
pod-group.scheduling.sigs.k8s.io/match-policy: "waiting-and-running"
If you use annotations to enable gang scheduling, only the
once-satisfied
match method is supported.
The following table describes different match methods.
Match method | Description |
only-waiting | Only pods that have completed resource preallocation are matched. |
waiting-and-running | Pods in the Running state and pods that have completed resource preallocation are matched. |
waiting-running-succeed | Pods in the Succeed state, pods in the Running state, and pods that have completed resource preallocation are matched. |
once-satisfied | Only pods that have completed resource preallocation are matched. The PodGroup becomes invalid after pods are matched. |
Examples
In this example, a distributed TensorFlow job is used to demonstrate how to enable gang scheduling. The ACK cluster that is used in this example has four GPUs.
Install Arena and deploy an environment in your cluster to run TensorFlow jobs. For more information, see Install Arena.
NoteArena is a subproject of Kubeflow. Kubeflow is an open source project for Kubernetes-based machine learning. Arena allows you to manage the lifecycle of machine learning jobs by using a CLI or SDK. Lifecycle management includes environment setup, data preparation, model development, model training, and model prediction. This improves the working efficiency of data scientists.
Use the following template to submit a distributed TensorFlow job to the ACK cluster. The job runs in one parameter server (PS) pod and four worker pods. Each worker pod requires two GPUs.
Submit the distributed TensorFlow job without enabling gang scheduling
Run the following command to query the status of the pods that run the TensorFlow job:
kubectl get pods
The output shows that only two worker pods are running and the other worker pods are in the Pending state.
NAME READY STATUS RESTARTS AGE tf-smoke-gpu-ps-0 1/1 Running 0 6m43s tf-smoke-gpu-worker-0 1/1 Running 0 6m43s tf-smoke-gpu-worker-1 1/1 Running 0 6m43s tf-smoke-gpu-worker-2 0/1 Pending 0 6m43s tf-smoke-gpu-worker-3 0/1 Pending 0 6m43s
Run the following command to query the log data of the running worker pods.
kubectl logs -f tf-smoke-gpu-worker-0
The returned log data indicates that the two worker pods are launched and waiting for the system to start the pending worker pods. The GPU resources occupied by the running worker pods are not in use.
INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3 INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
Submit the distributed TensorFlow job with gang scheduling enabled
Run the following command to query the status of the pods that run the TensorFlow job:
kubectl get pods
The computing resources in the cluster are insufficient to schedule the minimum number of pods. Therefore, the PodGroup cannot be scheduled and all pods are in the Pending state.
NAME READY STATUS RESTARTS AGE tf-smoke-gpu-ps-0 0/1 Pending 0 43s tf-smoke-gpu-worker-0 0/1 Pending 0 43s tf-smoke-gpu-worker-1 0/1 Pending 0 43s tf-smoke-gpu-worker-2 0/1 Pending 0 43s tf-smoke-gpu-worker-3 0/1 Pending 0 43s
After four GPUs are allocated to the cluster, the computing resources in the cluster are sufficient to schedule the minimum number of pods. After the PodGroup is scheduled, the four worker pods start to run. Run the following command to query the status of the pods that run the TensorFlow job:
kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE tf-smoke-gpu-ps-0 1/1 Running 0 3m16s tf-smoke-gpu-worker-0 1/1 Running 0 3m16s tf-smoke-gpu-worker-1 1/1 Running 0 3m16s tf-smoke-gpu-worker-2 1/1 Running 0 3m16s tf-smoke-gpu-worker-3 1/1 Running 0 3m16s
Run the following command to query the log data of a running worker pod: The output shows that the training job
kubectl logs -f tf-smoke-gpu-worker-0
The following output indicates that the training job has started.
INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step Img/sec loss INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318 INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10 images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343 INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20 images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142
Error messages
Error message: "rejected by podgroup xxx".
Possible causes: If you create multiple PodGroups in a cluster, the pods in a PodGroup may fail to be scheduled at the same time due to the BackOff queue of kube-scheduler. In this case, pods that have completed resource pre-allocation may be rejected when the system schedules the pods in subsequent PodGroups. You can ignore the error if the situation lasts no more than 20 minutes. If the situation lasts more than 20 minutes, submit a ticket.
References
For more information about release notes for kube-scheduler, see kube-scheduler.
Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of ACK clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. For more information, see Work with capacity scheduling.