By Wang Qingcan (Li Fan) and Zhang Kai
With years of experience in supporting Kubernetes products and customers, the Alibaba Cloud Container Service for Kubernetes Team has significantly optimized and extended Kube-scheduler to stably and efficiently schedule various complex workloads in different scenarios. This series of articles entitled, “The Burgeoning Kubernetes Scheduling System” provides a comprehensive summary of our experiences, technical thinking, and specific implementation methods to Kubernetes users and developers. We hope that articles can help you better understand the powerful capabilities and future trends of the Kubernetes scheduling system.
First, let's take a look at the definitions of coscheduling and gang scheduling. According to Wikipedia, coscheduling is the scheduling related processes to run on different processors at the same time in concurrent systems. In coscheduling scenarios, the main principle is to ensure that all related processes can be started at the same time. This prevents exceptions in some processes from blocking the entire process group. An abnormal process that blocks a group is called a fragment.
During implementation, coscheduling can be classified into explicit coscheduling, local coscheduling, and implicit coscheduling based on whether fragments are allowed. Among them, explicit coscheduling is known as gang scheduling. Gang scheduling allows no fragments, which means "all or nothing."
By mapping the preceding concepts to Kubernetes, you can understand why the Kubernetes scheduling system supports coscheduling for batch jobs. A batch job (equivalent to a related process group) contains N pods (equivalent to processes.) The Kubernetes scheduler schedules these N pods to run simultaneously on M nodes (equivalent to processors.) Assume this batch job can run as long as a certain number of pods are started simultaneously. We define the minimum number of pods that we need to start simultaneously as min-available. When min-available is equal to N, the batch job must meet the gang scheduling requirements.
Kubernetes is widely used in online service orchestration. To improve the utilization and operating efficiency of clusters, we hope to use Kubernetes as a unified management platform to manage online services and offline jobs. The default scheduler schedules pods serially without considering the relationship between pods. However, many offline jobs that involve data computing require combined scheduling. Combined scheduling means that all tasks must be created before the overall job can run properly. If some tasks are started but other tasks are not, the started tasks wait for the scheduler to schedule the remaining tasks. This is a gang scheduling scenario.
As shown in the following figure, JobA can only run properly when four pods are started at the same time. Kube-scheduler sequentially schedules and creates three pods. However, cluster resources are insufficient for Kube-scheduler to schedule the fourth pod. As a result, the first three pods for JobA remain in the pending state and continue to occupy resources. If the fourth pod cannot be started in time, the entire JobA cannot be run, and worse still, the occupied cluster resources are wasted.
In a worse case, as shown in the following figure, other cluster resources are occupied by the first three pods of JobB and Kube-scheduler is also waiting to create the fourth pod for JobB. As a result, a deadlock occurs, rendering the entire cluster inoperable.
To overcome the preceding pain points, the community provides the Kube-batch project and the Volcano project derived from the Kube-batch project. Specifically, the community developed a new scheduler to schedule PodGroups instead of pods during scheduling. In other words, the scheduler schedules pods by group. In these projects, the new scheduler schedules pods that require the coscheduling feature, and Kube-scheduler schedules other pods, such as those that run online services.
These projects can resolve coscheduling problems but create new problems. As we all know, a scheduler needs to centralize resources in a single cluster. However, if two schedulers coexist in the same cluster, decision conflicts may occur. For example, one unit of resources may be separately allocated to two different pods. As a result, a pod scheduled to a node may fail to be created due to insufficient resources. The only solutions are to forcibly divide nodes using labels or deploy multiple clusters. In this case, both online services and offline jobs are run in the same Kubernetes cluster, inevitably wasting cluster resources and increasing O&M costs. Furthermore, to run the Volcano project, you must start the custom MutatingAdmissionWebhook
and ValidatingAdmissionWebhook
. These webhooks pose introduce single point of failure risks. If any webhook fails, all the pods in the cluster may fail to be created. Running an additional scheduler also increases the complexity of maintenance and compromises on compatibility with the upstream Kube-scheduler API.
In the first article in this series, we introduced the architectural principles and development method of the Kubernetes Scheduling Framework. On this basis, we can extend and implement a coscheduling plug-in to enable the native Kubernetes scheduler to schedule batch jobs while avoiding the problems of the preceding solution. The previous article also provided a detailed description of the Scheduling Framework. You are welcome to read it for more information.
To better manage scheduling plug-ins in different scenarios, the sig-scheduling team, which is responsible for Kube-scheduler in Kubernetes, created a project named scheduler-plug-ins. The coscheduling plug-in implemented based on the Scheduling Framework became the first official plug-in for this project. In the following section, I will describe the implementation and usage of the coscheduling plug-in in detail.
We define PodGroups using labels. Pods with the same label belong to the same PodGroup. In addition, min-available is used to indicate the minimum number of replicas that a job of the PodGroup requires to run properly.
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "2"
Note: Pods in the same PodGroup must have the same priority.
The Permit plug-in of the Scheduling Framework provides the delayed binding feature. Specifically, for a pod that enters the Permit phase, you can customize a condition to allow the pod to pass the phase, deny the pod from passing the phase, or keep the pod pending. When you keep the pod pending, you can specify a timeout period for the pod. The delayed binding feature for the Permit phase allows pods belonging to the same PodGruop to wait after they are scheduled to a node. When the required number of pods is scheduled to the node, the scheduler can run all the pods of the same PodGroup to bind and create all the pods.
Assume that JobA can run properly only when four pods for the job are started at the same time, but the current cluster resources only allow you to create three pods. Unlike the default scheduler that schedules and creates three pods first, the Permit plug-in of the Scheduling Framework keeps all the pods pending.
Then, when idle resources are released in the cluster, and all of the resources required by the pods for JobA are available, the scheduler schedules and creates all four pods of JobA and runs JobA.
The queue of the default scheduler does not perceive PodGroup information. Therefore, pods in PodGroups are not in order when they dequeue. As shown in the following figure, pods a and b are from different PodGroups. When pods of the two PodGroups enter a queue, the pods are staggered in the queue due to the staggered creation time.
After a pod is created and added to the queue, it is not adjacent to other pods of the same PodGroup. Instead, it is queued with other pods in a staggered manner.
As a result, if PodGroupA is pending in the Permit phase, pods of PodGroupB remain in the pending state after pods of PodGroupB are scheduled. The resources occupied by the two groups prevent both PodGroupA and PodGroupA from being scheduled. In this case, the deadlock occurs in the Permit phase instead of the node, and the preceding problem is not resolved.
To address the preceding problem, we implemented the QueueSort plug-in to ensure that pods of the same PodGroup are adjacent to each other in the queue. We define the Less method for the QueueSort plug-in to determine the order of pods in the queue:
func Less(podA *PodInfo, podB *PodInfo) bool
First, the plug-in inherits the default priority-based comparison method, ensuring that pods with higher priorities precede pods of lower priorities.
Then, we define a new queuing logic to support the sorting of pods in a PodGroup in the case where pods have the same priority.
When using the preceding queuing policies, we allow pods in the same PodGroup to be adjacent to each other in the queue.
After a pod is created and added to the queue, the pod will be adjacent to other pods belonging to the same PodGroup.
To reduce ineffective scheduling operations and improve scheduling performance, we add a filtering condition in the Prefilter phase. Before scheduling a pod, the scheduler calculates the sum of pods, including running pods, in the same PodGroup as the pod. If the sum is less than min-available, the min-available requirement is not met. Then, the scheduler denies the pod in the Prefilter phase and the pod does not enter the main scheduling process.
If a pod times out in the Permit phase, the pod enters the UnReserve phase. The scheduler denies all pods in the same PodGroup as a pod in the UnReserve phase to prevent the remaining pods from waiting for a long time.
You can try out coscheduling in a self-build Kubernetes cluster or any dedicated Kubernetes service provided by a public cloud. Note: The cluster version must be 1.16 or later and you must have permission to update the primary nodes of the cluster.
This article uses the Kubernetes cluster provided by Alibaba Cloud Container Service for Kubernetes (ACK) to test the coscheduling feature.
We have already built the code of the coscheduling plug-in and the native scheduler into new container images and provided a Helm Chart package named ack-coscheduling for automatic installation. The package starts a job to automatically replace the native scheduler installed on the cluster with the coscheduling scheduler and modify the Config file related to the scheduler so the Scheduling Framework can load the coscheduling plug-in. After the trial, you can restore the default scheduler and related configurations of the cluster using the uninstall feature described in the following section.
Download the Helm Chart package and run the following command to install the Helm Chart:
$ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/ack-coscheduling.tar.gz
$ tar zxvf ack-coscheduling.tar.gz
$ helm install ack-coscheduling -n kube-system ./ack-coscheduling
NAME: ack-coscheduling
LAST DEPLOYED: Mon Apr 13 16:03:57 2020
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
On the primary node, run the following helm command to verify that the coscheduling plug-in is installed:
$ helm get manifest ack-coscheduling -n kube-system | kubectl get -n kube-system -f -
NAME COMPLETIONS DURATION AGE
scheduler-update-clusterrole 1/1 8s 35s
scheduler-update 3/1 of 3 8s 35s
Run the following helm command to uninstall the coscheduling plug-in to roll back the version and configurations of the kube-scheduler to the default state in the cluster.
$ helm uninstall ack-coscheduling -n kube-system
To use coscheduling, you only need to configure the following labels in the YAML file that you use to create the job: pod-group.scheduling.sigs.k8s.io/name
and pod-group.scheduling.sigs.k8s.io/min-available
.
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "3"
pod-group.scheduling.sigs.k8s.io/name:
the name of the PodGroup.
pod-group.scheduling.sigs.k8s.io/min-available:
indicates that the job can be scheduled as a whole only when the resources of the current cluster are sufficient to start the min-available pods.
Note: Pods in the same PodGroup must have the same priority.
In the following section, we will demonstrate the coscheduling results by running a distributed TensorFlow training job (TFJob). The test cluster has four graphics processing units (GPUs).
1. Deploy a runtime environment for a TFJob in the existing Kubernetes cluster using Kubeflow's Arena.
Arena is one of the subprojects of Kubeflow, an open-source community for Kubernetes-based machine learning systems. Arena allows you to manage machine learning jobs using command lines and SDKs in the following phases of the lifecycle: environment installation, data preparation, model development, model training, and model prediction. Arena effectively improves the productivity of data scientists.
git clone https://github.com/kubeflow/arena.git
kubectl create ns arena-system
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
Check whether the runtime environment is deployed.
$ kubectl get pods -n arena-system
NAME READY STATUS RESTARTS AGE
tf-job-dashboard-56cf48874f-gwlhv 1/1 Running 0 54s
tf-job-operator-66494d88fd-snm9m 1/1 Running 0 54s
2. Have a user submit a TFJob to the cluster. In the following example, the TFJob involves one parameter server pod and four worker pods, and each worker requires two GPUs. After you configure a PodGroup, you can run the job only when at least five pods of the PodGroup are started.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '1'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 2
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
3. To simulate job implementation without the coscheduling feature, do the following:
Delete the pod-group.scheduling.sigs.k8s.io/name
and pod-group.scheduling.sigs.k8s.io/min-available
labels from the TFJob YAML file. This step indicates that coscheduling is not used in the job. After you create the job, the cluster resources allow you to start only two workers. The other two workers are in the pending state.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-0 1/1 Running 0 6m43s
tf-smoke-gpu-worker-1 1/1 Running 0 6m43s
tf-smoke-gpu-worker-2 0/1 Pending 0 6m43s
tf-smoke-gpu-worker-3 0/1 Pending 0 6m43s
Check the logs of the running workers. You will find that both the workers are waiting for the other two workers to start. In this case, all the four GPUs are occupied but no job is run.
$ kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:02:18|/opt/launcher.py|27| 2020-05-19 07:02:18.199696: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:3
INFO|2020-05-19T07:02:28|/opt/launcher.py|27| 2020-05-19 07:02:28.199798: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
4. To simulate job implementation with the coscheduling feature, do the following:
Add labels related to the PodGroup and create a job. The cluster resources cannot meet the min-available requirements. As a result, the PodGroup cannot be scheduled and all the pods remain in the pending state.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-0 0/1 Pending 0 43s
tf-smoke-gpu-worker-1 0/1 Pending 0 43s
tf-smoke-gpu-worker-2 0/1 Pending 0 43s
tf-smoke-gpu-worker-3 0/1 Pending 0 43s
Now, if you scale out the cluster by adding four GPUs, the resources can meet the min-available requirements, the PodGroup can be scheduled, and all four workers will start to run.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-smoke-gpu-ps-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-0 1/1 Running 0 3m16s
tf-smoke-gpu-worker-1 1/1 Running 0 3m16s
tf-smoke-gpu-worker-2 1/1 Running 0 3m16s
tf-smoke-gpu-worker-3 1/1 Running 0 3m16s
View the log of one of the workers. You will find that the training job has started.
$ kubectl logs -f tf-smoke-gpu-worker-0
INFO|2020-05-19T07:15:24|/opt/launcher.py|27| Running warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Done warm up
INFO|2020-05-19T07:21:04|/opt/launcher.py|27| Step Img/sec loss
INFO|2020-05-19T07:21:05|/opt/launcher.py|27| 1 images/sec: 31.6 +/- 0.0 (jitter = 0.0) 8.318
INFO|2020-05-19T07:21:15|/opt/launcher.py|27| 10 images/sec: 31.1 +/- 0.4 (jitter = 0.7) 8.343
INFO|2020-05-19T07:21:25|/opt/launcher.py|27| 20 images/sec: 31.5 +/- 0.3 (jitter = 0.7) 8.142
Coscheduling is implemented based on the mechanism of the Kubernetes Scheduling Framework. It meets the requirements for combined scheduling in artificial intelligence (AI) and data computing batch jobs, reduces resource waste, and improves the overall resource utilization of clusters.
In subsequent articles in this series, we will provide more information about scheduling policies for batch jobs, including capacity scheduling and multi-queue management features. We will also describe the design and implementation of the scheduling policies in the Scheduling Framework. Stay tuned for more!
The Burgeoning Kubernetes Scheduling System – Part 1: Scheduling Framework
The Burgeoning Kubernetes Scheduling System – Part 3: Binpack Scheduling That Supports Batch Jobs
175 posts | 31 followers
FollowAlibaba Container Service - February 13, 2021
Alibaba Container Service - February 12, 2021
Alibaba Cloud Native Community - December 1, 2022
Alibaba Cloud MaxCompute - November 19, 2021
Alibaba Cloud Native Community - August 15, 2024
Alibaba Cloud MaxCompute - August 31, 2020
175 posts | 31 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreA platform that provides enterprise-level data modeling services based on machine learning algorithms to quickly meet your needs for data-driven operations.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMore Posts by Alibaba Container Service