Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of Container Service for Kubernetes (ACK) clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. This topic describes how to use the capacity scheduling feature.
Prerequisites
An ACK Pro cluster that runs Kubernetes 1.20 or later is created. For more information, see Create an ACK managed cluster.
Background information
If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. These users may use resources in different ways at different times. As a result, some users may encounter resource shortages while the resources allocated to other users are not used. In this case, the resource utilization of the cluster decreases and a considerable amount of resources is wasted.
Key features
To address this issue, Alibaba Cloud provides the capacity scheduling feature based on the Kubernetes scheduling framework to optimize resource allocation. This feature allows you to meet the resource requests in an ACK cluster and improve resource utilization by sharing resources. The capacity scheduling feature provides the following features:
Support hierarchical resource quotas. You can configure hierarchical resource quotas based on your requirements (such as creating resource quotas for departments in an enterprise), as shown in the following figure. In an elastic quota group, each namespace belongs to only one leaf and a leaf can contain multiple namespaces.
Support resource sharing and reclaiming between resource quotas.
Min: the minimum amount of resources that are guaranteed for use. If the resources of a cluster become insufficient, the total amount of minimum resources for all users must be lower than the total amount of resources of the cluster.
Max: the maximum amount of resources that you can use.
NoteIdle resource quotas of other users can be temporarily used by your workloads. However, the total amount of resources used by your workloads cannot exceed the maximum amount of the corresponding resource quota.
If the minimum amount of resources allocated to your workloads are idle, they can be temporarily used by other users. When you require these resources, they are reclaimed and preempted by your workloads.
Support multiple resource types. You can configure CPU and memory resource quotas. You can also configure resource quotas for extended resources that are supported by Kubernetes, such as GPUs.
Examples of capacity scheduling
In this topic, an Elastic Compute Service (ECS) instance of the ecs.sn2.13xlarge type (56 vCPUs and 224 GiB) is used to show how to configure resource quotas.
Run the following command to create namespaces:
kubectl create ns namespace1 kubectl create ns namespace2 kubectl create ns namespace3 kubectl create ns namespace4
Create an elastic quota group by using the following YAML template:
In the preceding YAML template, namespaces are configured below
namespaces
. Elastic quotas for leaves are configured belowchildren
. The quota configurations must meet the following requirements:The minimum amount of resources for a leaf cannot exceed the maximum amount of resources for the leaf.
The total amount of minimum resources for all leaves under a parent leaf cannot exceed the minimum amount of resources for the parent leaf.
The minimum amount of resources for the root must equal the maximum amount of resources for the root. This amount cannot exceed the total amount of resources of the cluster.
Each namespace belongs to only one leaf. A leaf can contain multiple namespaces.
Run the following command to check whether the elastic quota group is created:
kubectl get ElasticQuotaTree -n kube-system
Expected output:
NAME AGE elasticquotatree 68s
Create a Deployment in
namespace1
by using the following YAML template. The Deployment runs five pods and each pod requests five CPU cores.Run the following command to query the status of the pods:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 70s nginx1-744b889544-6l4s9 1/1 Running 0 70s nginx1-744b889544-cgzlr 1/1 Running 0 70s nginx1-744b889544-w2gr7 1/1 Running 0 70s nginx1-744b889544-zr5xz 0/1 Pending 0 70s
The total CPU resource requests of the pods in
namespace1
exceed 10 (min.cpu=10
), which is the minimum number of CPU cores forroot.a.1
. The maximum number of CPU cores for the root is 40 (root.max.cpu=40
). Therefore, these pods can use idle CPU cores in the cluster. The maximum number of CPU cores used by these pods cannot exceed 20 (max.cpu=20
), which is the maximum number of CPU cores forroot.a.1
.If the number of CPU cores used by the pods in namespace1 reaches 20 (
max.cpu=20
), the next pod to be scheduled becomes pending. Among the five pods of this Deployment, four are in the running state and one is in the pending state.
Create a second Deployment in
namespace2
by using the following YAML template. The Deployment runs five pods and each pod requests five CPU cores.Run the following command to query the status of the pods:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 111s nginx1-744b889544-6l4s9 1/1 Running 0 111s nginx1-744b889544-cgzlr 1/1 Running 0 111s nginx1-744b889544-w2gr7 1/1 Running 0 111s nginx1-744b889544-zr5xz 0/1 Pending 0 111s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-4gl8s 1/1 Running 0 111s nginx2-556f95449f-crwk4 1/1 Running 0 111s nginx2-556f95449f-gg6q2 0/1 Pending 0 111s nginx2-556f95449f-pnz5k 1/1 Running 0 111s nginx2-556f95449f-vjpmq 1/1 Running 0 111s
This Deployment is similar to the
nginx1
Deployment. The total CPU resource requests of the pods innamespace2
exceed 10 (min.cpu=10
), which is the minimum number of CPU cores forroot.a.2
. The maximum number of CPU cores for the root is 40 (root.max.cpu=40
). Therefore, these pods can use idle CPU cores in the cluster. The maximum number of CPU cores used by these pods cannot exceed 20 (max.cpu=20
), which is the maximum number of CPU cores forroot.a.2
.If the number of CPU cores used by the pods in namespace2 reaches 20 (
max.cpu=20
), the next pod to be scheduled becomes pending. Among the five pods of this Deployment, four are in the running state and one is in the pending state.After you create the preceding two Deployments, the pods in
namespace1
andnamespace2
have used 40 (root.max.cpu=40
) CPU cores, which is the maximum number of CPU cores for theroot
node.
Create a third Deployment in
namespace3
by using the following YAML template. This Deployment runs five pods and each pod requests five CPU cores.Run the following command to query the status of the pods:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 6m17s nginx1-744b889544-cgzlr 1/1 Running 0 6m17s nginx1-744b889544-nknns 0/1 Pending 0 3m45s nginx1-744b889544-w2gr7 1/1 Running 0 6m17s nginx1-744b889544-zr5xz 0/1 Pending 0 6m17s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-crwk4 1/1 Running 0 4m22s nginx2-556f95449f-ft42z 1/1 Running 0 4m22s nginx2-556f95449f-gg6q2 0/1 Pending 0 4m22s nginx2-556f95449f-hfr2g 1/1 Running 0 3m29s nginx2-556f95449f-pvgrl 0/1 Pending 0 3m29s
kubectl get pods -n namespace3
Expected output:
NAME READY STATUS RESTARTS AGE nginx3-578877666-msd7f 1/1 Running 0 4m nginx3-578877666-nfdwv 0/1 Pending 0 4m10s nginx3-578877666-psszr 0/1 Pending 0 4m11s nginx3-578877666-xfsss 1/1 Running 0 4m22s nginx3-578877666-xpl2p 0/1 Pending 0 4m10s
min
forroot.b.1
is set to10
. Therefore, when thenginx3
Deployment is created, the pods requiremin
resources. The scheduler reclaims the CPU cores that belong toroot.b
and are temporarily used byroot.a
. This ensures a minimum of 10 (min.cpu=10
) CPU cores for the pod scheduling ofnginx3
.Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority classes, availability, and creation time of the workloads of
root.a
. Therefore, after the pods ofnginx3
are scheduled based on the reclaimed 10 (min.cpu=10
) CPU cores, two pods are in the Running state and the other three are in the Pending state.Create a fourth Deployment
nginx4
innamespace4
by using the following YAML template. This Deployment runs five pods and each pod requests five CPU cores.Run the following command to query the status of the pods:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-cgzlr 1/1 Running 0 8m20s nginx1-744b889544-cwx8l 0/1 Pending 0 55s nginx1-744b889544-gjkx2 0/1 Pending 0 55s nginx1-744b889544-nknns 0/1 Pending 0 5m48s nginx1-744b889544-zr5xz 1/1 Running 0 8m20s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-cglpv 0/1 Pending 0 3m45s nginx2-556f95449f-crwk4 1/1 Running 0 9m31s nginx2-556f95449f-gg6q2 1/1 Running 0 9m31s nginx2-556f95449f-pvgrl 0/1 Pending 0 8m38s nginx2-556f95449f-zv8wn 0/1 Pending 0 3m45s
kubectl get pods -n namespace3
Expected output:
NAME READY STATUS RESTARTS AGE nginx3-578877666-msd7f 1/1 Running 0 8m46s nginx3-578877666-nfdwv 0/1 Pending 0 8m56s nginx3-578877666-psszr 0/1 Pending 0 8m57s nginx3-578877666-xfsss 1/1 Running 0 9m8s nginx3-578877666-xpl2p 0/1 Pending 0 8m56s
kubectl get pods -n namespace4
Expected output:
nginx4-754b767f45-g9954 1/1 Running 0 4m32s nginx4-754b767f45-j4v7v 0/1 Pending 0 4m32s nginx4-754b767f45-jk2t7 0/1 Pending 0 4m32s nginx4-754b767f45-nhzpf 0/1 Pending 0 4m32s nginx4-754b767f45-tv5jj 1/1 Running 0 4m32s
Similarly,
min
forroot.b.2
is set to10
. Therefore, whennginx4
is created, the pods requiremin
resources. The scheduler reclaims the CPU cores that belong toroot.b
and are temporarily used byroot.a
. This ensures a minimum of 10 (min.cpu=10
) CPU cores for the pod scheduling ofnginx4
.Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority classes, availability, and creation time of the workloads in
root.a
. Therefore, after the pods ofnginx4
are scheduled based on the reclaimed 10 (min.cpu=10
) CPU cores, two pods are in the running state and the other three are in the pending state.After all the four Deployments are created, all pods in each namespace are running with the
minimum amount of resources
that are guaranteed for resource quotas.
References
For more information about the release notes for kube-scheduler, see kube-scheduler.
ACK supports the gang scheduling feature, which is developed based on the kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. Gang scheduling is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. For more information, see Work with gang scheduling.