Work with capacity scheduling

Kubernetes uses the ResourceQuota object to allocate resources statically. This method does not ensure high resource utilization in Kubernetes clusters. To improve the resource utilization of Container Service for Kubernetes (ACK) clusters, Alibaba Cloud has developed the capacity scheduling feature based on the Yarn capacity scheduler and the Kubernetes scheduling framework. This feature uses elastic quota groups to meet the resource requests in an ACK cluster and share resources to improve resource utilization. This topic describes how to use the capacity scheduling feature.

Prerequisites

An ACK Pro cluster that runs Kubernetes 1.20 or later is created. For more information, see Create an ACK managed cluster.

Background information

If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. These users may use resources in different ways at different times. As a result, some users may encounter resource shortages while the resources allocated to other users are not used. In this case, the resource utilization of the cluster decreases and a considerable amount of resources is wasted.

Key features

To address this issue, Alibaba Cloud provides the capacity scheduling feature based on the Kubernetes scheduling framework to optimize resource allocation. This feature allows you to meet the resource requests in an ACK cluster and improve resource utilization by sharing resources. The capacity scheduling feature provides the following features:

Support hierarchical resource quotas. You can configure hierarchical resource quotas based on your requirements (such as creating resource quotas for departments in an enterprise), as shown in the following figure. In an elastic quota group, each namespace belongs to only one leaf and a leaf can contain multiple namespaces.
Support resource sharing and reclaiming between resource quotas.
- Min: the minimum amount of resources that are guaranteed for use. If the resources of a cluster become insufficient, the total amount of minimum resources for all users must be lower than the total amount of resources of the cluster.
- Max: the maximum amount of resources that you can use.
Note
- Idle resource quotas of other users can be temporarily used by your workloads. However, the total amount of resources used by your workloads cannot exceed the maximum amount of the corresponding resource quota.
- If the minimum amount of resources allocated to your workloads are idle, they can be temporarily used by other users. When you require these resources, they are reclaimed and preempted by your workloads.
Support multiple resource types. You can configure CPU and memory resource quotas. You can also configure resource quotas for extended resources that are supported by Kubernetes, such as GPUs.

Examples of capacity scheduling

In this topic, an Elastic Compute Service (ECS) instance of the ecs.sn2.13xlarge type (56 vCPUs and 224 GiB) is used to show how to configure resource quotas.

Run the following command to create namespaces:

kubectl create ns namespace1
kubectl create ns namespace2
kubectl create ns namespace3
kubectl create ns namespace4

Create an elastic quota group by using the following YAML template:

Show YAML content

apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
  name: elasticquotatree
  namespace: kube-system # The elastic quota group takes effect only if it is created in the kube-system namespace.
spec:
  root:
    name: root # Configure the resource quota of the root. The maximum amount of resources for the root must equal the minimum amount of resources for the root.
    max:
      cpu: 40
      memory: 40Gi
      nvidia.com/gpu: 4
    min:
      cpu: 40
      memory: 40Gi
      nvidia.com/gpu: 4
    children: # Configure resource quotas for level-2 leaves of the root.
      - name: root.a
        max:
          cpu: 40
          memory: 40Gi
          nvidia.com/gpu: 4
        min:
          cpu: 20
          memory: 20Gi
          nvidia.com/gpu: 2
        children: # Configure resource quotas for level-3 leaves of the root.
          - name: root.a.1
            namespaces: # Configure the namespace.
              - namespace1
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
          - name: root.a.2
            namespaces: # Configure the namespace.
              - namespace2
            max:
              cpu: 20
              memory: 40Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
      - name: root.b
        max:
          cpu: 40
          memory: 40Gi
          nvidia.com/gpu: 4
        min:
          cpu: 20
          memory: 20Gi
          nvidia.com/gpu: 2
        children: # Configure resource quotas for level-3 leaves of the root.
          - name: root.b.1
            namespaces: # Configure the namespace.
              - namespace3
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1
          - name: root.b.2
            namespaces: # Configure the namespace.
              - namespace4
            max:
              cpu: 20
              memory: 20Gi
              nvidia.com/gpu: 2
            min:
              cpu: 10
              memory: 10Gi
              nvidia.com/gpu: 1

In the preceding YAML template, namespaces are configured below namespaces. Elastic quotas for leaves are configured below children. The quota configurations must meet the following requirements:

The minimum amount of resources for a leaf cannot exceed the maximum amount of resources for the leaf.
The total amount of minimum resources for all leaves under a parent leaf cannot exceed the minimum amount of resources for the parent leaf.
The minimum amount of resources for the root must equal the maximum amount of resources for the root. This amount cannot exceed the total amount of resources of the cluster.
Each namespace belongs to only one leaf. A leaf can contain multiple namespaces.

Run the following command to check whether the elastic quota group is created:
```
kubectl get ElasticQuotaTree -n kube-system
```
Expected output:
```
NAME               AGE
elasticquotatree   68s
```

Create a Deployment in namespace1 by using the following YAML template. The Deployment runs five pods and each pod requests five CPU cores.

Show YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx1
  namespace: namespace1
  labels:
    app: nginx1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx1
  template:
    metadata:
      name: nginx1
      labels:
        app: nginx1
    spec:
      containers:
      - name: nginx1
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Run the following command to query the status of the pods:
```
kubectl get pods -n namespace1
```
Expected output:
```
NAME                     READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          70s
nginx1-744b889544-6l4s9   1/1     Running   0          70s
nginx1-744b889544-cgzlr   1/1     Running   0          70s
nginx1-744b889544-w2gr7   1/1     Running   0          70s
nginx1-744b889544-zr5xz   0/1     Pending   0          70s
```
- The total CPU resource requests of the pods in namespace1 exceed 10 (min.cpu=10), which is the minimum number of CPU cores for root.a.1. The maximum number of CPU cores for the root is 40 (root.max.cpu=40). Therefore, these pods can use idle CPU cores in the cluster. The maximum number of CPU cores used by these pods cannot exceed 20 (max.cpu=20), which is the maximum number of CPU cores for root.a.1.
- If the number of CPU cores used by the pods in namespace1 reaches 20 (max.cpu=20), the next pod to be scheduled becomes pending. Among the five pods of this Deployment, four are in the running state and one is in the pending state.

Create a second Deployment in namespace2 by using the following YAML template. The Deployment runs five pods and each pod requests five CPU cores.

Show YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx2
  namespace: namespace2
  labels:
    app: nginx2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx2
  template:
    metadata:
      name: nginx2
      labels:
        app: nginx2
    spec:
      containers:
      - name: nginx2
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Run the following command to query the status of the pods:
```
kubectl get pods -n namespace1
```
Expected output:
```
NAME                     READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          111s
nginx1-744b889544-6l4s9   1/1     Running   0          111s
nginx1-744b889544-cgzlr   1/1     Running   0          111s
nginx1-744b889544-w2gr7   1/1     Running   0          111s
nginx1-744b889544-zr5xz   0/1     Pending   0          111s
```
```
kubectl get pods -n namespace2
```
Expected output:
```
NAME                     READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-4gl8s   1/1     Running   0          111s
nginx2-556f95449f-crwk4   1/1     Running   0          111s
nginx2-556f95449f-gg6q2   0/1     Pending   0          111s
nginx2-556f95449f-pnz5k   1/1     Running   0          111s
nginx2-556f95449f-vjpmq   1/1     Running   0          111s
```
- This Deployment is similar to the nginx1 Deployment. The total CPU resource requests of the pods in namespace2 exceed 10 (min.cpu=10), which is the minimum number of CPU cores for root.a.2. The maximum number of CPU cores for the root is 40 (root.max.cpu=40). Therefore, these pods can use idle CPU cores in the cluster. The maximum number of CPU cores used by these pods cannot exceed 20 (max.cpu=20), which is the maximum number of CPU cores for root.a.2.
- If the number of CPU cores used by the pods in namespace2 reaches 20 (max.cpu=20), the next pod to be scheduled becomes pending. Among the five pods of this Deployment, four are in the running state and one is in the pending state.
- After you create the preceding two Deployments, the pods in namespace1 and namespace2 have used 40 (root.max.cpu=40) CPU cores, which is the maximum number of CPU cores for the root node.

Create a third Deployment in namespace3 by using the following YAML template. This Deployment runs five pods and each pod requests five CPU cores.

Show YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx3
  namespace: namespace3
  labels:
    app: nginx3
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx3
  template:
    metadata:
      name: nginx3
      labels:
        app: nginx3
    spec:
      containers:
      - name: nginx3
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Run the following command to query the status of the pods:

kubectl get pods -n namespace1

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx1-744b889544-52dbg   1/1     Running   0          6m17s
nginx1-744b889544-cgzlr   1/1     Running   0          6m17s
nginx1-744b889544-nknns   0/1     Pending   0          3m45s
nginx1-744b889544-w2gr7   1/1     Running   0          6m17s
nginx1-744b889544-zr5xz   0/1     Pending   0          6m17s

kubectl get pods -n namespace2

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-crwk4   1/1     Running   0          4m22s
nginx2-556f95449f-ft42z   1/1     Running   0          4m22s
nginx2-556f95449f-gg6q2   0/1     Pending   0          4m22s
nginx2-556f95449f-hfr2g   1/1     Running   0          3m29s
nginx2-556f95449f-pvgrl   0/1     Pending   0          3m29s

kubectl get pods -n namespace3

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx3-578877666-msd7f   1/1     Running   0          4m
nginx3-578877666-nfdwv   0/1     Pending   0          4m10s
nginx3-578877666-psszr   0/1     Pending   0          4m11s
nginx3-578877666-xfsss   1/1     Running   0          4m22s
nginx3-578877666-xpl2p   0/1     Pending   0          4m10s

min for root.b.1 is set to 10. Therefore, when the nginx3 Deployment is created, the pods require min resources. The scheduler reclaims the CPU cores that belong to root.b and are temporarily used by root.a. This ensures a minimum of 10 (min.cpu=10) CPU cores for the pod scheduling of nginx3.

Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority classes, availability, and creation time of the workloads of root.a. Therefore, after the pods of nginx3 are scheduled based on the reclaimed 10 (min.cpu=10) CPU cores, two pods are in the Running state and the other three are in the Pending state.

Create a fourth Deployment nginx4 in namespace4 by using the following YAML template. This Deployment runs five pods and each pod requests five CPU cores.

Show YAML content

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx4
  namespace: namespace4
  labels:
    app: nginx4
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx4
  template:
    metadata:
      name: nginx4
      labels:
        app: nginx4
    spec:
      containers:
      - name: nginx4
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Run the following command to query the status of the pods:

kubectl get pods -n namespace1

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx1-744b889544-cgzlr   1/1     Running   0          8m20s
nginx1-744b889544-cwx8l   0/1     Pending   0          55s
nginx1-744b889544-gjkx2   0/1     Pending   0          55s
nginx1-744b889544-nknns   0/1     Pending   0          5m48s
nginx1-744b889544-zr5xz   1/1     Running   0          8m20s

kubectl get pods -n namespace2

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE
nginx2-556f95449f-cglpv   0/1     Pending   0          3m45s
nginx2-556f95449f-crwk4   1/1     Running   0          9m31s
nginx2-556f95449f-gg6q2   1/1     Running   0          9m31s
nginx2-556f95449f-pvgrl   0/1     Pending   0          8m38s
nginx2-556f95449f-zv8wn   0/1     Pending   0          3m45s

kubectl get pods -n namespace3

Expected output:

NAME                     READY   STATUS    RESTARTS   AGE
nginx3-578877666-msd7f   1/1     Running   0          8m46s
nginx3-578877666-nfdwv   0/1     Pending   0          8m56s
nginx3-578877666-psszr   0/1     Pending   0          8m57s
nginx3-578877666-xfsss   1/1     Running   0          9m8s
nginx3-578877666-xpl2p   0/1     Pending   0          8m56s

kubectl get pods -n namespace4

Expected output:

nginx4-754b767f45-g9954   1/1     Running   0          4m32s
nginx4-754b767f45-j4v7v   0/1     Pending   0          4m32s
nginx4-754b767f45-jk2t7   0/1     Pending   0          4m32s
nginx4-754b767f45-nhzpf   0/1     Pending   0          4m32s
nginx4-754b767f45-tv5jj   1/1     Running   0          4m32s

Similarly, min for root.b.2 is set to 10. Therefore, when nginx4 is created, the pods require min resources. The scheduler reclaims the CPU cores that belong to root.b and are temporarily used by root.a. This ensures a minimum of 10 (min.cpu=10) CPU cores for the pod scheduling of nginx4.

Before the scheduler reclaims the temporarily used 10 CPU cores, it also considers other factors, such as the priority classes, availability, and creation time of the workloads in root.a. Therefore, after the pods of nginx4 are scheduled based on the reclaimed 10 (min.cpu=10) CPU cores, two pods are in the running state and the other three are in the pending state.

After all the four Deployments are created, all pods in each namespace are running with the minimum amount of resources that are guaranteed for resource quotas.

References

For more information about the release notes for kube-scheduler, see kube-scheduler.
ACK supports the gang scheduling feature, which is developed based on the kube-scheduler framework. Gang scheduling ensures that a group of correlated pods are scheduled at the same time. If the scheduling requirements are not met, none of the pods is scheduled. Gang scheduling provides a solution to job scheduling in All-or-Nothing scenarios. Gang scheduling is suitable for distributed applications which strictly require you to schedule or share resources for all big data computing jobs at the same time, such as Spark and Hadoop jobs. For more information, see Work with gang scheduling.

Prerequisites

Background information

Key features

Examples of capacity scheduling

References

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)