In a Kubernetes cluster, the scheduling component named kube-scheduler schedules pods to nodes based on the overall resource allocation of the cluster to ensure the high availability of applications and improve cluster resource utilization. Container Service for Kubernetes (ACK) provides flexible and various scheduling policies for different workloads, including job scheduling, topology-aware scheduling, quality of service (QoS)-aware scheduling, and descheduling.
Before you start
This topic describes cluster scheduling solutions for cluster O&M engineers (including cluster resource administrators) and application developers. You can select a scheduling policy based on your business scenario and business role.
O&M engineers are concerned about cluster costs and must maximize the utilization of cluster resources to avoid resource waste. They also need to ensure the high availability of clusters. They want to balance the loads among nodes and avoid single points of failure (SPOFs) by scheduling pods properly.
Application developers want to deploy and manage applications through a simple way. Applications can obtain the required resources, such as CPU, GPU, and memory resources, based on their performance requirements.
To better use the scheduling policies provided by ACK, we recommend that you learn about scheduling terms. For more information, see Kubernetes Scheduler, Node labels, Node-pressure Eviction, and Pod topology spread constraints.
The default scheduling policy of the ACK scheduler is the same as the default scheduling policy of the open source Kubernetes scheduler. The default scheduling policy consists of the Filter and Score plug-ins.
Kubernetes-native scheduling policies
Kubernetes-native scheduling policies can be classified into node scheduling policies and inter-pod scheduling policies.
Node scheduling policies: focus on the characteristics and resource conditions of nodes to ensure that pods are scheduled to nodes that meet their requirements.
Inter-pod scheduling policies: focus on controlling the distribution of pods to optimize the overall deployment of pods and ensure the high availability of applications.
Policy | Description | Scenario |
A simple targeted scheduling mechanism. You can label nodes by adding key-value pairs and then use nodeSelector to schedule pods to nodes with the specified labels. For example, you can use nodeSelector to schedule pods to specific nodes or schedule pods to a specific node pool. | This is a basic node selection feature which does not support complex scheduling features, such as anti-affinity rules. | |
This scheduling policy is more flexible and fine-grained than the scheduling policy that uses NodeSelector. For example, the | A basic node selection feature. Affinity rules are used to schedule pods to nodes with certain characteristics, such as regions, device types, and hardware configuration. Anti-affinity rules are used to avoid scheduling pods to nodes with certain characteristics, in order to spread pods across nodes to improve application availability. | |
A taint consists of a key, a value, and an effect. Common effects are |
| |
Pod labels are used to specify whether to schedule a pod to a specific node. Similar to node affinity rules, the |
|
Scheduling policies provided by ACK
If the Kubernetes-native scheduling policies cannot meet your business requirements, such as sequential scaling-out and reverse scaling-in of different instance resources, load-aware scheduling based on the actual resource usage of nodes, QoS guarantee in colocation scenarios, and pod descheduling and load balancing after descheduling, you can refer to the following scheduling policies provided by ACK.
Configure priority-based resource scheduling
Intended role: Cluster O&M engineers.
Description: If your ACK cluster contains various types of instance resources, such as Elastic Compute Service (ECS) instances and elastic container instances, with different billing methods, such as subscription, pay-as-you-go, and preemptible instances, we recommend that you configure priority-based resource scheduling. This allows you to specify the order in which different types of node resources are selected during pod scheduling and perform scale-in activities in a reverse order.
Policy | Description | Scenario | Reference |
Custom priority-based resource scheduling | You can specify a custom value for the During application scale-in, the cluster preferentially deletes pods on the elastic container instance to release node resources provided by the elastic container instance, then deletes pods on the pay-as-you-go ECS instance, and lastly deletes pods on the subscription ECS instance. |
|
Job scheduling
Intended role: Cluster O&M engineers.
Description: The Kubernetes scheduler can decide which node to run a pod based on predefined rules. However, the scheduler is not suitable for scheduling the pods of batch jobs. ACK supports gang scheduling and capacity scheduling for batch jobs.
Policy | Description | Scenario | Reference |
Gang Scheduling | In a concurrent system, multiple correlated processes in a job that requires all-or-nothing scheduling can be scheduled to run simultaneously on different processors. This means that all related pods are scheduled or none of the pods is scheduled, which prevents certain abnormal processes from blocking the entire group of correlated processes. |
| |
Capacity Scheduling | This feature allows clusters to reserve a specific amount of resources for specific namespaces or user groups, and improves the overall resource utilization by implementing resource sharing when cluster resources are barely sufficient. | In a multi-tenant scenario, different tenants demand different resource lifecycles and use resources in different ways. Consequently, the cluster resource utilization is low. Resource sharing and reclaiming based on a fixed amount of resources are needed to improve resource utilization. |
Affinity scheduling
Intended role: Cluster O&M engineers.
Description: You can schedule workloads to specific instance resources, such as field-programmable gate array (FPGA)-accelerated nodes and ARM-based nodes, based on Kubernetes-native scheduling policies. ACK clusters enhance scheduling capabilities by allowing the scheduler to loop through all topology domains during pod scheduling until the scheduler finds a topology domain that meets the requirements of all pods created by a job.
Policy | Description | Scenario | Reference |
Topology-aware scheduling | The scheduler adds gang scheduling labels to a job to ensure that the resource requests of all pods are fulfilled at the same time. You can also use topology-aware scheduling to enable the scheduler to loop through a list of topology domains until it finds a topology domain that meets the requirements of all pods. You can associate node pools with deployment sets to schedule pods to ECS instances in the same lower-latency deployment set to improve the performance of jobs. | In machine learning or big data analysis jobs, pods need to communicate frequently. The scheduler needs to loop through a list of topology domains until it finds a topology domain that meets the requirements of all pods. This way, the amount of time required to complete the job is reduced. | |
Schedule workloads to FPGA-accelerated nodes | Workloads are scheduled by | In heterogeneous computing scenarios, you need to schedule workloads to FPGA-accelerated nodes to improve the utilization of FPGA resources. | |
Schedule workloads to ARM-based nodes | By default, ACK schedules all workloads to x86-based worker nodes. You can configure | Your cluster contains ARM-based nodes and other nodes, such as x86-based nodes, and you want to schedule ARM workloads only to ARM-based nodes or preferably schedule multi-arch workloads to ARM-based nodes. |
Load-aware scheduling
Intended role: Cluster O&M engineers and application developers.
Description: Based on the Kubernetes-native scheduling policy, the scheduler schedules a pod based on the resource allocation. The scheduler checks the resource requests of the pod and the allocatable resources on the node to determine whether to schedule the pod to the node. However, the resource usage of nodes dynamically changes over time or based on the cluster environment, traffic, or workload requests. The Kubernetes scheduler cannot detect the actual resource loads on nodes.
Policy | Description | Scenario | Reference |
Load-aware scheduling | By reviewing the historical statistics of the loads of nodes and estimating the resource usage of newly scheduled pods, the ACK scheduler can monitor the loads of nodes and schedule pods to nodes with lower loads to implement load balancing. This prevents application or node crashes caused by a single overloaded node. | Applications that are sensitive to loads or access latency or have requirements on the QoS class of resources. |
Work with load-aware hotspot descheduling to prevent imbalanced load distribution among nodes.
QoS-aware scheduling
Applicable role: Cluster O&M engineers and application developers.
Description: You can configure QoS classes for pods, including Guaranteed, Burstable, and BestEffort. When node resources are insufficient, the kubelet decides which pods to evict from a node based on the QoS class. For applications with different QoS classes, ACK provides the service level objective (SLO)-aware resource scheduling feature to enhance the performance and service quality of latency-sensitive applications and ensure the resource usage for lower-priority jobs.
Policy | Description | Scenario | Reference |
CPU Burst | Due to the CPU limit, the operating system limits the use of resources within a cycle, which may cause containers to experience resource throttling (CPU throttling). CPU Burst allows a container to accumulate CPU time slices when the container is idle. The container can use the accumulated CPU time slices to burst above the CPU limit when the resource demand spikes. This enhances container performance, reduces the latency, and improves the service quality. |
| |
Topology-aware CPU scheduling | Fix the pods of CPU-sensitive workloads to the CPU cores on nodes. This addresses the issue of application performance degradation caused by frequent CPU context switching and memory access across NUMA nodes. |
| |
Topology-aware GPU scheduling | When multiple GPUs are deployed in a cluster and multiple GPU-intensive pods run at the same time, the pods may compete for GPU resources and frequently switch between different GPUs or NUMA nodes. This affects the performance of the application. Topology-aware GPU scheduling can schedule workloads to different GPUs, which reduces memory access across NUMA nodes and improves application performance and response speed. |
| |
Dynamic resource overcommitment | Quantify the resources that are allocated to pods but are not in use and schedule the resources to low-priority jobs to achieve resource overcommitment. The following single-node QoS policies must be used together to applications affect each other:
| Improve the resource utilization of clusters by using colocation. Typical colocation scenarios are machine learning model training and inference, big data batch processing and data analysis, online services, and offline backup services. | |
Dynamically modify the resource parameters of a pod | If you want to modify container parameters for a pod in a cluster that runs Kubernetes 1.27 or earlier, you must modify the PodSpec parameter and submit the change. Then, pod is deleted and recreated. ACK allows you to modify the CPU parameters, the memory parameters, and the disk IOPS limit of a pod without the need to restart the pod. | This feature is suitable for scenarios that you want to temporarily adjust CPU and memory resources. |
Descheduling
Intended role: Cluster O&M engineers and application developers.
Description: The Kubernetes scheduler schedules pods to proper nodes based on the current cluster status. However, the cluster status constantly changes. In some scenarios, you may need to migrate running pods to other nodes, which means you must reschedule the pod to a different node.
Policy | Description | Scenario | Reference |
Descheduling | In scenarios in which hotspot nodes exist because cluster resources on different nodes are not equally used or pods fail to be scheduled based on the predefined policy due to changes in node attributes, you can schedule pods that are not scheduled properly on a node to another node to ensure that the pods run on the optimal nodes. This ensures the high availability and efficiency of the workloads in your cluster. |
| |
Work with load-aware hotspot descheduling | You can use a combination of load-aware scheduling and hotspot descheduling to monitor changes in the loads of nodes and automatically optimize the nodes that exceed the load threshold to prevent node overloading. |
Billing
When you use the scheduling feature of ACK, you are charged for cluster management and cloud resources based on the billing rules. In addition, you are charged the following fees for the scheduling component:
The default ACK scheduler provided by kube-scheduler is a control plane component that you can install and use free of charge.
The resource scheduling optimization and descheduling features of ACK are implemented based on ack-koordinator. The installation and use of ack-koordinator are free of charge, but additional fees may be generated in specific scenarios. For more information, see ack-koordinator (ack-slo-manager).
FAQ
If you encounter any problems when you use the scheduling feature, see Scheduling FAQ for troubleshooting.
How do I avoid pod startup failures due to insufficient IP addresses provided by the vSwitch?
How do I migrate from ack-descheduler to Koordinator Descheduler?
Why are pods not scheduled to the new node that I added to the cluster?
What are the precautions for using the descheduling feature in ACK? Does this feature restart pods?
How do I ensure the high availability of pods when scheduling the pods of a workload?
References
For more information about the introduction and release notes for kube-scheduler and ack-koordinator, see kube-scheduler and ack-koordinator(ack-slo-manager).
For more information about how to customize the behavior of kube-scheduler to optimize pod scheduling, see Configure the custom parameters of kube-scheduler.
For more information about the best practices in scheduling scenarios, such as how to ensure service quality in a colocation architecture and achieve stable resource overcommitment, see Best practices for resource scheduling.
You can enable cost insights to get the usage and cost allocation of resources in ACK clusters. Cost insights also provide suggestions on cost savings to improve the overall resource utilization. For more information, see Overview of cost insights.
For more information about how to implement GPU scheduling and memory isolation, see GPU sharing overview.
For more information about the scheduling solutions for virtual nodes, see Compare and introduce virtual node scheduling schemes.