Scheduling overview - Container Service for Kubernetes - Alibaba Cloud Documentation Center

In a Kubernetes cluster, the scheduling component named kube-scheduler schedules pods to nodes based on the overall resource allocation of the cluster to ensure the high availability of applications and improve cluster resource utilization. Container Service for Kubernetes (ACK) provides flexible and various scheduling policies for different workloads, including job scheduling, topology-aware scheduling, quality of service (QoS)-aware scheduling, and descheduling.

Before you start

This topic describes cluster scheduling solutions for cluster O&M engineers (including cluster resource administrators) and application developers. You can select a scheduling policy based on your business scenario and business role.
- O&M engineers are concerned about cluster costs and must maximize the utilization of cluster resources to avoid resource waste. They also need to ensure the high availability of clusters. They want to balance the loads among nodes and avoid single points of failure (SPOFs) by scheduling pods properly.
- Application developers want to deploy and manage applications through a simple way. Applications can obtain the required resources, such as CPU, GPU, and memory resources, based on their performance requirements.
To better use the scheduling policies provided by ACK, we recommend that you learn about scheduling terms. For more information, see Kubernetes Scheduler, Node labels, Node-pressure Eviction, and Pod topology spread constraints.
The default scheduling policy of the ACK scheduler is the same as the default scheduling policy of the open source Kubernetes scheduler. The default scheduling policy consists of the Filter and Score plug-ins.

Kubernetes-native scheduling policies

Kubernetes-native scheduling policies can be classified into node scheduling policies and inter-pod scheduling policies.

Node scheduling policies: focus on the characteristics and resource conditions of nodes to ensure that pods are scheduled to nodes that meet their requirements.
Inter-pod scheduling policies: focus on controlling the distribution of pods to optimize the overall deployment of pods and ensure the high availability of applications.

Policy	Description	Scenario

Policy	Description	Scenario
nodeSelector	A simple targeted scheduling mechanism. You can label nodes by adding key-value pairs and then use nodeSelector to schedule pods to nodes with the specified labels. For example, you can use nodeSelector to schedule pods to specific nodes or schedule pods to a specific node pool.	This is a basic node selection feature which does not support complex scheduling features, such as anti-affinity rules.
nodeAffinity	This scheduling policy is more flexible and fine-grained than the scheduling policy that uses NodeSelector. For example, the `requiredDuringSchedulingIgnoredDuringExecution` node affinity rule ensures that pods are scheduled to the specified node. The `preferredDuringSchedulingIgnoredDuringExecution` node affinity rule schedules pods to nodes that meet the preferred conditions.	A basic node selection feature. Affinity rules are used to schedule pods to nodes with certain characteristics, such as regions, device types, and hardware configuration. Anti-affinity rules are used to avoid scheduling pods to nodes with certain characteristics, in order to spread pods across nodes to improve application availability.
Taints and tolerations	A taint consists of a key, a value, and an effect. Common effects are `NoSchedule`, `PreferNoSchedule`, and `NoExecute`. After you add a taint to a node, only pods that are configured with a toleration in the YAML file to tolerate the taint are scheduled to the node.	Taints and tolerations can be used to reserve dedicated node resources for specific applications, such as reserving GPU-accelerated nodes for AI or Machine Learning (ML) workloads. ACK allows you to add taints or labels to node pools to schedule certain application pods to the specified node pool. For more information, see Create and manage a node pool and Modify a node pool. Evict pods based on taints and tolerations. For example, after you add a taint to a node and set the effect to `NoSchedule`, pods that are not configured with a toleration to tolerate the taint are evicted. New pods are not scheduled to the node.
Inter-pod affinity and anti-affinity	Pod labels are used to specify whether to schedule a pod to a specific node. Similar to node affinity rules, the `podAffinity` field can be set to the `requiredDuringSchedulingIgnoredDuringExecution` affinity rule and the `preferredDuringSchedulingIgnoredDuringExecution` anti-affinity rule.	Schedule pods that work collaboratively on the same node or on neighboring nodes to reduce the network latency and improve communication efficiency. For example, deploy front-end services and back-end services on the same node. Distribute business-critical application pods across different nodes or fault domains. For example, deploy different replicas of a database on different nodes.

Scheduling policies provided by ACK

If the Kubernetes-native scheduling policies cannot meet your business requirements, such as sequential scaling-out and reverse scaling-in of different instance resources, load-aware scheduling based on the actual resource usage of nodes, QoS guarantee in colocation scenarios, and pod descheduling and load balancing after descheduling, you can refer to the following scheduling policies provided by ACK.

Configure priority-based resource scheduling

Intended role: Cluster O&M engineers.
Description: If your ACK cluster contains various types of instance resources, such as Elastic Compute Service (ECS) instances and elastic container instances, with different billing methods, such as subscription, pay-as-you-go, and preemptible instances, we recommend that you configure priority-based resource scheduling. This allows you to specify the order in which different types of node resources are selected during pod scheduling and perform scale-in activities in a reverse order.

Policy	Description	Scenario	Reference

Policy

Description

Scenario

Reference

Custom priority-based resource scheduling

You can specify a custom value for the ResourcePolicy parameter during application releases or scaling to specify the order in which different types of node resources are selected during pod scheduling. For example, pods can be preferentially scheduled to a subscription ECS instance, then to a pay-as-you-go ECS instance, and lastly to an elastic container instance.

During application scale-in, the cluster preferentially deletes pods on the elastic container instance to release node resources provided by the elastic container instance, then deletes pods on the pay-as-you-go ECS instance, and lastly deletes pods on the subscription ECS instance.

Specify the nodes that are preferred or must be avoided to balance the resource usage of nodes in the cluster.
If an application requires high performance, the pods running the application are preferentially scheduled to a node with high performance.
If an application does not require high performance, the pods running the application are preferentially scheduled to preemptible instances or nodes that have idle computing resources left. This reduces resource usage costs.

Configure priority-based resource scheduling

Job scheduling

Intended role: Cluster O&M engineers.
Description: The Kubernetes scheduler can decide which node to run a pod based on predefined rules. However, the scheduler is not suitable for scheduling the pods of batch jobs. ACK supports gang scheduling and capacity scheduling for batch jobs.

Policy	Description	Scenario	Reference

Policy	Description	Scenario	Reference
Gang Scheduling	In a concurrent system, multiple correlated processes in a job that requires all-or-nothing scheduling can be scheduled to run simultaneously on different processors. This means that all related pods are scheduled or none of the pods is scheduled, which prevents certain abnormal processes from blocking the entire group of correlated processes.	Batch jobs: A job contains multiple interdependent tasks that must be processed at the same time. Distributed computing: Machine learning training jobs or other distributed applications that must run at the same time strictly. High-performance computing: A job may require all resources to be available at the same time before the job can be executed.	Work with gang scheduling
Capacity Scheduling	This feature allows clusters to reserve a specific amount of resources for specific namespaces or user groups, and improves the overall resource utilization by implementing resource sharing when cluster resources are barely sufficient.	In a multi-tenant scenario, different tenants demand different resource lifecycles and use resources in different ways. Consequently, the cluster resource utilization is low. Resource sharing and reclaiming based on a fixed amount of resources are needed to improve resource utilization.	Work with capacity scheduling

Topology-aware scheduling

Intended role: Cluster O&M engineers.
Description: In machine learning and big data analytics workloads, pods often require intensive inter-pod network communication. By default, the native Kubernetes scheduler distributes pods across the cluster in a balanced manner, which increases network latency between communicating pods and ultimately extends job completion times. Existing node or pod affinity mechanisms cannot perform multi-topology-domain scheduling retries. This is because the node has only zone-specific labels.

Description	Scenario	Reference

Description

Scenario

Reference

The scheduler adds gang scheduling labels to a job to ensure that the resource requests of all pods are fulfilled at the same time. You can also use topology-aware scheduling to enable the scheduler to loop through a list of topology domains until it finds a topology domain that meets the requirements of all pods.

You can associate node pools with deployment sets to schedule pods to ECS instances in the same lower-latency deployment set to improve the performance of jobs.

In machine learning or big data analysis jobs, pods need to communicate frequently. The scheduler needs to loop through a list of topology domains until it finds a topology domain that meets the requirements of all pods. This way, the amount of time required to complete the job is reduced.

Load-aware scheduling

Intended role: Cluster O&M engineers and application developers.
Description: Based on the Kubernetes-native scheduling policy, the scheduler schedules a pod based on the resource allocation. The scheduler checks the resource requests of the pod and the allocatable resources on the node to determine whether to schedule the pod to the node. However, the resource usage of nodes dynamically changes over time or based on the cluster environment, traffic, or workload requests. The Kubernetes scheduler cannot detect the actual resource loads on nodes.

Description	Scenario	Reference

Description	Scenario	Reference
By reviewing the historical statistics of the loads of nodes and estimating the resource usage of newly scheduled pods, the ACK scheduler can monitor the loads of nodes and schedule pods to nodes with lower loads to implement load balancing. This prevents application or node crashes caused by a single overloaded node.	Applications that are sensitive to loads or access latency or have requirements on the QoS class of resources.	Use load-aware scheduling

We recommend that you work with load-aware hotspot descheduling to prevent imbalanced load distribution among nodes.

QoS-aware scheduling

Applicable role: Cluster O&M engineers and application developers.
Description: You can configure QoS classes for pods, including Guaranteed, Burstable, and BestEffort. When node resources are insufficient, the kubelet decides which pods to evict from a node based on the QoS class. For applications with different QoS classes, ACK provides the service level objective (SLO)-aware resource scheduling feature to enhance the performance and service quality of latency-sensitive applications and ensure the resource usage for lower-priority jobs.

Policy	Description	Scenario	Reference

Policy	Description	Scenario	Reference
CPU Burst	Due to the CPU limit, the operating system limits the use of resources within a cycle, which may cause containers to experience resource throttling (CPU throttling). CPU Burst allows a container to accumulate CPU time slices when the container is idle. The container can use the accumulated CPU time slices to burst above the CPU limit when the resource demand spikes. This enhances container performance, reduces the latency, and improves the service quality.	Scenarios in which a container consumes large amounts of CPU resources during the startup and loading phases, but occupies a regular amount of CPU resources after loading is complete. In scenarios in which the CPU resource demand suddenly increases, such as e-commerce, online gaming, and other web services and applications, containers must quickly respond to surges in business traffic.	Enable CPU Burst
Topology-aware CPU scheduling	Fix the pods of CPU-sensitive workloads to the CPU cores on nodes. This addresses the issue of application performance degradation caused by frequent CPU context switching and memory access across NUMA nodes.	Applications that are not adapted to cloud-native scenarios. For example, the number of threads is specified based on the total physical cores of the device instead of the container specifications. As a result, application performance degrades. Applications that run on multi-core ECS Bare Metal instances with Intel CPUs or AMD CPUs and experience performance degradation due to memory access across NUMA nodes. Applications that are highly sensitive to CPU context switching and cannot tolerate the performance fluctuations.	Enable topology-aware CPU scheduling
Topology-aware GPU scheduling	When multiple GPUs are deployed in a cluster and multiple GPU-intensive pods run at the same time, the pods may compete for GPU resources and frequently switch between different GPUs or NUMA nodes. This affects the performance of the application. Topology-aware GPU scheduling can schedule workloads to different GPUs, which reduces memory access across NUMA nodes and improves application performance and response speed.	Large-scale distributed computing scenarios that require efficient data transfer and processing, such as high-performance computing. Scenarios that require a large amount of GPU resources for learning and training and proper allocation of training jobs to different GPUs, such as machine learning and deep learning. Scenarios that require efficient allocation of rendering jobs to different GPUs, such as graphics rendering and game development.	GPU topology-aware scheduling Enable NUMA topology-aware scheduling
Dynamic resource overcommitment	Quantify the resources that are allocated to pods but are not in use and schedule the resources to low-priority jobs to achieve resource overcommitment. The following single-node QoS policies must be used together to applications affect each other: CPU Suppress: Limit the amount of CPU resources that can be used by pods with low priorities when the overall resource usage of the node is below the threshold. This ensures the stability of the containers on the node. CPU QoS: Use QoS classes to ensure that sufficient CPU resources are allocated to applications with high priorities. Memory QoS: Use QoS classes to ensure that sufficient memory resources are allocated to applications with high priorities to delay the time when the memory reclaim mechanism is triggered. Resource isolation based on the L3 cache and Memory Bandwidth Allocation (MBA): Use QoS classes to ensure that the L3 cache and MBA are prioritized for high-priority applications.	Improve the resource utilization of clusters by using colocation. Typical colocation scenarios are machine learning model training and inference, big data batch processing and data analysis, online services, and offline backup services.	Enable dynamic resource overcommitment Enable CPU Suppress Enable CPU QoS for containers Memory QoS Enable resource isolation based on the L3 cache and MBA Best practices for colocation
Dynamically modify the resource parameters of a pod	If you want to modify container parameters for a pod in a cluster that runs Kubernetes 1.27 or earlier, you must modify the PodSpec parameter and submit the change. Then, pod is deleted and recreated. ACK allows you to modify the CPU parameters, the memory parameters, and the disk IOPS limit of a pod without the need to restart the pod.	This feature is suitable for scenarios that you want to temporarily adjust CPU and memory resources.	Dynamically modify the resource parameters of a pod

Descheduling

Intended role: Cluster O&M engineers and application developers.
Description: The Kubernetes scheduler schedules pods to proper nodes based on the current cluster status. However, the cluster status constantly changes. In some scenarios, you may need to migrate running pods to other nodes, which means you must reschedule the pod to a different node.

Policy	Description	Scenario	Reference

Policy	Description	Scenario	Reference
Descheduling	In scenarios in which hotspot nodes exist because cluster resources on different nodes are not equally used or pods fail to be scheduled based on the predefined policy due to changes in node attributes, you can schedule pods that are not scheduled properly on a node to another node to ensure that the pods run on the optimal nodes. This ensures the high availability and efficiency of the workloads in your cluster.	The workloads in the cluster are not evenly distributed. Some nodes are overloaded. For example, different applications are scheduled to the same node in colocation scenarios. The overall resource utilization of the cluster is low. You want to remove some nodes to reduce costs. The cluster contains large numbers of resource fragments. As a result, a node may not have sufficient resources even if the total amount of resources in the cluster is sufficient. Taints or labels are added to or removed from a node.	Descheduling Enable the descheduling feature
Work with load-aware hotspot descheduling	You can use a combination of load-aware scheduling and hotspot descheduling to monitor changes in the loads of nodes and automatically optimize the nodes that exceed the load threshold to prevent node overloading.		Work with load-aware hotspot descheduling

Billing

When you use the scheduling feature of ACK, you are charged for cluster management and cloud resources based on the billing rules. In addition, you are charged the following fees for the scheduling component:

The default ACK scheduler provided by kube-scheduler is a control plane component that you can install and use free of charge.
The resource scheduling optimization and descheduling features of ACK are implemented based on ack-koordinator. The installation and use of ack-koordinator are free of charge, but additional fees may be generated in specific scenarios. For more information, see ack-koordinator (ack-slo-manager).

FAQ

If you encounter any problems when you use the scheduling feature, see Scheduling FAQ for troubleshooting.

References

For more information about the introduction and release notes for kube-scheduler and ack-koordinator, see kube-scheduler and ack-koordinator(ack-slo-manager).
For more information about how to customize the behavior of kube-scheduler to optimize pod scheduling, see Custom parameters of kube-scheduler.
For more information about the best practices in scheduling scenarios, such as how to ensure service quality in a colocation architecture and achieve stable resource overcommitment, see Best practices for resource scheduling.
You can enable cost insights to get the usage and cost allocation of resources in ACK clusters. Cost insights also provide suggestions on cost savings to improve the overall resource utilization. For more information, see Cost Insights.
For more information about how to implement GPU scheduling and memory isolation, see GPU sharing.
For more information about the scheduling solutions for virtual nodes, see Schedule a pod to a virtual node.