Priority-based resource scheduling is an elastic scheduling policy provided by Alibaba Cloud. It allows you to use the ResourcePolicy resource to define the order in which application instance pods are scheduled to different types of node resources during deployment or scale-out activities. During scale-in activities, pods are removed in the reverse order of the original scheduling sequence.
Important
As of kube-scheduler v1.x.x-aliyun-6.4, the default value of the ignorePreviousPod parameter for the priority-based resource scheduling feature is now False, and the default value of the ignoreTerminatingPod parameter is True. Existing ResourcePolicies that use these parameters will not be impacted by this change or subsequent updates.
Prerequisites
-
You have created an ACK Pro cluster running Kubernetes 1.20.11 or later. For details on upgrading, see Upgrade the Kubernetes Version of an ACK Cluster.
-
Ensure the scheduler version meets the requirements for different ACK cluster versions. For more information on the features supported by different scheduler versions, see kube-scheduler.
ACK version | Scheduler version |
ACK version | Scheduler version |
1.20 | v1.20.4-ack-7.0 or later |
1.22 | v1.22.15-ack-2.0 or later |
1.24 or later | All versions are supported |
-
If ECI resources are required, the ack-virtual-node is deployed. For more information, see Use ECI in ACK.
Limits
-
This feature cannot be used in conjunction with the pod-deletion-cost feature. For more information about pod-deletion-cost, see pod-deletion-cost.
-
This feature does not support concurrent use with ECI-based elastic scheduling. For more information about ECI-based elastic scheduling, see Elastic Scheduling with ElasticResource (deprecated).
-
Currently, this feature uses the BestEffort policy and does not guarantee that pods are removed in reverse order during scale-in activities.
-
The max parameter is only available if your cluster is running Kubernetes 1.22 or later and the scheduler version is 5.0 or higher.
-
When using this feature with elastic node pools, invalid nodes may be added. Ensure that the elastic node pools are included in units and do not specify the max parameter for the units.
-
If your scheduler version is below 5.0 or the Kubernetes version of your cluster is 1.20 or earlier, existing pods are prioritized during scale-in activities, even if the ResourcePolicy is created after them.
-
If your scheduler version is below 6.1 or the Kubernetes version of your cluster is 1.20 or earlier, do not modify the ResourcePolicy until all pods selected by it are deleted.
Procedure
To define priority-based resource scheduling, create a ResourcePolicy:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
preemptPolicy: AfterAllUnits
ignorePreviousPod: false
ignoreTerminatingPod: true
matchLabelKeys:
- pod-template-hash
podLabels:
key1: value1
podAnnotations:
key1: value1
whenTryNextUnits:
policy: TimeoutOrExceedMax
timeout: 1m
-
selector
: Defines the ResourcePolicy as applicable to pods with the label
key1=value1
within the same namespace. If selector
is not set, the ResourcePolicy applies to all pods in the namespace.
-
strategy
: Defines the scheduling strategy. Currently, only prefer
is supported.
-
units
refer to user-defined scheduling units. During scale-out operations, resources are allocated according to the specified order under units
. Conversely, during scale-in operations, resources are released in the reverse order.
-
resource
: Specifies the type of elastic resources. Supported types include eci
, ecs
, elastic
, and acs
. The elastic
type is available for clusters running Kubernetes 1.24 or later with a scheduler version of 6.4.3 or higher. The acs
type is available for clusters running Kubernetes 1.26 or later with a scheduler version of 6.7.1 or higher.
Note
The elastic type will be deprecated. We recommend using the auto-scaling node pool feature by adding the k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
label in podLabels
.
Note
The acs type automatically adds the alibabacloud.com/compute-class: default and alibabacloud.com/compute-class: general-purpose
labels to pods. You can overwrite these default values by specifying different values in podLabels
. If the alpha.alibabacloud.com/compute-qos-strategy
annotation is present in podAnnotations
, the alibabacloud.com/compute-class: default
label is not added to pods.
Important
Scheduler versions earlier than 6.8.3 do not support multiple units of the acs
type.
-
nodeSelector
: Identifies the nodes in this scheduling unit using the label
of the node
. This is only applicable to ecs
resources.
-
max
(available for scheduler versions 5.0 or higher): Sets the maximum number of pod replicas that can be scheduled to the unit.
-
podAnnotations
: A map[string]string{}
type. Key-Value pairs in podAnnotations
are updated to the pod by the scheduler. Only pods with these Key-Value pairs are counted when tallying the number of pods in this unit.
-
podLabels
: A map[string]string{}
type. Key-Value pairs in podLabels
are updated to the pod by the scheduler. When counting the number of pods in this unit, only pods with these Key-Value pairs are included.
Note
If the podLabels parameter of a unit includes the k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" label and the number of pods in the unit is less than the value of the max
parameter, the scheduler waits for pods in the unit. The maximum wait time is defined by the whenTryNextUnits
parameter. The k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
label is not updated to pods, and pods do not need this label when their number is calculated.
-
preemptPolicy (available for scheduler versions 6.1 or higher): Determines if the ResourcePolicy can preempt resources when pod scheduling to a unit fails. If set to BeforeNextUnit, the scheduler tries to preempt resources each time it fails to schedule pods to a unit. If set to AfterAllUnits
, the scheduler only tries to preempt resources after failing to schedule pods to all units. The default is AfterAllUnits
.
-
ignorePreviousPod
(available for scheduler versions 6.1 or higher): Must be used with the max
parameter in units
. If set to true
, pods scheduled before the ResourcePolicy creation are not counted when tallying the number of pods.
-
ignoreTerminatingPod
(available for scheduler versions 6.1 or higher): Must be used with the max
parameter in units
. If set to true
, pods in the Terminating state are not counted when tallying the number of pods.
-
matchLabelKeys
(available for scheduler versions 6.2 or higher): Must be used with the max
parameter in units
. Pods are grouped based on their label values, and each group has a different max
limit. Pods without the specified labels in matchLabelKeys
are rejected.
-
whenTryNextUnits
(available for clusters running Kubernetes 1.24 or later with a scheduler version of 6.4 or higher): Defines the conditions under which pods can use resources in subsequent units.
-
policy
: Specifies the policy for pods. Valid options: ExceedMax
, LackResourceAndNoTerminating
, TimeoutOrExceedMax
, and LackResourceOrExceedMax
(default).
-
ExceedMax: When the max parameter for a given unit is unspecified or the pod count within the unit meets or exceeds the max
value, pods may utilize resources from the subsequent unit. This approach can be effectively combined with Auto Scaling and Elastic Container Instance (ECI) to prioritize node pool scaling through Auto Scaling.
Important
-
If the autoscaler fails to add nodes to a node pool for an extended period, pending pods may occur.
-
The autoscaler does not recognize the max limit of the ResourcePolicy. The actual number of instances added may exceed the max limit. This issue will be addressed in future versions.
-
TimeoutOrExceedMax: Pods wait in the current unit if the max parameter is specified and the number of pods is less than the max value, or if the max parameter is not set and the podLabels parameter contains the k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
label. If resources in the current unit are insufficient, pods wait for the specified timeout period. This policy is compatible with Auto Scaling and Elastic Container Instance to preferentially use Auto Scaling for node pool scaling and fallback to elastic container instances after the timeout.
-
The current unit's max parameter is set, and the number of pods within the unit is fewer than the specified max value.
-
The current unit's max parameter is unspecified, and its podLabels parameter includes the k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"
label.
When the resources in the current unit are insufficient for pod scheduling, the pods will remain queued within the unit. The timeout
parameter determines the maximum duration they can wait. This approach can be effectively combined with Auto Scaling and Elastic Container Instance, prioritizing the expansion of a node pool through Auto Scaling and resorting to elastic container instances upon reaching the timeout limit.
Important
If newly added nodes do not reach the Ready state before the timeout ends and pods are not configured to tolerate the NotReady taint, pods will still be scheduled to elastic container instances.
-
LackResourceOrExceedMax: When the number of pods in a unit meets or exceeds the max
parameter value, or if resources in the current unit are insufficient, pods may utilize resources from the next unit. This default policy accommodates most scenarios.
-
LackResourceAndNoTerminating: When the current unit hosts a number of pods that meets or exceeds the max
parameter value, yet lacks sufficient resources, and none of the pods are in the Terminating state, pods are permitted to utilize resources from the subsequent unit. This policy is effectively paired with a rolling update policy to avoid dispatching new pods to following units if there are pods in the process of terminating in the current unit.
-
timeout: Defines the timeout period when the policy
parameter is set to TimeoutOrExceedMax
. If not specified, the default is 15 minutes.
Sample scenarios
Scenario 1: Priority-Based Scheduling for Node Pools
When deploying a Deployment in a cluster with two node pools, Node Pool A and Node Pool B, you may want to prioritize Node Pool A and only schedule pods to Node Pool B if Node Pool A's resources are insufficient. During scale-in activities, pods from Node Pool B should be deleted first. In this example, cn-beijing.10.0.3.137
and cn-beijing.10.0.3.138
are in Node Pool A, while cn-beijing.10.0.6.47
and cn-beijing.10.0.6.46
are in Node Pool B. Each node has 2 vCPUs and 4 GB of memory. Follow these steps to configure priority-based resource scheduling for node pools:
-
Create a ResourcePolicy using the following YAML file to specify the scheduling sequence for node pools.
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: nginx
namespace: default
spec:
selector:
app: nginx
strategy: prefer
units:
- resource: ecs
nodeSelector:
alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058****
- resource: ecs
nodeSelector:
alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****
Note
You can find the ID of a node pool on the Node Management > Node Pool page of the cluster. For more details, see Create and Manage Node Pools.
-
Deploy two pods using the following YAML file to create a Deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 2
requests:
cpu: 2
-
Create the Nginx application and verify the deployment outcome.
-
Execute the following command to create the Nginx application.
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
-
Check the deployment result with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>
The output indicates that the two pods are scheduled to nodes in Node Pool A.
-
Expand the number of pods.
-
Scale out the pods to four replicas with the following command.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
-
Verify the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>
nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output shows that additional pods are scheduled to Node Pool B due to insufficient resources in Node Pool A.
-
Reduce the number of pods.
-
Scale in the pods from four replicas to two with the following command.
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
-
Check the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>
nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output indicates that pods on nodes in Node Pool B are deleted in the reverse order of the scheduling sequence.
Scenario 2: Mixed Scheduling for ECS and ECI
When deploying a Deployment, if your cluster has subscription ECS instances, pay-as-you-go ECS instances, and elastic container instances, you may want to schedule pods based on cost efficiency: subscription ECS instances first, then pay-as-you-go ECS instances, and finally elastic container instances. During scale-in activities, pods should be deleted in the reverse order: elastic container instances first, followed by pay-as-you-go ECS instances, and lastly subscription ECS instances. In this example, each node has 2 vCPUs and 4 GB of memory. Follow these steps to configure mixed scheduling for ECS and ECI:
-
Assign different labels
to nodes based on their billing types using the following commands (alternatively, use the node pool feature to automatically manage labels
).
kubectl label node cn-beijing.10.0.3.137 paidtype=subscription
kubectl label node cn-beijing.10.0.3.138 paidtype=subscription
kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go
kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go
-
Create a ResourcePolicy specifying the scheduling sequence for resources using the following YAML file.
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: nginx
namespace: default
spec:
selector:
app: nginx
strategy: prefer
units:
- resource: ecs
nodeSelector:
paidtype: subscription
- resource: ecs
nodeSelector:
paidtype: pay-as-you-go
- resource: eci
-
Deploy two pods using the following YAML file to create a Deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
cpu: 2
requests:
cpu: 2
-
Create the Nginx application and verify the deployment outcome.
-
Execute the following command to create the Nginx application.
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
-
Check the deployment result with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output indicates that the first two pods are scheduled to nodes with the label
paidtype=subscription
.
-
Expand the number of pods.
-
Scale out the pods to four replicas with the following command.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
-
Verify the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that when nodes with the label
paidtype=subscription
are insufficient, pods are scheduled to nodes with the label
paidtype=pay-as-you-go
.
-
Increase the pod count to six replicas with the following command.
kubectl scale deployment nginx --replicas 6
Expected output:
deployment.apps/nginx scaled
-
Check the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none>
nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output indicates that additional pods are scheduled to elastic container instances due to a shortage of ECS nodes.
-
Reduce the number of pods.
-
Scale in the pods from six replicas to four with the following command.
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
-
Verify the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none>
nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output shows that pods on elastic container instances are deleted in the reverse order of the scheduling sequence.
-
Scale in the pods from four replicas to two with the following command.
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
-
Check the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none>
nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that pods on nodes with the label
paidtype=pay-as-you-go
are prioritized for deletion in the reverse order of the scheduling sequence.
-
Verify the pod status with the following command.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none>
nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output confirms that only pods with the label
paidtype=subscription
remain on the nodes.
References
-
When deploying Services in an ACK cluster, you can configure tolerations and node affinity to use only ECS instances or elastic container instances, or allow the scheduler to automatically request elastic container instances when ECS instances are insufficient. Different scheduling policies can be configured to scale resources in various scenarios. For more information, see Specify ECS and ECI Resource Allocation.
-
Ensuring high availability and performance is critical for distributed tasks. Within ACK Pro clusters, you can leverage Kubernetes-native scheduling semantics to distribute tasks across multiple zones for high availability, or target specific zones to optimize for performance. For more information, see Distribute and affinity schedule ECI pods across zones.