Custom elastic resource priority scheduling is an elastic scheduling policy provided by Alibaba Cloud. During application deployment or scale-out, you can define a ResourcePolicy to specify the order in which application instance pods are scheduled to different types of node resources. During scale-in, pods are removed in the reverse order of their scheduling.
Do not use system-reserved labels, such as alibabacloud.com/compute-class or alibabacloud.com/compute-qos, in the label selector of a workload, such as the spec.selector.matchLabels field of a Deployment. These labels may be modified by the system during custom priority scheduling, which causes the controller to frequently rebuild pods and affects application stability.
Prerequisites
An ACK managed cluster Pro edition of version 1.20.11 or later is created. To upgrade the cluster, see Manually upgrade a cluster.
The scheduler version must meet the following requirements for different ACK cluster versions. For more information about the features supported by each scheduler version, see kube-scheduler.
ACK version
Scheduler version
1.20
v1.20.4-ack-7.0 or later
1.22
v1.22.15-ack-2.0 or later
1.24 or later
All versions are supported
To use ECI resources, the ack-virtual-node component must be deployed. For more information, see Use ECI in ACK.
Precautions
Starting from scheduler version v1.x.x-aliyun-6.4, the default value of the
ignorePreviousPodfield for custom elastic resource priority is changed toFalse, andignoreTerminatingPodis changed toTrue. Existing ResourcePolicy objects and their subsequent updates are not affected.This feature conflicts with pod-deletion-cost and cannot be used at the same time.
This feature cannot be used with ECI elastic scheduling that is implemented through ElasticResource. For more information, see Use ElasticResource for elastic scheduling of ECI pods.
This feature uses a BestEffort policy and does not guarantee that pods are scaled in strictly in reverse order.
The max field is available only in clusters of version 1.22 or later with scheduler version 5.0 or later.
When used with elastic node pools, this feature may cause the node pools to create invalid nodes. To prevent this, include the elastic node pool in a unit and do not set the max field for that unit.
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, note that pods that exist before the ResourcePolicy is created are the first to be scaled in.
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify a ResourcePolicy while its associated pods are not completely deleted.
Usage
Create a ResourcePolicy to define the elastic resource priority:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
podLabels:
key1: value1
podAnnotations:
key1: value1
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
# Optional, Advanced Configurations
preemptPolicy: AfterAllUnits
ignorePreviousPod: false
ignoreTerminatingPod: true
matchLabelKeys:
- pod-template-hash
whenTryNextUnits:
policy: TimeoutOrExceedMax
timeout: 1mselector: Specifies that the ResourcePolicy applies to pods that have thelabelkey1=value1in the same namespace. If theselectoris empty, the policy applies to all pods in the namespace.strategy: The scheduling strategy. Currently, onlypreferis supported.units: User-defined scheduling units. During scale-out, pods are scheduled to resources in the order defined inunits. During scale-in, pods are removed in the reverse order.resource: The type of elastic resource. The supported types areeci,ecs,elastic, andacs. Theelastictype is available in clusters of version 1.24 or later with scheduler version 6.4.3 or later. Theacstype is available in clusters of version 1.26 or later with scheduler version 6.7.1 or later.NoteThe
elastictype is being deprecated. You can use auto-scaling node pools by settingk8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"in `podLabels`.NoteBy default, the
acstype adds thealibabacloud.com/compute-class: defaultandalibabacloud.com/compute-class: general-purposelabels to pods. You can overwrite these default values by specifying different values in `podLabels`. However, ifalpha.alibabacloud.com/compute-qos-strategyis specified in `podAnnotations`, thealibabacloud.com/compute-class: defaultlabel is not added.NoteThe
acsandecitypes add tolerations for virtual node taints to pods by default. The scheduler adds these tolerations internally and they are not reflected in the pod spec. Pods can be scheduled to virtual nodes without requiring additional taint toleration configurations.ImportantIn scheduler versions earlier than 6.8.3, you cannot use multiple units of the
acstype at the same time.nodeSelector: Identifies the nodes in this scheduling unit usinglabelson thenode.max(Available in scheduler version 5.0 or later): The maximum number of pod replicas that can be scheduled in this unit.maxResources(Available in scheduler version 6.9.5 or later): The maximum amount of resources that can be scheduled for pods in this unit.podAnnotations: The type ismap[string]string{}. The key-value pairs configured inpodAnnotationsare updated to the pod by the scheduler. Only pods with these key-value pairs are counted when calculating the number of pods in the unit.podLabels: The type ismap[string]string{}. The key-value pairs configured inpodLabelsare updated to the pod by the scheduler. Only pods with these key-value pairs are counted when calculating the number of pods in the unit.NoteIf the `podLabels` of a unit include the
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"label, or if the number of pods in the current unit is less than the `max` value, the scheduler keeps the pod in a waiting state in the current unit. You can set the waiting time inwhenTryNextUnits. Thek8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"label is not applied to the pod and is not required for pod counting.NoteWhen ResourcePolicy is used with auto-scaling, it must be used with instant elasticity. The cluster-autoscaler may otherwise trigger incorrect node pool scaling.
preemptPolicy: Specifies the preemption policy when a `ResourcePolicy` contains multipleunits. `BeforeNextUnit` indicates that the scheduler attempts preemption each time it fails to schedule a unit. `AfterAllUnits` indicates that the scheduler attempts preemption only after it fails to schedule the last unit. The default value is `AfterAllUnits`. This parameter is available for scheduler v6.1 or later and does not apply to ACS.You can enable preemption by configuring ACK Scheduler parameters. For more information, see Enable preemption.
ignorePreviousPod(Available in scheduler version 6.1 or later): Must be used withmaxinunits. If this value istrue, pods that were scheduled before the ResourcePolicy was created are ignored during pod counting.ignoreTerminatingPod(Available in scheduler version 6.1 or later): Must be used withmaxinunits. If this value istrue, pods in the Terminating state are ignored during pod counting.matchLabelKeys(Available in scheduler version 6.2 or later): Must be used withmaxinunits. Pods are grouped based on the values of their labels. Themaxcount is applied to each group of pods separately. If a pod is missing a label declared inmatchLabelKeys, the pod is rejected by the scheduler.whenTryNextUnits(Available in cluster version 1.24 or later with scheduler version 6.4 or later): Describes the conditions under which a pod is allowed to use resources from subsequent units.policy: The policy used by the pod. Valid values areExceedMax,LackResourceAndNoTerminating,TimeoutOrExceedMax, andLackResourceOrExceedMax(default).ExceedMax: Allows a pod to use resources from the next unit if the `max` and `maxResources` fields are not set for the current unit, or if the number of pods in the current unit is greater than or equal to the `max` value, or if the resources used in the current unit plus the resources of the current pod exceed the `maxResources` value. This policy can be used with auto-scaling and ECI to prioritize the auto-scaling of node pools.ImportantNote that if the auto-scaling node pool cannot create nodes for a long time, this policy may cause pods to remain in the Pending state.
Currently, the Cluster Autoscaler is not aware of the max limit in ResourcePolicy. The actual number of created instances may be greater than the value of max. This issue will be optimized in a future version.
TimeoutOrExceedMax: When one of the following conditions is met:The max field of the current unit is set and the number of pods in the unit is less than the value of max, or the maxResources field is set and the scheduled resources plus the current pod's resources are less than the value of maxResources.
The max field for the current unit is not set, and the podLabels of the current unit contain
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true".
If the current unit has insufficient resources to schedule the pod, the pod waits in the current unit for a maximum duration specified by
timeout. This policy can be used with auto-scaling and ECI to prioritize scaling out the node pool and automatically use ECI after the timeout.ImportantNote that if a node is created during the timeout period but is not in the Ready state, and the pod does not tolerate the NotReady taint, the pod is still scheduled to ECI.
LackResourceOrExceedMax: Allows a pod to use resources from the next unit if the number of pods in the current unit is greater than or equal to the `max` value, or if the current unit runs out of resources. This is the default policy and is suitable for most basic requirements.LackResourceAndNoTerminating: Allows a pod to use resources from the next unit if the current unit lacks available resources or has reached its maximum pod count (`max`), and no pods in the current unit are in the `Terminating` state. This policy is suitable for rolling update strategies because it prevents new pods from being scheduled to subsequent units while pods in the current unit are terminating.
timeout(This parameter is not supported for ACS units, which are limited only by `max`): The timeout duration when `policy` is set toTimeoutOrExceedMax. If this field is empty, the default value is 15 minutes.
Scenario examples
Scenario 1: Schedule based on node pool priority
You need to deploy a Deployment. The cluster has two node pools: Node Pool A and Node Pool B. You want to schedule pods to Node Pool A first. If Node Pool A has insufficient resources, schedule the pods to Node Pool B. When scaling in, you want to remove pods from Node Pool B first, and then from Node Pool A. In this example, cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Node Pool A. cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Node Pool B. All nodes have 2 vCPUs and 4 GB of memory. The following steps describe how to schedule based on node pool priority:
Use the following YAML content to create a ResourcePolicy to customize the node pool scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod that you will create later. strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058**** - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****NoteYou can obtain the node pool ID from the Node Management > Node Pools page of the cluster. For more information, see Create and manage a node pool.
Use the following YAML content to create a Deployment with two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yamlExpected output:
deployment.apps/nginx createdRun the following command to view the deployment result.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>The output shows that the first two pods are scheduled to the nodes in Node Pool A.
Scale out the pods.
Run the following command to scale out the pods to four replicas.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>The output shows that when the nodes in Node Pool A have insufficient resources, the new pods are scheduled to the nodes in Node Pool B.
Scale in the pods.
Run the following command to scale in the pods from four replicas to two.
kubectl scale deployment nginx --replicas 2Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>The output shows that pods in Node Pool B are removed first, which is the reverse of the scheduling order.
Scenario 2: Hybrid scheduling of ECS and ECI
You need to deploy a Deployment. The cluster has three types of resources: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. To reduce resource costs, you want the service deployment to follow this priority order: subscription ECS, pay-as-you-go ECS, and then ECI. When scaling in, you want to remove pods from ECI instances first, then from pay-as-you-go ECS instances, and finally from subscription ECS instances. In this example, the nodes have 2 vCPUs and 4 GB of memory. The following steps describe how to perform hybrid scheduling of ECS and ECI:
Run the following commands to add different
labelsto nodes of different billing methods. You can also use the node pool feature to automatically addlabels.kubectl label node cn-beijing.10.0.3.137 paidtype=subscription kubectl label node cn-beijing.10.0.3.138 paidtype=subscription kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-goUse the following YAML content to create a ResourcePolicy to customize the node pool scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod that you will create later. strategy: prefer units: - resource: ecs nodeSelector: paidtype: subscription - resource: ecs nodeSelector: paidtype: pay-as-you-go - resource: eciUse the following YAML content to create a Deployment with two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yamlExpected output:
deployment.apps/nginx createdRun the following command to view the deployment result.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that the first two pods are scheduled to nodes with the
labelpaidtype=subscription.
Scale out the pods.
Run the following command to scale out the pods to four replicas.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that when nodes with the
labelpaidtype=subscriptionhave insufficient resources, the new pods are scheduled to nodes with thelabelpaidtype=pay-as-you-go.Run the following command to scale out the pods to six replicas.
kubectl scale deployment nginx --replicas 6Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>The output shows that when ECS resources are insufficient, the new pods are scheduled to ECI resources.
Scale in the pods.
Run the following command to scale in the pods from six replicas to four.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>The output shows that pods on ECI instances are removed first, which is the reverse of the scheduling order.
Run the following command to scale in the pods from four replicas to two.
kubectl scale deployment nginx --replicas 2Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that pods on nodes with the
labelpaidtype=pay-as-you-goare removed next, which is the reverse of the scheduling order.Run the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that only pods on nodes with the
labelpaidtype=subscriptionremain.
References
When you deploy services in an ACK cluster, you can use tolerations and node affinity to declare that only ECS or ECI elastic resources are used, or to automatically request ECI resources when ECS resources are insufficient. By configuring scheduling policies, you can meet different requirements for elastic resources in various workload scenarios. For more information, see Specify resource allocation for ECS and ECI.
High availability (HA) and high performance are important requirements for running distributed tasks. In an ACK managed cluster Pro edition, you can use native Kubernetes scheduling semantics to discretize distributed tasks across zones to meet HA deployment requirements. You can also use native Kubernetes scheduling semantics to implement affinity-based deployment of distributed tasks in specified zones to meet high-performance deployment requirements. For more information, see Implement zone-based discretization and affinity scheduling for ECI pods.