ack-koordinator provides the load-aware hotspot descheduling feature, which can sense the changes in the loads on cluster nodes and automatically optimize the nodes that exceed the safety load to prevent extreme load imbalance. This topic describes how to work with the load-aware hotspot descheduling feature and how to configure advanced settings for this feature.
Limits
Only ACK Pro clusters support load-aware hotspot descheduling. For more information, see Create an ACK managed cluster.
To use the load-aware hotspot descheduling feature, make sure that the following requirements are met.
Component
Required version
ACK scheduler
v1.22.15-ack-4.0 and later or v1.24.6-ack-4.0 and later
ack-koordinator(ack-slo-manager)
v1.1.1-ack.1 and later
Helm
3.0 and later
The descheduler only evicts pods, and the ACK scheduler reschedules the pods. We recommend that you use the descheduling feature together with load-aware scheduling. This enables the ACK scheduler to avoid scheduling pods to hotspots again.
During the rescheduling process, the old pods are evicted and then new pods are created. Make sure that your application has sufficient redundant replicas to avoid affecting application availability during eviction.
During the descheduling process, the standard Kubernetes eviction API is used to evict pods. Make sure that the logic of the application pod is reentrant and the service is not down due to pod restarts after eviction. For more information, see API-initiated Eviction.
Billing rules
No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:
ack-koordinator is an non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered as custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing overview topic of Managed Service for Prometheus to learn the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage see Resource usage and bills.
Introduction to load-aware hotspot descheduling
This section describes the terms used in load-aware hotspot descheduling.
Load-aware pod scheduling
The ACK scheduler supports load-aware scheduling, which can schedule pods to nodes that run with low loads. Due to the changes in the cluster environment, traffic, and requests, the utilization of nodes dynamically changes and may break the load balance between nodes in the cluster, and even result in extreme load imbalance. This affects the runtime quality of the workload. ack-koordinator can identify changes in the loads of nodes and automatically optimize nodes that exceed the safety load to prevent extreme load imbalance. You can use a combination of load-aware scheduling and hotspot descheduling to achieve optimal load balancing among nodes. For more information, see Use load-aware pod scheduling.
How koord-descheduler works
koord-descheduler is a module of the ack-koordinator component. The LowNodeLoad plug-in can identify the changes in loads and perform hotspot descheduling. Unlike the Kubernetes-native descheduler plug-in LowNodeUtilization, the LowNodeLoad plug-in decides to deschedule nodes based on the actual utilization of the nodes, while LowNodeUtilization decides to deschedule based on resource allocation.
Descheduling procedure
koord-descheduler periodically performs descheduling. The following figure shows the steps of descheduling within each cycle.
Data collection: collects information about nodes and workloads in the cluster and the resource utilization statistics.
Descheduling policy implementation by the policy plug-in.
This section uses LowNodeLoad as an example.
Identifies hotspot nodes. For more information about the classification of nodes, see Load thresholds.
Traverses all hotspot nodes, identifies the nodes on which pods can be migrated, and sorts the pods. For more information about how pods are scored and sorted, see Pod scoring policy.
Traverses all pods to be migrated and checks whether the pods meet the requirements for migration based on constraints such as the cluster size, resource utilization, and the ratio of replicated pods. For more information, see Load-aware hotspot descheduling policies.
Only pods that meet the requirements are migrated. If no pod meets the requirements on the current node, LowNodeLoad continues to traverse the pods on other hotspot nodes.
Pod eviction and migration: evicts the pods that meet the requirements for migration. For more information, see API-initiated Eviction.
Load thresholds
The LowNodeLoad plug-in allows you to set the following load thresholds:
highThresholds: specifies the high load threshold. Pods on nodes whose load is higher than this threshold are descheduled. We recommend that you enable the load-aware scheduling feature of the ACK scheduler. For more information, see Scheduling policies. For more information about how to use them in a combination, see How do I use a combination of load-aware scheduling and load-aware hotspot descheduling?
lowThresholds: specifies the low load threshold. Pods on nodes whose load is lower than this threshold are not descheduled.
In the following figure, lowThresholds is set to 45% and highThresholds is set to 70%. Nodes are classified based on their loads and the thresholds. If the values of lowThresholds and highThresholds change, the standards for node classification also change.
The resource utilization statistics are updated every minute and the average values within the previous 5 minutes are displayed.
Idle nodes: nodes whose resource utilization is lower than 45%.
Normal nodes: nodes whose resource utilization is higher than or equal to 45% but lower than or equal to 70%. This is the desired resource utilization range for cluster nodes.
Hotspot nodes: nodes whose resource utilization is higher than 70. Pods on hotspot nodes will be evicted until the resource utilization of these nodes drops to 70% or lower.
Load-aware hotspot descheduling policies
Policy | Description |
Hotspot detection frequency policy | To accurately identify hotspot nodes and avoid frequent pod migration caused by the delayed monitoring data, koord-descheduler allows you to specify the frequency of hotspot detection. A node is considered a hotspot node only if the number of times that the node consecutively exceeds the load threshold reaches the specified frequency value. |
Node sorting policy | When hotspot nodes are identified, koord-descheduler sorts the nodes in descending order of resource usage and then deschedules the nodes in sequence. koord-descheduler compares the memory and CPU usage of the hotspot nodes and preferably deschedules nodes whose resource usage is higher. |
Pod scoring policy | koord-descheduler scores and sorts the pods on each hotspot node and then evicts the pods to idle nodes in the following order:
Note If you have specific requirements for the eviction order of pods, we recommend that you configure different priorities or QoS classes for them. |
Filtering policy | koord-descheduler allows you to configure various pod and node filters to control descheduling.
|
Precheck policy | koord-descheduler can precheck pods before migrating the pods.
|
Migration control policy | To ensure the high availability of applications during pod migration, koord-descheduler provides multiple features to allow you to control pod migration. You can specify the maximum number of pods that can be migrated at the same time per node, namespace, or workload. koord-descheduler also allows you to specify a pod migration time window to prevent pods that belong to the same workload from being migrated too frequently. koord-descheduler is compatible with the Pod Disruption Budgets (PDB) mechanism of open source Kubernetes, which helps you guarantee the high availability of your applications in a fine-grained manner. For more information, see Specifying a Disruption Budget for your Application. |
Observability policy | You can collect events to monitor the descheduling procedure, and view the reason and status of the descheduling in event details. The following code block shows an example:
|
Step 1: Install or modify ack-koordinator and enable load-aware hotspot descheduling
Install ack-koordinator
Install ack-koordinator. On the Install ack-koordinator(ack-slo-manager) page, select Enable Descheduler for ack-koordinator. For more information, see Install ack-koordinator.
Modify ack-koordinator (ack-koordinator is already installed)
Modify ack-koordinator. On the ack-koordinator Parameters page, select Enable Descheduler for ack-koordinator. For more information, see Modify ack-koordinator.
Step 2: Enable the LowNodeLoad plug-in
Create a file named koord-descheduler-config.yaml and add the following YAML content to the file:
The koord-descheduler-config.yaml file is a ConfigMap used to enable the LowNodeLoad plug-in.
Run the following command to apply the configuration to the cluster:
kubectl apply -f koord-descheduler-config.yaml
Run the following command to restart koord-descheduler.
After koord-descheduler is restarted, the modified configuration takes effect.
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 deployment.apps/ack-koord-descheduler scaled
Step 3: (Optional) Enable the load-aware scheduling plug-in
Enable the load-aware scheduling plug-in to achieve optimal load balancing among nodes. For more information, see Step 1: Enable load-aware scheduling.
Step 4: Verify load-aware hotspot descheduling
In this section, a cluster that contains three nodes is used as an example. Each node has 104 vCores and 396 GB of memory.
Create a file named stress-demo.yaml and add the following content to the file.
Run the following command to create a pod for stress testing:
kubectl create -f stress-demo.yaml deployment.apps/stress-demo created
Run the following command to view the status of the pod until it starts to run:
kubectl get pod -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Running 0 82s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none>
The output indicates that pod
stress-demo-588f9646cf-s****
is scheduled to nodecn-beijing.10.XX.XX.53
.Increase the load of node
cn-beijing.10.XX.XX.53
. Then, run the following command to check the load of each node:kubectl top node
Expected output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% cn-beijing.10.XX.XX.215 17611m 17% 24358Mi 6% cn-beijing.10.XX.XX.53 63472m 63% 11969Mi 3%
The output indicates that the load of node
cn-beijing.10.XX.XX.53
is higher, which is 63%. The load exceeds the high resource threshold of 50%. The load of nodecn-beijing.10.XX.XX.215
is lower, which is 17%. The load is lower than the low resource threshold of 20%.Enable load-aware hotspot descheduling. For more information, see Step 2: Enable the LowNodeLoad plug-in.
Run the following command to view the changes of the pods.
Wait for the descheduler to identify hotspot nodes and evict pods.
NoteA node is considered a hotspot node if the node consecutively exceeds the high resource threshold 5 times within 10 minutes.
kubectl get pod -w
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Terminating 0 59s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none> stress-demo-588f9646cf-7**** 1/1 ContainerCreating 0 10s 10.XX.XX.215 cn-beijing.10.XX.XX.215 <none> <none>
Run the following command to view the event:
kubectl get event | grep stress-demo-588f9646cf-s****
Expected output:
2m14s Normal Evicting podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 101s Normal EvictComplete podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" has been evicted 2m14s Normal Descheduled pod/stress-demo-588f9646cf-s**** Pod evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 2m14s Normal Killing pod/stress-demo-588f9646cf-s**** Stopping container stress
The output indicates the migration result. The pods on the hotspot node are migrated to the idle node.
Advanced settings
Advanced settings for koord-descheduler
The configuration of koord-descheduler is stored in a ConfigMap. The following code block shows the advanced settings for load-aware hotspot descheduling.
koord-descheduler system settings
Parameter | Type | Valid value | Description | Example |
dryRun | boolean |
| The global read-only mode. After this mode is enabled, pods cannot be migrated. | false |
deschedulingInterval | time.Duration | >0s | The descheduling interval. | 120s |
Migration control settings
Parameter | Type | Valid value | Description | Example |
maxMigratingPerNode | int64 | ≥ 0 (default value: 2) | The maximum number of pods that can be migrated at the same time on a node. A value of 0 indicates that no limit is set. | 2 |
maxMigratingPerNamespace | int64 | ≥ 0 (Default value: 0) | The maximum number of pods that can be migrated at the same time in a namespace. A value of 0 indicates that no limit is set. | 1 |
maxMigratingPerWorkload | intOrString | ≥ 0 (Default value: 10%) | The maximum number or percentage of pods that can be migrated at the same time in a workload, such as a Deployment. A value of 0 indicates that no limit is set. If the workload contains only one replicated pod, the workload is excluded for descheduling. | 1 or 10% |
maxUnavailablePerWorkload | intOrString | Equal to or larger than 0 (Default value: 10%) and smaller than the number of replicated pods of the workload | The maximum number or percentage of unavailable replicated pods that are allowed in a workload, such as a Deployment. A value of 0 indicates that no limit is set. | 1 or 10% |
evictLocalStoragePods | boolean |
| Specify whether pods that are configured with the emptyDir or hostPath can be descheduled. By default, this feature is disabled to ensure data security. | false |
objectLimiters.workload | A structure in the following format:
|
| Workload-specific pod migration control.
|
The example indicates that only one replicated pod can be migrated within 5 minutes in a workload. |
LowNodeLoad settings
Parameter | Type | Valid value | Description | Example |
highThresholds | map[string]float64 Note Set the parameter to a percentage value. You can specify this parameter for pods or nodes. | [0,100] | The high resource threshold. Pods on nodes whose load exceeds this threshold are descheduled. Note You can use load-aware hotspot descheduling together with load-aware scheduling. Set this parameter and the loadAwareThreshold parameter to the same value. This way, the scheduler does not schedule pods to hotspots. For more information, see Scheduling policies. |
|
lowThresholds | map[string]float64 Note Set the parameter to a percentage value. You can specify this parameter for pods or nodes. | [0,100] | The low resource threshold. Pods on nodes whose load is lower than this threshold are not descheduled. |
|
anomalyCondition.consecutiveAbnormalities | int64 | > 0 (Default value: 5) | Hotspot detection frequency. A node is considered a hotspot node if the node exceeds highThresholds within the specified consecutive number of hotspot detection cycles. Hotspot nodes are descheduled and then the counter is reset. | 5 |
evictableNamespaces |
| Namespaces in the cluster | The namespaces that you want to include or exclude for descheduling. If you leave this parameter empty, all pods can be descheduled. You can specify the include or exclude list. The lists are mutually exclusive.
|
|
nodeSelector | metav1.LabelSelector | For more information about the format of LabelSelector, see Labels and Selectors. | Use the LabelSelector to select nodes. | You can specify one node pool or multiple node pools when you configure this parameter.
|
podSelectors | A list that can consist of multiple pods. Format of PodSelector:
| For more information about the format of LabelSelector, see Labels and Selectors. | Specify the pods that can be descheduled. |
|
FAQ
What do I do if the resource utilization of a node has reached the high threshold but no pod on the node is evicted?
The following table describes the possible causes.
Category | Cause | Solution |
Ineffective component configuration | No pods or nodes specified | No pods or nodes are specified in the configuration of the descheduler. Check whether namespaces and nodes are specified. |
Descheduler not restarted after modification | After you modify the configuration of the descheduler, you need to restart it for the modification to take effect. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in. | |
Invalid node status | Average node resource utilization lower than the threshold for a long period of time | The descheduler continuously monitors the resource utilization within a period of time and calculates the average value. Descheduling is triggered only if the average value remains above the threshold for a certain period of time. The default time period is 10 minutes. The resource utilization returned by |
Insufficient available resources in the cluster | The descheduler checks other nodes in the cluster to ensure that the nodes can provide sufficient available resources before the descheduler evicts pods. For example, the descheduler wants to evict a pod that requests 8 vCores and 16 GB of memory. However, no node in the cluster can provide sufficient available resources to meet the requirement of the pod. In this case, the descheduler does not evict the pod. To resolve this issue, you can add nodes to the cluster. | |
Workload limits | Only one replicated pod in the workload | By default, if a workload contains only one replicated pod, the pod is excluded for descheduling. This ensures the high availability of the application that runs in the pod. If you want to deschedule the preceding pod, add the Note This annotation configuration supports only v1.2.0-ack1.3 and earlier versions. |
Pods configured with the emptyDir or hostPath | By default, pods configured with the emptyDir or hostPath are excluded for descheduling to ensure data security. If you want to deschedule these pods, refer to the evictLocalStoragePods setting. For more information, see Migration control settings. | |
Excessive number of unavailable replicated pods or replicated pods that are being migrated | The number of unavailable replicated pods or replicated pods that are being migrated in a workload (Deployment or StatefulSet) exceeds the upper limit specified in maxUnavailablePerWorkload or maxMigrationPerWorkload. For example, both maxUnavailablePerWorkload and maxMigrationPerWorkload are set to 20%, and the expected number of replicated pods for the Deployment is set to 10. Two pods are being migrated or releasing the application. In this case, the descheduler does not evict these pods. Wait until the pods are migrated or the pods finish releasing the application, or increase the values of the preceding parameters. | |
Incorrect replicated pod limits | When the number of replicated pods in a workload is smaller than or equal to the maximum number of pods allowed to migrate specified in maxMigrationPerWorkload or the maximum number of unavailable pods allowed specified in maxUnavailablePerWorkload, the descheduler does not deschedule the pods in the workload. Decrease the values of the preceding parameters and set the parameters to percentage values. |
Why does the descheduler frequently restart?
The format of the ConfigMap of the descheduler is invalid or the ConfigMap does not exist. Refer to Advanced settings and check the content and format of the ConfigMap, modify the ConfigMap, and then restart the descheduler. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in.
How do I use a combination of load-aware scheduling and load-aware hotspot descheduling?
After you enable load-aware hotspot descheduling, pods on hotspot nodes are evicted. The ACK scheduler will select proper nodes for pods that are created by controllers (such as Deployments) in the upper layer. To achieve optimal load balancing, we recommend that enable load-aware scheduling at the same time. For more information, see Use load-aware scheduling.
We recommend that you set the loadAwareThreshold parameter of the scheduler and the highThresholds parameter of the descheduler to the same value. For more information, see Scheduling policies. When the load of a node exceeds highThresholds, the descheduler evicts pods on the node. The scheduler stops scheduling new pods to the hotspot node due to loadAwareThreshold. If you do not set the parameters to the same value, pods may be scheduled to the hotspot node. This issue is more likely to occur when a pod has specified the scope of schedulable nodes but only a small number of nodes are available and the resource utilization values of these nodes are close.
What is the utilization algorithm used by the descheduler?
The descheduler continuously monitors the resource utilization within a period of time and calculates the average value. Descheduling is triggered only if the average value remains above the threshold for a certain period of time. The default time period is 10 minutes. In addition, the memory utilization counted by the descheduler excludes the page cache because the memory resource used by the page cache can be recycled by the operating system. The memory utilization queried by using the kubectl top node
command includes the page cache. You can view the actual memory utilization in the Managed Service for Prometheus console.