Descheduling is a process of scheduling pods that match eviction rules on a node to another node. This feature is suitable for scenarios such as uneven cluster resource utilization, heavily loaded nodes, and demand for new scheduling policies. The descheduling feature helps maintain cluster health, optimize resource usage, and improve the quality of service for workloads. This topic uses the node taint verification plug-in named RemovePodsViolatingNodeTaints
as an example to describe how to enable the descheduling feature based on the ack-koordinator component.
Before you start
Before you read this topic, we recommend that you learn about the features, use scenarios, workflow, and basic concepts of descheduling. For more information, see Descheduling overview.
This topic uses the node taint verification plug-in named
RemovePodsViolatingNodeTaints
as an example. For more information about how taints and tolerations are used to schedule and evict pods, such as NoSchedule, see Taints and Tolerations.If you use Kubernetes Descheduler, we recommend that you learn about the differences between Koordinator Descheduler and Kubernetes Descheduler and migrate to Koordinator Descheduler. For more information, see Comparison between Koordinator Descheduler and Kubernetes Descheduler.
If you are familiar with the basic operations of the descheduling feature, you can explore the advanced configurations, such as the system configurations, template configurations, policy plug-in configurations, and evictor plug-in configurations. For more information, see Configure advanced parameters.
Prerequisites
An ACK Pro cluster is created. For more information, see Create an ACK managed cluster.
NoteThe descheduling feature cannot be used on virtual nodes.
A kubectl client is connected to the ACK cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Usage notes
Koordinator Descheduler evicts running pods only and does not recreate or schedule the evicted pods. After a pod is evicted, the pod is recreated by the workload controller, such as a Deployment and StatfulSet. The recreated pod is still scheduled by the scheduler.
During the descheduling process, the old pods are evicted and then new pods are created. Make sure that your application has sufficient
replicas
in case application availability is affected during eviction.
Examples
This topic describes how to enable the descheduling feature and implement the descheduling policy based on the ack-koordinator component. The node taint verification plug-in RemovePodsViolatingNodeTaints
is used as an example.
By default, the RemovePodsViolatingNodeTaints
policy checks pods on a node whose taint effect
is NoSchedule
and evicts pods that cannot tolerate the taint. For example, a node hosts running pods and an administrator adds the deschedule=not-allow:NoSchedule
taint to the node. If the pods on the node are not configured with a toleration to tolerate the taint, the pods will be evicted by the descheduling policy. For more information, see RemovePodsViolatingNodeTaints.
The RemovePodsViolatingNodeTaints
policy allows you to set the excludedTaints
field to ignore node taints. If a taint key
or key-value pair (key=value
) matches a taint in the excludedTaints
list, the taint will be ignored.
In this example, the plug-in is configured to verify taints in the following way:
The
effect
is set toNoSchedule
on the node.In the
NoSchedule
taint attributes, the taintkey
is notdeschedule
, and the taintvalue
is notnot-allow
.
On nodes that meet the preceding conditions, if the running pods do not have the matching toleration, the pods will be evicted by the descheduler.
Step 1: Install or modify the ack-koordinator component and enable descheduling
You can follow the steps in this section to install the ack-koordinator (FKA ack-slo-manager) component and use the descheduling feature provided by Koordinator Descheduler. Koordinator Descheduler is deployed on nodes as a Deployment.
If you have installed the ack-koordinator component, make sure that the version of the component is 1.2.0-ack.2 or later.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
Find the ack-koordinator component and click Install in the lower-right corner. In the Install ack-koordinator dialog box, select Enable Descheduler for ACK-Koordinator to enable the descheduling module. Then, configure and install the component as prompted.
Step 2: Enable the descheduling plug-in named RemovePodsViolatingNodeTaints
Use the following YAML content to create a file named koord-descheduler-config.yaml:
The koord-descheduler-config.yaml file is a ConfigMap used to enable and configure the descheduling plug-in named
RemovePodsViolatingNodeTaints
.# koord-descheduler-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: koord-descheduler-config namespace: kube-system data: koord-descheduler-config: | # Do not modify the following system configuration of koord-desheduler. apiVersion: descheduler/v1alpha2 kind: DeschedulerConfiguration leaderElection: resourceLock: leases resourceName: koord-descheduler resourceNamespace: kube-system deschedulingInterval: 120s # The interval at which LowNodeLoad runs. The interval is set to 120 seconds in this example. dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations. # The preceding configuration is the system configuration. profiles: - name: koord-descheduler plugins: deschedule: enabled: -name: RemovePodsViolatingNodeTaints # Enable the node taint verification plug-in. pluginConfig: - name: RemovePodsViolatingNodeTaints # The configurations of the node taint verification plug-in. args: excludedTaints: - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
Run the following command to deploy the koord-descheduler-config.yaml file in the cluster:
kubectl apply -f koord-descheduler-config.yaml
Run the following command to restart the descheduler module Koordinator Descheduler:
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 # Expected output: # deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 # Expected output: # deployment.apps/ack-koord-descheduler scaled
Set the number of replicas of the
ack-koord-descheduler
Deployment to0
and then set to1
. This operation restarts the Koordinator Descheduler module. The latest configuration is used after the restart.
Step 3: Verify the descheduling feature
A cluster that contains three nodes is used as an example.
Use the following YAML content to create a file named stress-demo.yaml:
A sample application is defined in the stress-demo.yaml file.
Run the following command to deploy the stress-demo.yaml file and create a test pod:
kubectl create -f stress-demo.yaml
Run the following command to view the status of the pod until it enters the Running state:
kubectl get pod -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-5f6cddf9-9**** 1/1 Running 0 10s 192.XX.XX.27 cn-beijing.192.XX.XX.247 <none> <none> stress-demo-5f6cddf9-h**** 1/1 Running 0 10s 192.XX.XX.20 cn-beijing.192.XX.XX.249 <none> <none> stress-demo-5f6cddf9-v**** 1/1 Running 0 10s 192.XX.XX.32 cn-beijing.192.XX.XX.248 <none> <none>
Run the following commands to add the
key=value:NoSchedule
taint to nodes:Add the
deschedule=not-allow:NoSchedule
taint to the node namedcn-beijing.192.XX.XX.247
.kubectl taint nodes cn-beijing.192.XX.XX.247 deschedule=not-allow:NoSchedule
Expected output:
node/cn-beijing.192.XX.XX.247 tainted
Add the
deschedule=allow:NoSchedule
taint to the node namedcn-beijing.192.XX.XX.248
.kubectl taint nodes cn-beijing.192.XX.XX.248 deschedule=allow:NoSchedule
Expected output:
node/cn-beijing.192.XX.XX.248 tainted
Run the following command to view the changes of the pods:
kubectl get pod -o wide -w
Wait for the descheduler module to check the node taints and perform the eviction operation.
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-5f6cddf9-9**** 1/1 Running 0 5m34s 192.XX.XX.27 cn-beijing.192.XX.XX.247 <none> <none> stress-demo-5f6cddf9-h**** 1/1 Running 0 5m34s 192.XX.XX.20 cn-beijing.192.XX.XX.249 <none> <none> stress-demo-5f6cddf9-v**** 1/1 Running 0 5m34s 192.XX.XX.32 cn-beijing.192.XX.XX.248 <none> <none> stress-demo-5f6cddf9-v**** 1/1 Terminating 0 7m58s 192.XX.XX.32 cn-beijing.192.XX.XX.248 <none> <none> stress-demo-5f6cddf9-j**** 0/1 ContainerCreating 0 0s <none> cn-beijing.192.XX.XX.249 <none> <none> stress-demo-5f6cddf9-j**** 1/1 Running 0 2s 192.XX.XX.32 cn-beijing.192.XX.XX.249 <none> <none>
The output indicates the following information:
The pod
stress-demo-5f6cddf9-v****
on the nodecn-beijing.192.XX.XX.248
with the taintdeschedule=allow:NoSchedule
is evicted.The pod
stress-demo-5f6cddf9-9****
on the nodecn-beijing.192.XX.XX.247
with the taintdeschedule=not-allow:NoSchedule
is not evicted.The evicted pod
stress-demo-5f6cddf9-v****
is descheduled to the nodecn-beijing.192.XX.XX.249
, which does not have theNoSchedule
taint.
Run the following command to view the event of the evicted pod:
kubectl get event | grep stress-demo-5f6cddf9-v****
Expected output:
3m24s Normal Evicting podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798**** Pod "default/stress-demo-5f6cddf9-v****" evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints" 2m51s Normal EvictComplete podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798**** Pod "default/stress-demo-5f6cddf9-v****" has been evicted 3m24s Normal Descheduled pod/stress-demo-5f6cddf9-v**** Pod evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints" 3m24s Normal Killing pod/stress-demo-5f6cddf9-v**** Stopping container stress
The expected output shows the migration record of the pod. The pod is not configured with the matching toleration to tolerate the taint
deschedule=not-allow
of the nodecn-beijing.192.XX.XX.248
. Consequently, the pod is descheduled to another node. The results meet expectations.
Configure advanced parameters
In addition to the preceding operations, you can also use a ConfigMap to configure advanced parameters for Koordinator Descheduler.
Example of advanced configurations
The following YAML file shows an example of the advanced configurations of Koordinator Descheduler. The configurations modify the behavior of Koordinator Descheduler by calling the DeschedulerConfiguration
API, enable the RemovePodsViolatingNodeTaint
node taint verification policy, and use MigrationController
as the pod evictor.
For more information about the parameters in the sample configurations, see the following sections.
System configurations
You can configure global, system-level behavior for Koordinator Descheduler in DeschedulerConfiguration
.
Parameter | Type | Valid value | Description | Example |
| boolean |
| The global read-only mode. After this mode is enabled, pods cannot be migrated. | false |
| time.Duration | >0s | The descheduling interval. | 120s |
| Structure | N/A | Limit the nodes that can be descheduled. The descheduling policy takes effect only on the nodes specified by the node selector. For more information about the node selector, see Kubernetes labelSelector. |
|
| int | ≥ 0 (Default value: 0) | Limit the maximum number of pods that can be evicted at the same time on a node. This parameter takes effect during descheduling. | 10 |
| int | ≥ 0 (Default value: 0) | Limit the maximum number of pods that can be evicted at the same time in a namespace. This parameter takes effect during descheduling. | 10 |
Template configurations
Koordinator Descheduler uses descheduling templates to manage descheduling policies and pod evictors. You can define one or more descheduling templates in the profiles
field of DeschedulerConfiguration
. In each descheduling template, the descheduling policy and pod evictor are configured as plug-ins. The descheduling template contains the following parts:
name
The value is a string. You can customize the name of the descheduling template.
plugins
Configure the descheduling policies (
deschedule
,balance
) that you want to enable or disable, the pod eviction plug-ins (evict
), and the filter policies to be used before pod eviction (filter
). The following table describes the parameters that you can configure.Parameter
Type
Valid value
Description
Example
deschedule
A structure in the following format:
type PluginList struct { Enabled []Plugin Disabled []Plugin } type Plugin struct { Name string }
Enabled
andDisabled
arePlugin
structured lists, which indicate plug-ins to be enabled and disabled.All plug-ins are disabled by default. You can specify the Deschedule descheduling policies to be enabled.
plugins: deschedule: enabled: - name: PodLifeTime - name: RemovePodsViolatingNodeTaints
RemovePodsViolatingInterPodAntiAffinity
This policy evicts pods that violate inter-pod anti-affinity rules.
RemovePodsViolatingNodeAffinity
This policy evicts pods that do not match node affinity rules.
RemovePodsViolatingNodeTaints
This policy evicts pods that cannot tolerate node taints.
RemovePodsHavingTooManyRestarts
This policy evicts pods that frequently restart.
PodLifeTime
This policy evicts pods whose TTLs have expired.
RemoveFailedPod
This policy evicts pods that are in the Failed state.
balance
A structure in the following format:
type PluginList struct { Enabled []Plugin Disabled []Plugin } type Plugin struct { Name string }
Enabled
andDisabled
arePlugin
structured lists, which indicate plug-ins to be enabled and disabled.All plug-ins are disabled by default. Specify the Balance descheduling policies to be enabled.
plugins: balance: enabled: - name: RemoveDuplicates - name: LowNodeLoad
RemoveDuplicates
Spread replicated pods.
LowNodeUtilization
Perform hotspot spreading based on node resource allocation.
HighNodeUtilization
Perform load aggregation based on node resource allocation. Pods are migrated from nodes with low resource utilization to nodes with high resource utilization if the policy allows.
RemovePodsViolatingTopologySpreadConstraint
Evict pods that do not match the topology distribution constraint.
LowNodeLoad
Perform hotspot spreading based on node resource utilization.
evict
A structure in the following format:
type PluginList struct { Enabled []Plugin Disabled []Plugin } type Plugin struct { Name string }
Enabled and Disabled are Plugin structured lists. You can enable plug-ins in the Enabled list and disable plug-ins in the Disabled list.
MigrationController
DefaultEvictor
The pod evictors that you choose to enable.
MigrationController
is enabled by default.Do not enable multiple
evict
plug-ins at the same time.plugins: evict: enabled: - name: MigrationController
filter
A structure in the following format:
type PluginList struct { Enabled []Plugin Disabled []Plugin } type Plugin struct { Name string }
Enabled and Disabled are Plugin structured lists. You can enable plug-ins in the Enabled list and disable plug-ins in the Disabled list.
MigrationController
DefaultEvictor
Select a eviction filtering policy to be used before pod eviction.
MigrationController
is enabled by default.Do not enable multiple
filter
plug-ins at the same time.plugins: filter: enabled: - name: MigrationController
pluginConfig
Configure advanced parameters for each plug-in. Specify the name of the plug-in that you want to configure by setting the
name
field. For more information about how to configure a plug-in in theargs
field, see Configure policy plug-ins and Configure evictor plug-ins.
Configure policy plug-ins
Koordinator Descheduler supports six Deschedule policy plug-ins and five Balance policy plug-ins. The LowNodeLoad plug-in is provided by Koordinator. For more information, see Work with load-aware hotspot descheduling. The following descheduling plug-ins are provided by Kubernetes Descheduler:
Policy type | Policy feature | Policy setting |
Deschedule | This policy evicts pods that do not match inter-pod anti-affinity rules. | |
This policy evicts pods that do not match node affinity rules. | ||
This policy evicts pods that cannot tolerate node taints. | ||
This policy evicts pods that frequently restart. | ||
This policy evicts pods whose TTLs have expired. | ||
This policy evicts pods that are in the Failed state. | ||
Balance | Spread replicated pods. | |
Perform hotspot spreading based on node resource allocation. | ||
Perform load aggregation based on node resource allocation. | ||
Evict pods that do not match the topology distribution constraint. |
Configure evictor plug-ins
Koordinator Descheduler supports both the DefaultEvictor
and MigrationController
evictor plug-ins.
MigrationController
The following table describes the advanced configurations of the MigrationController
evictor plug-in.
Parameter | Type | Valid value | Description | Example |
| boolean |
| Specify whether pods that are configured with the emptyDir or hostPath can be descheduled. For security reasons, this parameter is disabled by default. | false |
| int64 | ≥ 0 (default value: 2) | The maximum number of pods that can be migrated at the same time on a node. A value of 0 indicates that no limit is set. | 2 |
| int64 | ≥ 0 (Default value: 0) | The maximum number of pods that can be migrated at the same time in a namespace. A value of 0 indicates that no limit is set. | 1 |
| intOrString | ≥ 0 (Default value: 10%) | The maximum number or percentage of pods that can be migrated at the same time in a workload, such as a Deployment. A value of 0 indicates that no limit is set. If the workload contains only one replicated pod, the workload is excluded for descheduling. | 1 or 10% |
| intOrString | Equal to or larger than 0 (Default value: 10%) and smaller than the number of replicated pods of the workload | The maximum number or percentage of unavailable replicated pods that are allowed in a workload, such as a Deployment. A value of 0 indicates that no limit is set. | 1 or 10% |
| A structure in the following format:
|
| Workload-specific pod migration control.
|
The example indicates that only one replicated pod can be migrated within 5 minutes in a workload. |
| string | The following modes are supported:
|
| Eviction |
DefaultEvictor
The DefaultEvictor plug-in is provided by Kubernetes Descheduler. For more information, see DefaultEvictor.
Comparison
The following table compares the pod eviction capabilities between DefaultEvictor and MigrationController.
Item | DefaultEvictor | MigrationController |
Eviction methods | Call the Eviction API to evict pods. | Multiple eviction methods are supported. For more information, see Configure evictor plug-ins. |
Eviction limits |
|
|
Eviction throttling | Not supported | A time window-based throttling mechanism is adopted to ensure that pods that belong to the same workload are not frequently migrated. |
Eviction observation | Component logs can be used to view pod eviction information. |
|
References
Some descheduling features depend on the ACK scheduler. For more information, see Work with load-aware hotspot descheduling.
You can use descheduling feature with cost insights to obtain the resource usage of clusters and cost distribution. This helps generate cost saving suggestions to increase the resource utilization of clusters. For more information, see Overview of cost insights.
For information about how to troubleshoot issues that you may encounter, see FAQ about scheduling.
For more information about the introduction to ack-koordinator and the release notes, see ack-koordinator (ack-slo-manager).
ack-descheduler is discontinued. We recommend that you migrate from ack-descheduler to Koordinator Descheduler. For more information, see How to migrate from ack-descheduler to Koordinator Descheduler.