All Products
Search
Document Center

Container Service for Kubernetes:Work with load-aware hotspot descheduling

Last Updated:Oct 27, 2024

ack-koordinator provides the load-aware hotspot descheduling feature, which can sense the changes in the loads on cluster nodes and automatically optimize the nodes that exceed the safety load to prevent extreme load imbalance. This topic describes how to work with the load-aware hotspot descheduling feature and how to configure advanced settings for this feature.

Limits

  • Only ACK Pro clusters support load-aware hotspot descheduling. For more information, see Create an ACK managed cluster.

  • To use the load-aware hotspot descheduling feature, make sure that the following requirements are met.

    Component

    Required version

    ACK scheduler

    v1.22.15-ack-4.0 and later or v1.24.6-ack-4.0 and later

    ack-koordinator(ack-slo-manager)

    v1.1.1-ack.1 and later

    Helm

    3.0 and later

Important
  • The descheduler only evicts pods, and the ACK scheduler reschedules the pods. We recommend that you use the descheduling feature together with load-aware scheduling. This enables the ACK scheduler to avoid scheduling pods to hotspots again.

  • During the rescheduling process, the old pods are evicted and then new pods are created. Make sure that your application has sufficient redundant replicas to avoid affecting application availability during eviction.

  • During the descheduling process, the standard Kubernetes eviction API is used to evict pods. Make sure that the logic of the application pod is reentrant and the service is not down due to pod restarts after eviction. For more information, see API-initiated Eviction.

Billing rules

No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:

  • ack-koordinator is an non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.

  • By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered as custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing overview topic of Managed Service for Prometheus to learn the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage see Resource usage and bills.

Introduction to load-aware hotspot descheduling

This section describes the terms used in load-aware hotspot descheduling.

Load-aware pod scheduling

The ACK scheduler supports load-aware scheduling, which can schedule pods to nodes that run with low loads. Due to the changes in the cluster environment, traffic, and requests, the utilization of nodes dynamically changes and may break the load balance between nodes in the cluster, and even result in extreme load imbalance. This affects the runtime quality of the workload. ack-koordinator can identify changes in the loads of nodes and automatically optimize nodes that exceed the safety load to prevent extreme load imbalance. You can use a combination of load-aware scheduling and hotspot descheduling to achieve optimal load balancing among nodes. For more information, see Use load-aware pod scheduling.

How koord-descheduler works

koord-descheduler is a module of the ack-koordinator component. The LowNodeLoad plug-in can identify the changes in loads and perform hotspot descheduling. Unlike the Kubernetes-native descheduler plug-in LowNodeUtilization, the LowNodeLoad plug-in decides to deschedule nodes based on the actual utilization of the nodes, while LowNodeUtilization decides to deschedule based on resource allocation.

Descheduling procedure

koord-descheduler periodically performs descheduling. The following figure shows the steps of descheduling within each cycle.

koorle-descheduler执行过程

  1. Data collection: collects information about nodes and workloads in the cluster and the resource utilization statistics.

  2. Descheduling policy implementation by the policy plug-in.

    This section uses LowNodeLoad as an example.

    1. Identifies hotspot nodes. For more information about the classification of nodes, see Load thresholds.

    2. Traverses all hotspot nodes, identifies the nodes on which pods can be migrated, and sorts the pods. For more information about how pods are scored and sorted, see Pod scoring policy.

    3. Traverses all pods to be migrated and checks whether the pods meet the requirements for migration based on constraints such as the cluster size, resource utilization, and the ratio of replicated pods. For more information, see Load-aware hotspot descheduling policies.

    4. Only pods that meet the requirements are migrated. If no pod meets the requirements on the current node, LowNodeLoad continues to traverse the pods on other hotspot nodes.

  3. Pod eviction and migration: evicts the pods that meet the requirements for migration. For more information, see API-initiated Eviction.

Load thresholds

The LowNodeLoad plug-in allows you to set the following load thresholds:

In the following figure, lowThresholds is set to 45% and highThresholds is set to 70%. Nodes are classified based on their loads and the thresholds. If the values of lowThresholds and highThresholds change, the standards for node classification also change.

image

The resource utilization statistics are updated every minute and the average values within the previous 5 minutes are displayed.

  • Idle nodes: nodes whose resource utilization is lower than 45%.

  • Normal nodes: nodes whose resource utilization is higher than or equal to 45% but lower than or equal to 70%. This is the desired resource utilization range for cluster nodes.

  • Hotspot nodes: nodes whose resource utilization is higher than 70. Pods on hotspot nodes will be evicted until the resource utilization of these nodes drops to 70% or lower.

Load-aware hotspot descheduling policies

Policy

Description

Hotspot detection frequency policy

To accurately identify hotspot nodes and avoid frequent pod migration caused by the delayed monitoring data, koord-descheduler allows you to specify the frequency of hotspot detection. A node is considered a hotspot node only if the number of times that the node consecutively exceeds the load threshold reaches the specified frequency value.

Node sorting policy

When hotspot nodes are identified, koord-descheduler sorts the nodes in descending order of resource usage and then deschedules the nodes in sequence. koord-descheduler compares the memory and CPU usage of the hotspot nodes and preferably deschedules nodes whose resource usage is higher.

Pod scoring policy

koord-descheduler scores and sorts the pods on each hotspot node and then evicts the pods to idle nodes in the following order:

  1. Pods with lower priorities. The default priority is 0, which is the lowest.

  2. Pods with lower quality of service (QoS) classes.

  3. For pods with the same priority and QoS class, koord-descheduler will sort them based on factors such as resource usage and startup time.

Note

If you have specific requirements for the eviction order of pods, we recommend that you configure different priorities or QoS classes for them.

Filtering policy

koord-descheduler allows you to configure various pod and node filters to control descheduling.

  • Filter by Namespace: specifies the namespaces of the pods that can be descheduled. For more information, see evictableNamespaces.

  • Filter by pod selector: specifies the label selectors of the pods that can be descheduled. For more information, see podSelectors.

  • Filter by node selector: specifies the label selectors of the nodes that can be descheduled. For more information, see nodeSelector.

Precheck policy

koord-descheduler can precheck pods before migrating the pods.

  • Checks the node affinity and the nodes available for pod scheduling before descheduling the pods. The following attributes of nodes are checked: node affinity, node selector, toleration, and unallocated resources.

  • Checks the resource usage on idle nodes to ensure that the resource usage does not exceed the load threshold after pods are scheduled to the nodes. This avoids triggering descheduling frequently.

    Formula: Available resources on an idle node = (highThresholds - Load of the idle node) × Total resources on the idle node

    For example, the load of the idle node is 20%, the value of highThresholds is 70%, and the node has 96 vCores. The available number of vCores on the node is calculated based on the following formula: 48 = (70% - 20%) × 96. In this scenario, koord-descheduler ensures that the total number of vCores requested by the migrated pods does not exceed 48.

Migration control policy

To ensure the high availability of applications during pod migration, koord-descheduler provides multiple features to allow you to control pod migration. You can specify the maximum number of pods that can be migrated at the same time per node, namespace, or workload. koord-descheduler also allows you to specify a pod migration time window to prevent pods that belong to the same workload from being migrated too frequently. koord-descheduler is compatible with the Pod Disruption Budgets (PDB) mechanism of open source Kubernetes, which helps you guarantee the high availability of your applications in a fine-grained manner. For more information, see Specifying a Disruption Budget for your Application.

Observability policy

You can collect events to monitor the descheduling procedure, and view the reason and status of the descheduling in event details. The following code block shows an example:

kubectl get event | grep stress-demo-588f9646cf-7****
55s         Normal    Evicting           podmigrationjob/3bf8f623-4d10-4fc5-ab4e-2bead3c4****   Pod "default/stress-demo-588f9646cf-7****" evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(76.72%)>threshold(50.00%)"
22s         Normal    EvictComplete      podmigrationjob/3bf8f623-4d10-4fc5-ab4e-2bead3c4****   Pod "default/stress-demo-588f9646cf-7****" has been evicted
55s         Normal    Descheduled        pod/stress-demo-588f9646cf-7****                       Pod evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(76.72%)>threshold(50.00%)"
55s         Normal    Killing            pod/stress-demo-588f9646cf-7****                       Stopping container stress

Step 1: Install or modify ack-koordinator and enable load-aware hotspot descheduling

Install ack-koordinator

Install ack-koordinator. On the Install ack-koordinator(ack-slo-manager) page, select Enable Descheduler for ack-koordinator. For more information, see Install ack-koordinator.

Modify ack-koordinator (ack-koordinator is already installed)

Modify ack-koordinator. On the ack-koordinator Parameters page, select Enable Descheduler for ack-koordinator. For more information, see Modify ack-koordinator.

Step 2: Enable the LowNodeLoad plug-in

  1. Create a file named koord-descheduler-config.yaml and add the following YAML content to the file:

    The koord-descheduler-config.yaml file is a ConfigMap used to enable the LowNodeLoad plug-in.

    Click to view details

    # koord-descheduler-config.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: koord-descheduler-config
      namespace: kube-system
    data:
      koord-descheduler-config: |
        # Do not modify the following system configuration of koord-desheduler. 
        apiVersion: descheduler/v1alpha2
        kind: DeschedulerConfiguration
        leaderElection:
          resourceLock: leases
          resourceName: koord-descheduler
          resourceNamespace: kube-system
        deschedulingInterval: 120s # The interval at which LowNodeLoad runs. The interval is set to 120 seconds in this example. 
        dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations. 
        # The preceding configuration is the system configuration. 
    
        profiles:
        - name: koord-descheduler
          plugins:
            deschedule:
              disabled:
                - name: "*"
            balance:
              enabled:
                - name: LowNodeLoad # Enable the LowNodeLoad plug-in. 
            evict:
              disabled:
                - name: "*"
              enabled:
                - name: MigrationController # Enable the migration controller. 
    
          pluginConfig:
          - name: MigrationController # Configure the parameters of the migration controller. 
            args:
              apiVersion: descheduler/v1alpha2
              kind: MigrationControllerArgs
              defaultJobMode: EvictDirectly
    
          - name: LowNodeLoad # Configure the LowNodeLoad plug-in. 
            args:
              apiVersion: descheduler/v1alpha2
              kind: LowNodeLoadArgs
    
              lowThresholds:  # Specify the low resource threshold for identifying idle nodes. Nodes whose load is lower than the threshold are considered idle nodes. 
                cpu: 20 # The low CPU utilization threshold is 20%. 
                memory: 30  # The low memory utilization threshold is 30%. 
              highThresholds: # Specify the high resource threshold for identifying hotspot nodes. Nodes whose load is higher than the threshold are considered hotspot nodes. 
                cpu: 50  # The high CPU utilization threshold is 50%. 
                memory: 60 # The high memory utilization threshold is 60%. 
    
              evictableNamespaces: # Specify the namespaces that you want to include or exclude for descheduling. The include and exclude lists are mutually exclusive. 
                include: # Specify the namespaces that you want to include for descheduling. 
                  - default
                # exclude: # Specify the namespaces that you want to exclude for descheduling. 
                  # - "kube-system"
                  # - "koordinator-system"
  2. Run the following command to apply the configuration to the cluster:

    kubectl apply -f koord-descheduler-config.yaml
  3. Run the following command to restart koord-descheduler.

    After koord-descheduler is restarted, the modified configuration takes effect.

    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0
    deployment.apps/ack-koord-descheduler scaled
    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1
    deployment.apps/ack-koord-descheduler scaled

Step 3: (Optional) Enable the load-aware scheduling plug-in

Enable the load-aware scheduling plug-in to achieve optimal load balancing among nodes. For more information, see Step 1: Enable load-aware scheduling.

Step 4: Verify load-aware hotspot descheduling

In this section, a cluster that contains three nodes is used as an example. Each node has 104 vCores and 396 GB of memory.

  1. Create a file named stress-demo.yaml and add the following content to the file.

    Click to view details

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: stress-demo
      namespace: default
      labels:
        app: stress-demo
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: stress-demo
      template:
        metadata:
          name: stress-demo
          labels:
            app: stress-demo
        spec:
          containers:
            - args:
                - '--vm'
                - '2'
                - '--vm-bytes'
                - '1600M'
                - '-c'
                - '2'
                - '--vm-hang'
                - '2'
              command:
                - stress
              image: polinux/stress
              imagePullPolicy: Always
              name: stress
              resources:
                limits:
                  cpu: '2'
                  memory: 4Gi
                requests:
                  cpu: '2'
                  memory: 4Gi
          restartPolicy: Always
  2. Run the following command to create a pod for stress testing:

    kubectl create -f stress-demo.yaml
    deployment.apps/stress-demo created
  3. Run the following command to view the status of the pod until it starts to run:

    kubectl get pod -o wide

    Expected output:

    NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE                    NOMINATED NODE   READINESS GATES
    stress-demo-588f9646cf-s****   1/1     Running   0          82s   10.XX.XX.53   cn-beijing.10.XX.XX.53   <none>           <none>

    The output indicates that pod stress-demo-588f9646cf-s**** is scheduled to node cn-beijing.10.XX.XX.53.

  4. Increase the load of node cn-beijing.10.XX.XX.53. Then, run the following command to check the load of each node:

    kubectl top node

    Expected output:

    NAME                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
    cn-beijing.10.XX.XX.215   17611m       17%    24358Mi         6%
    cn-beijing.10.XX.XX.53    63472m       63%    11969Mi         3%

    The output indicates that the load of node cn-beijing.10.XX.XX.53 is higher, which is 63%. The load exceeds the high resource threshold of 50%. The load of node cn-beijing.10.XX.XX.215 is lower, which is 17%. The load is lower than the low resource threshold of 20%.

  5. Enable load-aware hotspot descheduling. For more information, see Step 2: Enable the LowNodeLoad plug-in.

  6. Run the following command to view the changes of the pods.

    Wait for the descheduler to identify hotspot nodes and evict pods.

    Note

    A node is considered a hotspot node if the node consecutively exceeds the high resource threshold 5 times within 10 minutes.

    kubectl get pod -w

    Expected output:

    NAME                           READY   STATUS               RESTARTS   AGE     IP           NODE                     NOMINATED NODE   READINESS GATES
    stress-demo-588f9646cf-s****   1/1     Terminating          0          59s   10.XX.XX.53    cn-beijing.10.XX.XX.53     <none>           <none>
    stress-demo-588f9646cf-7****   1/1     ContainerCreating    0          10s   10.XX.XX.215   cn-beijing.10.XX.XX.215    <none>           <none>
  7. Run the following command to view the event:

    kubectl get event | grep stress-demo-588f9646cf-s****

    Expected output:

    2m14s       Normal    Evicting            podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb****   Pod "default/stress-demo-588f9646cf-s****" evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)"
    101s        Normal    EvictComplete       podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb****   Pod "default/stress-demo-588f9646cf-s****" has been evicted
    2m14s       Normal    Descheduled         pod/stress-demo-588f9646cf-s****                       Pod evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)"
    2m14s       Normal    Killing             pod/stress-demo-588f9646cf-s****                       Stopping container stress

    The output indicates the migration result. The pods on the hotspot node are migrated to the idle node.

Advanced settings

Advanced settings for koord-descheduler

The configuration of koord-descheduler is stored in a ConfigMap. The following code block shows the advanced settings for load-aware hotspot descheduling.

Click to view details

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler. 
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    deschedulingInterval: 120s # The interval at which LowNodeLoad runs. The interval is set to 120 seconds in this example. 
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations. 
    # The preceding configuration is the system configuration. 

    profiles:
    - name: koord-descheduler
      plugins:
        deschedule:
          disabled:
            - name: "*"
        balance:
          enabled:
            - name: LowNodeLoad # Enable the LowNodeLoad plug-in. 
        evict:
          disabled:
            - name: "*"
          enabled:
            - name: MigrationController # Enable the migration controller. 

      pluginConfig:
      - name: MigrationController # Configure the parameters of the migration controller. 
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly
          maxMigratingPerNode: 1 # The maximum number of pods that can be migrated at the same time on a node. 
          maxMigratingPerNamespace: 1  # The maximum number of pods that can be migrated at the same time in a namespace. 
          maxMigratingPerWorkload: 1 # The maximum number of pods that can be migrated at the same time in a workload, such as a Deployment. 
          maxUnavailablePerWorkload: 2 # The maximum number of unavailable replicated pods that are allowed in a workload, such as a Deployment. 
          evictLocalStoragePods: false # Specify whether pods that are configured with the emptyDir or hostPath can be descheduled.
          objectLimiters:
            workload: # Control workload-specific pod migration. By default, the system can migrate only one replicated pod within 5 minutes after the first eviction. 
              duration: 5m
              maxMigrating: 1

      - name: LowNodeLoad # Configure the LowNodeLoad plug-in. 
        args:
          apiVersion: descheduler/v1alpha2
          kind: LowNodeLoadArgs

          lowThresholds:  # Specify the low resource threshold for identifying idle nodes. Nodes whose load is lower than the threshold are considered idle nodes. 
            cpu: 20 # The low CPU utilization threshold is 20%. 
            memory: 30  # The low memory utilization threshold is 30%. 
          highThresholds: # Specify the high resource threshold for identifying hotspot nodes. Nodes whose load is higher than the threshold are considered hotspot nodes. 
            cpu: 50  # The high CPU utilization threshold is 50%. 
            memory: 60 # The high memory utilization threshold is 60%. 

          anomalyCondition: # The hotspot node detection setting. 
            consecutiveAbnormalities: 5 # A node is considered a hotspot node if the node exceeds the highThresholds within five consecutive hotspot detection cycles. Hotspot nodes are descheduled and then the counter is reset. 

          evictableNamespaces: # Specify the namespaces that you want to include or exclude for descheduling. The include and exclude lists are mutually exclusive. 
            include: # Specify the namespaces that you want to include for descheduling. 
              - default
            # exclude: # Specify the namespaces that you want to exclude for descheduling. 
              # - "kube-system"
              # - "koordinator-system"

          nodeSelector: # Only the specified nodes can be descheduled. 
            matchLabels:
              alibabacloud.com/nodepool-id: np77f520e1108f47559e63809713ce****

          podSelectors: # Only the specified pods can be descheduled. 
          - name: lsPods
            selector:
              matchLabels:
                koordinator.sh/qosClass: "LS"

koord-descheduler system settings

Parameter

Type

Valid value

Description

Example

dryRun

boolean

  • true

  • false (default value)

The global read-only mode. After this mode is enabled, pods cannot be migrated.

false

deschedulingInterval

time.Duration

>0s

The descheduling interval.

120s

Migration control settings

Parameter

Type

Valid value

Description

Example

maxMigratingPerNode

int64

≥ 0 (default value: 2)

The maximum number of pods that can be migrated at the same time on a node. A value of 0 indicates that no limit is set.

2

maxMigratingPerNamespace

int64

≥ 0 (Default value: 0)

The maximum number of pods that can be migrated at the same time in a namespace. A value of 0 indicates that no limit is set.

1

maxMigratingPerWorkload

intOrString

≥ 0 (Default value: 10%)

The maximum number or percentage of pods that can be migrated at the same time in a workload, such as a Deployment. A value of 0 indicates that no limit is set.

If the workload contains only one replicated pod, the workload is excluded for descheduling.

1 or 10%

maxUnavailablePerWorkload

intOrString

Equal to or larger than 0 (Default value: 10%) and smaller than the number of replicated pods of the workload

The maximum number or percentage of unavailable replicated pods that are allowed in a workload, such as a Deployment. A value of 0 indicates that no limit is set.

1 or 10%

evictLocalStoragePods

boolean

  • true

  • false (default value)

Specify whether pods that are configured with the emptyDir or hostPath can be descheduled. By default, this feature is disabled to ensure data security.

false

objectLimiters.workload

A structure in the following format:

type MigrationObjectLimiter struct {
    Duration time.Duration `json:"duration,omitempty"`
    MaxMigrating *intstr.IntOrString `json:"maxMigrating,omitempty"`
}
  • Valid values of Duration: > 0 (Default value: 5m)

  • Valid values of MaxMigrating: ≥ 0 (Default value: 10%)

Workload-specific pod migration control.

  • Duration: specifies a time window. For example, 5m indicates 5 minutes.

  • MaxMigrating: The maximum number or percentage of replicated pods that can be migrated at the same time. Set to an integer or a percentage value. By default, the value of maxMigratingPerWorkload is used.

objectLimiters:
  workload:
    duration: 5m
    maxMigrating: 1

The example indicates that only one replicated pod can be migrated within 5 minutes in a workload.

LowNodeLoad settings

Parameter

Type

Valid value

Description

Example

highThresholds

map[string]float64

Note

Set the parameter to a percentage value. You can specify this parameter for pods or nodes.

[0,100]

The high resource threshold. Pods on nodes whose load exceeds this threshold are descheduled.

Note

You can use load-aware hotspot descheduling together with load-aware scheduling. Set this parameter and the loadAwareThreshold parameter to the same value. This way, the scheduler does not schedule pods to hotspots. For more information, see Scheduling policies.

highThresholds:
  cpu: 55 # The high CPU utilization threshold is set to 55%. 
  memory: 75 # The high memory utilization threshold is set to 75%.

lowThresholds

map[string]float64

Note

Set the parameter to a percentage value. You can specify this parameter for pods or nodes.

[0,100]

The low resource threshold. Pods on nodes whose load is lower than this threshold are not descheduled.

lowThresholds:
  cpu: 25 # The low CPU utilization threshold is set to 25%. 
  memory: 25 # The low memory utilization threshold is set to 25%. 

anomalyCondition.consecutiveAbnormalities

int64

> 0 (Default value: 5)

Hotspot detection frequency. A node is considered a hotspot node if the node exceeds highThresholds within the specified consecutive number of hotspot detection cycles. Hotspot nodes are descheduled and then the counter is reset.

5

evictableNamespaces

  • include: string

  • exclude: string

Namespaces in the cluster

The namespaces that you want to include or exclude for descheduling. If you leave this parameter empty, all pods can be descheduled.

You can specify the include or exclude list. The lists are mutually exclusive.

  • include: Specify the namespaces that you want to include for descheduling.

  • exclude: Specify the namespaces that you want to exclude for descheduling.

evictableNamespaces:
  exclude:
    - "kube-system"
    - "koordinator-system"

nodeSelector

metav1.LabelSelector

For more information about the format of LabelSelector, see Labels and Selectors.

Use the LabelSelector to select nodes.

You can specify one node pool or multiple node pools when you configure this parameter.

  • Method 1

    # Select nodes in one node pool.
    nodeSelector:
      matchLabels:
        alibabacloud.com/nodepool-id: np77f520e1108f47559e63809713ce****
  • Method 2

    # Select nodes in multiple node pools.
    nodeSelector:
      matchExpressions:
      - key: alibabacloud.com/nodepool-id
        operator: In
        values:
        - node-pool1
        - node-pool2

podSelectors

A list that can consist of multiple pods. Format of PodSelector:

type PodSelector struct {
    name     string
    selector metav1.LabelSelector
}

For more information about the format of LabelSelector, see Labels and Selectors.

Specify the pods that can be descheduled.

# Only Latency-Sensitive (LS) pods can be descheduled. 
podSelectors:
- name: lsPods
  selector:
    matchLabels:
      koordinator.sh/qosClass: "LS"

FAQ

What do I do if the resource utilization of a node has reached the high threshold but no pod on the node is evicted?

The following table describes the possible causes.

Category

Cause

Solution

Ineffective component configuration

No pods or nodes specified

No pods or nodes are specified in the configuration of the descheduler. Check whether namespaces and nodes are specified.

Descheduler not restarted after modification

After you modify the configuration of the descheduler, you need to restart it for the modification to take effect. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in.

Invalid node status

Average node resource utilization lower than the threshold for a long period of time

The descheduler continuously monitors the resource utilization within a period of time and calculates the average value. Descheduling is triggered only if the average value remains above the threshold for a certain period of time. The default time period is 10 minutes. The resource utilization returned by kubectl top node is an average value within 1 minute. We recommend that you monitor the resource utilization for a long period of time and then modify the hotspot detection frequency and detection interval.

Insufficient available resources in the cluster

The descheduler checks other nodes in the cluster to ensure that the nodes can provide sufficient available resources before the descheduler evicts pods. For example, the descheduler wants to evict a pod that requests 8 vCores and 16 GB of memory. However, no node in the cluster can provide sufficient available resources to meet the requirement of the pod. In this case, the descheduler does not evict the pod. To resolve this issue, you can add nodes to the cluster.

Workload limits

Only one replicated pod in the workload

By default, if a workload contains only one replicated pod, the pod is excluded for descheduling. This ensures the high availability of the application that runs in the pod. If you want to deschedule the preceding pod, add the descheduler.alpha.kubernetes.io/evict: "true" annotation to the pod or add the annotation to the TemplateSpec of the workload (Deployment or StatefulSet).

Note

This annotation configuration supports only v1.2.0-ack1.3 and earlier versions.

Pods configured with the emptyDir or hostPath

By default, pods configured with the emptyDir or hostPath are excluded for descheduling to ensure data security. If you want to deschedule these pods, refer to the evictLocalStoragePods setting. For more information, see Migration control settings.

Excessive number of unavailable replicated pods or replicated pods that are being migrated

The number of unavailable replicated pods or replicated pods that are being migrated in a workload (Deployment or StatefulSet) exceeds the upper limit specified in maxUnavailablePerWorkload or maxMigrationPerWorkload. For example, both maxUnavailablePerWorkload and maxMigrationPerWorkload are set to 20%, and the expected number of replicated pods for the Deployment is set to 10. Two pods are being migrated or releasing the application. In this case, the descheduler does not evict these pods. Wait until the pods are migrated or the pods finish releasing the application, or increase the values of the preceding parameters.

Incorrect replicated pod limits

When the number of replicated pods in a workload is smaller than or equal to the maximum number of pods allowed to migrate specified in maxMigrationPerWorkload or the maximum number of unavailable pods allowed specified in maxUnavailablePerWorkload, the descheduler does not deschedule the pods in the workload. Decrease the values of the preceding parameters and set the parameters to percentage values.

Why does the descheduler frequently restart?

The format of the ConfigMap of the descheduler is invalid or the ConfigMap does not exist. Refer to Advanced settings and check the content and format of the ConfigMap, modify the ConfigMap, and then restart the descheduler. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in.

How do I use a combination of load-aware scheduling and load-aware hotspot descheduling?

After you enable load-aware hotspot descheduling, pods on hotspot nodes are evicted. The ACK scheduler will select proper nodes for pods that are created by controllers (such as Deployments) in the upper layer. To achieve optimal load balancing, we recommend that enable load-aware scheduling at the same time. For more information, see Use load-aware scheduling.

We recommend that you set the loadAwareThreshold parameter of the scheduler and the highThresholds parameter of the descheduler to the same value. For more information, see Scheduling policies. When the load of a node exceeds highThresholds, the descheduler evicts pods on the node. The scheduler stops scheduling new pods to the hotspot node due to loadAwareThreshold. If you do not set the parameters to the same value, pods may be scheduled to the hotspot node. This issue is more likely to occur when a pod has specified the scope of schedulable nodes but only a small number of nodes are available and the resource utilization values of these nodes are close.

What is the utilization algorithm used by the descheduler?

The descheduler continuously monitors the resource utilization within a period of time and calculates the average value. Descheduling is triggered only if the average value remains above the threshold for a certain period of time. The default time period is 10 minutes. In addition, the memory utilization counted by the descheduler excludes the page cache because the memory resource used by the page cache can be recycled by the operating system. The memory utilization queried by using the kubectl top node command includes the page cache. You can view the actual memory utilization in the Managed Service for Prometheus console.