Koordinator v1.1: Load-Aware Scheduling and Interference Detection Collection

By Koordinator Community

Background

Koordinator is a QoS-based scheduling system for hybrid orchestration workloads on Kubernetes. Its goal is to improve the runtime efficiency and reliability of both latency sensitive workloads and batch jobs, simplify the complexity of resource-related configuration tuning, and increase pod deployment density to improve resource utilization.

Since its release in April 2022, Koordinator has released nine versions. During the development of the project for more than half a year, the community absorbed a large number of excellent engineers to promote the maturity of the Koordinator project.

We are pleased to announce the official release of Koordinator v1.1, which includes load-aware scheduling/rescheduling, cgroup v2 support, interference detection metrics collection, and other optimization. Next, we will give an in-depth interpretation and explanation of these new features.

In-Depth Interpretation

Load-Aware Scheduling

It Supports Load Statistics and Load Balancing Based on the Workload Type.

Koordinator v1.0 and earlier versions provide load-aware scheduling and basic utilization threshold to prevent nodes with high load water levels from deteriorating and affecting the runtime quality of workloads and resolve cold node overload through the prediction mechanism. Existing load-aware scheduling can solve problems in many common scenarios. However, as an optimization method, load-aware scheduling still needs to be improved in many scenarios.

The current load-aware scheduling mainly implements the load balancing effect of the whole machine dimension within the cluster, but there may be some special cases: a lot of offline Pods are deployed on the nodes to run, raising the utilization of the whole machine, but the overall utilization rate of the online application workload is low. At this time, if there is a new online Pod and the resources in the entire cluster are insufficient, the following issues may occur:

Pods may not be scheduled to this node because the utilization rate of the whole machine exceeds the security threshold of the whole machine.
It is also possible that although the utilization rate of a node is relatively low, all online applications run on it. From the perspective of online applications, the utilization rate is already high, but according to the current scheduling policy, this Pod will continue to be scheduled, resulting in a large number of online applications accumulated on the node, and the overall running effect is not good.

In Koordinator v1.1, the koord-scheduler supports workload type awareness to differentiate water levels and policies for scheduling.

In the Filter phase:

The prodUsageThresholds parameter of threshold configuration is added to indicate the security threshold of the online application. This parameter is left empty by default. If the Pod to be scheduled is of the Prod type, the koord-scheduler collects the total utilization rate of all online applications from the NodeMetric of the current node. If the total utilization rate exceeds the prodUsageThresholds, the node is filtered out. If the Pod is offline or no prodUsageThresholds is configured, the original logic is used to process the Pod based on the utilization of the whole machine.

In the Score phase:

The scoreAccordingProdUsage switch indicates whether to score according to the Prod Pod utilization. This parameter is not enabled by default. If the switch is enabled and the current Pod is of the Prod type, the koord-scheduler only processes the Pods of the Prod type in the prediction algorithm. The current utilization rate of the Pods of other online applications that are not processed by the prediction algorithm in NodeMetrics is summed up. The summed value is used for the final score. If the scoreAccordingProdUsage is not enabled or the Pod is offline, the original logic is used to process the Pod based on the utilization rate of the whole machine.

It Supports Balancing by Percentile Utilization Rate.

Koordinator v1.0 and earlier perform filtering and scoring based on the average utilization data reported by koordlet. However, the average value hides a relatively large amount of information. Therefore, in Koordinator v1.1, koordlet supports data aggregation based on the percentile utilization rate. The scheduler side has also adapted accordingly.

Change the configuration of the LoadAware plugin of the scheduler. Aggregated indicates that data is aggregated based on percentile statistics for scoring and filtering.

aggregated.usageThresholds indicates the threshold for filtering. aggregated.usageAggregationType indicates the percentile type of the machine's utilization when filtering, including avg, p50, p90, p95, and p99. aggregated.usageAggregatedDuration indicates the statistical period of the percentile of the machine's utilization when filtering. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default. aggregated.scoreAggregationType indicates the percentile type of the machine's utilization when scoring.

aggregated.scoreAggregatedDuration indicates the statistical period of the percentile of Prod Pod's utilization when scoring. When this field is not set, the scheduler uses the data of the maximum period in NodeMetrics by default.

In the Filter phase:

If the aggregated.usageThresholds and the corresponding aggregation type are configured, the scheduler will perform filtering based on the percentile statistics.

In the Score phase:

If the aggregated.scoreAggregationType parameter is configured, the scheduler will score data based on the percentile statistics. Currently, percentile filtering is not supported for Prod pods.

Example

1. Change the koord-scheduler configuration, enable the utilization statistics by Prod, and make it take effect in the filtering and scoring phase. The percentile statistics utilization of the whole machine takes effect in the filtering and scoring phase.

apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-scheduler-config
  ...
data:
  koord-scheduler-config: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: koord-scheduler
        plugins:
          # enable the LoadAwareScheduling plugin
          filter:
            enabled:
              - name: LoadAwareScheduling
              ...
          score:
            enabled:
              - name: LoadAwareScheduling
                weight: 1
              ...
          reserve:
            enabled:
              - name: LoadAwareScheduling
          ...
        pluginConfig:
        # configure the thresholds and weights for the plugin
        - name: LoadAwareScheduling
          args:
            apiVersion: kubescheduler.config.k8s.io/v1beta2
            kind: LoadAwareSchedulingArgs
            # whether to filter nodes where koordlet fails to update NodeMetric
            filterExpiredNodeMetrics: true
            # the expiration threshold seconds when using NodeMetric
            nodeMetricExpirationSeconds: 300
            # weights of resources
            resourceWeights:
              cpu: 1
              memory: 1
            # thresholds (%) of resource utilization
            usageThresholds:
              cpu: 75
              memory: 85
            # thresholds (%) of resource utilization of Prod Pods
            prodUsageThresholds:
              cpu: 55
              memory: 65
            # enable score according Prod usage
            scoreAccordingProdUsage: true
            # the factor (%) for estimating resource usage
            estimatedScalingFactors:
              cpu: 80
              memory: 70
            # enable resource utilization filtering and scoring based on percentile statistics
            aggregated:
              usageThresholds:
                cpu: 65
                memory: 75
              usageAggregationType: "p99"
              scoreAggregationType: "p99"

2. Deploy a Pod for stress testing

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-demo
  namespace: default
  labels:
    app: stress-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress-demo
  template:
    metadata:
      name: stress-demo
      labels:
        app: stress-demo
    spec:
      containers:
        - args:
            - '--vm'
            - '2'
            - '--vm-bytes'
            - '1600M'
            - '-c'
            - '2'
            - '--vm-hang'
            - '2'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources:
            limits:
              cpu: '2'
              memory: 4Gi
            requests:
              cpu: '2'
              memory: 4Gi
      restartPolicy: Always
      schedulerName: koord-scheduler # use the koord-scheduler

$ kubectl create -f stress-demo.yaml
deployment.apps/stress-demo created

Wait for the stress testing Pod to be in the Running state

$ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
stress-demo-7fdd89cc6b-gcnzn   1/1     Running   0          82s   10.0.3.114   cn-beijing.10.0.3.112   <none>           <none>

The Pod is scheduled to the cn-beijing.10.0.3.112 node

3. Check the load of each node

$ kubectl top node
NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
cn-beijing.10.0.3.110   92m          2%     1158Mi          9%
cn-beijing.10.0.3.111   77m          1%     1162Mi          9%
cn-beijing.10.0.3.112   2105m        53%    3594Mi          28%

The output shows that the node cn-beijing.10.0.3.111 has the lowest load, and the node cn-beijing.10.0.3.112 has the highest load.

4. Deploy an online Pod

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-with-loadaware
  labels:
    app: nginx
spec:
  replicas: 6
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      # Use koord-prod to indicate that the Pod is Prod
      priorityClassName: "koord-prod"
      schedulerName: koord-scheduler # use the koord-scheduler
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 500m

$ kubectl create -f nginx-with-loadaware.yaml
deployment/nginx-with-loadawre created

5. Check the scheduling result

$ kubectl get pods | grep nginx
nginx-with-loadaware-5646666d56-224jp   1/1     Running   0          18s   10.0.3.118   cn-beijing.10.0.3.110   <none>           <none>
nginx-with-loadaware-5646666d56-7glt9   1/1     Running   0          18s   10.0.3.115   cn-beijing.10.0.3.110   <none>           <none>
nginx-with-loadaware-5646666d56-kcdvr   1/1     Running   0          18s   10.0.3.119   cn-beijing.10.0.3.110   <none>           <none>
nginx-with-loadaware-5646666d56-qzw4j   1/1     Running   0          18s   10.0.3.113   cn-beijing.10.0.3.111   <none>           <none>
nginx-with-loadaware-5646666d56-sbgv9   1/1     Running   0          18s   10.0.3.120   cn-beijing.10.0.3.111   <none>           <none>
nginx-with-loadaware-5646666d56-z79dn   1/1     Running   0          18s   10.0.3.116   cn-beijing.10.0.3.111   <none>           <none>

The preceding output shows that since the cluster has enabled the load-aware scheduling function, it can sense the node load. After using the scheduling policy, the Pods are preferentially scheduled to nodes other than cn-beijing.10.0.3.112.

Load-Aware Rescheduling

Koordinator has continued to evolve the rescheduler over the past few releases, successively opening the source of the complete framework, enhancing security, and preventing excessive eviction of Pod from affecting the stability of online applications. This also affected the progress of the rescheduling function. Koordinator did not have much power to build the rescheduling capability in the past. This situation will be changed.

Koordinator v1.1 has added the load-aware rescheduling feature. The new plugin is called LowNodeLoad. This plugin cooperates with the load-aware scheduling capability of the scheduler to form a closed loop. The load-aware scheduling of the scheduler makes decisions to select the optimal node at the scheduling moment. However, as the time and cluster environment and the traffic/requests faced by the workload change, load-aware rescheduling can intervene to help optimize the nodes whose load level exceeds the safety threshold. The difference between LowNodeLoad and K8s descheduler plugin LowNodeUtilization is that LowNodeLoad determines rescheduling based on the actual node utilization, while LowNodeUtilization determines rescheduling based on the resource allocation rate.

Basic Principles

The LowNodeLoad plugin has two important parameters:

HighThresholds indicates the target security threshold of the load level. Pods on nodes that exceed this threshold will be rescheduled.
LowThresholds indicates the idle safe water level of the load water level. Pods on nodes below this threshold are not rescheduled.

In the following figure, the value of lowThresholds is 45%, and the value of highThresholds is 70%. We can classify nodes into three categories:

Idle Node Nodes with resource utilization lower than 45%
Normal Node Nodes whose resource utilization is higher than 45% but lower than 70%. This load level range is a reasonable range that we expect.
Hotspot Node If the resource utilization of a node is higher than 70%, the node is considered to be unsafe and a hotspot node. Some Pods should be evicted, and the load level should be reduced to no more than 70%.

After identifying which nodes are hotspot nodes, descheduler performs migration and eviction to evict some Pods on the hotspot nodes to idle nodes.

If the total number of idle nodes in a cluster is not large, the rescheduling is terminated. This may be helpful in large clusters, where some nodes may be underused often or for a short time. By default, numberOfNodes is set to zero. You can set the numberOfNodes parameter to enable this feature.

Before the migration, descheduler calculates the actual idle capacity to ensure that the sum of the actual utilization of the Pods to be migrated does not exceed the total idle capacity in the cluster. This actual idle capacity comes from idle nodes. The actual idle capacity of one idle node = (highThresholds - the current load of the node) the total capacity of the node. Assuming the load level of node A is 20%, the value of highThresholds is 70%, and the total CPU capacity of node A is 96C, then (70% - 20%) 96 = 48C, which is the idle capacity that can be used.

In addition, when migrating hotspot nodes, Pods on the hotspot nodes are filtered. Currently, the descheduler supports a variety of filtering parameters, which can avoid the migration and eviction of very important pods.

Filter by Namespace: You can configure the descheduler to only filter certain namespaces or filter out certain namespaces.
Filter by Pod Selector: You can use the label selector to filter Pods or exclude Pods that have certain labels
Configure nodeFit to check whether the scheduling rule has alternative nodes. After it is enabled, the descheduler checks whether the Pod has a matching Node based on the Node Affinity/Node Selector/Toleration. If the Pod does not have a matching node, it will not be evicted and migrated. If you set the nodeFit parameter to false, the migration controller at the underlying layer of descheduler completes the capacity reservation to ensure resources are available before the migration starts.

After Pods are filtered, they are sorted by QoSClass, Priority, actual usage, and creation time.

After Pods are filtered and sorted, the migration starts. Before the migration, the system checks whether the remaining idle capacity is met and whether the load level of the current node is higher than the target safety threshold. If one of the two conditions is not met, the system stops rescheduling. Each time a Pod is migrated, the remaining idle capacity is withheld, and the load level of the current node is adjusted until the remaining capacity is insufficient or the water level reaches the safety threshold.

Example

1. Change the koord-descheduler configuration and enable LowNodeLoad

apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  ...
data:
  koord-descheduler-config: |
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    ...
    deschedulingInterval: 60s  # The execution cycle. The LowNodeLoad plugin is executed once in 60s
    profiles:
      - name: koord-descheduler
        plugins:
          deschedule:
            disabled:
              - name: "*"
          balance:
            enabled:
              - name: LowNodeLoad  # Enable the LowNodeLoad plugin
          ....
        pluginConfig:
        # Parameters of the LowNodeLoad plugin
        - name: LowNodeLoad
          args:
            apiVersion: descheduler/v1alpha2
            kind: LowNodeLoadArgs
            evictableNamespaces:
            # Include and exclude are mutually exclusive. You can configure only one of them. 
            # include: # include indicates that only the following configured namespace is processed
            #   - test-namespace
              exclude:
                - "kube-system" # The namespace to be excluded
                - "koordinator-system"
            lowThresholds:  # lowThresholds indicates the access watermark threshold of an idle node
              cpu: 20 # CPU utilization is 20%
              memory: 30  # Memory utilization is 30%
            highThresholds: # highThresholds indicates the target security threshold. Nodes that exceed this threshold are determined as hotspot nodes
              cpu: 50  # CPU utilization is 50%
              memory: 60 # Memory utilization is 60%
        ....

2. Deploy a Pod for stress testing

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-demo
  namespace: default
  labels:
    app: stress-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress-demo
  template:
    metadata:
      name: stress-demo
      labels:
        app: stress-demo
    spec:
      containers:
        - args:
            - '--vm'
            - '2'
            - '--vm-bytes'
            - '1600M'
            - '-c'
            - '2'
            - '--vm-hang'
            - '2'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
          resources:
            limits:
              cpu: '2'
              memory: 4Gi
            requests:
              cpu: '2'
              memory: 4Gi
      restartPolicy: Always
      schedulerName: koord-scheduler # use the koord-scheduler

$ kubectl create -f stress-demo.yaml
deployment.apps/stress-demo created

Wait for the stress testing Pod to be in the Running state

$ kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE                    NOMINATED NODE   READINESS GATES
stress-demo-7fdd89cc6b-gcnzn   1/1     Running   0          82s   10.0.3.114   cn-beijing.10.0.3.121   <none>           <none>

Pod stress-demo-7fdd89cc6b-gcnzn is scheduled to the cn-beijing.10.0.3.121 node.

3. Check the load of each node

$ kubectl top node
NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
cn-beijing.10.0.3.121   2106m        54%    4452Mi          35%
cn-beijing.10.0.3.124   73m          1%     1123Mi          8%
cn-beijing.10.0.3.125   69m          1%     1064Mi          8%

The output shows that the cn-beijing.10.0.3.124 node and the cn-beijing.10.0.3.125 node have the lowest load, and the node cn-beijing.10.0.3.112 has the highest load, which exceeds the configured highThresholds.

4. Observe the Pod changes and wait for the rescheduler to execute the eviction and migration operation

$ kubectl get pod -w
NAME                           READY   STATUS               RESTARTS   AGE     IP           NODE                    NOMINATED NODE   READINESS GATES
stress-demo-7fdd89cc6b-l7psv   1/1     Running              0          4m45s   10.0.3.127   cn-beijing.10.0.3.121   <none>           <none>
stress-demo-7fdd89cc6b-l7psv   1/1     Terminating          0          8m34s   10.0.3.127   cn-beijing.10.0.3.121   <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   0/1     Pending              0          0s      <none>       <none>                  <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   0/1     Pending              0          0s      <none>       <none>                  <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   0/1     Pending              0          0s      <none>       cn-beijing.10.0.3.124   <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   0/1     ContainerCreating    0          0s      <none>       cn-beijing.10.0.3.124   <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   0/1     ContainerCreating    0          3s      <none>       cn-beijing.10.0.3.124   <none>           <none>
stress-demo-7fdd89cc6b-b4c5g   1/1     Running              0          20s     10.0.3.130   cn-beijing.10.0.3.124   <none>           <none>

5. Observe the Event. You can see the following migration records:

$ kubectl get event |grep stress-demo-7fdd89cc6b-l7psv
2m45s       Normal    Evicting                  podmigrationjob/20c8c445-7fa0-4cf7-8d96-7f03bb1097d9   Try to evict Pod "default/stress-demo-7fdd89cc6b-l7psv"
2m12s       Normal    EvictComplete             podmigrationjob/20c8c445-7fa0-4cf7-8d96-7f03bb1097d9   Pod "default/stress-demo-7fdd89cc6b-l7psv" has been evicted
11m         Normal    Scheduled                 pod/stress-demo-7fdd89cc6b-l7psv                       Successfully assigned default/stress-demo-7fdd89cc6b-l7psv to cn-beijing.10.0.3.121
11m         Normal    AllocIPSucceed            pod/stress-demo-7fdd89cc6b-l7psv                       Alloc IP 10.0.3.127/24
11m         Normal    Pulling                   pod/stress-demo-7fdd89cc6b-l7psv                       Pulling image "polinux/stress"
10m         Normal    Pulled                    pod/stress-demo-7fdd89cc6b-l7psv                       Successfully pulled image "polinux/stress" in 12.687629736s
10m         Normal    Created                   pod/stress-demo-7fdd89cc6b-l7psv                       Created container stress
10m         Normal    Started                   pod/stress-demo-7fdd89cc6b-l7psv                       Started container stress
2m14s       Normal    Killing                   pod/stress-demo-7fdd89cc6b-l7psv                       Stopping container stress
11m         Normal    SuccessfulCreate          replicaset/stress-demo-7fdd89cc6b                      Created pod: stress-demo-7fdd89cc6b-l7psv

cgroup v2 Support

Background

Many standalone QoS capabilities and resource throttling/scaling policies in Koordinator are built on the Linux Control Group (cgroups) mechanism, such as CPU QoS (cpu), Memory QoS (memory), CPU Burst (cpu), and CPU Suppress (cpu, cpuset). The koordlet component can use cgroups (v1) to limit the time slice, weight, priority, topology, and other attributes of the available resources of the container. The high-version Linux kernel is also continuously enhancing and updating the cgroups mechanism, bringing the cgroups v2 mechanism to unify the cgroups directory structure, improving cooperation between different subsystems/cgroup controllers in v1, and enhancing the resource management and monitoring capabilities of some subsystems. Kubernetes has adopted cgroups v2 as a general availability (GA) feature since 1.25. This feature is enabled in Kubelet to perform resource management of containers and set resource isolation parameters for containers at a unified cgroups layer to support the enhanced feature of MemoryQoS.

In Koordinator v1.1, the standalone component koordlet adds the support for cgroups v2, including the following work:

The Resource Executor module is refactored to unify the file operations of the same or similar cgroup interfaces on different versions of v1 and v2. This facilitates the compatibility of koldlet features with cgroups v2 and merging Write–Read (WR) Conflict.
Adapt cgroups v2 to the currently available standalone features, replace cgroup operations with a new Resource Executor module, and optimize error logs in different system environments.

Most koordlet features in Koordinator v1.1 are compatible with cgroups v2, including (but not limited to):

Resource Utilization Collection
Dynamic Resource Overcommitment
Batch Resource Isolation (BatchResource, deprecated BECgroupReconcile)
CPU QoS (GroupIdentity)
Memory QoS (CgroupReconcile)
CPU Dynamic Suppression (BECPUSuppress)
Memory Eviction (BEMemoryEvict)
CPU Burst (CPUBurst)
L3 Cache and Memory Bandwidth Isolation (RdtResctrl)

Incompatible features (such as PSICollector) will be adapted in the following v1.2, and you can follow issue#407 for the latest progress. More cgroups v2 enhancements will be introduced in the next Koordinator release.

Use cgroups v2

In Koordinator v1.1, the adaptation of koordlet to cgroups v2 is transparent to the upper-layer feature configurations. You do not need to change the ConfigMap slo-controller-config and other feature-gate configurations except for the feature-gate of deprecated features. When koordlet runs on a node with cgroups v2 enabled, the corresponding standalone feature automatically switches to operate on the cgroups-v2 system interface.

In addition, cgroups v2 is a feature of the higher-version Linux kernel (recommended >= 5.8) and depends on the system kernel version and Kubernetes version. We recommend using a Linux distribution with cgroups v2 enabled by default and Kubernetes v1.24 and above.

Please see the documentation for more information about how to enable cgroups v2.

Develop Support for cgroups v2 in Koordlet

If you expect to develop and customize new features that support cgroups v2 in the koordlet component, you are welcome to learn about the new system resource interface, Resource, and system file operation module, ResourceExecutor, in Koordinator v1.1, which are designed to optimize the consistency and compatibility of system file operations (such as cgroups and resctrl).

You can modify the resource isolation parameters of a container using a common cgroups interface:

var (
    // NewCgroupReader() generates a cgroup reader for reading cgroups with the current cgroup version.
// e.g. read `memory.limit_in_bytes` on v1, while read `memory.max` on v2.
    cgroupReader = resourceexecutor.NewCgroupReader()
    // NewResourceUpdateExecutor() generates a resource update executor for updating system resources (e.g. cgroups, resctrl) cacheablely and in order.
    executor = resourceexecutor.NewResourceUpdateExecutor()
)

// readPodCPUSet reads the cpuset CPU IDs of the given pod.
// e.g. read `/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-podxxx.slice/cpuset.cpus` -> `6-15`
func readPodCPUSet(podMeta *statesinformer.PodMeta) (string, error) {
    podParentDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
    cpus, err := cgroupReader.ReadCPUSet(podParentDir)
    if err != nil {
        return "", err
    }
    return cpus.String(), nil
}

func updatePodCFSQuota(podMeta *statesinformer.PodMeta, cfsQuotaValue int64) error {
    podDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
    cfsQuotaStr := strconv.FormatInt(cfsQuotaValue, 10)
    // DefaultCgroupUpdaterFactory.New() generates a cgroup updater for cacheable updating cgroups with the current cgroup version.
    // e.g. update `cpu.cfs_quota_us` on v1, while update `cpu.max` on v2.
    updater, err := resourceexecutor.DefaultCgroupUpdaterFactory.New(system.CPUCFSQuotaName, podParentDir, cfsQuotaStr)
    if err != nil {
        return err
    }
    // Use executor to cacheable update the cgroup resource, and avoid the repeated but useless writes.
  _, err := executor.Update(true, updater)
    if err != nil {
    return err
  }
    return nil
}

You can also add and register cgroups resources and update functions in the following ways:

// package system

const (
    // Define the cgroup filename as the resource type of the cgroup resource.
    CgroupXName   = "xx.xxx"
    CgroupYName   = "yy.yyy"
    CgroupXV2Name = "xx.xxxx"
    CgroupYV2Name = "yy.yy"
)

var (
    // New a cgroup v1 resource with the filename and the subsystem (e.g. cpu, cpuset, memory, blkio).
    // Optional: add a resource validator to validate the written values, and add a check function to check if the system supports this resource.
    CgroupX = DefaultFactory.New(CgroupXName, CgroupXSubfsName).WithValidator(cgroupXValidator).WithCheckSupported(cgroupXCheckSupportedFunc)
    CgroupY = DefaultFactory.New(CgroupYName, CgroupYSubfsName)
    // New a cgroup v2 resource with the corresponding v1 filename and the v2 filename.
    // Optional: add a resource validator to validate the written values, and add a check function to check if the system supports this resource.
    CgroupXV2 = DefaultFactory.NewV2(CgroupXName, CgroupXV2Name).WithValidator(cgroupXValidator).WithCheckSupported(cgroupXV2CheckSupportedFunc)
    CgroupYV2 = DefaultFactory.NewV2(CgroupYName, CgroupYV2Name).WithCheckSupported(cgroupYV2CheckSupportedFunc)
)

func init() {
    // Register the cgroup resource with the corresponding cgroup version.
    DefaultRegistry.Add(CgroupVersionV1, CgroupX, CgroupY)
    DefaultRegistry.Add(CgroupVersionV2, CgroupXV2, CgroupYV2)
}

// package resourceexecutor

func init() {
    // Register the cgroup updater with the resource type and the generator function.
    DefaultCgroupUpdaterFactory.Register(NewCommonCgroupUpdater,
    system.CgroupXName,
        system.CgroupYName,
}

Interference Detection Metric Collection

In the real production environment, the standalone runtime status is a chaotic system, and the application interference caused by resource competition cannot be avoided. Koordinator is building the capability of interference detection and optimization. It performs real-time analysis and detection by extracting metrics of the application running status and adopts more targeted strategies for target applications and interference sources after interference is detected.

Koordinator has implemented a series of Performance Collectors. It collects underlying metrics that are highly correlated with the running status of applications on a standalone side and exposes them through Prometheus, thus providing support for interference detection and cluster application scheduling.

Metric Collection

Performance Collector is controlled by multiple feature-gates. Koordinator currently provides the following metrics collectors:

CPICollector: It is used to control the CPI metric collector. Cycles Per Instruction (CPI) is the average number of clock cycles required for an instruction to execute in a computer. The CPI collector is implemented based on the two Kernel Performance Monitoring Unit (PMU) events (Cycles and Instructions) and perf_event_open(2) system call.
PSICollector: It is used to control the PSI metric collector. Pressure Stall Information (PSI) indicates the number of blocked tasks of the container during the collection interval while waiting for CPU, memory, and IO resources to be allocated. Before you use the PSI collector, you must enable the PSI feature in Anolis OS.

Performance Collector is currently turned off by default. You can use it by modifying the feature-gates of koordlet, which does not affect other feature-gates:

kubectl edit ds koordlet -n koordinator-system

...
spec:
  ...
    spec:
      containers:
      - args:
        ...
        # modify here
        # - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true
        - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,CPICollector=true,PSICollector=true

ServiceMonitor

In Koordinator v1.1.0, the ServiceMonitor feature is introduced to koordlet to expose collected metrics through Prometheus. You can use this feature to collect metrics for application system analysis and management.

apiVersion: v1
kind: Service
metadata:
  labels:
    koord-app: koordlet
  name: koordlet
  namespace: koordinator-system
spec:
  clusterIP: None
  ports:
  - name: koordlet-service
    port: 9316
    targetPort: 9316
  selector:
    koord-app: koordlet
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    koord-app: koordlet
  name: koordlet
  namespace: koordinator-system
spec:
  endpoints:
  - interval: 30s
    port: koordlet-service
    scheme: http
  jobLabel: koord-app
  selector:
    matchLabels:
      koord-app: koordlet

ServiceMonitor is introduced by Prometheus. Therefore, the installation of ServiceMonitor is disabled by default in the Helm chart. You can run the following command to install ServiceMonitor:

helm install koordinator https://... --set koordlet.enableServiceMonitor=true

After deployment, you can find the Targets in the Prometheus UI:

# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="cycles",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 2.228107503e+09
koordlet_container_cpi{container_id="containerd://498de02ddd3ad7c901b3c80f96c57db5b3ed9a817dbfab9d16b18be7e7d2d047",container_name="koordlet",cpi_field="instructions",node="your-node-name",pod_name="koordlet-x8g2j",pod_namespace="koordinator-system",pod_uid="3440fb9c-423b-48e9-8850-06a6c50f633d"} 4.1456092e+09

It is expected that the ability of Koordinator to detect interference needs more detection metrics in more complex real-world scenarios. We will continue to make efforts in the collection and construction of metrics for many other resources (such as memory and disk IO).

Other Updates

You can see the new features on the v1.1 release page.

v1.1 release page: https://github.com/koordinator-sh/koordinator/releases/tag/v1.1.0

Planning

The Koordinator community will continue to enrich the forms of big data computing tasks, expand the co-location support for multiple computing frameworks, and enrich co-location task solutions. It will continue to improve the interference detection and problem diagnosis systems, promote the integration of more load types into the Koordinator ecosystem, and achieve better resource operation efficiency.

Click here to learn more product features of Koordinator v1.1.

Community

Koordinator v1.1: Load-Aware Scheduling and Interference Detection Collection

Background

In-Depth Interpretation

Load-Aware Scheduling

It Supports Load Statistics and Load Balancing Based on the Workload Type.

It Supports Balancing by Percentile Utilization Rate.

Example

Load-Aware Rescheduling

Basic Principles

Example

cgroup v2 Support

Background

Use cgroups v2

Develop Support for cgroups v2 in Koordlet

Interference Detection Metric Collection

Metric Collection

ServiceMonitor

Other Updates

Planning

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Cloud-Native Applications Management Solution

Function Compute

Managed Service for Prometheus

Container Service for Kubernetes