All Products
Search
Document Center

Container Service for Kubernetes:Topology-aware CPU scheduling

Last Updated:May 16, 2024

Container Service for Kubernetes (ACK) provides the topology-aware CPU scheduling feature based on the new Kubernetes scheduling framework. This feature can improve the performance of CPU-sensitive workloads. This topic describes how to enable topology-aware CPU scheduling.

How it works

Multiple pods can run on a node in a Kubernetes cluster, and some of the pods may belong to CPU-intensive workloads. In this case, pods compete for CPU resources. When the competition becomes intense, the CPU cores that are allocated to each pod may frequently change. This situation intensifies when Non-Uniform Memory Access (NUMA) nodes are used. These changes degrade the performance of the workloads. The Kubernetes CPU manager provides a CPU scheduling solution to fix this issue within a node. However, the Kubernetes CPU manager cannot find the optimal way to allocate CPU cores within a cluster. The Kubernetes CPU manager works only on guaranteed pods and does not apply to other types of pods. In a guaranteed pod, each container is configured with a CPU request and a CPU limit, and the request and limit are set to the same value.

Topology-aware CPU scheduling applies to the following scenarios:

  • The workload is compute-intensive.

  • The application is CPU-sensitive.

  • The workload runs on multi-core Elastic Compute Service (ECS) bare metal instances with Intel CPUs or AMD CPUs.

    To test topology-aware CPU scheduling, stress tests are performed on two NGINX applications, each of which requests 4 CPU cores and 8 GB of memory. One application is deployed on an ECS Bare Metal instance with 104 Intel CPU cores and the other application is deployed on an ECS Bare Metal instance with 256 AMD CPU cores. The results show that the application performance is improved by 22% to 43% when topology-aware CPU scheduling is enabled. The following figures show the details.12

    Performance metrics

    Intel

    AMD

    QPS

    Improved by 22.9%

    Improved by 43.6%

    AVG RT

    Reduced by 26.3%

    Reduced by 42.5%

    Important

    Different applications have varying sensitivities to the CPU core binding policy. The data from the preceding experiments is for reference only. We recommended that you deploy your application in a stable environment and adjust the experimental stress level based on the device type and other environmental factors to ensure that the application can run as normal. Then, evaluate whether to enable the topology-aware CPU scheduling feature after comparing the performance statistics.

When you enable topology-aware CPU scheduling, you can set cpu-policy to static-burst in the template.metadata.annotations section of the Deployment object or in the metadata.annotations section of the Pod object to adjust the automatic CPU core binding policy. The policy is suitable for compute-intensive workloads and can efficiently reduce CPU core contention among processes and memory access between NUMA nodes. This maximizes the utilization of fragmented CPU resources and optimizes resource allocation for compute-intensive workloads without the need to modify the hardware and VM resources. This further improves CPU usage.

For more information about how topology-aware CPU scheduling is implemented, see Practice of Fine-grained Cgroups Resources Scheduling in Kubernetes.

Prerequisites

  • An ACK Pro cluster is created. For more information, see Create an ACK Pro cluster.

  • ack-koordinator (FKA ack-slo-manager) is installed. For more information, see ack-koordinator (FKA ack-slo-manager).

    Note

    ack-koordinator is upgraded and optimized based on resource-controller. You must uninstall resource-controller after you install ack-koordinator. For more information about how to uninstall resource-controller, resource-controller.

Limits

The following table describes the versions that are required for the system components.

Component

Version

Kubernetes

≥ 1.18

ack-koordinator

≥ 0.2.0

Billing

No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:

  • ack-koordinator is an non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.

  • By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered as custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing topic of Managed Service for Prometheus to learn the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage see Query the amount of observable data and bills.

Usage notes

  • Before you enable topology-aware CPU scheduling, make sure that ack-koordinator is deployed.

  • When you enable topology-aware CPU scheduling, make sure that cpu-policy=none is configured for the nodes.

  • To limit pod scheduling, add the nodeSelector parameter.

    Important

    Do not add the nodeName parameter, which cannot be parsed by the pod scheduler when topology-aware CPU scheduling is enabled.

Enable topology-aware CPU scheduling

Before you enable topology-aware CPU scheduling, you need to configure the annotations and Containers parameters when you configure pods. Perform the following steps to enable topology-aware CPU scheduling.

  • Set cpuset-scheduler to true in the template.metadata.annotations section of the Deployment object or in the metadata.annotations section of the Pod object to enable topology-aware CPU scheduling.

  • Set the resources.limit.cpu parameter in the containers section to an integer.

  1. Create a file named go-demo.yaml based on the following content and configure the Deployment to use topology-aware CPU scheduling.

    Important
    • You need to configure pod annotations in the template.metadata section of the Deployment.

    • When you configure topology-aware CPU scheduling, you can set cpu-policy to static-burst in the annotations section to adjust the automatic CPU core binding policy. To use the setting, delete the number sign (#) before cpu-policy.

    Click to view details

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: go-demo
    spec:
      replicas: 4
      selector:
        matchLabels:
          app: go-demo
      template:
        metadata:
          annotations:
            cpuset-scheduler: "true" # Enable topology-aware CPU scheduling. 
          labels:
            app: go-demo
        spec:
          containers:
          - name: go-demo
            image: registry.cn-hangzhou.aliyuncs.com/polinux/stress/go-demo:1k
            imagePullPolicy: Always
            ports:
            - containerPort: 8080
            resources:
              requests:
                cpu: 1
              limits: 
                cpu: 4  #Specify the value of resources.limit.cpu.

  2. Run the following command to create a Deployment:

    kubectl create -f go-demo.yaml

Verify topology-aware CPU scheduling

In this example, the following conditions apply:

  • The Kubernetes version of the ACK Pro cluster is 1.20.

  • Two cluster nodes are used in the test. One is used as the load generator. The other runs the workloads and serves as the tested machine.

Important

The following deployment and stress testing commands are for reference only. You can adjust the resource specifications and request stress according to your experimental environment to ensure that the application is in a normal state before proceeding with the experiment.

  1. Run the following command to add a label to the tested machine:

     kubectl label node 192.168.XX.XX policy=intel/amd
  2. Deploy the NGINX service on the tested machine.

    1. Use the following YAML templates to create resources for the NGINX service:

      View the content of the service.yaml file

      apiVersion: v1
      kind: Service
      metadata:
        name: nginx-service-nodeport
      spec:
        selector:
            app: nginx
        ports:
          - name: http
            port: 8000
            protocol: TCP
            targetPort: 80
            nodePort: 32257
        type: NodePort

      View the content of the configmap.yaml file

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: nginx-configmap
      data:
        nginx_conf: |-
          user  nginx;
          worker_processes  4;
          error_log  /var/log/nginx/error.log warn;
          pid        /var/run/nginx.pid;
          events {
              worker_connections  65535;
          }
          http {
              include       /etc/nginx/mime.types;
              default_type  application/octet-stream;
              log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                            '$status $body_bytes_sent "$http_referer" '
                            '"$http_user_agent" "$http_x_forwarded_for"';
              access_log  /var/log/nginx/access.log  main;
              sendfile        on;
              #tcp_nopush     on;
              keepalive_timeout  65;
              #gzip  on;
              include /etc/nginx/conf.d/*.conf;
          }

      View the content of the nginx.yaml file

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: nginx-deployment
        labels:
          app: nginx
      spec:
        replicas: 2
        selector:
          matchLabels:
            app: nginx
        template:
          metadata:
            annotations:
              #cpuset-scheduler: "true"    By default, topology-aware CPU scheduling is disabled. 
            labels:
              app: nginx
          spec:
            nodeSelector:
              policy: intel7
            containers:
            - name: nginx
              image: nginx:latest
              ports:
              - containerPort: 80
              resources:
                requests:
                  cpu: 4
                  memory: 8Gi
                limits:
                  cpu: 4
                  memory: 8Gi
              volumeMounts:
                 - mountPath: /etc/nginx/nginx.conf
                   name: nginx
                   subPath: nginx.conf
            volumes:
              - name: nginx
                configMap:
                  name: nginx-configmap
                  items:
                    - key: nginx_conf
                      path: nginx.conf
    2. Run the following command to create the resources that are provisioned for the NGINX service:

      • kubectl create -f service.yaml
      • kubectl create -f configmap.yaml
      • kubectl create -f nginx.yaml
  3. Log on to the load generator, download the wrk2 open source stress test tool, and decompress the package. For more information, see wrk2 official site.

    Note

    For more information about how to log on to a node, see Connect to an instance by using VNC or Connect to a Windows instance by using a username and password.

  4. Run the following command to perform stress tests and record the test data:

    wrk --timeout 2s -t 20 -c 100 -d 60s --latency http://<IP address of the tested machine>:32257

    Expected output:

    20 threads and 100 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   600.58us    3.07ms 117.51ms   99.74%
    Req/Sec    10.67k     2.38k   22.33k    67.79%
    Latency Distribution
    50%  462.00us
    75%  680.00us
    90%  738.00us
    99%    0.90ms
    12762127 requests in 1.00m, 10.10GB read
    Requests/sec: 212350.15Transfer/sec:    172.13MB
  5. Run the following command to delete the NGINX Deployment:

    kubectl delete deployment nginx

    Expected output:

    deployment "nginx" deleted
  6. Use the following YAML template to deploy an NGINX Deployment with topology-aware CPU scheduling enabled:

    Click to view details

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx-deployment
      labels:
        app: nginx
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          annotations:
            cpuset-scheduler: "true"
          labels:
            app: nginx
        spec:
          nodeSelector:
            policy: intel7
          containers:
          - name: nginx
            image: nginx:latest
            ports:
            - containerPort: 80
            resources:
              requests:
                cpu: 4
                memory: 8Gi
              limits:
                cpu: 4
                memory: 8Gi
            volumeMounts:
               - mountPath: /etc/nginx/nginx.conf
                 name: nginx
                 subPath: nginx.conf
          volumes:
            - name: nginx
              configMap:
                name: nginx-configmap
                items:
                  - key: nginx_conf
                    path: nginx.conf
  7. Run the following command to perform stress tests and record the test data for comparison:

    wrk --timeout 2s -t 20 -c 100 -d 60s --latency http://<IP address of the tested machine>:32257

    Expected output:

    20 threads and 100 connections
    ls  Thread Stats   Avg         Stdev     Max       +/- Stdev
    Latency            345.79us    1.02ms    82.21ms   99.93%
    Req/Sec            15.33k      2.53k     25.84k    71.53%
    Latency Distribution
    50%  327.00us
    75%  444.00us
    90%  479.00us
    99%  571.00us
    18337573 requests in 1.00m, 14.52GB read
    Requests/sec: 305119.06Transfer/sec:    247.34MB

    Compare the data of the preceding tests. This comparison indicates that the performance of the NGINX service is improved by 43% after topology-aware CPU scheduling is enabled.

Verify that the automatic CPU core binding policy improves performance

In this example, a CPU policy is configured for a workload that runs on a node with 64 CPU cores. After you configure the automatic CPU core binding policy of an application with topology-aware CPU scheduling enabled, the CPU usage can be further improved by 7% to 8%.

Important

The following deployment and stress testing commands are for reference only. You can adjust the resource specifications according to your own experimental environment to ensure that the application is in a normal state before proceeding with the experiment.

  1. Run the following command to query the pods:

    kubectl get pods | grep cal-pi

    Expected output:

    NAME                 READY     STATUS    RESTARTS   AGE
    cal-pi-d****         1/1       Running   1          9h
  2. Run the following command to query the log of the cal-pi-d**** application:

    kubectl logs cal-pi-d****

    Expected output:

    computing Pi with 3000 Threads...computed the first 20000 digets of pi in 620892 ms! 
    the first digets are: 3.14159264
    writing to pi.txt...
    finished!
  3. Use topology-aware CPU scheduling.

    Configure the Deployment to use topology-aware CPU scheduling and configure the automatic CPU core binding policy. For more information, see Enable topology-aware CPU scheduling.

    1. Create a file named go-demo.yaml based on the following content and configure the Deployment to use topology-aware CPU scheduling.

      Important

      You need to configure pod annotations in the template.metadata section of the Deployment.

      Click to view details

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: go-demo
      spec:
        replicas: 4
        selector:
          matchLabels:
            app: go-demo
        template:
          metadata:
            annotations:
              cpuset-scheduler: "true" # Enable topology-aware CPU scheduling. 
              cpu-policy: 'static-burst' # Configure the automatic CPU core binding policy and improve the utilization of fragmented CPU resources. 
            labels:
              app: go-demo
          spec:
            containers:
            - name: go-demo
              image: registry.cn-hangzhou.aliyuncs.com/polinux/stress/go-demo:1k
              imagePullPolicy: Always
              ports:
              - containerPort: 8080
              resources:
                requests:
                  cpu: 1
                limits: 
                  cpu: 4  #Specify the value of resources.limit.cpu.

    2. Run the following command to create a Deployment:

      kubectl create -f go-demo.yaml
  4. Run the following command to query the pods:

    kubectl get pods | grep go-demo

    Expected output:

    NAME                 READY     STATUS    RESTARTS   AGE
    go-demo-e****        1/1       Running   1          9h
  5. Run the following command to query the log of the go-demo-e**** application:

    kubectl logs go-demo-e****

    Expected output:

    computing Pi with 3000 Threads...computed the first 20000 digets of pi in 571221 ms!
    the first digets are: 3.14159264
    writing to pi.txt...
    finished!

    Compare the log data with the log data in Step 2. You can find that the performance of the pod configured with a CPU policy is improved by 8%.

References

Kubernetes is unaware of the topology of GPU resources on nodes. Therefore, Kubernetes schedules GPU resources in a random manner. As a result, the GPU acceleration for training jobs considerably varies based on the scheduling results of GPU resources. To avoid this situation, ACK supports topology-aware GPU scheduling based on the scheduling framework of Kubernetes. You can use this feature to select multiple GPUs from GPU-accelerated nodes to achieve optimal GPU acceleration for training jobs. For more information, see Topology-aware GPU scheduling.