Enable topology-aware CPU scheduling to ensure the service quality for CPU-sensitive workloads - Container Service for Kubernetes

When multiple pods run on the same node, the pods compete for CPU resources. The CPU cores that are allocated to each pod may frequently change, leading to performance jitter. For performance-sensitive applications, you can enable the topology-aware CPU scheduling feature to pin pods to the CPU cores on the node. This approach reduces performance issues caused by CPU context switching and memory access across Non-Uniform Memory Access (NUMA) nodes.

Note

To better understand and use this feature, we recommend that you refer to the official Kubernetes documentation to learn about pod QoS classes, assigning memory resources to containers and pods, and CPU management policies on nodes, such as CPU management policies, none policy, and static policy.

Scenarios

In a Kubernetes cluster, multiple pods may share CPU cores on the same node. However, in the following scenarios, some applications may need to be pinned to specific CPU cores:

Applications that are not adapted to cloud-native scenarios. For example, the number of threads is specified based on the total physical cores of the device instead of the container specifications. As a result, application performance degrades.
Applications that run on multi-core ECS Bare Metal instances with Intel CPUs or AMD CPUs and experience performance degradation due to memory access across NUMA nodes.
Applications that are highly sensitive to CPU context switching and cannot tolerate performance jitter.

To addess these concerns, ACK supports topology-aware CPU scheduling based on the new scheduling framework of Kubernetes, which can be enabled through pod annotations to optimize the service performance of CPU-sensitive workloads.

Topology-aware CPU scheduling overcomes the limitations of CPU Manager provided by Kubernetes. CPU Manager resolves these issues by configuring static policies for applications with high CPU affinity and performance requirements. This allows these applications to exclusively use specific CPU cores on nodes to ensure stable computing resources. However, CPU Manager only provides node-level CPU scheduling solutions, and cannot find the optimal way to allocate multiple CPU cores at the cluster level. Additionally, configuring the static policy through the CPU Manager affects only Guaranteed pods and does not apply to other pod types, including Burstable pods and BestEffort pods. In a Guaranteed pod, each container is configured with both a CPU request and a CPU limit, with these values set identically.

Prerequisites

An ACK Pro cluster has been created, and the CPU policy of the node pool is set to None. For more information, see Create an ACK Pro cluster.
The ack-koordinator component has been installed, and the component version is 0.2.0 or later. For more information, see ack-koordinator.

Billing rules

No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:

ack-koordinator is a non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing overview topic of Managed Service for Prometheus to learn about the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage, see Resource usage and bills.

Procedure

This topic uses an NGINX application as an example to demonstrate how to enable topology-aware CPU scheduling to achieve Processor Affinity.

Step 1: Deploy a sample application

Use the following YAML template to deploy an NGINX application:

Expand to view the YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        policy: intel
      containers:
      - name: nginx
        image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 4
            memory: 8Gi
          limits:
            cpu: 4
            memory: 8Gi
        volumeMounts:
           - mountPath: /etc/nginx/nginx.conf
             name: nginx
             subPath: nginx.conf
      volumes:
        - name: nginx
          configMap:
            name: nginx-configmap
            items:
              - key: nginx_conf
                path: nginx.conf

On the node where the pod is deployed, run the following command to view the CPU cores that are bound to the container:

# The path can be obtained by concatenating the pod UID and the container ID.
cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf9b79bee_eb2a_4b67_befe_51c270f8****.slice/cri-containerd-aba883f8b3ae696e99c3a920a578e3649fa957c51522f3fb00ca943dc2c7****.scope/cpuset.cpus

Expected output:

# The output shows that the serial numbers of the CPU cores that can be used by the container range from 0 to 31 before you bind CPU cores to the container.
0-31

Step 2: Enable topology-aware CPU scheduling

You can enable topology-aware CPU scheduling through pod annotations to achieve Processor Affinity.

Important

When using topology-aware CPU scheduling, do not specify nodeName on the pod. kube-scheduler is not involved in the scheduling process of such pods. You can use fields such as nodeSelector to configure affinity policies to specify node scheduling.

Standard CPU core binding

You can enable topology-aware CPU scheduling through the pod annotation cpuset-scheduler, and the system will implement Processor Affinity for you.

In the metadata.annotations of the pod YAML file, set cpuset-scheduler to true to enable topology-aware CPU scheduling.
Note
To apply configurations to a workload, such as a deployment, set the appropriate annotations for the pod in the template.metadata field.
In the Containers field, set an integer value to resources.limit.cpu to limit the number of CPU cores.

Expand to view the YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      annotations:
        cpuset-scheduler: "true" # Set to true to enable topology-aware CPU scheduling.
      labels:
        app: nginx
    spec:
      nodeSelector:
        policy: intel
      containers:
      - name: nginx
        image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 4
            memory: 8Gi
          limits:
            cpu: 4 # Set the value of resources.limit.cpu, which must be an integer.
            memory: 8Gi
        volumeMounts:
           - mountPath: /etc/nginx/nginx.conf
             name: nginx
             subPath: nginx.conf
      volumes:
        - name: nginx
          configMap:
            name: nginx-configmap
            items:
              - key: nginx_conf
                path: nginx.conf

Automatic CPU core binding

You can enable topology-aware CPU scheduling and the automatic CPU core binding policy through annotations at the same time. After configuration, the scheduler will automatically determine the number of bound CPU cores based on the pod specifications while attempting to avoid cross-NUMA memory access.

In the metadata.annotations of the pod YAML file, set cpuset-scheduler to true and cpu-policy to static-burst to enable the automatic CPU core binding.
Note
To apply configurations to a workload, such as a deployment, set the appropriate annotations for the pod in the template.metadata field.
In the Containers field, set resources.limit.cpu to an integer value as a reference upper limit of CPU cores.

Expand to view the YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      annotations:
        cpuset-scheduler: "true" # Set to true to enable topology-aware CPU scheduling.
        cpu-policy: "static-burst" # Set to static-burst to enable automatic CPU core binding.
      labels:
        app: nginx
    spec:
      nodeSelector:
        policy: intel
      containers:
      - name: nginx
        image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/nginx_optimized:20240221-1.20.1-2.3.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 4
            memory: 8Gi
          limits:
            cpu: 4 # Used as a reference value for the CPU core limit, must be an integer.
            memory: 8Gi
        volumeMounts:
           - mountPath: /etc/nginx/nginx.conf
             name: nginx
             subPath: nginx.conf
      volumes:
        - name: nginx
          configMap:
            name: nginx-configmap
            items:
              - key: nginx_conf
                path: nginx.conf

Result verification

Take the standard CPU core binding as an example to verify whether topology-aware CPU scheduling is successfully enabled. The verification process for the automatic CPU core binding is similar.

On the node where the pod is deployed, after enabling the automatic CPU core binding, run the following command to view the CPU cores that are bound to the container:

# The path can be obtained by concatenating the Pod UID and the Container ID.
cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf9b79bee_eb2a_4b67_befe_51c270f8****.slice/cri-containerd-aba883f8b3ae696e99c3a920a578e3649fa957c51522f3fb00ca943dc2c7****.scope/cpuset.cpus

Expected output:

# The output is the same as the limit.
0-3

The expected output indicates that the serial numbers of the CPU cores that can be used by the container range from 0 to 3. The number of available CPU cores is consistent with the resources.limit.cpu declared in the YAML file.

References

Kubernetes is unaware of the topology of GPU resources on nodes. Therefore, Kubernetes schedules GPU resources in a random manner. As a result, GPU acceleration for training jobs varies considerably based on the scheduling results of GPU resources. We recommend that you enable topology-aware GPU scheduling to achieve optimal GPU acceleration for training jobs. For more information, see Topology-aware GPU scheduling.
You can quantify the resources that are allocated to pods but are not in use and schedule these resources to low-priority jobs to achieve resource overcommitment. For more information, see Enable dynamic resource overcommitment.