When multiple pods run on the same node, the pods compete for CPU resources. The CPU cores that are allocated to each pod may frequently change, leading to performance jitter. For performance-sensitive applications, you can enable the topology-aware CPU scheduling feature to pin pods to the CPU cores on the node. This approach reduces performance issues caused by CPU context switching and memory access across Non-Uniform Memory Access (NUMA) nodes.
To better understand and use this feature, we recommend that you refer to the official Kubernetes documentation to learn about pod QoS classes, assigning memory resources to containers and pods, and CPU management policies on nodes, such as CPU management policies, none
policy, and static
policy.
Scenarios
In a Kubernetes cluster, multiple pods may share CPU cores on the same node. However, in the following scenarios, some applications may need to be pinned to specific CPU cores:
Applications that are not adapted to cloud-native scenarios. For example, the number of threads is specified based on the total physical cores of the device instead of the container specifications. As a result, application performance degrades.
Applications that run on multi-core ECS Bare Metal instances with Intel CPUs or AMD CPUs and experience performance degradation due to memory access across NUMA nodes.
Applications that are highly sensitive to CPU context switching and cannot tolerate performance jitter.
To addess these concerns, ACK supports topology-aware CPU scheduling based on the new scheduling framework of Kubernetes, which can be enabled through pod annotations to optimize the service performance of CPU-sensitive workloads.
Topology-aware CPU scheduling overcomes the limitations of CPU Manager provided by Kubernetes. CPU Manager resolves these issues by configuring static
policies for applications with high CPU affinity and performance requirements. This allows these applications to exclusively use specific CPU cores on nodes to ensure stable computing resources. However, CPU Manager only provides node-level CPU scheduling solutions, and cannot find the optimal way to allocate multiple CPU cores at the cluster level. Additionally, configuring the static
policy through the CPU Manager affects only Guaranteed pods and does not apply to other pod types, including Burstable pods and BestEffort pods. In a Guaranteed pod, each container is configured with both a CPU request and a CPU limit, with these values set identically.
Prerequisites
An ACK Pro cluster has been created, and the CPU policy of the node pool is set to None. For more information, see Create an ACK Pro cluster.
The ack-koordinator component has been installed, and the component version is 0.2.0 or later. For more information, see ack-koordinator.
Billing rules
No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:
ack-koordinator is a non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing overview topic of Managed Service for Prometheus to learn about the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage, see Resource usage and bills.
Procedure
This topic uses an NGINX application as an example to demonstrate how to enable topology-aware CPU scheduling to achieve Processor Affinity.
Step 1: Deploy a sample application
Use the following YAML template to deploy an NGINX application:
On the node where the pod is deployed, run the following command to view the CPU cores that are bound to the container:
# The path can be obtained by concatenating the pod UID and the container ID. cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf9b79bee_eb2a_4b67_befe_51c270f8****.slice/cri-containerd-aba883f8b3ae696e99c3a920a578e3649fa957c51522f3fb00ca943dc2c7****.scope/cpuset.cpus
Expected output:
# The output shows that the serial numbers of the CPU cores that can be used by the container range from 0 to 31 before you bind CPU cores to the container. 0-31
Step 2: Enable topology-aware CPU scheduling
You can enable topology-aware CPU scheduling through pod annotations to achieve Processor Affinity.
When using topology-aware CPU scheduling, do not specify nodeName
on the pod. kube-scheduler is not involved in the scheduling process of such pods. You can use fields such as nodeSelector
to configure affinity policies to specify node scheduling.
Standard CPU core binding
You can enable topology-aware CPU scheduling through the pod annotation cpuset-scheduler
, and the system will implement Processor Affinity for you.
In the
metadata.annotations
of the pod YAML file, setcpuset-scheduler
totrue
to enable topology-aware CPU scheduling.NoteTo apply configurations to a workload, such as a deployment, set the appropriate annotations for the pod in the
template.metadata
field.In the
Containers
field, set an integer value toresources.limit.cpu
to limit the number of CPU cores.
Automatic CPU core binding
You can enable topology-aware CPU scheduling and the automatic CPU core binding policy through annotations at the same time. After configuration, the scheduler will automatically determine the number of bound CPU cores based on the pod specifications while attempting to avoid cross-NUMA memory access.
In the
metadata.annotations
of the pod YAML file, setcpuset-scheduler
totrue
andcpu-policy
tostatic-burst
to enable the automatic CPU core binding.NoteTo apply configurations to a workload, such as a deployment, set the appropriate annotations for the pod in the
template.metadata
field.In the
Containers
field, setresources.limit.cpu
to an integer value as a reference upper limit of CPU cores.
Result verification
Take the standard CPU core binding as an example to verify whether topology-aware CPU scheduling is successfully enabled. The verification process for the automatic CPU core binding is similar.
On the node where the pod is deployed, after enabling the automatic CPU core binding, run the following command to view the CPU cores that are bound to the container:
# The path can be obtained by concatenating the Pod UID and the Container ID.
cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf9b79bee_eb2a_4b67_befe_51c270f8****.slice/cri-containerd-aba883f8b3ae696e99c3a920a578e3649fa957c51522f3fb00ca943dc2c7****.scope/cpuset.cpus
Expected output:
# The output is the same as the limit.
0-3
The expected output indicates that the serial numbers of the CPU cores that can be used by the container range from 0 to 3. The number of available CPU cores is consistent with the resources.limit.cpu
declared in the YAML file.
References
Kubernetes is unaware of the topology of GPU resources on nodes. Therefore, Kubernetes schedules GPU resources in a random manner. As a result, GPU acceleration for training jobs varies considerably based on the scheduling results of GPU resources. We recommend that you enable topology-aware GPU scheduling to achieve optimal GPU acceleration for training jobs. For more information, see Topology-aware GPU scheduling.
You can quantify the resources that are allocated to pods but are not in use and schedule these resources to low-priority jobs to achieve resource overcommitment. For more information, see Enable dynamic resource overcommitment.