Kubernetes allows you to specify the resource requests and limits of containers. The memory available for an application depends on various factors, such as page cache reclamation and excessive memory consumption by other applications. In extreme cases, out of memory (OOM) errors occur due to insufficient memory on the node, downgrading the performance of applications on the node. The ack-koordinator component provides the memory quality of service (QoS) feature for containers. You can use the component to assign different QoS classes to different containers based on your business requirements. This allows you to prioritize the memory requests of applications with high QoS classes while ensuring fair memory allocation.
To help you better understand and use the memory QoS feature, we recommend that you first read the following topics in the Kubernetes official documentation: Pod Quality of Service Classes and Assign Memory Resources to Containers and Pods.
Feature introduction
Why memory QoS?
To ensure that pods can efficiently and securely run in Kubernetes clusters, Kubernetes allows you to specify the resource requests and limits of pods. The following figure shows the memory request and limit of a pod.
Memory request (requests.memory): The memory request of a pod takes effect during the scheduling process of the pod. The system schedules the pod to a node that meets the memory request of the pod.
Memory limit (requests.memory): The memory limit of a pod limits the amount of memory that the pod can use on the node. The memory.limit_in_bytes parameter in the cgroup file specifies the upper limit of memory that can be used by the pod.
The memory usage of a container depends on the memory limit of the container and the memory capacity of the node:
Container memory limit: If the amount of memory that a container uses, including the page cache, is about to reach the memory limit of the container, memory control group (memcg)-level direct memory reclamation is triggered for the pod. As a result, the processes in the pod are blocked. In this case, if the pod applies for memory at a faster rate than the memory is reclaimed, an OOM error occurs and the pod is terminated.
Node memory capacity: The memory limit of a container can be greater than the memory request of the container. When multiple containers are deployed on a node, the sum of the memory limits of the containers may exceed the memory capacity of the node. If the overall memory usage on a node is excessively high, the OS kernel may reclaim memory from containers. As a result, the performance of your application is downgraded. In extreme cases, OOM errors occur due to insufficient memory on the node, and your application is terminated.
Feature description
To improve application performance and node stability, ack-koordinator provides the memory QoS feature for containers that run on different Alibaba Cloud Linux kernel versions. ack-koordinator automatically configures the memcg based on the configuration of the container to enable other features such as Memcg QoS, Memcg backend asynchronous reclamation, and Memcg global minimum watermark rating. This optimizes the performance of memory-sensitive applications while ensuring fair memory scheduling among containers.
Memory reclamation and memory lock policies
The memory QoS feature requires you to configure multiple cgroup parameters.
memory.limit_in_bytes: the upper limit of memory that can be used by a pod.
memory.high: the memory throttling threshold. The OS kernel reclaims memory to prevent the memory usage from exceeding this value.
memory.wmark_high: the memory reclamation threshold (
wmarkRatio
). Asynchronous reclamation is performed on reclaimable memory to ensure that memory usage remains below the threshold.memory.min: the memory lock threshold. You can configure the absolute lock threshold (
minLimitPercent
) and the relative lock threshold (lowLimitPercent
).
For more information about the preceding parameters, see Advanced parameters.
The memory QoS feature provides the following benefits:
When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclamation for a specific amount of memory. This prevents the reclamation of all the memory used by the pod and therefore minimizes the adverse impact on the application performance caused by direct memory reclamation.
Memory reclamation is performed more fairly among pods. When the available memory on a node becomes insufficient, memory reclamation is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory.
When the system reclaims memory, the system prioritizes the memory requests of latency-sensitive (LS) pods, including Guaranteed pods and Burstable pods.
Flexible configuration and multi-environmental compatibility
The memory QoS feature is supported in Kubernetes 1.22 and supports only cgroup v2. To enable memory QoS, you must manually configure the kubelet. Memory QoS takes effect on all pods and nodes in the cluster and therefore does not support fine-grained configurations. Compared with the memory QoS feature provided by open source Kubernetes, the memory QoS feature provided by ack-koordinator is optimized in the following perspectives:
Provides advanced features such as memcg backend asynchronous reclamation and minimum watermark rating based on Alibaba Cloud Linux and is compatible with the cgroup v1 and cgroup v2 interfaces. For more information about the OS kernel features required by the memory QoS feature of Container Service for Kubernetes (ACK), see Overview of kernel features and interfaces.
Allows you to use annotations or ConfigMaps to easily and flexibly configure fine-grained memory QoS for containers in a specific pod, namespace, or cluster.
Prerequisites
An ACK cluster that meets the following requirements is created:
Kubernetes version: 1.18 or later. For more information about how to update an ACK cluster, see Manually update ACK clusters.
OS: Alibaba Cloud Linux. Some parameters required by the memory QoS feature rely on Alibaba Cloud Linux. For more information, see Advanced parameters.
ack-koordinator 0.8.0 or later is installed. For more information, see ack-koordinator (FAK ack-slo-manager).
Billing
No fee is charged when you install or use the ack-koordinator component. However, fees may be charged in the following scenarios:
ack-koordinator is a non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing topic of Managed Service for Prometheus to learn about the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage, see Query the amount of observable data and bills.
Usage notes
When you enable the memory QoS feature for pods, the cgroup parameters are automatically configured based on the specified ratios and pod parameters. This section describes how to enable memory QoS for containers in a specific pod, namespace, or cluster.
Use annotations to enable memory QoS for containers in a specific pod
You can use the following pod annotation to enable memory QoS for containers in a specific pod.
annotations:
# To enable memory QoS for the containers in a pod, set the value to auto.
koordinator.sh/memoryQOS: '{"policy": "auto"}'
# To disable memory QoS for the containers in a pod, set the value to none.
koordinator.sh/memoryQOS: '{"policy": "none"}'
Use ConfigMaps to enable memory QoS for containers in a specific cluster
You can configure a ConfigMap to enable memory QoS for all containers in a specific cluster. You can use the koordinator.sh/qosClass
pod label to centrally manage memory QoS parameters based on application characteristics. If you set the value of the koordinator.sh/qosClass
label to LS
or BE
, no annotation is required for enabling memory QoS.
The following sample ConfigMap provides an example on how to enable memory QoS for containers in a specific cluster:
apiVersion: v1 data: resource-qos-config: |- { "clusterStrategy": { "lsClass": { "memoryQOS": { "enable": true } }, "beClass": { "memoryQOS": { "enable": true } } } } kind: ConfigMap metadata: name: ack-slo-config namespace: kube-system
Use the pod YAML template to set the QoS class to
LS
orBE
.NoteIf the pod does not have the
koordinator.sh/qosClass
label, ack-koordinator configures the memory QoS parameters based on the original QoS class of the pod. AGuaranteed
pod is assigned the default memory QoS settings. A Burstable pod is assigned the default memory QoS settings for theLS
QoS class. A BestEffort pod is assigned the default memory QoS settings for theBE
QoS class.apiVersion: v1 kind: Pod metadata: name: pod-demo labels: koordinator.sh/qosClass: 'LS' # Set the QoS class of the pod to LS.
Check whether the
ack-slo-config
ConfigMap exists in thekube-system
namespace.If the ack-slo-config ConfigMap exists, we recommend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap.
kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)"
If the ack-slo-config ConfigMap does not exist, run the following command to create a ConfigMap:
kubectl apply -f configmap.yaml
Optional. Configure advanced parameters.
Use ConfigMaps to enable memory QoS for containers in a specific namespace
If you want to enable or disable memory QoS for pods of the LS
and BE
QoS classes in a specific namespace, specify the namespaces in the ConfigMap.
The following sample ConfigMap provides an example on how to enable memory QoS for containers in a specific cluster:
apiVersion: v1 data: resource-qos-config: |- { "clusterStrategy": { "lsClass": { "memoryQOS": { "enable": true } }, "beClass": { "memoryQOS": { "enable": true } } } } kind: ConfigMap metadata: name: ack-slo-config namespace: kube-system
Create a file named ack-slo-pod-config.yaml and copy the following content to the file.
The following code block is used to enable or disable memory QoS for containers in the kube-system namespace:
apiVersion: v1 kind: ConfigMap metadata: name: ack-slo-pod-config namespace: kube-system # You need to manually create the namespace during the first time. data: # Enable or disable memory QoS for containers in a specific namespace. memory-qos: | { "enabledNamespaces": ["allow-ns"], "disabledNamespaces": ["block-ns"] }
Run the following command to update the ConfigMap:
kubectl patch cm -n kube-system ack-slo-pod-config --patch "$(cat ack-slo-pod-config.yaml)"
Optional. Configure advanced parameters.
Example
In this section, a Redis pod is used as an example. The following conditions are used to compare the latency and throughput of the pod before memory QoS is enabled and after memory QoS is enabled in memory overcommitment scenarios:
An ACK Pro cluster is used.
The cluster contains 2 nodes, each of which has 8 vCPUs and 32 GB of memory. One node is used to perform stress tests. The other node runs the workload and serves as the tested machine.
Procedure
Create a file named redis-demo.yaml and copy the following content to the file:
apiVersion: v1 kind: ConfigMap metadata: name: redis-demo-config data: redis-config: | appendonly yes appendfsync no --- apiVersion: v1 kind: Pod metadata: name: redis-demo labels: koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS. annotations: koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. spec: containers: - name: redis image: redis:5.0.4 command: - redis-server - "/redis-master/redis.conf" env: - name: MASTER value: "true" ports: - containerPort: 6379 resources: limits: cpu: "2" memory: "6Gi" requests: cpu: "2" memory: "2Gi" volumeMounts: - mountPath: /redis-master-data name: data - mountPath: /redis-master name: config volumes: - name: data emptyDir: {} - name: config configMap: name: redis-demo-config items: - key: redis-config path: redis.conf nodeName: # Set nodeName to the name of the tested node. --- apiVersion: v1 kind: Service metadata: name: redis-demo spec: ports: - name: redis-port port: 6379 protocol: TCP targetPort: 6379 selector: name: redis-demo type: ClusterIP
Run the following command to deploy Redis Server as the test application.
You can access the redis-demo Service from within the cluster.
kubectl apply -f redis-demo.yaml
Simulate memory overcommitment.
Use the Stress tool to increase the load on memory and trigger memory reclamation. The sum of the memory limits of all pods on the node exceeds the physical memory of the node.
Create a file named stress-demo.yaml and copy the following content to the file:
apiVersion: v1 kind: Pod metadata: name: stress-demo labels: koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE. annotations: koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. spec: containers: - args: - '--vm' - '2' - '--vm-bytes' - 11G - '-c' - '2' - '--vm-hang' - '2' command: - stress image: polinux/stress imagePullPolicy: Always name: stress restartPolicy: Always nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed.
Run the following command to deploy stress-demo:
kubectl apply -f stress-demo.yaml
Run the following command to query the global minimum watermark of the node:
NoteIn memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM killers may be triggered for all pods on the node even before memory reclamation is performed. Therefore, we recommend that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set to 4,000,000 KB for the tested node that has 32 GiB of memory.
cat /proc/sys/vm/min_free_kbytes
Expected output:
4000000
Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node:
apiVersion: v1 kind: Pod metadata: labels: name: memtier-demo name: memtier-demo spec: containers: - command: - memtier_benchmark - '-s' - 'redis-demo' - '--data-size' - '200000' - "--ratio" - "1:4" image: 'redislabs/memtier_benchmark:1.3.0' name: memtier restartPolicy: Never nodeName: # Set nodeName to the name of the node that is used to send requests.
Run the following command to query the test results from memtier-benchmark:
kubectl logs -f memtier-demo
Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests again and compare the results.
apiVersion: v1 kind: Pod metadata: name: redis-demo labels: koordinator.sh/qosClass: 'LS' annotations: koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. spec: ... --- apiVersion: v1 kind: Pod metadata: name: stress-demo labels: koordinator.sh/qosClass: 'BE' annotations: koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS.
Analyze the result
The following table describes the stress test results when memory QoS is enabled and disabled.
Disabled: The memory QoS policy of the pod is set to
none
.Enabled: The memory QoS policy of the pod is set to
auto
and the recommended memory QoS settings are used.
The data in the following table is for reference only. The actual data generated in your test environment shall prevail.
Metrics | Disabled | Enabled |
| 51.32 ms | 47.25 ms |
| 149.0 MB/s | 161.9 MB/s |
The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of applications in memory overcommitment scenarios.
Advanced parameters
You can enable memory QoS for containers in a specific pod or cluster. If both pod annotations and ConfigMaps are used to configure memory QoS parameters, the pod annotations take precedence. If no pod annotation is added to configure memory QoS, ack-koordinator retrieves memory QoS parameters from ConfigMaps in a specific namespace. If no configuration exists in the ConfigMaps in the namespace, ack-koordinator retrieves memory QoS parameters from ConfigMaps in a specified cluster.
The Annotation and ConfigMap columns indicate whether you can configure the parameter by using annotations and the ConfigMap. indicates supported and indicates not supported.
Parameter | Type | Value range | Description | Pod Annotation | ConfigMap |
| Boolean |
|
| ||
| String |
|
| ||
| Int | 0~100 | Unit: %. Default value: This parameter specifies the unreclaimable proportion of the memory request of a pod. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For more information, see the Alibaba Cloud Linux topic Memcg QoS feature of the cgroup v1 interface. The amount of unreclaimable memory is calculated based on the following formula: | ||
| Int | 0~100 | Unit: %. Default value: This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. For more information, see the Alibaba Cloud Linux topic Memcg QoS feature of the cgroup v1 interface. The amount of relatively unreclaimable memory is calculated based on the following formula: | ||
| Int | 0~100 | Unit: %. Default value: This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to prevent cgroups from triggering OOM. For more information, see the Alibaba Cloud Linux topic Memcg QoS feature of the cgroup v1 interface. The memory throttling threshold for memory usage is calculated based on the following formula: | ||
| Int | 0~100 | Unit: %. Default value: This parameter specifies the asynchronous memory reclamation threshold of memory usage to memory limit or memory usage to the value of If throttlingPercent is disabled, the memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Memory limit × wmarkRatio/100. If throttlingPercent is enabled, the memory reclaim threshold for memory usage is calculated based on the following formula: | ||
| Int | -25~50 | Unit: %. The default value is This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclamation for the container. A positive value increases the global minimum watermark and therefore antedates memory reclamation for the container. For more information, see the Alibaba Cloud Linux topic Memcg global minimum watermark rating. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is |
FAQ
Is the memory QoS feature that is enabled based on the earlier version of the ack-slo-manager protocol still supported after I upgrade from ack-slo-manager to ack-koordinator?
In an earlier version (≤ 0.8.0) of the ack-slo-manager protocol, the following pod annotations are used:
alibabacloud.com/qosClass
alibabacloud.com/memoryQOS
ack-koordinator is compatible with earlier versions of the ack-slo-manager protocol. You can seamlessly upgrade from ack-slo-manager to ack-koordinator. ack-koordinator is compatible with the earlier protocol versions no later than July 30, 2023. We recommend that you upgrade the resource parameters in an earlier protocol version to the latest version.
The following table describes the compatibility between different versions of ack-koordinator and the memory QoS feature.
ack-koordinator version | alibabacloud.com protocol | koordinator.sh protocol |
≥ 0.3.0 and < 0.8.0 | ✓ | × |
≥ 0.8.0 | ✓ | ✓ |