In Container Service for Kubernetes (ACK) clusters, you can enable the dynamic resource overcommitment feature to schedule resources that are allocated to pods but are not in use to low-priority applications. The dynamic resource overcommitment feature monitors the loads of a node in real time, calculates the CPU and memory resources that are allocated but are not in use in the cluster, and schedules resources to BestEffort pods. This ensures that resources are fairly scheduled among BestEffort pods.
To help you better understand and use this feature, we recommend that you first read the following topics in the Kubernetes official documentation: Pod Quality of Service Classes and Assign Memory Resources to Containers and Pods.
Why do I need to enable dynamic resource overcommitment
Kubernetes manages resources that are used by pods on a node based on the quality of service (QoS) classes of the pods. For example, Kubernetes controls the out of memory (OOM) priorities. The QoS class of a pod can be Guaranteed, Burstable, or BestEffort.
To improve the stability of applications, application administrators reserve resources for pods when applications are deployed. The reserved resources are used to handle fluctuating workloads in upstream and downstream services. In most cases, the resource request of a pod is much higher than the actual resource usage. To improve the resource usage in a cluster, cluster administrators may provision BestEffort pods. BestEffort pods can share resources that are allocated to other pods but are not in use. This mechanism is known as resource overcommitment. However, resource overcommitment has the following disadvantages:
The system cannot determine whether to provide more resources for BestEffort pods based on the actual loads of a node. As a result, even if the node is overloaded, the system still schedules BestEffort pods to the node because BestEffort pods do not have resource limits.
The resources cannot be fairly scheduled among BestEffort pods. Each pod requires a different amount of resources. However, you cannot specify different resource amounts in the pod configuration file.
To resolve the preceding issues, ACK provides the capability to calculate resources that can be dynamically overcommitted. The ack-koordinator component monitors the loads of a node and synchronizes resource statistics to the node metadata as extended resources in real time. To allow BestEffort pods to use reclaimed resources, you can configure the requests and limits of reclaimed resources for the BestEffort pods. The ACK scheduler can schedule BestEffort pods to nodes based on resource requests and configure the resource limits in the cgroup of the node to ensure that the pods can use resources properly.
To differentiate reclaimed resources from regular resources, ack-koordinator introduces the Batch concept to describe reclaimed resources. batch-cpu indicates CPU resources and batch-memory indicates memory resources. As shown in the following figure, Reclaimed refers to the amount of resources that can be dynamically overcommitted. Buffered refers to reserved resources. Usage refers to the actual resource usage.
Billing rules
No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:
ack-koordinator is an non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered as custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing overview topic of Managed Service for Prometheus to learn the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage see Resource usage and bills.
Prerequisites
An ACK Pro cluster is created. For more information, see Create an ACK Pro cluster.
The ack-koordinator component is installed and the version of the component is 0.8.0 or later. For more information, see ack-koordinator.
Procedure
You can use a ConfigMap to enable the dynamic resource overcommitment feature for a pod. Then, you can use the koordinator.sh/qosClass
label to specify the QoS class of the pod in the YAML file of the pod, and configure the requests and limits of Batch resources for the pod. This way, the pod can use the dynamic resource overcommitment feature.
1. Enable dynamic resource overcommitment
You can use the ConfigMap to enable the dynamic resource overcommitment feature. You can also configure related parameters in the ConfigMap to flexibly manage reclaimed resources, such as the reclaim threshold of resources and the policy for calculating the resource capacity of the node.
Create a file named configmap.yaml based on the following ConfigMap content:
apiVersion: v1 kind: ConfigMap metadata: name: ack-slo-config namespace: kube-system data: colocation-config: | { "enable": true, "metricAggregateDurationSeconds": 60, "cpuReclaimThresholdPercent": 60, "memoryReclaimThresholdPercent": 70, "memoryCalculatePolicy": "usage" }
You can flexibly manage the batch-cpu and batch-memory resources by modifying the parameters in the ConfigMap.
Parameters
Parameter
Format
Description
enable
Boolean
Specifies whether to dynamically update the statistics about Batch resources. If you disable this feature, the amount of reclaimed resources is reset to
0
. Default value:false
.metricAggregateDurationSeconds
Int
The minimum frequency at which the statistics about Batch resources are updated. Unit: seconds. We recommend that you use the default value, which is 60 seconds.
cpuReclaimThresholdPercent
Int
The reclaim threshold of batch-cpu resources. Default value:
65
. Unit: percentage (%).memoryReclaimThresholdPercent
Int
The reclaim threshold of
batch-memory
resources. Default value:65
. Unit: percentage (%).memoryCalculatePolicy
String
The policy for calculating the amount of batch-memory resources. Valid values:
"usage"
: The amount of batch-memory resources is calculated based on the actual memory usage of pods whose QoS classes are Burstable or Guaranteed. The batch-memory resources include resources that are not allocated and resources that are allocated but are not in use. This is the default value."request"
: The amount of batch-memory resources is calculated based on the memory requests of pods whose QoS classes are Burstable or Guaranteed. The batch-memory resources include only resources that are not allocated.
Calculate the amount of Batch resources
Calculate the amount of Batch resources based on the resource requests of pods
Check whether the ConfigMap named
ack-slo-config
exists in the kube-system namespace.If the ack-slo-config ConfigMap exists, we recommend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap.
kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)"
If the ack-slo-config ConfigMap does not exist, run the following command to create a ConfigMap:
kubectl apply -f configmap.yaml
2. Apply for Batch resources for pods
After the configuration is complete, you can specify the QoS class of the pod by using the koordinator.sh/qosClass
label in the metadata
field of the YAML file of the pod based on the total Batch resources of the node, and specify the Batch resource request and Batch resource limit.
If you do not add the koordinator.sh/qosClass
label to the pod, ack-koordinator selects a Kubernetes-native QoS class. BE
indicates BestEffort pods and LS
indicates Burstable or Guaranteed pods.
Run the following command to query the total amount of available Batch resources on the node:
# Replace $nodeName with the name of the node that you want to query. kubectl get node $nodeName -o yaml
Expected output:
# Pod information. status: allocatable: # Unit: millicores. In the following example, 50 cores can be allocated. kubernetes.io/batch-cpu: 50000 # Unit: bytes. In the following example, 50 GB of memory can be allocated. kubernetes.io/batch-memory: 53687091200
Create a pod and apply for Batch resources.
ImportantIf you provision a pod by using a Deployment or other types of workloads, configure the
template.metadata
field. A pod cannot apply for Batch resources and regular resources at the same time.ack-koordinator dynamically adjusts the Batch resources that can be allocated to pods based on the actual loads of the node. In rare cases, the kubelet may have a certain delay in synchronizing node status information. As a result, pods fail to be scheduled due to insufficient resources. In this case, delete and recreate the pods.
You must set the amount of extended resources to an integer in Kubernetes clusters. The unit of batch-cpu resources is millicores.
metadata: labels: # Required. Set the QoS class of the pod to BestEffort. koordinator.sh/qosClass: "BE" spec: containers: - resources: requests: # Unit: millicores. In the following example, the CPU request is set to one core. kubernetes.io/batch-cpu: "1k" # Unit: bytes. In the following example, the memory request is set to 1 GB. kubernetes.io/batch-memory: "1Gi" limits: kubernetes.io/batch-cpu: "1k" kubernetes.io/batch-memory: "1Gi"
Example
This example shows how to deploy a BestEffort pod that applies for Batch resources after the dynamic resource overcommitment feature is enabled. After the deployment is complete, verify the result by checking whether the resource limits of the BestEffort pod take effect in the cgroup of the node.
Run the following command to query the total amount of available Batch resources on the node:
kubectl get node $nodeName -o yaml
Expected output:
# Pod information. status: allocatable: # Unit: millicores. In the following example, 50 cores can be allocated. kubernetes.io/batch-cpu: 50000 # Unit: bytes. In the following example, 50 GB of memory can be allocated. kubernetes.io/batch-memory: 53687091200
Create a file named be-pod-demo.yaml and copy the following content to the file:
apiVersion: v1 kind: Pod metadata: lables: koordinator.sh/qosClass: "BE" name: be-demo spec: containers: - command: - "sleep" - "100h" image: registry-cn-beijing.ack.aliyuncs.com/acs/stress:v1.0.4 imagePullPolicy: Always name: be-demo resources: limits: kubernetes.io/batch-cpu: "50k" kubernetes.io/batch-memory: "10Gi" requests: kubernetes.io/batch-cpu: "50k" kubernetes.io/batch-memory: "10Gi" schedulerName: default-scheduler
Run the following command to deploy be-pod-demo as the test application:
kubectl apply -f be-pod-demo.yaml
Check whether the resource limits of the BestEffort pod take effect in the cgroup of the node.
Run the following command to query the CPU limit:
cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/cpu.cfs_quota_us
Expected output:
# The CPU limit in the cgroup is set to 50 cores. 5000000
Run the following command to query the memory limit:
cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/memory.limit_in_bytes
Expected output:
# The memory limit in the cgroup is set to 10 GB. 10737418240
What to do next
View the usage of Batch resources in Managed Service for Prometheus
ACK clusters are integrated with Managed Service for Prometheus, which provides Prometheus dashboards. You can view the usage of Batch resources in the ACK console.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
Choose Others > k8s-reclaimed-resource.
On the k8s-reclaimed-resource tab, you can view details about the Batch resources. The details include the total amount of Batch resources provided and requested by each node and the cluster. For more information, see Basic monitoring capabilities.
If you created a Prometheus dashboard, you can view the data of colocated resources based on the following metrics:
# The amount of allocatable batch-cpu resources on the node. koordlet_node_resource_allocatable{resource="kubernetes.io/batch-cpu",node="$node"} # The amount of batch-cpu resources that are allocated on the node. koordlet_container_resource_requests{resource="kubernetes.io/batch-cpu",node="$node"} # The amount of allocatable batch-memory resources on the node. kube_node_status_allocatable{resource="kubernetes.io/batch-memory",node="$node"} # The amount of batch-memory resources that are allocated on the node. koordlet_container_resource_requests{resource="kubernetes.io/batch-memory",node="$node"}
FAQ
Is the resource overcommitment feature that is enabled based on the earlier version of the ack-slo-manager protocol supported after I upgrade from ack-slo-manager to ack-koordinator?
The earlier version of the ack-slo-manager protocol includes the following components:
The
alibabacloud.com/qosClass
pod annotation.The
alibabacloud.com/reclaimed
field that is used to specify the resource requests and limits of pods.
ack-koordinator is compatible with the earlier version of the ack-slo-manager protocol. The scheduler of an ACK Pro cluster can calculate the amount of requested resources and the amount of available resources based on the earlier protocol version and the new protocol version. You can seamlessly upgrade from ack-slo-manager to ack-koordinator.
ack-koordinator is compatible with protocol versions no later than July 30, 2023. We recommend that you replace the resource parameters used in an earlier protocol version with those used in the latest version.
The following table describes the compatibility between the scheduler of an ACK Pro cluster, ack-koordinator, and different protocols.
Scheduler version | ack-koordinator (ack-slo-manager) | alibabacloud.com protocol | koordinator.sh protocol |
≥1.18 and < 1.22.15-ack-2.0 | ≥ 0.3.0 | Supported | Not supported |
≥ 1.22.15-ack-2.0 | ≥ 0.8.0 | Supported | Supported |
Why does the memory usage suddenly increase after an application uses Batch resources?
For applications configured with kubernetes.io/batch-memory
(Batch memory limit), ack-koordinator specifies the memory limit in the cgroup of the node based on the Batch memory limit after the container is created. Some applications automatically request memory based on the cgroup of the container when the applications are started. If an application is started before the memory limit of the cgroup is configured, the actual memory usage may exceed the Batch memory limit. However, the operating system does not immediately reduce the memory usage of the process. As a result, the memory limit in the cgroup of the container cannot be specified until the actual usage falls below the Batch memory limit.
In this case, we recommend that you modify the application configuration to ensure that the actual memory usage remains below the Batch memory limit. You can also check the memory limit parameters in the application boot script to ensure that the parameters are configured before the application starts. This ensures that the memory usage of the application is properly limited and avoids OOM errors.
Run the following command in the container to query the memory limit:
# Unit: bytes.
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# The expected output.
1048576000