Customize the OS parameters of a node pool - Container Service for Kubernetes

If the default parameter settings of the node OS, such as Linux, do not meet your business requirements, you can customize the OS parameters of your node pools to improve OS performance. After you customize the OS parameters of a node pool, Container Service for Kubernetes (ACK) updates the nodes in the node pool in batches. The new OS parameters immediately take effect on existing nodes in the node pool. Newly added nodes also use the new OS parameters.

Limits

This feature is supported only by ACK clusters that run Kubernetes 1.28 and later. For more information, see Create an ACK managed cluster, Create an ACK dedicated cluster, and Create an ACK edge cluster. To update an ACK cluster, see Manually upgrade ACK clusters.

Usage notes

Dynamically modifying node OS configurations may change the configurations of existing pods on nodes. As a result, pods may be recreated. Before you modify node OS configurations, we recommend that you ensure the high availability of your applications.
Modifications to OS parameters may affect the Linux kernel, which may cause node performance degradation or unavailability. As a result, your applications may be affected. Before you modify an OS parameter in the production environment, we recommend that you learn the purpose of the parameter and test the impact of the parameter change.
Do not use other methods to modify OS parameters that are not allowed to be customized in the ACK console. Modifying OS parameters that are not allowed may lead to node unavailability or overwrite other OS parameter modifications. For example, if you manually modify the /etc/sysctl.d/99-k8s.conf file in the CLI, other configuration modifications you made may be overwritten when the system performs cluster O&M operations, such as cluster updates and custom parameter changes.

Customizable sysctl parameters in the ACK console

Parameter	Description	Default	Suggested value range
fs.aio-max-nr	The maximum number of asynchronous I/O operations supported by the system.	65536	[65536, 6553500]
fs.file-max	The maximum number of file handles that can be allocated by the system.	2097152	[8192, 12000500]
fs.inotify.max_user_watches	The maximum number of inotify watches that can be created by a user.	524288	[524288, 2097152]
fs.nr_open	The maximum number of file descriptors that can be allocated by a process.	1048576	[1000000, 20000500] The value of this parameter must be less than the value of the fs.file-max parameter.
kernel.pid_max	The maximum number of process IDs (PIDs) that can be allocated by the system.	4194303	> 1048575
kernel.threads-max	The maximum number of threads that can be created by the system.	504581	> 500000
net.core.netdev_max_backlog	The maximum number of packets supported by the input queue when the packet receive rate on the interface is higher than the processing rate of the kernel.	16384	[1000, 3240000]
net.core.optmem_max	The maximum ancillary buffer size supported by a socket. Unit: bytes.	20480	[20480, 4194304]
net.core.rmem_max	The maximum receive buffer size supported by a socket. Unit: bytes.	16777216	[212992, 134217728]
net.core.wmem_max	The maximum send buffer size supported by a socket. Unit: bytes.	16777216	[212992, 134217728]
net.core.wmem_default	The default send buffer size supported by a socket. Unit: bytes.	212992	≥ 212992
net.ipv4.tcp_mem	The maximum size of memory that can be used by the TCP stack. Unit: pages. In most cases, the page size is 4 KB. The value of this parameter consists of three integers that specify different memory watermarks for the TCP stack. The first integer specifies the minimum memory watermark. The second integer specifies the stressful memory watermark. The third integer specifies the maximum memory watermark.	The value is dynamically calculated based on the total memory provided by the system.	The three values increase in sequence. Minimum value: 80000.
net.ipv4.neigh.default.gc_thresh1	The minimum number of entries to retain in the Address Resolution Protocol (ARP) cache. The system will not perform garbage collection if the number of entries in the cache falls below this value.	128	[128, 80000]
net.ipv4.neigh.default.gc_thresh2	The maximum number of entries in the ARP cache. This is a soft limit. When the number of entries in the cache reaches this value, the system will start the garbage collection process after 5 seconds.	1024	[512, 90000]
net.ipv4.neigh.default.gc_thresh3	The maximum number of entries to retain in the ARP cache. This is a hard limit. When the number of entries in the cache reaches this value, the system will immediately perform garbage collection. If the number of entries continuously exceeds this value, the system will perform ongoing cleanup operations.	8192	[1024, 100000]

Customizable THP parameters in the ACK console

The Transparent Huge Pages (THP) feature is a common feature in the Linux kernel. THP can merge small pages (typically 4KB in size) into huge pages (typically 2 MB or larger in size) to reduce the number of page table entries (PTEs) and memory access. This way, the stress of the translation lookaside buffer (TLB) cache is reduced and application performance is improved.

Note

This feature is in canary release. To use it, submit a ticket.
The default values in the following table are the default settings used by systems that run Alibaba Cloud Linux 2 with kernel version 4.19.91-18 and later.

Parameter	Description	Default	Valid value
transparent_enabled	Specifies whether to globally enable the THP feature.	always	always: globally enables the THP feature. never: globally disables the THP feature. madvise: enables the THP feature only in the memory zones called by the `madvise()` system call and flagged by `MADV_HUGEPAGE`.
transparent_defrag	Specifies whether to enable the THP defragmentation feature. After you enable THP defragmentation, small pages can be merged into huge pages, which reduces the page table size and improves system performance.	madvise	always: If no transparent huge page is available, the system suspends memory allocation. The system waits for direct memory reclaim and direct memory compaction. If the system has sufficient contiguous free memory after direct reclaim and direct compaction are completed, the system continues to allocate transparent huge pages. defer: If no transparent huge page is available, the system allocates regular pages (4 KB in size). Meanwhile, the system starts the `kswapd` kernel daemon to perform background reclaim, and starts the `kcompactd` kernel daemon to perform background compaction. If the system has sufficient contiguous free memory after these operations run for a period of time, the `khugepaged` kernel daemon merges the previously allocated regular pages (4 KB in size) into transparent huge pages (2 MB in size). madvise: In the memory area that is called by the `madvise()` system call and is flagged by `MADV_HUGEPAGE`, the memory allocation behavior is the same as that of the always option. In other memory areas, when a page fault occurs, the system allocates regular pages (4 KB in size) instead. defer+madvise: In the memory area that is called by the `madvise()` system call and is flagged by `MADV_HUGEPAGE`, the memory allocation behavior is the same as that of the always option. In other memory areas, the memory allocation behavior is the same as that of the defer option. never: disables THP defragmentation.
khugepaged_defrag	`khugepaged` is a kernel thread that is used for transparent huge page management and defragmentation. This reduces memory fragments and improves system performance. khugepaged monitors transparent huge pages in the system and attempts to merge scattered transparent huge pages into larger pages. This improves memory utilization and performance. This operation locks the memory directory. In addition, the `khugepaged` kernel daemon may start scanning and converting regular pages to transparent huge pages at the wrong time. As a result, this operation may affect application performance.	1	0: disables khugepaged defragmentation. 1: The system periodically starts the `khugepaged` kernel daemon when the system is idle and attempts to merge consecutive regular pages (4 KB in size) into transparent huge pages (2 MB in size).
khugepaged_alloc_sleep_millisecs	If THP allocation fails, the `khugepaged` kernel daemon waits for the specified period of time before it starts to re-allocate transparent huge pages. This helps prevent consecutive THP allocation failures within a short period of time. Unit: milliseconds.	Default value: 60000, which is equivalent to 60 seconds.
khugepaged_scan_sleep_millisecs	The system starts the `khugepaged` kernel daemon based on the specified interval. Unit: milliseconds.	Default value: 10000, which is equivalent to 10 seconds.
khugepaged_pages_to_scan	The `khugepaged` kernel daemon scans the specified number of pages after it is started each time. Unit: pages.	Default value: 4096.

Customize the OS parameters of a node pool in the ACK console

After you customize the OS parameters of a node pool, ACK updates the nodes in the node pool in batches. The new OS parameters immediately take effect on existing nodes in the node pool. Newly added nodes also use the new OS parameters. OS parameters on existing nodes may affect applications on the nodes. We recommend that you perform this operation during off-peak hours.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
On the Node Pools page, select the node pool that you want to manage and choose More > OS Configuration in the Actions column.
Read the configuration notes. Click + Custom Parameters and select the parameters that you want to modify. Specify the Maximum Number of Nodes to Repair per Batch parameter. Then, click Submit and follow the instructions to complete subsequent operations.
After you specify the Maximum Number of Nodes to Repair per Batch parameter, the new OS configurations are updated on the nodes in the node pool in batches based on the value you specified. You can view the progress of the update in the Event Rotation section. You can also pause, resume, or cancel the update. You can pause the update and then verify the updated nodes. After you pause the update, the OS configurations of the nodes in the current batch will still be updated. The remaining batches of nodes are not updated until you resume the update.
Important
We recommend that you complete the update at the earliest opportunity. If the update remains paused for seven days, the system automatically cancels the update and deletes the related events and logs.