The Page Cache Limit feature is provided in Alibaba Cloud Linux 3 starting with kernel version 5.10.134-14
to resolve system instability issues that are caused by unlimited page cache usage, such as business jitters and unexpected out-of-memory (OOM) errors.
Background information
In the kernel, memory is allocated and a direct memory reclaim is triggered on a memory control group (memcg) when an upper limit for memory that is specified by the memcg is reached. The memory reclaim may affect the performance of the current process. The memcg backend asynchronous reclaim feature is provided to resolve this issue. However, the feature is not quite effective for burst memory requests. When specific jobs such as the Spark framework are running, a large amount of memory is used as page cache. Most pages in the page cache are dirty pages. Dirty pages are reclaimed slowly. As a result, the kernel may be unable to get enough memory to continue operating and OOM errors unexpectedly occur. To ensure business stability and prevent unexpected OOM errors, page cache usage must be limited.
Alibaba Cloud Linux 3 provides the Page Cache Limit feature to limit page cache usage for memcgs, including root memcgs. You can use the Page Cache Limit feature to specify a limit for page cache usage and asynchronously or synchronously reclaim excess page cache when the limit is exceeded. This prevents larger-than-expected amounts of memory from being used for page cache and improves system stability and reliability.
Interfaces
Interface | Description |
/sys/kernel/mm/pagecache_limit/enabled | The switch that controls whether to enable the Page Cache Limit feature globally in the kernel. Valid values: 0 and 1. Default value: 0.
|
/sys/fs/cgroup/memory/<Memcg directory name>/memory.pagecache_limit.enable | The switch that controls whether to enable the Page Cache Limit feature for a specific memcg. Valid values: 0 and 1. Default value: 0.
|
/sys/fs/cgroup/memory/<Memcg directory name>/memory.pagecache_limit.size | The maximum page cache usage of a specific memcg. Unit: bytes. Valid values: 0 to the
|
/sys/fs/cgroup/memory/<Memcg directory name>/memory.pagecache_limit.sync | Controls whether to perform asynchronous or synchronous reclaim when the memcg exceeds the limit for page cache usage. Valid values: 0 and 1. Default value: 0.
|
How the feature works
After you enable the Page Cache Limit feature, the feature works on memcgs based on the following principles:
When page cache is allocated to a memcg process, the feature determines whether the current memcg exceeds the limit for page cache usage, and traverses upwards from the memcg to check the
memory.pagecache_limit
values of parent memcgs hierarchically. If the memory.pagecache_limit value of a parent memcg is 0, the Page Cache Limit feature is disabled for the parent memcg. Page cache usage is not limited for the parent memcg and its child memcgs.If the current memcg exceeds the limit for page cache usage, the feature determines whether to perform synchronous or asynchronous reclaim based on the
memory.pagecache_limit.sync
value.The feature reclaims page cache.
Synchronous reclaim: By default, only unmapped file pages can be reclaimed. When the kernel performs more than four scans, mapped file pages can also be reclaimed.
Asynchronous reclaim: By default, unmapped and mapped file pages can be reclaimed. When the kernel performs more than two scans, dirty pages can be reclaimed.
NoteThe following memory pages are available:
Unmapped file pages: memory pages that are not mapped to files. In most cases, the pages are private regions of memory that hold temporary data and processes and are not persisted to disks.
Mapped file pages: memory pages that are mapped to files. These pages allow processes to read and write file data in memory, which enables random access to files.
Dirty pages: mapped file pages that are modified. When processes write data to mapped file pages, the pages are marked dirty. The mark indicates that file copies in memory are modified and different from the files on disks. Dirty pages are periodically written back to disks to ensure data persistence.
Example on how to configure the interfaces
In this example, a 20 MiB page cache is created and page cache usage is limited to 10 MiB. After you enable the Page Cache Limit feature, verify whether the feature works as expected.
Connect to an Elastic Compute Service (ECS) instance.
For more information, see Connect to a Linux instance by using a password or key.
Run the following command to enable the Page Cache Limit feature globally:
sudo sh -c 'echo 1 > /sys/kernel/mm/pagecache_limit/enabled'
Enable the Page Cache Limit feature and limit the page cache usage for a specific memcg.
Run the following command to create a memcg directory. Example:
/sys/fs/cgroup/memory/test/
.sudo mkdir -p /sys/fs/cgroup/memory/test/
Run the following command to specify a limit for page cache usage for the memcg.
In this example, the page cache usage limit of the memcg is set to 10,485,760 bytes (approximately equal to 10 MiB).
sudo sh -c 'echo 10485760 > /sys/fs/cgroup/memory/test/memory.pagecache_limit.size'
Configure a page cache reclaim scheme for the memcg.
To use the asynchronous reclaim scheme, run the following command:
sudo sh -c 'echo 0 > /sys/fs/cgroup/memory/test/memory.pagecache_limit.sync'
To use the synchronous reclaim scheme, run the following command:
sudo sh -c 'echo 1 > /sys/fs/cgroup/memory/test/memory.pagecache_limit.sync'
Run the following command to enable the Page Cache Limit feature for the memcg:
sudo sh -c 'echo 1 > /sys/fs/cgroup/memory/test/memory.pagecache_limit.enable'
Create a page cache.
Run the following command to install the
libcgroup
package:The
cgexec
command is required to create a page cache. In most cases, thecgexec
command is provided as part of the libcgroup package and needs to be installed. If thecgexec
command is unavailable in your system, install the libcgroup package.sudo yum install libcgroup-tools
Run the following commands to create a page cache.
In this example, the
dd
command is used to create a 20-MiB page cache by writing a 1-MiB block 20 times in a row.sudo dd if=/dev/zero of=./testfile bs=1M count=20 oflag=direct sudo cgexec -g "memory:test" cat ./testfile > /dev/null
Check whether the Page Cache Limit feature works as expected.
Run the following command to check the page cache usage:
grep cache /sys/fs/cgroup/memory/test/memory.stat
The following command output is returned.
In the preceding command output,
cache
indicates that page cache usage is limited to 10,543,104 bytes (approximately equal to 10 MiB).Run the following command to check whether the Page Cache Limit feature reclaims page cache as expected:
cat /sys/fs/cgroup/memory/test/memory.exstat
The following command output is returned.
In the preceding command output,
pagecache_limit_reclaimed_kb
indicates that 10,108 KB (approximately equal to 10 MiB) of page cache is reclaimed.The verification results show that a 20-MiB page cache is created and page cache usage is limited to 10 MiB. When page cache usage exceeds the limit, 10 MiB of page cache is reclaimed by the Page Cache Limit feature as expected.
NoteIf the
pagecache_limit_reclaimed_kb
value is higher than expected, this may be because an improper amount of data that is read ahead or prefetched during a sequential read operation results in excessive reclaim of page cache. We recommend that you run theecho 128 | sudo tee /sys/block/<Disk device name>/queue/read_ahead_kb
command to configure the read_ahead_kb parameter for the disk. In this example,vda
is used as the disk device name. The read_ahead_kb parameter specifies the number of kilobytes for the kernel to read ahead or prefetch during a sequential read operation. Then, verify the Page Cache feature again.