By Yu Xu
This article will introduce the Memory folios that the basic part of the folio has started to merge into Linux 5.16. As of Linux 6.5, the folio have made great progress.
Referred from LWN:
The key points are as follows.
Add memory folios, a new type to represent either order-0 pages or the head page of a compound page.
The folio can be seen as a layer of packaging for pages, without any additional overhead. A folio can be either a single page or a composite page.
(The figure references Extreme Optimization of HugeTLB)
The preceding figure shows the structure of a page, containing 64 bytes for managing information such as flags, lru, mapping, index, private, {ref_, map_}count, and memcg_data. In the case of a composite page, the mentioned flags and information are present in the head page, while the tail page is responsible for managing information such as compound_{head, mapcount, order, nr, dtor}.
struct folio {
/* private: don't document the anon union */
union {
struct {
/* public: */
unsigned long flags;
struct list_head lru;
struct address_space *mapping;
pgoff_t index;
void *private;
atomic_t _mapcount;
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
};
struct page page;
};
};
In the structure definition of the folio, information such as flags and lru is exactly the same as that of the page. Therefore, you can perform union with page. This allows you to use the folio->flags
directly instead of folio->page->flags
.
#define page_folio(p) (_Generic((p), \
const struct page *: (const struct folio *)_compound_head(p), \
struct page *: (struct folio *)_compound_head(p)))
#define nth_page(page,n) ((page) + (n))
#define folio_page(folio, n) nth_page(&(folio)->page, n)
At first glance, page_folio may be a little confusing, but it is actually equivalent to:
switch (typeof(p)) {
case const struct page *:
return (const struct folio *)_compound_head(p);
case struct page *:
return (struct folio *)_compound_head(p)));
}
It's as simple as that.
_Generic
is a C11 STANDARD - 6.5.1.1 Generic selection attribute. The syntax is as follows:
Generic selection
Syntax
generic-selection:
_Generic ( assignment-expression , generic-assoc-list )
generic-assoc-list:
generic-association
generic-assoc-list , generic-association
generic-association:
type-name : assignment-expression
default : assignment-expression
The conversion between pages and folios is simple. Converting head or tail pages to folios is essentially obtaining the corresponding folios for the head pages. When converting a folio back to a page, we can use folio->page
to retrieve the head page, and folio_page(folio, n)
to retrieve the tail page.
However, the question remains: Since pages can already represent both base pages and compound pages, why introduce folios?
The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head().
The reason why the folio is introduced is that pages have a large number of meanings, such as base pages, compound head pages, and compound tail pages.
As mentioned above, the page metadata is stored on head pages (base pages can be regarded as head pages), such as page->mapping and page->index. However, on the mm path, the page parameters passed in always need to be determined whether they are head pages or tail pages. Because there is no context cache, there may be a large number of duplicate compound_head calls on the mm path.
Take the mem_cgroup_move_account function call as an example. A mem_cgroup_move_account call can execute the compand_head operation up to seven times.
static inline struct page *compound_head(struct page *page)
{
unsigned long head = READ_ONCE(page->compound_head);
if (unlikely(head & 1))
return (struct page *) (head - 1);
return page;
}
Then, take the page_mapping(page)
as an example to analyze the function. Enter the function and execute the compound_head(page)
first to obtain information such as page mapping. In addition, there is a branch PageSwapCache(page)
. When this branch function is executed, pages are passed. A compound_head(page)
is required to be executed inside the function to obtain page flag information.
struct address_space *page_mapping(struct page *page)
{
struct address_space *mapping;
page = compound_head(page);
/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(PageSlab(page)))
return NULL;
if (unlikely(PageSwapCache(page))) {
swp_entry_t entry;
entry.val = page_private(page);
return swap_address_space(entry);
}
mapping = page->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
EXPORT_SYMBOL(page_mapping);
After switching to folios, the page_mapping(page)
corresponds to the folio_mapping(folio)
. As folios imply that they are head pages, the calls to the two compound_head(page)
are omitted.
mem_cgroup_move_account
is just the tip of the iceberg. The mm path is full of compound_head
calls. This way, the execution cost is reduced, and the developer can also get a hint that the current folios must be head pages, reducing the judgment branch.
1) Reduce redundant compand_head
calls.
2) If developers see the folio, they can conclude that this is a head page.
3) Fix potential bugs caused by tail pages.
Here's an example where our current confusion between "any page"
and "head page" at least produces confusing behaviour, if not an
outright bug, isolate_migratepages_block():
page = pfn_to_page(low_pfn);
if (PageCompound(page) && !cc->alloc_contig) {
const unsigned int order = compound_order(page);
if (likely(order < MAX_ORDER))
low_pfn += (1UL << order) - 1;
goto isolate_fail;
}
compound_order() does not expect a tail page; it returns 0 unless it's
a head page. I think what we actually want to do here is:
if (!cc->alloc_contig) {
struct page *head = compound_head(page);
if (PageHead(head)) {
const unsigned int order = compound_order(head);
low_pfn |= (1UL << order) - 1;
goto isolate_fail;
}
}
Not earth-shattering; not even necessarily a bug. But it's an example
of the way the code reads is different from how the code is executed,
and that's potentially dangerous. Having a different type for tail
and not-tail pages prevents the muddy thinking that can lead to
tail pages being passed to compound_order().
This converts just parts of the core MM and the page cache.
willy/pagecache.git has a total of 209 commits. In the merge window of Linux 5.16, the author, Matthew Wilcox (Oracle) willy@infradead.org
, first merged the basic part of the folio, known as Merge tag folio-5.16. This merge includes 90 commits, 74 changed files with 2914 additions and 1703 deletions. Apart from the foundational infrastructure like folio definition, this update primarily focuses on memcg, filemap, and writeback.
The gradual replacement of pages with folios in folio-5.16 is quite noteworthy. Considering the extensive number of mm paths, it is impractical to replace them all at once. Instead, a top-down approach is followed, where the page is changed to folio starting from where it is allocated, and the replacement continues until all pages are replaced. However, this approach is unrealistic as it would require modifications throughout the entire mm folder.
In folio-5.16, a bottom-up approach is adopted. Starting from a specific function in the mm paths, pages are replaced with folios, and all internal implementations use folios to form a "closure." Then, the caller functions are modified to pass folios as parameters. Once all the caller functions are updated, this "closure" gains an additional layer. However, some functions have numerous callers and cannot be changed immediately. In such cases, folio-5.16 provides a wrapper. Taking page_mapping and folio_mapping as an example:
First, the closure includes foundational components like folio_test_slab(folio) and folio_test_swapcache(folio), which then expands to folio_mapping. There are numerous callers of page_mapping. Although mem_cgroup_move_account can smoothly call folio_mapping, page_evictable still uses page_mapping. Thus, the expansion of the closure stops at this point.
struct address_space *folio_mapping(struct folio *folio)
{
struct address_space *mapping;
/* This happens if someone calls flush_dcache_page on slab page */
if (unlikely(folio_test_slab(folio)))
return NULL;
if (unlikely(folio_test_swapcache(folio)))
return swap_address_space(folio_swap_entry(folio));
mapping = folio->mapping;
if ((unsigned long)mapping & PAGE_MAPPING_ANON)
return NULL;
return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
}
struct address_space *page_mapping(struct page *page)
{
return folio_mapping(page_folio(page));
}
mem_cgroup_move_account(page, ...) {
folio = page_folio(page);
mapping = folio_mapping(folio);
}
page_evictable(page, ...) {
ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
}
Many of you may wonder: Is that all? Is it just a compound_head problem?
I had to learn the LWN: A discussion on folios, LPC 2021 - File Systems MC to see how experts talk about folios. Then it turns out that Matthew Wilcox's focus is not "The folio", but "Efficient Buffered I/O". Things are not simple.
Folio-5.16 merged all FS-related codes this time. The group experts mentioned that "Linux-mm community experts do not agree to replace all the pages with folios, and anonymous pages and slabs cannot be replaced in the short term." So I kept going through the Linux-mm mailing list.
Currently, all pages in the page cache are 4 KB in size. Huge pages in the page cache, such as code huge pages, are also read-only. The reason why transparent huge pages in the page cache have not been implemented can be found in this LWN article. One of the reasons is that it becomes complex for buffer_head-based file systems to handle the page cache when implementing read and write file THP.
• buffer_head
buffer_head represents the offset of the block device mapped to physical memory. Usually, a buffer_head is 4 KB in size, so one buffer_head corresponds to one page. Some file systems may use a smaller block size, such as 1 KB or 512 bytes. In such cases, a page can have up to four or eight buffer_head structures to describe the corresponding physical disk location in its memory. This complicates the process of multi-page reading and writing, as each page needs to obtain the relationship between the page and disk offset through get_block
, resulting in inefficiency and complexity.
• Iomap
iomap was originally extracted from XFS and is based on extents, naturally supporting multi-page functionality. This means that when processing multi-page reading and writing, only one translation is required to obtain the relationship between all pages and disk offsets. With iomap, the file system is isolated from the page cache. For example, both express size in bytes instead of the number of pages. Therefore, Matthew Wilcox recommends that any file system directly using the page cache should consider switching to iomap or netfs_lib. There may be other ways to isolate the file system and the page cache besides using folio, but scatter gather is not accepted due to its overly complicated abstraction.
This explains why folio was first implemented in XFS and AFS, as these file systems are based on iomap.
This also explains why FS developers strongly hope for the merging of folio. It would allow them to easily utilize larger pages in the page cache, making the file system's I/O more efficient.
buffer_head has some features that the current iomap does not have. The integration of folio enables the promotion of iomap, allowing block-based file systems to use iomap.
The main objection comes from Johannes Weiner, who acknowledged the issue with compound_head but believed that introducing such a significant change to fix the problem was not worth it. Additionally, he believed that anonymous pages did not require the optimizations provided by folios for file systems.
Unlike the filesystem side, this seems like a lot of churn for very little tangible value. And leaves us with an end result that nobody appears to be terribly excited about.
But the folio abstraction is too low-level to use JUST for file cache and NOT for anon. It's too close to the page layer itself and would duplicate too much of it to be maintainable side by side.
Finally, Johannes Weiner compromised with the support for the folio from Kirill A. Shutemov, Michal Hocko and other experts.
At the end of the community discussion, the objection to folios no longer existed in the folio-5.15 code, but the merge window of Linux 5.15 was missed, so this time, the folio-5.16 was merged intact.
I think the problem with folio is that everybody wants to read in her/his hopes and dreams into it and gets disappointed when see their somewhat related problem doesn't get magically fixed with folio.
Folio started as a way to relief pain from dealing with compound pages. It provides an unified view on base pages and compound pages. That's it.
It is required ground work for wider adoption of compound pages in page cache. But it also will be useful for anon THP and hugetlb.
Based on adoption rate and resulting code, the new abstraction has nice downstream effects. It may be suitable for more than it was intended for initially. That's great.
But if it doesn't solve your problem... well, sorry...
The patchset makes a nice step forward and cuts back on mess I created on the way to huge-tmpfs.
I would be glad to see the patchset upstream.
--Kirill A. Shutemov
Everyone knows the "struct page-related confusion", but no one is going to solve it. Everyone is silently enduring this long-standing problem, and the code is full of the following code.
if (compound_head(page)) // do A;
else // do B;
The folio is not perfect. Perhaps people's expectations for it are too high, so a few people are disappointed with the final implementation of the folio. But most people think that the folio is an important step in the right direction. After all, there is still more work to be done.
For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready.
The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%.
Since folio-5.16 reduces the number of compound_head calls, there should be a performance improvement in micro benchmarks with high system usage. However, practical testing has not been conducted.
After folio-5.18 adds support for multi-page folios, there is a theoretical improvement in I/O efficiency. We will have to wait and see for practical results.
The primary task for file system developers is to transition file systems that currently utilize buffer head to use iomap for I/O, particularly for block-based file systems.
Other developers should readily embrace folio. Any new features developed based on Linux 5.16 and later should make extensive use of folios and familiarize themselves with the associated APIs. The fundamental aspects of APIs, such as memory allocation and recycling, remain unchanged.
85 posts | 5 followers
FollowAlibaba Clouder - July 19, 2019
Alibaba Clouder - March 19, 2018
OpenAnolis - June 19, 2023
OpenAnolis - September 27, 2022
OpenAnolis - May 13, 2022
Alibaba Cloud Blockchain Service Team - October 25, 2018
85 posts | 5 followers
FollowAlibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreReach global users more accurately and efficiently via IM Channel
Learn MoreMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreMore Posts by OpenAnolis