By Huaxiong Song, from the ApsaraDB RDS for MySQL kernel team
In non-UNIV_PFS_MEMORY compilation mode, UT_NEW calls the original interfaces such as new, delete, malloc, and free to apply for and release memory. In UNIV_PFS_MEMORY compilation mode, ut_allocator encapsulated internally is used for management, and information such as memory tracking is added, which can be displayed through the PFS table.
ut_allocator can be used as the memory allocator of std containers, such as std::map, allowing the internal memory of the container to be allocated through memory traceability provided by InnoDB. The following describes the different memory allocation methods provided by ut_allocator.
#ifdef UNIV_PFS_MEMORY
#define UT_NEW(expr, key) ::new (ut_allocator<decltype(expr)>(key).allocate(1, NULL, key, false, false)) expr
...
#define ut_malloc(n_bytes, key) static_cast<void *>(ut_allocator<byte>(key).allocate(n_bytes, NULL, UT_NEW_THIS_FILE_PSI_KEY, false, false))
...
#else /* UNIV_PFS_MEMORY */
#define UT_NEW(expr, key) ::new (std::nothrow) expr
...
#define ut_malloc(n_bytes, key) ::malloc(n_bytes)
...
#endif
An extra piece of ut_new_pfx_t data is allocated during memory application (PFS_MEMORY is enabled), which stores information such as the key, size, and owner.
// An extra pfx memory is allocated during application.
total_bytes+=sizeof(ut_new_pfx_t)
// Apply for memory.
...
// The starting address of the memory is returned.
return (reinterpret_cast<pointer>(pfx + 1));
The retry mechanism for memory allocation is added.
for (size_t retries = 1;; retries++) {
// malloc/calloc for memory allocation.
malloc(); // calloc()...
if (ptr != nullptr || retries >= alloc_max_retries) break;
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Release pfx first, then release the memory data.
deallocate_trace(pfx);
free(pfx);
Similar to allocate, it recalculates the size and switches to new ut_new_pfx_t(pfx_old--pfx_new).
Apply for large memory used in buf_chunk_init() and add pfx information. Note that the mmap mode does not consume the real physical memory, and this memory cannot be tracked by using methods such as jemalloc.
pointer ptr = reinterpret_cast<pointer>(os_mem_alloc_large(&n_bytes));
|->mmap()/shmget()、shmat()、shmctl()
...
allocate_trace(n_bytes, PSI_NOT_INSTRUMENTED, pfx);
Release the pfx pointer and release the large memory.
deallocate_trace(pfx);
os_mem_free_large(ptr, pfx->m_size);
|->munmap()/shmdt()
The aligned_memory series including aligned_pointer and aligned_array_pointer are encapsulated separately in the code, but its underlying layer is still ut_alloc and ut_free. We will not go into details here. For example, if you use this method to build the log_t structure, the aligned memory can match the sector size during I/O write, thus improving I/O efficiency.
Similar to ut_allocator, mem_heap_allocator can also be used as the allocator of stl. However, it should be noted that this type of allocator only provides mem_heap_alloc function for memory application, and there are no memory release, reuse, or merge operations.
class mem_heap_allocator {
...
pointer allocate(size_type n, const_pointer hint = nullptr) {
return (reinterpret_cast<pointer>(mem_heap_alloc(m_heap, n * sizeof(T)))); // mem_heap_alloc applies for memory.
}
void deallocate(pointer p, size_type n) {}; // Operations such as memory release are null operations.
...
}
This data structure is a non-null memory block linked list, which is linearly connected by mem_block_t of different sizes. Let's focus on free_block and buf_block. To some extent, these two pointers define the actual location of data storage. Data is stored in the memory pointed to by one of the two pointers depending on the request type. By using mem_heap_t to allocate memory, multiple memory allocations can be merged into a single one, and subsequent memory requests can be performed within the InnoDB engine. This reduces the time and performance overheads caused by frequent calls of the malloc and free functions.
typedef struct mem_block_info_t mem_block_t;
typedef mem_block_t mem_heap_t;
...
/** The info structure stored at the beginning of a heap block */
struct mem_block_info_t {
...
UT_LIST_BASE_NODE_T(mem_block_t) base; /* Basic nodes in the linked list, which are defined only in the first block. */
UT_LIST_NODE_T(mem_block_t) list; /* The block linked list. */
ulint len; /*!< The size of the current block. */
ulint total_size; /*!< The total size of all blocks. */
ulint type; /*!< The allocation type. */
ulint free; /*!< The available location of the current block. */
ulint start; /*!< The starting address of the free function during block construction. (I haven't seen many uses.) */
void *free_block; /* In the heap containing the MEM_HEAP_BTR_SEARCH type, the heap root is mounted with free_block to apply for more memory space, while for other types, the pointer is null. */
void *buf_block; /* Apply memory from the buffer pool and save the buf_block_t pointer, otherwise the pointer is null. */
};
mem_heap_t can be classified into the following types based on the source of the requested memory:
#define MEM_HEAP_DYNAMIC 0 /* The original request. Call ut_allocator for InnoDB memory application. */
#define MEM_HEAP_BUFFER 1 /* Obtain memory from the buffer pool. */
#define MEM_HEAP_BTR_SEARCH 2/* Use memory in free_block. */
More combined allocation modes are defined on this basis, making memory allocation more flexible.
/** Different type of heaps in terms of which data structure is using them */
#define MEM_HEAP_FOR_BTR_SEARCH (MEM_HEAP_BTR_SEARCH | MEM_HEAP_BUFFER)
#define MEM_HEAP_FOR_PAGE_HASH (MEM_HEAP_DYNAMIC)
#define MEM_HEAP_FOR_RECV_SYS (MEM_HEAP_BUFFER)
#define MEM_HEAP_FOR_LOCK_HEAP (MEM_HEAP_BUFFER)
Build a memory heap structure based on the input size and heap type. The minimum size is 64. We can know from the internal construction logic that the maximum size of a single mem_block is the same as the defined page_size, which generally is 16 KB.
To create mem_heap_t, you first need to build a root node of the linked table that is mentioned earlier. Control the block to create functions. The first parameter specified by mem_heap_create_block is heap=nullptr, which indicates that the block is the first node in mem_heap_t. In the case where the type contains MEM_HEAP_BTR_SEARCH operation bits, a construction failure may occur. The detailed logic and reasons for failure will be presented later.
After the first block is created, set it as the base node and update the linked list information to create root node mem_heap_t.
mem_heap_t *mem_heap_create_func(ulint size, ulint type) {
mem_block_t *block;
if (!size) {
size = MEM_BLOCK_START_SIZE;
}
// Create the first block of mem_heap. The first parameter specified is nullptr.
block = mem_heap_create_block(nullptr, size, type, file_name, line);
// In the MEM_HEAP_BTR_SEARCH mode, there is a possibility that the construction fails and a null pointer is returned.
if (block == nullptr) {
return (nullptr);
}
// Due to the possibility of BP resizing, the first block cannot be obtained from BP.
ut_ad(block->buf_block == nullptr);
// Initialize the base node of the linked list. If the base is not null, the node is marked as the base node.
UT_LIST_INIT(block->base, &mem_block_t::list);
UT_LIST_ADD_FIRST(block->base, block);
return (block);
}
As mentioned earlier, if the type includes a MEM_HEAP_BTR_SEARCH operation bit, the data may be stored in a memory unit corresponding to the free_block. In this case, you need to release the created free_block separately, and then release blocks on the mem_heap_t linked list one by one in reverse order.
void mem_heap_free(mem_heap_t *heap) {
...
// Obtain the last node in the linked list.
block = UT_LIST_GET_LAST(heap->base);
// Release the free_block node that is created in the MEM_HEAP_BTR_SEARCH mode.
if (heap->free_block) {
mem_heap_free_block_free(heap);
}
// Release blocks one by one in reverse order.
while (block != nullptr) {
/* Store the contents of info before freeing current block
(it is erased in freeing) */
prev_block = UT_LIST_GET_PREV(list, block);
mem_heap_block_free(heap, block);
block = prev_block;
}
}
This function is the core of the entire mem_heap_t memory allocation, which implements different memory allocation strategies for different types. The following are the specific examples:
// case 1
if (type == MEM_HEAP_DYNAMIC || len < UNIV_PAGE_SIZE / 2) {
ut_ad(type == MEM_HEAP_DYNAMIC || n <= MEM_MAX_ALLOC_IN_BUF);
block = static_cast<mem_block_t *>(ut_malloc_nokey(len));
} else {
len = UNIV_PAGE_SIZE;
// case 2
if ((type & MEM_HEAP_BTR_SEARCH) && heap) {
// Obtain the memory from the free block of the heap root.
buf_block = static_cast<buf_block_t *>(heap->free_block);
heap->free_block = nullptr;
if (UNIV_UNLIKELY(!buf_block)) {
return (nullptr);
}
} else {
// case 3
buf_block = buf_block_alloc(nullptr);
}
block = (mem_block_t *)buf_block->frame;
}
This code achieves the following effects:
heap->free_block=nullptr ensures that the free block of the root node will not be reused. This also explains why memory allocation may fail when the type contains MEM_HEAP_BTR_SEARCH bits. The following are the reasons:
This step mainly includes the setting of various parameters in several mem_heap_t node objects of the block, including len, type, and free. This article focuses on the setting of buf_block and free_block, which is also very subtle.
UNIV_MEM_FREE(block, len);
UNIV_MEM_ALLOC(block, MEM_BLOCK_HEADER_SIZE);
block->buf_block = buf_block;
block->free_block = nullptr;
The first two lines set the data corresponding to the block to the free state, and initialize the data in the head at the same time to prepare for the initialization of len and other data. The settings of the last two lines vary at different conditions, which are explained by the following cases:
The final form of the memory structure in Case 2/3 is the same, except that the structure in Case 2 is converted from free_block to buf_block, while that in Case 3 is directly applied from BP. The free_block parameter is generally specified during the construction of mem_heap_t.
It can be seen that whether in Case 1, Case 2, Case 3, or a combination of different cases, the data can be set correctly through the modification of buf_block and free_block.
In addition to the basic form of alloc/free, the SQL layer mainly uses the MEM_ROOT structure to reduce the time and resource consumption for memory operation. This article focuses on MEM_ROOT.
As a generic memory management object, MEM_ROOT is widely applied at the SQL layer. For example, it is included as a memory allocator in structures such as THD and TABLE_SHARE. In fact, MEM_ROOT is only responsible for memory management. The structure that allocates memory is the block. MEM_ROOT only contains one block and is only responsible for the current unique block. The block contains a pointer pointing to the previous block node and is linked into a linked list.
Unlike mem_heap_t mentioned in the summary of 1.2.1, MEM_ROOT is mainly responsible for memory allocation at the SQL layer, while mem_heap_t is implemented separately in InnoDB and is responsible for memory allocation at the InnoDB layer. However, the structure and the implementation mode of the two are similar.
The original construction method of MEM_ROOT is very simple. Only m_block_size, m_orig_block, and m_psi_key are assigned values. At the same time, MEM_ROOT takes over the held MEM_ROOT by using a mobile constructor and mobile assignment. The logic is as follows:
// Mobile constructor.
MEM_ROOT(MEM_ROOT &&other)
noexcept
: m_xxx(other.m_cxxx),
...{
other.m_xxx = nullptr/0/origin_value;
...
}
// Mobile assignment.
MEM_ROOT &operator=(MEM_ROOT &&other) noexcept {
Clear();
::new (this) MEM_ROOT(std::move(other));
return *this;
}
The alloc function returns a new starting address from the currently managed and existing block according to the size of the required memory specified. At the same time, it updates the memory usage information. If the size of blocks managed by MEM_ROOT does not meet the requirements, the AllocSlow function is called to allocate and manage new blocks. Also, note that the returned address is always 8-aligned.
The function AllocSlow is used to apply for a new block. At the underlying layer, two allocation modes are called according to different scenarios, and the returned memory addresses are also aligned.
void *MEM_ROOT::AllocSlow(size_t length) {
// The memory applied is very large or an exclusive memory is required.
if (length >= m_block_size || MEM_ROOT_SINGLE_CHUNKS) {
Block *new_block =
AllocBlock(/*wanted_length=*/length, /*minimum_length=*/length);
if (new_block == nullptr) return nullptr;
if (m_current_block == nullptr) {
new_block->prev = nullptr;
m_current_block = new_block;
m_current_free_end = new_block->end;
m_current_free_start = m_current_free_end;
} else {
// Insert the new block in the second-to-last position.
new_block->prev = m_current_block->prev;
m_current_block->prev = new_block;
}
return pointer_cast<char *>(new_block) + ALIGN_SIZE(sizeof(*new_block));
} else { // Normal conditions.
if (ForceNewBlock(/*minimum_length=*/length)) {
return nullptr;
}
char *new_mem = m_current_free_start;
m_current_free_start += length;
return new_mem;
}
}
AllocBlock is the basic function of block allocation. At the underlying layer, the my_malloc function is called to apply for memory. Data is counted based on PSI information and PFS switches. The my_malloc and my_free functions will be briefly described later.
Large memory requests may fail when the error flag of memory exceeding the limit is set. AllocBlock allows you to specify the wanted_length and minium_length parameters. In some cases, the memory size of the minium_length can be allocated. After each allocation, the value of the m_block_size parameter is set to 1.5 times its current value. This prevents frequent alloc calls.
The ForceNewBlock function corresponds to the second memory allocation method of AllocSlow. It directly calls AllockBlock to apply for the memory block, then mounts it at the end of the block linked list and sets it to the current block managed by MEM_ROOT.
The clear function has simple execution logic and involves the following operations:
When the previously used memory no longer needs to be released, and you do not want to use MEM_ROOT again and run the process of alloc again, ClearForReuse can play a vital role. Unlike the clear function that frees all blocks, ClearForReuse keeps the current block and releases other nodes. In other words, after the ClearForReuse operation, only the last node is left in the block linked list. However, in the scenario of exclusive memory, the code logic still is Clear().
The MEM_ROOT memory allocation method is byte-aligned. It operates by rounding the required memory length in the upper-layer interface such as alloc. However, MEM_ROOT also provides interfaces for "non-standard" operations. It provides functions such as peek and RawCommit and supports direct operations on the underlying blocks. Note that such operations do not occur frequently, and the memory will be rounded again the next time an operation such as alloc is used.
MEM_ROOT is frequently used at the SQL layer, such as THD, THD::transactions, Prepared_statement:, TABLE_SHARE, sp_head, sp_head, and table_mapping. Taking the commonly used THD scenario as an example, this article briefly introduces the application of MEM_ROOT at the SQL layer.
THD contains three MEM_ROOT (including objects and pointers): main_mem_root, user_var_events_alloc, and mem_root.
The MEM_ROOT object, which is destructed with the THD structure, is mainly used for parsing and runtime data storage involved in the execution of SQL statements.
This memory root is used for two purposes: - for conventional queries, to allocate structures stored in main_lex during parsing, and allocate runtime data (execution plan, etc.) during execution. - for prepared queries, only to allocate runtime data. The parsed tree itself is reused between executions and thus is stored elsewhere.
THD::THD(bool enable_plugins)
: Query_arena(&main_mem_root, STMT_REGULAR_EXECUTION),
...
lex_returning(new im::Lex_returning(false, &main_mem_root)),
... {
main_lex->reset();
set_psi(nullptr);
mdl_context.init(this);
init_sql_alloc(key_memory_thd_main_mem_root, &main_mem_root,
global_system_variables.query_alloc_block_size,
global_system_variables.query_prealloc_size);
...
}
The current mem_root pointer points to main_mem_root during THD initialization, but it will change in practice. MEM_ROOT of other objects is used to apply for memory by temporarily changing the mem_root pointer. After that, the mem_root pointer is changed to point to the initial memory address (main_mem_root).
Q: Why is mem_root designed as a changeable object? Why is the memory pointer of mem_root embedded into THD?
A: This design allows for convenient control of memory size. If thd->mem_root always points to main_mem_root, the corresponding memory will persist until THD is destructed. By changing the mem_root pointer, we can better control the memory life cycle, release temporarily occupied memory, and separate it from long-standing memory. Embedding the mem_root pointer in THD (essentially its parent class Query_arena) can provide clearer statistics on memory occupied by THD and simplify the management process. Although this portion of memory is generated during statement execution rather than directly by THD, the "responsibilities" are attributed to THD. This simplifies parameter transfers and reduces the need for an additional MEM_ROOT parameter, as parameters can be directly transferred to THD.
THD::THD(bool enable_plugins)
: Query_arena(&main_mem_root, STMT_REGULAR_EXECUTION),
...
MEM_ROOT* old_mem_root = thd->mem_root; // Save the original mem_root (main_mem_root).
thd->mem_root = xxx_mem_root; // mem_root is mostly temporary MEM_ROOT.
// do something using memory
...
thd->mem_root = old_mem_root; // Restore to the original mem_root (main_mem_root).
The temporary replacement of mem_root occurs in the following locations, but due to the design of MEM_ROOT, such as the mobile construction, the statistics of memory resources will continue to use the previous PSI_MEMORY_KEY without causing complexity and confusion in the statistics.
// sql/dd_table_share.cc
open_table_def()
// sql/sp_head.cc
sp_parser_data::start_parsing_sp_body() &&
sp_parser_data::finish_parsing_sp_body()
// sql/sp_instr.cc PSI_NOT_INSTRUMENTED
LEX *sp_lex_instr::parse_expr()
// sql/sql_cursor.cc
Query_result_materialize::start_execution()
// sql/sql_table.cc
rm_table_do_discovery_and_lock_fk_tables()
drop_base_table()
lock_check_constraint_names()
// sql/thd_raii.h The type and where it is called (sql/auth/sql_auth_cache.cc:grant_load()).
class Swap_mem_root_guard;
// sql/auth/sql_authorization.cc
mysql_table_grant() // Stores table-level and row-level permissions.
mysql_routine_grant() // Stores route-level permissions.
/* sql/dd/upgrade_57/global.h storage/ndb/pligin/ndb_dd_upgrade_table.cc
The type and where it is called. */
class Thd_mem_root_guard
The memroot pointer that is used to allocate Binlog_user_var_event array elements in THD, usually pointing to the same location as the thd->mem_root pointer.
MySQL has made significant efforts and optimizations in the allocation, usage, and management of memory. Each module is a separate memory allocation management system, and its design and usage policies are worth learning.
In the latest 8.0 version, ut_allocator has been removed from the InnoDB layer. The corresponding memory request and release code have been modified to template functions. By using mem_heap_t, memory fragmentation is effectively reduced, making it suitable for scenarios where small amounts of memory are allocated multiple times within a short cycle. However, mem_heap_t does not free memory during usage, leading to some memory wastage when a single block becomes idle.
MEM_ROOT is the most commonly used memory allocator at the SQL layer. Similar to mem_heap_t, it also faces the issue of block fragmentation, but it provides a ClearForReuse interface in its design to release previously occupied memory in a timely manner. Additionally, MEM_ROOT also considers scenarios of exclusive memory and large memory, reducing the memory size for subsequent applications. Furthermore, the flexible use of the MEM_ROOT pointer in the THD structure provides new ideas for memory usage, which is worth learning from.
PolarDB-X Open Source | Three Replicas of MySQL Based on Paxos
ApsaraDB - April 1, 2024
ApsaraDB - April 2, 2024
ApsaraDB - February 22, 2022
ApsaraDB - June 1, 2022
ApsaraDB - June 7, 2022
Alibaba Cloud MaxCompute - September 12, 2018
An on-demand database hosting service for SQL Server with automated monitoring, backup and disaster recovery capabilities
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMore Posts by ApsaraDB