An Introduction to MySQL Parallel Query and DDL

This article describes the basic usage, effects, principles, and restrictions of the parallel acceleration technologies provided by MySQL at the InnoDB layer.

By Ke Yu

1. Overview

As the amount of data on user tables increases, the execution time of some SQL statements becomes increasingly longer. For example, for some SQL queries that require a full table scan, the data scan process takes a certain amount of time. The larger the table size, the longer the scan time. In addition, some DDL operations to create secondary indexes and rebuild tables will scan the whole table and perform sort operations. The larger the table size, the longer the scan and sort time.

To improve SQL query performance, MySQL Community Edition introduced the InnoDB parallel scan feature in version 8.0.14, which can control the number of threads for the parallel scan of clustered indexes through innodb_parallel_read_threads variables. In version 8.0.27, parallel index creation is supported and the number of parallel threads for creating secondary indexes can be controlled through innodb_ddl_threads to accelerate the index creation process. Parallel processing in MySQL is currently limited to the InnoDB layer for full table operations.

Based on the 8.0.37 code, this article provides some basic introduction and principle analysis of the parallel technologies at the InnoDB layer.

2. Usage

2.1 Parameter Introduction

innodb_parallel_read_threads

This parameter controls the number of threads for the parallel scan of primary key indexes. Currently, only a parallel scan of primary key indexes is supported, and that of secondary indexes is not supported. If you set this parameter to a value greater than 1, the parallel scan is enabled. The maximum value is 256, which is also the maximum value of the sum of parallel threads for all user connections. When the value reaches the limit, the user connection falls back to a single-threaded scan.

innodb_ddl_threads

This parameter specifies the number of threads for parallel sort during secondary index creation, as well as the number of threads for parallel sort and parallel construction of the B+ tree during table rebuilds.

innodb_ddl_buffer_size

This parameter represents the total size of the sort buffer used during DDL operations. Memory sort is done by parallel scan threads, and the size of the sort buffer allocated to each parallel thread is innodb_ddl_buffer_size/innodb_parallel_read_threads. Therefore, when innodb_parallel_read_threads is increased, innodb_ddl_buffer_size also needs to be increased.

There are three involving parameters, which are summarized in the following table.

Parameter	Description	Dynamic modification	Scope	Default value	Minimum value	Maximum value
innodb_parallel_read_threads	Number of threads for parallel scans of primary key indexes	Supported	Connection level	4	1	256
innodb_ddl_threads	Number of parallel threads during DDL operations	Supported	Connection level	4	1	64
innodb_ddl_buffer_size	Size of the sort buffer used by DDL operations	Supported	Connection level	1048576	65536	4294967295

2.2 SQL Statements Supporting Parallel Operations

• SELECT COUNT(*) FROM table1;

The SELECT COUNT(*) statement supports parallel scans of primary key indexes but does not support scans of secondary indexes.

• CHECK TABLE table1;

During the execution of the CHECK TABLE statement, the primary key index is scanned twice. When the primary key index is scanned for the second time, the parallel scan is supported.

• CREATE INDEX index1 ON table1 (col1); ALTER TABLE table1 ADD INDEX index1 (col1);

When you execute the CREATE INDEX statement, parallel operations are supported during the scan and sort of primary key indexes. However, they are not supported during the B+ tree construction stage. Virtual column indexes, full-text indexes, and sparse indexes are not supported.

• ALTER TABLE table1 ENGINE=INNODB; OPTIMIZE TABLE table1; (hereafter referred to as REBUILD TABLE. optimize table = rebuild table + analyze table. At the REBUILD TABLE stage, parallel operations are supported.)

When the table is rebuilt, parallel operations are supported during the sorting stage and the construction of secondary indexes. During the scan of primary key indexes, parallel operations are not supported.

The following table summarizes the parallel support for the preceding SQL statements. Some SQL statements that require a full table scan of primary key indexes and DDL statements that create secondary indexes (CREATE INDEX) support parallel operations. Note that these SQL statements do not support parallel operations at all stages.

SQL Statements Supporting Parallel Operations	Parallel Scan	Parallel Sort	Parallel Construction of B+ Tree Indexes
SELECT COUNT(*) FROM table1;	Supported	NULL	NULL
CHECK TABLE table1;	Supported	NULL	NULL
CREATE INDEX index1 ON table1 (col1); ALTER TABLE table1 ADD INDEX index1 (col1);	Supported	Supported	Not supported
ALTER TABLE table1 ENGINE=INNODB; OPTIMIZE TABLE table1;	Not supported	Supported	Supported

3. Parallel Query

MySQL's parallel query is completed at the InnoDB layer, which is actually a parallel scan process of the B+ tree.

The InnoDB layer provides an index scan interface row_scan_index_for_mysql for a full record scan of indexes. The SELECT COUNT(*) and CHECK TABLE statements share this interface for parallel scan of indexes.

row_scan_index_for_mysql  // Full record scan of indexes
|-> row_mysql_parallel_select_count_star // Parallel count(*)
|    |-> Parallel_reader reader; // Create a reader
|    |-> reader.add_scan(count_callback); // Set the callback function for the COUNT(*) statement to perform pre-sharding.
|    |-> reader.run // Start parallel scan
|
|-> parallel_check_table // Parallel CHECK TABLE
|    |-> Parallel_reader reader; // Create a reader
|    |-> reader.add_scan(check_callback); // Set the callback function for the CHECK TABLE statement to perform pre-sharding.
|    |-> reader.run // Start parallel scan

At the stage of scanning primary key indexes for creating secondary indexes, a parallel scan is also used. The implementation is the same except for a different scan entry.

3.1 Parallel `COUNT(*)`

First, COUNT(*) is used as an example to describe how to perform a single-threaded scan.

How does a single thread scan a B+ tree? Let's briefly review the structure of the B+ tree. The above figure shows a 3-layer B+ tree. The first layer is the root node page, and the third layer is the leaf node. Its characteristics are as follows:

InnoDB organizes data through the B+ tree. Non-leaf nodes hold pointer information, and leaf nodes hold all records in order.
The upper and lower layers are connected by one-way pointers.
Pages at the same level are connected by a doubly-linked list.
The records within the page are connected by a one-way linked list to form an ordered linked list in ascending order.
Starting from the first record of the leftmost leaf node, the full record scan of the B+ tree can be completed in order by traversing the B+ tree backward.

Before parallel queries, the basic execution process of a single-threaded SELECT COUNT(*) full table at the InnoDB layer is as follows:

First access the minimum record of the root node (ID=1), then locate its leftmost child node based on the record, and then continue to locate the minimum record of the leaf node (ID=1).
During positioning, add INDEX S lock, ROOT PAGE S lock, NO LEAF PAGE S lock, and LEAF PAGE S lock in sequence. Release all S locks except LEAF PAGE S lock after positioning.
Scan from the minimum record in the B+ tree to the maximum record. During the scanning process, MVCC visibility is checked for each record. If it is visible, the row record is processed. For COUNT(*) statements, the counter is incremented by 1.
After scanning the B+ tree, the counter results are returned to the SERVER layer.

In the scanning process described above, the row_search_mvcc function completes the minimum record positioning and acquires the next record. During the scanning process, only a single LEAF PAGE S lock is held. Each time a record is scanned within the page, the page will be locked and unlocked:

After one page is scanned, the next page will be scanned. The preceding process is repeated until the last record.

InnoDB provides the parallel scan feature to accelerate that process. If you need to perform a full record scan on the B+ tree, you can use the parallel scan feature.

Parallel Scan

A B+ tree is composed of multiple subtrees, and the parallel scan is performed by dividing the B+ tree into multiple subtrees. The scanning of each subtree is a parallel task that is handled by multiple threads. The specific process is as follows:

A user thread initiates a COUNT(*) statement and then enters the parallel scan logic of the InnoDB layer.
The user thread pre-shards the B+ tree (divides into subtrees) and puts the shards into the task queue. In the preceding figure, the B+ tree is divided into two shards. Each shard consists of 2-tuples, which record the start record and the end record of the subtree. Here, the user thread first performs coarser-grained sharding to reduce the burden on the user thread, and then the work thread will perform finer-grained sharding.
The user thread enables multiple work threads and waits for the work thread to end (the user thread is not working during this time).
The work thread takes the shards from the task queue in sequence.
If the sharding granularity is large, the work thread divides the shard into shards with smaller granularity again and puts the divided shards into queues.
The work thread executes scan tasks on each shard in turn.
After the parallel thread is executed, the results are summarized and returned to the SERVER layer.

Sharding Process

Each shard (Range) represents a set of primary key records with consecutive logic. It consists of two Iters forming a left-closed and right-open interval [start, end).

struct RANGE {
  Iter start;
  Iter end;
}

Each Iter marks the location of a record.

struct Iter {
...
  const rec_t *m_rec{}; // The record that marks the boundary of the shard
  btr_pcur_t *m_pcur{}; // The B+ tree cursor that marks the page number of the record
...
}

The key to parallel scan lies in how to perform sharding and how to lock the B+ tree during the sharding process. It is necessary to ensure that the shards are as evenly distributed as possible and that the shard boundaries are continuous and non-overlapping.

The sharding strategy of the user thread is to pre-shard based on the number of subtrees of the root node. For example, if the root node contains N subtrees (N records), it is divided into N shards. The sharding process is to locate the leaf node row records based on the node pointer of ROOT. The pre-sharding process is as follows:

1. INDEX S lock;

2. ROOT PAGE S lock. The whole sharding process holds the INDEX S lock and ROOT S lock to ensure that the ROOT

PAGE record is not changed and no new subtree is generated.
Access the first record of ROOT.
Locate the leaf node record based on the record, and add an S lock to the PAGE along the process.
Create a shard based on the leaf node record as the starting position of the shard and use the record as the ending position of the previous shard.
Release the PAGE locks along the process and retain the ROOT and INDEX locks.
Access the second record of the ROOT PAGE. Return to step 2.a to repeat the above process.

3. After the records of ROOT PAGE are traversed, release the ROOT S lock and INDEX S lock to complete the pre-sharding.

Assuming that the B+ tree is relatively balanced, pre-sharding can divide the B+ tree more evenly.

Why does the work thread need to perform additional sharding? How does the work thread determine that the sharding granularity is large enough to be divided again?

The initial design of parallel scan has the issue of large sharding granularity, which would lead to insufficient utilization of parallel threads for very large trees. Assume that there are four parallel scan threads and a B+ tree with five subtrees, the user thread divides it into five shards. Then, the first four subtrees can be scanned in parallel, and one thread needs to scan two subtrees. Assuming that it takes 1 minute to scan each subtree, the total parallel scan time would be 2 minutes.

The solution is to divide the B+ tree into subtrees with smaller granularity to improve the utilization of parallel threads. The user thread performs pre-sharding, and the work thread will divide some shards again. This way, the total scanning time in the above example can be reduced to 1 minute and 15 seconds (5min/4).

The optimized sharding strategy is:

Pre-sharding: The user thread pre-shards based on the number of subtrees on the root node. Additionally, the residual shards (the remainder of the number of shards minus the number of parallel threads) are marked as needing to be divided again. For example, in the following figure, there are four parallel work threads and a B+ tree with five subtrees. The user thread pre-shards into five shards. Then, the fifth shard is marked as the shard that needs to be divided again. Then, the pre-shards are put into the queue in sequence, and the shards that are put first are first fetched by the work thread.
The first four shards are fetched by the work thread and executed in parallel.
After the first round of tasks for a worker thread ends, if it acquires the fifth shard and finds that the shard needs further subdivision, it will divide the subtree into multiple smaller subtrees.

When sharding again, the sharding process and locking logic are basically the same as those of the ROOT PAGE. However, the subtree structure may change because the INDEX and ROOT PAGE S locks are released during the two sharding processes. Therefore, it cannot be divided directly by the number of subtrees of a shard but needs to be divided by the RANGE of the shard.

The following content describes the details:

The following figure locates the record at leaf node ID=301 as a boundary point.
Get a record after the 301 of the SUB TREE ROOT and locate the leaf node as a boundary point.
By the same token, the final boundary points are divided into 301, 401, 501, and null.

After secondary sharding, the tree is divided into two layers. The root node is divided once, and the subtree of the root node is divided again. The B+ tree is divided twice at most since it is short and wide without a very high level. The large table can be divided into many subtrees after being divided twice, which can improve the utilization of parallel threads.

Scanning Process

The scanning process of each work thread on the subtree shard [start, end) is basically the same as the process of scanning the entire B+ tree with a single thread:

1. Restore the leaf node page corresponding to the cursor of start.

Optimistically add an S lock to the leaf node.
If optimistic locking fails, it is necessary to relocate the record from ROOT and add an S lock to the leaf node.

2. Scan the records within the page, check the MVCC visibility for each record, and construct the old version of the data.

3. Reach the maximum record of the page, obtain the next page, and add an S lock to the next page.

4. Release the S lock of the page and repeat the preceding scanning process for the next page until the end of the shard is scanned.

Each thread only holds a single LEAF PAGE S lock during the scanning process. After the S lock is added to the page, it processes all records within the page at one time, and then unlocks the page to process the next page. This is more efficient than locking and unlocking each record within the page during a single-threaded scan.

Callback Function

parallel_work
|--> process one row
|    |--> callback
|    |    |--> counter++

Parallel scan has designed a set of common interfaces. During parallel scan, the row records are processed through callback functions. Each work thread calls the callback function once every time a row of records is scanned.

For the COUNT(*) statement, the callback function increments 1 to the thread-level counter. After the parallel scan ends, all counters are accumulated to obtain the result.

Parallel Effect

Let's take a look at the effect of COUNT(*) parallel scan. We test it through a Sysbench table containing 400 million rows of records with the following configuration:

Sysbench tests a single table with 400 million rows of data, which is about 95GB in size.
innodb_buffer_pool_size=128GB (pure memory scenario).
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz, 52 cores
os: 4.19.91-013.ali4000.alios7.x86_64
Disk: SSD
MySQL: 8.0.37
Before the test, preheat multiple times to load all data into the buffer pool. During the test, the buffer pool hit rate is 100%.

• The parallel effect is evident. When the number of threads increases between 2 and 32, each time the number of threads doubles, the time consumed decreases by half, showing a linear downward trend.

• However, under high concurrency, there is no linear trend, and issues with the utilization of concurrent threads still exist.

• When the number of threads increases from 1 to 2, the time becomes 1/4 of the original. The reason is that when the number of threads is 1, it does not use a single worker thread for scanning but reverts to the original serial scanning logic of the user thread. After each row is scanned, the row_search_mvcc function (1600 rows) is used to locate the next row. The row_search_mvcc function is less efficient in this scenario, as many variable definitions and if conditions do not need to be executed, and the locking efficiency is low. The parallel work thread scans and locates the next line of records with only 100 lines of code and has higher locking efficiency, making it more efficient than row_search_mvcc.

Performance Fallback

In official versions 8.0.14 to 8.0.36, the COUNT(*) statement supports parallel scans of primary key indexes instead of parallel scans of secondary indexes or single-threaded scans. Even if the indexes selected by the optimizer are secondary indexes, InnoDB forces primary key indexes to perform parallel scans. This results in a performance fallback: The time required to scan a primary key index by using multiple threads may be longer than that required to scan a secondary index by using a single thread.

Why is there a performance fallback? The time complexity of single-threaded primary key scanning and single-threaded secondary index scanning is o(n). In a pure memory scenario, only the COUNT++ operation is performed on each row of data, and no performance fallback occurs. The following table describes the time complexity.

However, when the buffer pool is small, the time is mainly spent on reading data pages into the buffer pool. Because the primary key index occupies much more disk space than the secondary index, the I/O times for scanning the primary key are much higher than that for scanning the secondary index, so multi-threaded scanning of the primary key index is not as fast as single-threaded scanning of the secondary index. The larger the table's primary key index is, the more serious the performance impact will be.

The preceding table shows the test results of a 95GB Sysbench table with a buffer pool configuration of 6GB. Among these results, parallel scanning with 16 threads on the primary key index outperforms single-threaded scanning of the secondary index. The larger the table's primary key index is, the more serious the performance impact will be.

AliSQL 8.0.25 discovered and fixed this issue. If the specified innodb_parallel_read_threads is 0, the index selected by the optimizer is used for scanning.

This issue is officially fixed in 8.0.37, where the primary key index is no longer forced to execute COUNT(*) statements. When the optimizer selects a secondary index, a single-threaded scan is performed; when the optimizer selects a primary key index, a parallel scan is performed.

release note: "MySQL no longer ignores the Optimizer hint to use a secondary index scan, which instead forced a clustered (parallel) index scan.".

3.2 Parallel CHECK TABLE

This section describes how to execute the CHECK TABLE statement at the InnoDB layer. When the CHECK TABLE statement is executed at the InnoDB layer, it traverses each index of the table and performs the following checks on each index in turn:

1. The first index scan verifies the correctness of the B+ tree structure, scanning layer by layer from the root layer to the leaf layer.

• Starting from the leftmost node of the current level, traverse to the rightmost node in sequence.

• Check page consistency: FLAG checks, PAGE_MAX_TRX_ID checks, PAGE directory, and RECORD verification. It will traverse from the minimum record to the maximum record within the page.

• Check the direction of the node pointer.

• Check the direction of the parent node.

• Check the order of records between adjacent pages: the maximum record of the node is smaller than the minimum record of its right node.

• …

2. The second index scan is a row-by-row record scan, traversing from the minimum record in the B+ tree to the maximum record, and comparing the size of adjacent records in turn to see if they meet requirements.

In this example, the first index scan is single-threaded. During the second scan, if the index is a primary key index, the parallel scan is supported. For the secondary index, the single-threaded scan is performed.

Parallel Scan

The implementation of the CHECK TABLE parallel scan is the same as that of the SELECT COUNT(*) statement. The difference lies in the callback function.

COUNT*(*): The work thread counts each subtree, and the counts are aggregated at the end.

CHECK TABLE: The work thread compares the order of leaf node records within each subtree. There is no check for the order between subtrees because the order between pages is checked during the first scan.

When the CHECK TABLE work thread scans the subtree, a callback function is executed every time a row of primary key records is scanned. The logic of the callback function is to compare the order of the current row with the previous row.

If it is the leftmost record of a shard, set pre_record = nullptr.
If the pre_record pointer is not null, compare whether the size of current_record and pre_record is as expected. If not, an error is returned; otherwise, continue to scan.
Set pre_record = current_record.
Return a success report.

Parallel Effect

First, let's look at the effect of the CHECK TABLE parallel scan. A Sysbench table containing 400 million rows of records is used for testing. The test configuration is the same as that in COUNT(*). A Sysbench native table contains one primary key index and one secondary index. Each index is scanned twice when you execute the CHECK TABLE statement.

• When the number of threads increases from 1 to 2, the time consumed is significantly reduced. However, the benefit here is not brought by parallel scans. The reason is the same as the scenario where the number of threads in COUNT(*) changes from 1 to 2: the original single-threaded scan efficiency is poor.

• After the number of threads increases from 2, the effect is improved but not obvious, indicating that the majority of time consumption during CHECK TABLE is on the first scan.

Optimizable Points

After viewing the parallel process of CHECK TABLE, we can find some areas that can be optimized.

Parallel operation can be performed at the index level. Parallel check between indexes is not supported currently. If there are a large number of secondary indexes, the checking process will be time-consuming. You can add a logic for parallel check at the index level.
During the first scan of indexes in a CHECK TABLE operation, it can theoretically be divided into multiple subtrees for parallel check. However, there are two additional considerations: 1) non-leaf nodes also need to be scanned; 2) subtree boundary nodes also need to be checked.
During the first scan of indexes in a CHECK TABLE operation, all records have been traversed. During the traversal process, the order between records can be compared, and the order of adjacent pages has also been checked. It seems unnecessary to scan the index again for the second time to compare the order of all records.
Check whether the parallel operation can be supported during the second scan of the secondary index in the CHECK TABLE operation. It seems unnecessary to support it because the main focus here is on scanning the primary key index.

4. Parallel DDL

4.1 Parallel Secondary Index Creation

First, we briefly describe the execution process of creating a secondary index statement at the InnoDB layer, which can be divided into three stages:

1. Scan: Scan the clustered index from the minimum record to the maximum record in turn to build the secondary index row records, and then perform a few memory sort operations on the secondary index row records. If the memory is full, write the records to the temporary files.

2. Sort: Sort the temporary files.

3. Build: Insert the sorted data into the secondary index tree.

The granularity and implementation of parallel operations in each stage are different:

The first stage can be accelerated by the parallel scan described above, with parallel granularity at the subtree level, and the number of parallel scan threads can be configured through the innodb_parallel_read_threads parameter. The buffer size for each thread memory sort is innodb_ddl_buffer_size/ innodb_parallel_read_threads.
The second stage can use multiple threads for parallel sort, and the parallel granularity is the number of temporary sort files. The number of parallel sort threads can be configured by the innodb_ddl_threads parameter.
In the third stage, the process of building the sorted data into a secondary index tree is executed by a single thread.

ddl::Context::build // Create a secondary index and rebuild the table entry.
|-> ddl::Loader::build_all
|   |-> scan_and_build_indexes // There are multiple indexes when rebuilding the table.
|   |   |-> scan // The primary key index is scanned in parallel during CREATE INDEX and is scanned in a single thread when rebuilding the table.
|   |   |   |-> Parallel_reader.add_scan(bulk_inserter) // The callback function for scanning each row
|   |   |   |-> Parallel_reader.run // Parallel scan
|   |   |-> load // Multi-threaded sort and build
|   |   |   |-> mt_execute
|   |   |   |   |-> Loader::Task::operator()
|   |   |   |   |   |-> Builder::merge_sort  // Sort，Builder::State::SORT
|   |   |   |   |   |
|   |   |   |   |   |-> Builder::btree_build // Build，Builder::State::BTREE_BUILD

Parallel Scan

During the parallel scan, each work thread will create a temporary file and a sort buffer. Each time the work thread scans a row of primary key records, it will call a callback function. The callback function is responsible for:

Build a secondary index row record based on the primary key record.
Write the row record to the thread-level buffer.
If the buffer is not full, return directly.
After the buffer is full, use the merge sort algorithm to sort the records in the buffer. After sorting, an ordered list is obtained and is written to a temporary file. Then clear the buffer.

The following figure shows the process of writing the first ordered list to a temporary file. The size of an ordered list is the buffer size.

After work thread 1 scans a shard, there are multiple ordered lists on temporary file 1.

The work thread continuously retrieves shards from the queue for scanning. During the scanning process, ordered lists are written to corresponding temporary files. Assume that the innodb_parallel_read_threads is 3, then there are three parallel scan threads and three temporary files in total. After all shards are scanned in parallel, three partially ordered files will be obtained, each of which consists of multiple ordered lists.

Summary

Here is a summary of the parallel scan. The parallel scan work thread executes a callback function every time it scans a row of primary key records. The corresponding callback functions are designed for different upper-level SQL statements to complete the corresponding parallel tasks. The parallel scan execution processes of COUNT(*) full table, CHECK TABLE, and secondary index creation are the same. The parallel scan thread is controlled by the innodb_parallel_read_threads parameter. The main difference between the three lies in the callback function, which is summarized in the following table.

	COUNT(*) Full Table	CHECK TABLE	Secondary Index Creation
Callback Function	Increment the count by 1.	Check the sequence of the current record and the previous record.	Write records to the sort file and do a small amount of sorting.

Theoretically, any process that requires scanning a B+ tree can use this set of parallel interfaces to divide the tree into multiple subtrees for parallel scan, such as:

aggregate functions other than COUNT(*) (sum and avg)
WHERE statements
range query (start and end)
analyze table

Parallel Sort

The sort and build of DDL are completed as tasks by the work thread. The work thread takes the tasks from the task queue for execution and determines whether to sort or build based on the task status.

DDL_parallel_work
|-> Loader::Task::operator() // Execute the task.
|   |-> Builder::merge_sort  // Sort，Builder::State::SORT
|   |
|   |-> Builder::btree_build // Build，Builder::State::BTREE_BUILD

After the parallel scan is completed, three partially ordered temporary files are obtained, and the collection of the three files constitutes all records of a secondary index. The user thread will create three sort tasks (in fact, sort tasks may also be created by the work thread, which is simplified here) and set the task status to Builder::State::SORT.

The parallel granularity is at the temporary file level, and each temporary file corresponds to a sort task. After the user thread creates a work thread, it obtains tasks from the queue for execution. The work thread continuously retrieves tasks from the queue to execute. This is different from the parallel scan, where the user thread is not used as the worker thread.

It can be found that the number of parallel sort tasks is determined by the number of temporary files, while the number of temporary files is determined by the number of parallel scan threads. In the preceding section, it is assumed that the innodb_parallel_read_threads is 3. Here, even if the innodb_ddl_threads is set to 6, there are only three tasks, and the number of DDL sort threads actually executed is also three. If you set the innodb_ddl_threads to 2, then one DDL thread will process two temporary files. Therefore, it is recommended to set innodb_parallel_read_threads and innodb_ddl_threads both to the same value.

The DDL thread file sort process uses the merge sort algorithm, which creates a new temporary file to assist the sort. Assuming that temporary file 1 contains four ordered lists, a file containing two ordered lists is obtained after the first round of merge sort, and a globally ordered temporary file is obtained after two rounds.

After all parallel sort tasks are completed, three ordered temporary files are obtained. These three files are no longer merged and sorted into one file.

Single-threaded Build

After all DDL work threads complete parallel sort, three ordered temporary files are obtained, and the three files are not merged and sorted into one ordered file. The three temporary files are constructed as a priority queue, adding a build task, and the task status is set to Builder::State::BTREE_BUILD. The build task is then executed by a work thread, which takes records from the priority queue in order and writes them to the B+ tree row by row to complete the build process. The specific build process is not described in this article.

Parallel Effect

To see the effect of parallel indexing, a Sysbench table containing 40 million rows of records (about 9.5GB in size) was used for testing. The remaining configuration is the same as that in section 3.1. Test SQL to add a secondary index to the pad column of the Sysbench table:

create index idx1 on sbtest1 (pad);

To observe the parallel effect of each stage, several control experiments were conducted.

innodb_parallel_read_threads	innodb_ddl_threads	Time consumed (seconds)	Difference from the previous row
1	1	174.3	NULL
2	1	114.1	60.2
4	1	82.8	31.3
8	1	69.3	13.5
16	1	64.3	5
32	1	63.7	0.6

To analyze the effect of the parallel scan in the first stage, the above table sets innodb_ddl_threads to 1 and innodb_ddl_buffer_size to 4GB. Setting this value as large as possible can make all data sorted in memory, which can minimize the time required for sorting files in the second stage, and reduce the interference with the parallel scan in the first stage. After increasing innodb_parallel_read_threads:

• The time consumed has decreased, indicating that the parallel scan in the first stage is effective, and the time consumed in the second and third stages is not significant.

• The time consumed does not decrease linearly with the increase of innodb_parallel_read_threads, indicating that the second and third stages will consume a portion of the time.

• The time difference between each two experiments basically decreases linearly with the increase of innodb_parallel_read_threads, indicating that the parallel scan effect in the first stage is in line with expectations.

• Unlike COUNT(*) and CHECK TABLE parallel scans, when innodb_parallel_read_threads is 1, a single parallel work thread is used to scan the index, so the efficiency of COUNT(*) in single concurrency is not poor.

innodb_parallel_read_threads	innodb_ddl_threads	Time consumed (seconds)
1	1	168.3
1	2	170.2
1	4	173.7
1	8	171.8

To analyze the effect of the parallel sort in the second stage, the above table sets innodb_parallel_read_threads to 1 and innodb_ddl_buffer_size to 32MB. After increasing innodb_ddl_threads, it can be found that DDL parallel threads do not work. The reason will be explained later.

innodb_parallel_read_threads	innodb_ddl_threads	Time consumed (seconds)	Difference from the previous row
16	1	182	NULL
16	2	137.7	44.3
16	4	117.5	20.2
16	8	109.1	8.4
16	16	105.5	3.6
16	32	108.3	-2.8

To further analyze the effect of the parallel sort in the second stage, the above table sets innodb_parallel_read_threads to 16, so that the parallel scan time in the first stage is as small as possible, reducing the interference to the second stage.

The innodb_ddl_buffer_size is 32MB, and the buffer size allocated to each parallel thread is 32MB/16=2MB. Since the buffer is used for memory sort, setting a relatively small buffer size can reduce the amount of memory sorting, thus increasing the time spent on file sorting tasks. This makes it easier to observe the performance improvement of parallel sort (it can simulate large table scenarios online, where creating a secondary index requires extensive file sorting). Then gradually increase the innodb_ddl_threads.

• The time difference between each two experiments basically decreases linearly as innodb_ddl_threads increases, indicating that parallel sort in the second stage is in line with expectations.

4.2 Parallel Table Rebuild

The following section describes the execution process of the InnoDB REBUILD TABLE statement, which can be divided into three stages:

1. Scan: Traverse the B+ tree from the first record of the primary key B+ tree to the last record, skipping empty records. Perform the following operations after scanning each row:

Directly write the primary key records to the new primary key index B+ tree in order (the primary key is ordered and does not need to be sorted again).
For the secondary index to be rebuilt, build the secondary index row record, write the record to the sort buffer, and then write to the temporary file when the buffer is full.

2. Sort: Sort the secondary index temporary files.

3. Build: Insert the sorted secondary index data into the secondary index tree.

Compared with the CREATE INDEX statement, there is an additional operation of 1.a to build a primary key B+ tree, and multiple secondary indexes may be created. If only one secondary index is rebuilt, steps 1.b, 2, and 3 are a CREATE INDEX operation.

When rebuilding a table, whether each stage supports parallel operations and the granularity of parallel operations are quite different from those of creating secondary indexes in parallel.

One of the main purposes of rebuilding a table is to rebuild the primary key index B+ tree, reclaim holes, and shrink the table space. The primary key B+ tree itself is ordered, so there is no need to sort it in parallel. It is only need to scan the primary key index serially in ascending order and then insert the primary key row records into the new B+ tree to obtain a compact primary key B+ tree.

It seems unnecessary to parallelize the process of rebuilding primary key indexes. MySQL itself does the same: primary key indexes cannot be scanned in parallel when rebuilding tables, and the innodb_parallel_read_threads parameter does not take effect on rebuilding tables.

The side effect of this is that for a single secondary index to be rebuilt, multiple temporary files cannot be written in parallel during the scan stage. In addition, you cannot perform parallel sort on a single index during the sort stage.

However, the sort and build stages of secondary indexes still support parallel operations, but the granularity becomes the index level. Assume that there are four secondary indexes to be rebuilt, the sort and build of the four indexes can be processed in parallel by DDL parallel threads.

Single-threaded Scan

The scanning process for rebuilding the primary key B+ tree is completed by the user thread in a single thread. Assuming that the original table has one primary key index and three secondary indexes, the scanning process creates a temporary sort file for each secondary index, resulting in a total of three temporary files. Each time a row of primary key records is scanned, the following operations are performed:

Insert the primary key record into the new primary key B+ tree.
Build three corresponding secondary index row records based on the primary key record and write them to three temporary files respectively.

After the single-threaded scan is completed, a new compact primary key B+ tree and three partially ordered temporary files are obtained, each of which corresponds to a set of secondary index records:

Parallel Sort

The parallel sort of REBUILD TABLE is the same as that of CREATE INDEX in implementation except that the parallel granularity is different.

As mentioned earlier, when creating a secondary index, the number of parallel tasks for multi-threaded sorting is determined by the innodb_parallel_read_threads, and the corresponding number of temporary files is created for a single index, which is sorted in parallel within the index. If a single-threaded scan is used to rebuild the table, the innodb_parallel_read_threads is 1. Then, there is only one temporary file for a single secondary index. In this parallel sort architecture, parallel sort cannot be performed inside the secondary index.

However, multiple secondary indexes support parallel operations. In other words, the granularity of parallel operations is at the index level. The temporary files of the three indexes here are added to the task queue as three sort tasks. The work thread takes out the tasks to sort the temporary files. The task queue is as follows: each task corresponds to an index and a partially ordered temporary file.

Assuming that the innodb_ddl_threads is configured to 3, then there will be three threads in parallel, each thread corresponding to an index and a temporary file. Each thread will merge and sort the temporary files. After that, three files are obtained, and each file records all the records of the corresponding secondary index in sequence.

Parallel Build

There is no need to create a new work thread here. The work thread created in the previous step will not exit. The thread still retrieves the task from the task queue for execution, but the task is changed from sorting to building. In the previous step, each sort task corresponds to an index. After each sort task is completed, a build task is automatically added and put into the task queue with the task status set to Builder::State::BTREE_BUILD. Three build tasks are added here, each of which corresponds to the build of an index.

Each DDL thread is responsible for scanning the build of an index, reading row records in sequence, and building a new compact secondary index B+ tree row by row. Finally, after the process of rebuilding the table, one new compact primary key B+ tree and three new compact secondary index B+ trees are obtained.

Summary

Parallel build and sort of REBUILD TABLE and parallel sort and single-threaded build of CREATE INDEX have similar implementations, the difference lies in the parallel granularity and parallel task. Both parallel build and sort of REBUILD TABLE are performed at the index granularity, and the number of threads is determined by the innodb_ddl_threads. The summary is as follows.

	Number of parallel threads	Parallel granularity
Parallel sort stage for creating secondary indexes	min (innodb_parallel_read_threads, innodb_ddl_threads)	The granularity is a single temporary file. The number of temporary files is determined by innodb_parallel_read_threads.
Build stage for creating secondary indexes	Single work thread	Single work thread
Parallel sort stage for rebuilding the table	min (the number of secondary indexes in the original table, innodb_ddl_threads)	The granularity is INDEX.
Parallel build stage for rebuilding the table	min (the number of secondary indexes in the original table, innodb_ddl_threads)	The granularity is INDEX.

Parallel Effect

Let's take a look at the effect of the parallel rebuild of the table. A Sysbench table containing 40 million rows of records (about 9.5GB in size) was used for testing. The innodb_ddl_buffer_size is set to 65,536, and the rest of the configuration is the same as that in section 3.1. The test includes two types of tables:

1. Sysbench native table: contains one primary key index and one secondary index.

2. Sysbench table with two additional indexes: contains one primary key index and three secondary indexes. Add two secondary indexes to the pad and c columns based on the original Sysbench table:

create index idx1 on sbtest1 (pad);
create index idx2 on sbtest1 (c);

The test SQL statement is a REBUILD TABLE statement:

ALTER TABLE sbtest1 ENGINE = INNODB;

	innodb_parallel_read_threads	innodb_ddl_threads	Time consumed (seconds)
Sysbench native table: contains one primary key index and one secondary index	1	3	186.6
	2	3	185.8
	4	3	187.3
Sysbench table with two additional indexes: contains one primary key index and three secondary indexes	1	1	849.5
	1	2	602.3
	1	3	534.9
	1	4	533.3

• It can be found that the innodb_parallel_read_threads parameter does not result in performance improvement when rebuilding the table.

• The innodb_ddl_threads parameter improves performance by enabling parallel processing for REBUILD TABLE statements that contain multiple secondary indexes. The more secondary indexes there are, the higher the degree of parallelism.

5. Summary

This article describes the basic usage, effects, principles, and restrictions of the parallel acceleration technologies provided by MySQL at the InnoDB layer. Some queries and DDL statements support parallel scan, parallel sort, and parallel construction of the B+ tree. All of these technologies are available in AliSQL. Feel free to use them. MySQL currently supports a limited number of parallel statements. We expect the community to launch more supported parallel statements in the future. The AliSQL team will also add features on parallel queries and DDL, and will support parallel scans of secondary indexes in the future. Below are some practical suggestions.

• We recommend that you set the appropriate innodb_parallel_read_threads, innodb_ddl_threads, and innodb_ddl_buffer_size parameters based on the instance specifications and load.

• Since parallel scan significantly improves the effect of the COUNT(*) full table statement, you can set an appropriate innodb_parallel_read_threads parameter based on the table size.

• As parallel scan has little effect on improving CHECK TABLE statements, it is recommended to set the innodb_parallel_read_threads value to 2 during CHECK TABLE.

• For statements that create secondary indexes, parallel scan and parallel sort can significantly improve the performance. We recommend that you set the innodb_parallel_read_threads and innodb_ddl_threads to the same value. You can set applicable values based on the table size. When you increase innodb_parallel_read_threads, you'd better increase innodb_ddl_buffer_size simultaneously.

• For REBUILD TABLE statements, parallel operations significantly improve the performance. You can sort and rebuild files with indexes as the parallel granularity. You can set the innodb_ddl_threads value based on the number of original table indexes. There is no need to set the innodb_parallel_read_threads parameter and the innodb_ddl_buffer_size parameter can be set based on memory usage.

References

[01] https://github.com/mysql/mysql-server
[02] http://mysql.taobao.org/monthly/2020/11/03/
[03] http://mysql.taobao.org/monthly/2019/03/05/
[04] http://mysql.taobao.org/monthly/2021/03/03/
[05] http://mysql.taobao.org/monthly/2019/12/05/
[06]https://dev.mysql.com/doc/refman/8.4/en/online-ddl-parallel-thread-configuration.html
[07] https://dev.mysql.com/blog-archive/mysql80-innodb-parallel-threads-ddl/
[08] http://mysql.taobao.org/monthly/2019/12/05/

Community

An Introduction to MySQL Parallel Query and DDL

1. Overview

2. Usage

2.1 Parameter Introduction

innodb_parallel_read_threads

innodb_ddl_threads

innodb_ddl_buffer_size

2.2 SQL Statements Supporting Parallel Operations

3. Parallel Query

3.1 Parallel COUNT(*)

Parallel Scan

Sharding Process

Scanning Process

Callback Function

Parallel Effect

Performance Fallback

3.2 Parallel CHECK TABLE

Parallel Scan

Parallel Effect

Optimizable Points

4. Parallel DDL

4.1 Parallel Secondary Index Creation

Parallel Scan

Summary

Parallel Sort

Single-threaded Build

Parallel Effect

4.2 Parallel Table Rebuild

Single-threaded Scan

Parallel Sort

Parallel Build

Summary

Parallel Effect

5. Summary

References

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for MySQL

Time Series Database (TSDB)

Database Security Solutions

AnalyticDB for MySQL

3.1 Parallel `COUNT(*)`