By Guoqing Ma (Yiyun)
When you get started to learn about databases, you will always be asked a question: What is the biggest advantage of database systems over file systems?. There are many specific advantages, and one of the most important parts is that the database system can perform better concurrent access control.
So, how does the database system perform concurrent access control?
This series of articles takes MySQL 8.0.35 code as an example, and is divided into two parts to provide an overall introduction to concurrent access control in MySQL. The previous article mainly introduces table-level concurrent access control, and this article is part two, which will focus on page-level and row-level concurrent access control.
The InnoDB engine stores data through B+tree. There are many articles on the Internet about B+tree. You can search and learn by yourself. Here is a brief introduction to the basic structure of B+tree for subsequent description.
The preceding figure shows a typical three-layer B+tree, in which:
As can be seen from the above figure, there is a one-way pointer between layers (from the upper layer to the lower layer), and there is a two-way pointer between different nodes at the same layer. Each node in the B+tree is a data page (Page), which is also the smallest unit for reading and writing data in InnoDB. The default page size in InnoDB is 16KB.
For InnoDB tables, we often hear a concept called "index organization tree". From the author's perspective, each index of each InnoDB table is a B+tree and the data is stored on the B+tree. The concepts of more indexes in InnoDB, including primary key indexes, secondary indexes, clustered indexes, and covering indexes, are not the focus of this article, so they are not elaborated here. Interested readers can query and learn by themselves.
Page-level concurrent access control is mainly realized through locks on index and page, which is actually the locking process of B+tree. Before introducing the locking process, let's first look at the access path of B+tree based on the data.
The primary key information (ID) is added to the previous B+tree structure. Assuming that the data to be accessed now is a row with ID = 400, the access path of B+tree is shown in the above figure. As you can see, first access the root node, then find the non-leaf node at the next layer according to the primary key, and then continue to find the corresponding leaf nodes to read the data.
In fact, the locking process of B+tree is carried out according to the above access path. Taking the above query as an example, the locking process of B+tree is shown in the following figure:
The specific steps are as follows:
Similarly, if it is an optimistic update on the page (or an insert in the page), the locking process of B+tree is shown in the following figure:
The specific steps are as follows:
It can be seen that if it is an in-page modification, the locking logic is actually similar to the locking of the read process, except that the type of locking on the leaf node is different.
The above section introduces the query process and the locking process of B+tree when an in-page modification occurs. If the updated data cannot be completed within the page, or the modification will cause a change of the B+tree structure (SMO, Structure Modify Operation), how should the locking be carried out?
When InnDB performs a data update operation, it first tries to use the optimistic update (MODIFY LEAF). If the optimistic update fails, it enters the logic of the pessimistic update (MODIFY TREE). The following figure shows the locking process of the pessimistic update.
The specific steps are as follows:
It can be seen that, unlike the previous process, when entering the pessimistic update logic, the SX lock will be directly added to the index (in the version earlier than 5.7, the X lock was directly added, the 5.7 version introduced the SX lock, the SX lock and the S lock are not mutually exclusive, so it can be read at this time). Therefore, in the subsequent B+tree traversal process, only the nodes passed by the index will be collected first and there is no direct locking. Only when passing the leaf node to be modified will it determine which non-leaf nodes may also be modified, thus adding the X lock.
Therefore, during the entire SMO period, except for leaf nodes and non-leaf nodes that may be modified, X locks are not added to other nodes (index and root nodes are added with SX locks), and read operations on non-modified nodes can be performed normally. However, there can only be one SMO on a B+tree at the same time.
The whole B+tree locking process is more complicated. Here, the main code stack is explained through the insert process on a primary key:
|--> row_ins_clust_index_entry
| |--> row_ins_clust_index_entry_low(..., BTR_MODIFY_LEAF, ...) // Optimistic
| | |--> pcur.open(index, ...)
| | | |--> btr_cur_search_to_nth_level // Traverse B+tree
| | | | |--> // switch (latch_mode)
| | | | |--> // default
| | | | |--> mtr_s_lock(dict_index_get_lock(index), ...) // Add the S lock on the index
| | | | |--> btr_cur_latch_for_root_leaf
| | | | |
| | | | |--> // search_loop
| | | | |--> // retry_page_get
| | | | |--> buf_page_get_gen(..., rw_latch, ...)
| | | | | |--> mtr_add_page // Lock the page by type
| |
| |--> row_ins_clust_index_entry_low(..., BTR_MODIFY_TREE, ...) // Pessimistic
| | |--> pcur.open(index, ...)
| | | |--> btr_cur_search_to_nth_level // Traverse B+tree
| | | | |--> // switch (latch_mode)
| | | | |--> // BTR_MODIFY_TREE
| | | | |--> mtr_sx_lock(dict_index_get_lock(index), ...) // Add the SX lock on the index
| | | | |--> btr_cur_latch_for_root_leaf
| | | | |
| | | | |--> // search_loop
| | | | |--> // retry_page_get
| | | | |--> buf_page_get_gen(..., rw_latch, ...)
| | | | | |--> mtr_add_page // Lock the page by type
The above section is only an introduction to the B+tree locking process. The detailed locking logic can be developed by itself through the above entry, and will not be further elaborated here.
The above content is the B+tree locking process. Let's make a summary:
Before introducing row-level concurrent access control, let's first look at an interesting example:
CREATE TABLE `t1` (
`id` int NOT NULL,
`c1` int DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1, 10);
INSERT INTO t1 VALUES (2, 20);
INSERT INTO t1 VALUES (3, 30);
In the preceding example, the transaction isolation level is RC (READ-COMMITTED) by default. Session 1 starts a transaction, then inserts a row of data, and does not commit it. Session 2 and session 3 subsequently insert the same data, but both are blocked. Finally, if session 1 performs a rollback operation, what happens to session 2 and session 3 respectively?
Readers who know about the principle of MySQL row locks may give the following answer: session 2 inserts successfully and session 3 reports an error with the type 'Duplicate key'. So is that really the case?
The author directly posts the execution results on MySQL 8.0.35 here.
[session 2] Execution Result:
[session 3] Execution Result:
As you can see, session 2 does insert successfully, and session 3 also reports an error, but this error seems to be not in line with our general understanding. Why is the Deadlock found, and where does the deadlock come from?
To figure out this problem, the author turned off the deadlock detection logic on MySQL (innodb_deadlock_detect
was set to OFF), and then tried the above operation again. The results show that session 2 and session 3 are indeed stuck. View the data_locks table under the performance_schema mentioned earlier:
According to the lock wait relationship in the data_locks table, it is found that both session 2 (THREAD_ID=69) and session 3 (THREAD_ID=70) are waiting for the intention lock. The implied semantics is that both sessions hold the Gap Lock, which causes a deadlock.
Therefore, as you can see from the above example, even a simple primary key insert at the RC isolation level does not just lock a single row of records, and may also cause a deadlock.
As you can see from the above example, row locks in InnoDB are not just locks on a single row of records. In fact, row locks are divided into the following types within InnoDB:
1. Rec Lock: the lock added to a single row of records. Its name in the official code is LOCK_REC_NOT_GAP. In terms of the lock type, Rec Lock can be divided into the record read lock (S lock) and the record write lock (X lock).
2. Gap Lock: the lock added to the gap between rows of records. Its name in the official code is LOCK_GAP. (In addition, many articles on the Internet claim that Gap Lock is to solve the phantom read problem of write at RR (REPEATABLE-READ) isolation level. In fact, it is not entirely true, as can be illustrated by the previous example of inserting a deadlock. The issue of transaction isolation levels is not the focus of this article and will not be elaborated here.)
3. Next-Key Lock: the lock that can be simply understood as the combination of Rec Lock and Gap Lock (gap before the record). Its name in the official code is LOCK_ORDINARY.
4. Insert Intention Lock: if the Gap Lock has been added in the inserted position by another transaction, then the current operation needs to wait. At this time, an Insert Intention Lock will be generated, and its name in the official code is LOCK_INSERT_INTENTION.
The above is the basic concept of row locks in InnoDB, which looks very simple, but it is not easy to really understand, so the following section will provide a detailed analysis mainly through some typical cases.
Before entering the case analysis, take a write process as an example to explain the main logic based on the code. The main code stack is as follows:
|--> ha_innobase::write_row
| |--> row_insert_for_mysql
| | |--> row_insert_for_mysql_using_ins_graph
| | | |--> // run_again
| | | |--> row_ins_step
| | | | |--> row_ins
| | | | | |--> row_ins_index_entry_step
| | | | | | |--> row_ins_index_entry
| | | | | | | |--> row_ins_clust_index_entry // Insert the primary key
| | | | | | | | |--> row_ins_clust_index_entry_low
| | | | | | | |
| | | | | | | |--> row_ins_sec_index_entry // Insert a secondary index
| | | |
| | | |--> row_mysql_handle_errors
| | | | |--> lock_wait_suspend_thread // Lock wait, wake up and enter run_again
|--> row_ins_clust_index_entry_low
| |--> btr_pcur_t::open // Traverse B+tree
| |
| |--> row_ins_duplicate_error_in_clust // The first insert will not enter (implicit lock)
| | |--> row_ins_set_rec_lock
| | | |--> lock_clust_rec_read_check_and_lock
| | | | |--> lock_rec_convert_impl_to_expl // Convert implicit lock to explicit lock
| | | | | |--> lock_rec_convert_impl_to_expl_for_trx
| | | | | | |--> lock_rec_add_to_queue
| | | | | | | |--> rec_lock.create // RecLock::create
| | | | | | | | |--> lock_alloc
| | | | | | | | |--> lock_add
| | | | | | | | | |--> // No lock wait
| | | | | | | | | |--> lock_rec_insert_to_granted
| | | | | | | | | |--> locksys::add_to_trx_locks
| | | | |--> lock_rec_lock // Construct a lock wait
| | | | | |--> lock_rec_lock_fast
| | | | | | |--> rec_lock.create // RecLock::create
| | | | | |--> lock_rec_lock_slow
| | | | | | |--> lock_rec_has_expl
| | | | | | |--> lock_rec_other_has_conflicting // Check for conflicts
| | | | | | |--> rec_lock.add_to_waitq
| | | | | | | |--> create // RecLock::create
| | | | | | | | |--> lock_alloc
| | | | | | | | |--> lock_add
| | | | | | | | | |--> // Wait
| | | | | | | | | |--> lock_rec_insert_to_waiting
| | | | | | | | | |--> locksys::add_to_trx_locks
| | | | | | | | | |--> lock_set_lock_and_trx_wait
| |
| |--> btr_cur_optimistic_insert // Optimistic insert
| | |--> btr_cur_ins_lock_and_undo
| | | |--> lock_rec_insert_check_and_lock // Check for lock conflicts before insertion
| | | | |--> lock_rec_other_has_conflicting
| | | | |--> rec_lock.add_to_waitq
| | | |--> trx_undo_report_row_operation
| |--> btr_cur_pessimistic_insert // Pessimistic insert
From the above code, the insert process in the initial example should be like this:
1. When session 1 inserts, because it is the first time to insert, it inserts directly without the need to explicitly create a lock.
2. When the session 2 inserts and the row_ins_duplicate_error_in_clust function is used for conflict check:
3. When session 3 inserts, the basic process is the same as that of session 2, except that Rec X Lock already exists. There is no need to trigger the logic of converting implicit lock to explicit lock. It directly creates a Rec S Lock for itself and enters waiting.
At this point, the waiting logic of session 2 and session 3 is relatively clear. But why do session 2 and session 3 form a deadlock after session 1 rolls back? According to the previous analysis, the rollback of session 1 releases Rec X Lock, so session 2 and session 3 are awakened. The thread that enters the insert logic first inserts successfully and the other thread fails. In addition, as mentioned earlier, session 2 and session 3 finally wait for the Insert Intention Lock. Theoretically, the occurrence of this lock requires other threads to hold the Gap Lock. How does this wait relationship occur?
The author also thought about this question for a long time and finally had to find the answer by debugging the source code. Simply put:
lock_rec_inherit_to_gap
, which converts Rec Lock on session 2 and session 3 to Gap Lock.lock_rec_insert_check_and_lock
) before insertion, and then generate the Insert Intention Lock.Regarding the locking process in the write process, the above section is only a very simple introduction with the help of the case of deadlock caused by the insert. The key point is to clarify the basic logic of row lock generation and lock wait generation in the insert process. The knowledge of InnoDB row lock is very rich, which will not be further elaborated here, and can be introduced separately later when possible.
Note: The following scenarios are mainly derived from: https://www.modb.pro/db/1703591734429175808
The table structures used in each scenario are as follows:
DROP TABLE IF EXISTS `t1`;
CREATE TABLE `t1` (
`id` int NOT NULL AUTO_INCREMENT,
`a` int DEFAULT NULL,
`b` int DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uk_a` (`a`)
) ENGINE=InnoDB;
INSERT INTO t1 values (1, 10, 0);
INSERT INTO t1 values (2, 20, 0);
INSERT INTO t1 values (3, 30, 0);
INSERT INTO t1 values (4, 40, 0);
INSERT INTO t1 values (5, 50, 0);
Note: You must disable the deadlock detection logic on MySQL (innodb_deadlock_detect is set to OFF).
T1: Session 1 inserts for the first time and does not explicitly create a lock.
T2: Session 2 inserts duplicate rows. It first creates a Next-Key Lock (X) on the UK (30, 35] for session 1 (trx 1), and then creates a Next-Key Lock (S) on the UK (30, 35] for itself. At this time, the Gap Lock does not conflict, so the session is the Rec Lock (S) waiting on 35.
T3: Session 1 inserts again. Since the inserted interval is still (30, 35), which conflicts with the interval lock held by session 2, an Insert Intention Lock is generated, resulting in a deadlock.
T1: When session 1 is updated, a UK conflict is detected and the Next-Key Lock (X) is created on the UK (30, 40] and (40, 50].
T2: When session 2 is updated, a UK conflict is also detected, and the Next-Key Lock (X) needs to be created on the UK (20, 30] and (30, 40]. Since session 1 already holds the Rec Lock (X) on 40, session 2 can only wait for the Rec Lock (X) on 40.
T3: When session 3 is updated, a UK conflict is also detected, and the Next-Key Lock (X) needs to be created on the UK (30, 40] and (40, 50]. Since session 1 already holds the Rec Lock (X) on 40, session 3 can only wait for the Rec Lock (X) on 40 and the Next-Key Lock (X) on (40,50] has not yet been processed.
T4: Session 1 rolls back and releases the lock; session 2 obtains the Rec Lock (X) on 40, but since the inserted interval is (20,40) and session 3 already holds the Gap Lock on (30,40), an Insert Intention Lock needs to be generated, resulting in a deadlock.
Note: There is actually an interesting question here. Why does the update on the UK require two Next-Key Locks? Interested readers can refer to the following article.
🔗 References: http://mysql.taobao.org/monthly/2022/05/02/
T1: Session 1 executes the query. Due to for UPDATE
protection, a Rec Lock (X) on 40 needs to be added.
Subsequent steps are the same as those in scenario 2, and details are not described again.
This scenario is the same as the deadlock scenario explained earlier (An Interesting Deadlock Problem) and will not be analyzed again.
The above example describes some deadlock problems caused by row-level concurrent access control. Finally, let's briefly describe the troubleshooting ideas for deadlock problems:
innodb_deadlock_detect
). In principle, we recommend that you do not manually disable it. In addition, we recommend that you do not set the innodb_lock_wait_timeout
parameter too large.The above content introduces the row-level locking process, and let's make a summary:
This series of articles is mainly a compilation of the concurrent access control and locking logic on tables, pages, and rows in MySQL. In general:
ApsaraDB - November 22, 2024
ApsaraDB - April 1, 2024
ApsaraDB - June 1, 2022
ApsaraDB - February 21, 2023
ApsaraDB - April 2, 2024
Alibaba Cloud Storage - June 19, 2019
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreApsaraDB Dedicated Cluster provided by Alibaba Cloud is a dedicated service for managing databases on the cloud.
Learn MoreMore Posts by ApsaraDB