PolarDB-X source code interpretation series
In the previous source code reading, we introduced the execution process of INSERT. Unlike INSERT, INSERT IGNORE needs to determine whether there is a unique key conflict with the inserted value and ignore the conflicting inserted value. Therefore, this article will further introduce the execution process of INSERT IGNORE in PolarDB-X, which will vary depending on whether the inserted table has GSI.
Push down execution
If the inserted table has only one main table and no GSI, you only need to send INSERT IGNORE directly to the corresponding physical table, and the DN will ignore the conflicting values. In this case, the execution process of INSERT IGNORE is basically the same as that of INSERT. Readers can refer to the previous source code to read the article.
Logical execution
In the case of GSI, the INSERT IGNORE cannot be simply distributed to the main table and the corresponding physical sub table of GSI, or the data of the main table and GSI may be inconsistent. for instance:
create table t1 (a int primary key, b int, global index g1(b) dbpartition by hash(b)) dbpartition by hash(a);
insert ignore into t1 values (1,1),(1,2);
For the two inserted records, they are located in the same physical table (a is the same) on the main table, but they are located in different physical tables (b is different) on the GSI. If you directly issue INSERT IGNORE, only (1,1) on the main table can be successfully inserted (primary key conflict), while (1,1) and (1,2) on the GSI can be successfully inserted, so the GSI has one more piece of data than the main table.
In this case, one solution is to first select the data that may conflict in the database to CN according to the Unique Key in the inserted value, and then judge the conflicting value in CN and delete it.
When performing a SELECT, the simplest way is to send all the SELECT directly to the main table. However, the main table may not have a corresponding Unique Key, which will result in a full table scan during the SELECT process, affecting performance. So in the optimizer stage, we will determine whether the corresponding SELECT needs to be sent to the main table or GSI according to whether the Unique Key is defined on the main table or GSI. The specific code location is:
com.alibaba.polardbx.optimizer.core.planner.rule.OptimizeLogicalInsertRule#groupUkByTable
At the execution stage, we process INSERT IGNORE in LogicalInsertIgnoreHandler. First, we will enter the getDuplicatedValues function, which will find the records of conflicting Unique Keys in the table by issuing SELECT. We set the column selected in the issued SELECT statement to (value_index, uk_index, pk). Where value_ Index and uk_ All indexes are constants of.
For example, suppose there is a table:
CREATE TABLE `t` (
`id` int(11) NOT NULL,
`a` int(11) NOT NULL,
`b` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE GLOBAL KEY `g_ i_ a` (`a`) COVERING (`id`) DBPARTITION BY HASH(`a`)
) DBPARTITION BY HASH(`id`)
And an INSERT IGNORE statement:
INSERT IGNORE INTO t VALUES (1,2,3),(2,3,4),(3,4,5);
Suppose that when it is executed in PolarDB-X, it will number the Unique Key as
0: id
1: g_ i_ a
Each value inserted in the INSERT IGNORE statement is numbered
0: (1,2,3)
1: (2,3,4)
2: (3,4,5)
Then the SELECT on the GSI constructed by UNIQUE KEY of (2,3,4) is
Query GSI
SELECT 1 as `value_ index`, 1 as `uk_ index`, `id`
FROM `g_ i_ a_ xxxx`
WHERE `a` in 3;
If (5,3,6) already exists in the table, the return result of this SELECT is (1,1,5). In addition, since the return formats of SELECT for different Unique Keys are the same, we will send different SELECT queries UNION on the same physical library to get multiple results at once, reducing the number of interactions between CN and DN. As long as a unique key has duplicate values, we can_ Index and uk_ The index determines which Unique Key in which row the value is inserted is duplicate.
After getting all the returned results, we will de duplicate the data. We put the conflicting values obtained in the previous step into a SET, and then scan all the inserted values in each row in sequence. If duplicate values are found, skip the row, or add the row to the SET (because there may also be conflicts between inserted values). After de duplication, we get all the values that do not conflict. After inserting these values into the table, we complete the execution of an INSERT IGNORE.
RETURNING optimization
Although the logical execution mode of INSERT IGNORE mentioned in the previous section ensures the correctness of data, it also requires at least two interactions between CN and DN to complete an INSERT IGNORE statement (the first SELECT and the second INSERT), which affects the execution performance of INSERT IGNORE.
The current DN already supports the RETURNING optimization of AliSQL, which can return the successfully inserted value after the DN INSERT IGNORE is executed. With this function, PolarDB-X further optimizes the INSERT IGNORE: directly issue the INSERT IGNORE. If all the INSERT IGNORE are successfully returned on the main table and GSI, it means that there is no conflict in the inserted values, and the INSERT IGNORE is successfully executed; Otherwise, the multiple inserted values will be deleted.
When executing, CN will first issue a physical INSERT IGNORE statement with RETURNING to DN according to the syntax above, for example:
call dbms_ trans. returning("a", "insert into t1_xxxx values(1,1)");
The return column is the primary key, which is used to identify which of the inserted batch of data has been successfully inserted; t1_ Xxxx is a physical sub table of logical table t1. After all INSERT IGNORE on the main table and GSI are executed, we calculate the intersection of successfully inserted values in the main table and GSI as the final result, and then delete the multiple inserted values. This part of the code is in
com.alibaba.polardbx.repo.mysql.handler.LogicalInsertIgnoreHandler#getRowsToBeRemoved
Compared with the "pessimistic execution" of logic execution in the previous section, INSERT IGNORE optimized with RETURNING is equivalent to "optimistic execution". If the inserted values do not conflict with each other, only one interaction between CN and DN is required for an INSERT IGNORE statement; In case of conflict, we need to issue a DELETE statement to delete the multiple inserted values in the main table or GSI. Therefore, CN and DN need to interact twice. It can be seen that even in case of conflict, the number of interactions between CN and DN will not exceed the logic execution in the previous section. Therefore, when it is impossible to directly push down, the execution strategy of INSERT IGNORE is to use RETURNING to optimize execution by default.
Of course, the use of RETURNING optimization also has some limitations. For example, the inserted Value cannot be used when there are duplicate primary keys, because in this case, it is impossible to determine which row was successfully inserted and which row needs to be deleted; For details, please read the conditional judgment in the code. When RETURNING optimization cannot be used, the system will automatically select the logic execution mode in the previous section to execute this INSERT IGNORE statement to ensure the correctness of the data.
Execution process optimized with RETURNING:
com.alibaba.polardbx.repo.mysql.handler.LogicalInsertIgnoreHandler#doExecute
Summary
This article introduces the execution process of INSERT IGNORE in PolarDB-X. In addition to INSERT IGNORE, there are also some DML statements that need to be judged for duplicate values during execution, such as REPLACE, INSERT ON DUPLICATE KEY UPDATE, etc. These statements all adopt the logical execution method in the case of GSI, that is, select first and then perform operations such as duplicate judgment and update. Interested readers can read the relevant code by themselves.
Push down execution
If the inserted table has only one main table and no GSI, you only need to send INSERT IGNORE directly to the corresponding physical table, and the DN will ignore the conflicting values. In this case, the execution process of INSERT IGNORE is basically the same as that of INSERT. Readers can refer to the previous source code to read the article.
Logical execution
In the case of GSI, the INSERT IGNORE cannot be simply distributed to the main table and the corresponding physical sub table of GSI, or the data of the main table and GSI may be inconsistent. for instance:
create table t1 (a int primary key, b int, global index g1(b) dbpartition by hash(b)) dbpartition by hash(a);
insert ignore into t1 values (1,1),(1,2);
For the two inserted records, they are located in the same physical table (a is the same) on the main table, but they are located in different physical tables (b is different) on the GSI. If you directly issue INSERT IGNORE, only (1,1) on the main table can be successfully inserted (primary key conflict), while (1,1) and (1,2) on the GSI can be successfully inserted, so the GSI has one more piece of data than the main table.
In this case, one solution is to first select the data that may conflict in the database to CN according to the Unique Key in the inserted value, and then judge the conflicting value in CN and delete it.
When performing a SELECT, the simplest way is to send all the SELECT directly to the main table. However, the main table may not have a corresponding Unique Key, which will result in a full table scan during the SELECT process, affecting performance. So in the optimizer stage, we will determine whether the corresponding SELECT needs to be sent to the main table or GSI according to whether the Unique Key is defined on the main table or GSI. The specific code location is:
com.alibaba.polardbx.optimizer.core.planner.rule.OptimizeLogicalInsertRule#groupUkByTable
At the execution stage, we process INSERT IGNORE in LogicalInsertIgnoreHandler. First, we will enter the getDuplicatedValues function, which will find the records of conflicting Unique Keys in the table by issuing SELECT. We set the column selected in the issued SELECT statement to (value_index, uk_index, pk). Where value_ Index and uk_ All indexes are constants of.
For example, suppose there is a table:
CREATE TABLE `t` (
`id` int(11) NOT NULL,
`a` int(11) NOT NULL,
`b` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE GLOBAL KEY `g_ i_ a` (`a`) COVERING (`id`) DBPARTITION BY HASH(`a`)
) DBPARTITION BY HASH(`id`)
And an INSERT IGNORE statement:
INSERT IGNORE INTO t VALUES (1,2,3),(2,3,4),(3,4,5);
Suppose that when it is executed in PolarDB-X, it will number the Unique Key as
0: id
1: g_ i_ a
Each value inserted in the INSERT IGNORE statement is numbered
0: (1,2,3)
1: (2,3,4)
2: (3,4,5)
Then the SELECT on the GSI constructed by UNIQUE KEY of (2,3,4) is
Query GSI
SELECT 1 as `value_ index`, 1 as `uk_ index`, `id`
FROM `g_ i_ a_ xxxx`
WHERE `a` in 3;
If (5,3,6) already exists in the table, the return result of this SELECT is (1,1,5). In addition, since the return formats of SELECT for different Unique Keys are the same, we will send different SELECT queries UNION on the same physical library to get multiple results at once, reducing the number of interactions between CN and DN. As long as a unique key has duplicate values, we can_ Index and uk_ The index determines which Unique Key in which row the value is inserted is duplicate.
After getting all the returned results, we will de duplicate the data. We put the conflicting values obtained in the previous step into a SET, and then scan all the inserted values in each row in sequence. If duplicate values are found, skip the row, or add the row to the SET (because there may also be conflicts between inserted values). After de duplication, we get all the values that do not conflict. After inserting these values into the table, we complete the execution of an INSERT IGNORE.
RETURNING optimization
Although the logical execution mode of INSERT IGNORE mentioned in the previous section ensures the correctness of data, it also requires at least two interactions between CN and DN to complete an INSERT IGNORE statement (the first SELECT and the second INSERT), which affects the execution performance of INSERT IGNORE.
The current DN already supports the RETURNING optimization of AliSQL, which can return the successfully inserted value after the DN INSERT IGNORE is executed. With this function, PolarDB-X further optimizes the INSERT IGNORE: directly issue the INSERT IGNORE. If all the INSERT IGNORE are successfully returned on the main table and GSI, it means that there is no conflict in the inserted values, and the INSERT IGNORE is successfully executed; Otherwise, the multiple inserted values will be deleted.
When executing, CN will first issue a physical INSERT IGNORE statement with RETURNING to DN according to the syntax above, for example:
call dbms_ trans. returning("a", "insert into t1_xxxx values(1,1)");
The return column is the primary key, which is used to identify which of the inserted batch of data has been successfully inserted; t1_ Xxxx is a physical sub table of logical table t1. After all INSERT IGNORE on the main table and GSI are executed, we calculate the intersection of successfully inserted values in the main table and GSI as the final result, and then delete the multiple inserted values. This part of the code is in
com.alibaba.polardbx.repo.mysql.handler.LogicalInsertIgnoreHandler#getRowsToBeRemoved
Compared with the "pessimistic execution" of logic execution in the previous section, INSERT IGNORE optimized with RETURNING is equivalent to "optimistic execution". If the inserted values do not conflict with each other, only one interaction between CN and DN is required for an INSERT IGNORE statement; In case of conflict, we need to issue a DELETE statement to delete the multiple inserted values in the main table or GSI. Therefore, CN and DN need to interact twice. It can be seen that even in case of conflict, the number of interactions between CN and DN will not exceed the logic execution in the previous section. Therefore, when it is impossible to directly push down, the execution strategy of INSERT IGNORE is to use RETURNING to optimize execution by default.
Of course, the use of RETURNING optimization also has some limitations. For example, the inserted Value cannot be used when there are duplicate primary keys, because in this case, it is impossible to determine which row was successfully inserted and which row needs to be deleted; For details, please read the conditional judgment in the code. When RETURNING optimization cannot be used, the system will automatically select the logic execution mode in the previous section to execute this INSERT IGNORE statement to ensure the correctness of the data.
Execution process optimized with RETURNING:
com.alibaba.polardbx.repo.mysql.handler.LogicalInsertIgnoreHandler#doExecute
Summary
This article introduces the execution process of INSERT IGNORE in PolarDB-X. In addition to INSERT IGNORE, there are also some DML statements that need to be judged for duplicate values during execution, such as REPLACE, INSERT ON DUPLICATE KEY UPDATE, etc. These statements all adopt the logical execution method in the case of GSI, that is, select first and then perform operations such as duplicate judgment and update. Interested readers can read the relevant code by themselves.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00