Core Technology of PolarDB-X Storage Engine | Lizard XA Two-phase Commit Algorithm

PolarDB-X's Lizard storage engine optimizes distributed two-phase commit via transaction log sinking, branch parallelization, and asynchronous commit for better performance.

By Cuanye and Wuzhe

Classic Two-phase Commit Algorithm

As early as the 1970s, the Two-Phase Commit Protocol (2PC) algorithm began to attract attention from researchers as a potential solution for the consistency issue in distributed computing. Later, with the development of distributed databases and systems, the importance of 2PC in handling distributed transactions increased rapidly. In modern times, as cloud computing develops, distributed computing is facing new challenges, leading to the development of more complex algorithms and protocols such as Paxos, three-phase commit protocol (3PC), and XA protocol. The influence of 2PC can still be seen in these newer methods. Today, the basic concepts of 2PC remain an important starting point for understanding distributed consistency issues.

Roles of Classic Two-phase Commit Algorithm

The preceding figure shows a classic and mature 2PC algorithm in the distributed database field. The concepts of each component are as follows:

AP (Application Program): This is the user of database services, such as an application program.
TM (Transaction Manager): Act as a coordinator in the 2PC algorithm.
RM (Resource Manager): Act as a participant in the 2PC algorithm. A single distributed transaction may involve multiple RMs, each of which may have multiple branch transactions. All branch transactions together constitute a complete distributed transaction.
GMS (Global Metadata Service): In the distributed database field, to provide Multi-version Concurrency Control (MVCC) capability, the GMS service is utilized to sequence all distributed transactions.

Classic Two-phase Commit Protocol

When an AP initiates a commit request for a distributed transaction:

1. PREPARE phase

The TM issues a PREPARE command to all participating RMs. When all RMs respond to the request and return a success message, the PREPARE phase is completed.

2. GMS sequencing phase

The TM accesses GMS and obtains the Global Commit Number (GCN) to sequence the distributed transaction.

It is worth noting that this phase is not necessarily required in the 2PC algorithm. However, global sequencing is still a well-tested and accepted solution for achieving read consistency in the database industry.

3. COMMIT phase

The TM sends a COMMIT command to all participating RMs. When all the RMs respond to the request and return a success message, the distributed transaction is committed.

Dilemmas of the Classic Two-phase Commit Algorithm

As one of the oldest algorithms in distributed computing, the 2PC algorithm might seem to have exhausted all areas for discussion. However, in practical engineering applications, we have found three major dilemmas of the classic 2PC algorithm in the field of distributed databases.

Dilemma 1: High Transaction Log Forwarding Costs

In the classic 2PC algorithm, after the PREPARE phase (and GMS sequencing) is completed, the TM needs to drive branch transactions of RM to commit transactions.

To ensure that transactions meet atomicity (for example, multiple branch transactions are eventually fully committed or rolled back), the TM usually needs to persist a transaction log (TLOG) before the COMMIT phase. This ensures that, in the event of an unexpected crash and subsequent recovery, the TM can still drive the pending branch transactions (hereinafter referred to as hanging transactions) to commit or roll back.

Generally, the structured abstraction of TLOG is as follows:

Distributed transaction ID	Transaction completion action	Transaction commit number	Participant	....
GTRID_1	COMMIT	225208	5	....
GTRID_2	ROLLBACK	/	3	....

TLOG is the core log of the 2PC process. If TLOG is unavailable or lost:

• Resources cannot be released: The hanging transactions cannot be completed, and the occupied lock resources and UNDO resources cannot be released.

• Transaction features are compromised: Losing the TLOG will cause distributed transactions to lose consistency and atomicity.

Therefore, ensuring efficient and reliable use of TLOG is one of the most critical parts of the 2PC algorithm. Generally, TLOG has the following difficulties in engineering:

• TLOG storage cost

TLOG is naturally a structured piece of data and is suitable for being stored in an RM in a table form. The modification requires a complete transaction, that is, committing a branch transaction on an RM requires an additional transaction operation on the TLOG table. As we know, a single transaction operation comes with significant overhead.

• TLOG management cost

TLOG storage content cannot be expanded indefinitely, that is, the TM needs to regularly clean up unnecessary TLOG, which brings additional management overhead and certain risks. In extreme scenarios, your business may even be affected.

• TLOG high-availability cost

Since TLOG is the core log for 2PC, its availability requires the highest level of assurance. In a financial-grade database such as PolarDB-X, TLOG is required to store more than two copies to ensure availability.

Of course, this is not without cost. We know that replica-based disaster recovery design often means high communication, storage, and synchronization waiting costs. In particular, these costs occur during the 2PC process.

Dilemma 2: Single RM with Multiple Branches Cannot Execute in Parallel

Transaction splitting and parallelization are part of the key factors of the performance advantages of distributed databases over standalone databases. When it comes to implementing transaction splitting at the storage engine level, various solutions exist across industry products like Oracle and PolarDB-X. Taking PolarDB-X as an example, distributed transactions are split into multiple branch transactions and executed in parallel on RM:

• For RM, each branch transaction is an independent transaction and needs to meet the atomicity, consistency, isolation, and durability (ACID) features of the transaction.

• For TM, multiple branches collectively form a complete distributed transaction and should exhibit the characteristics of a complete transaction.

This contradiction is mainly reflected in two aspects:

• Isolation of write branch transactions

Although multiple concurrent branch transactions under the same RM belong to the same distributed transaction, their modifications are invisible to each other. This is because the branch transaction in the context of 2PC is an independent transaction for RM, and its modifications can only be seen by other transactions after the transaction is committed. This greatly limits the scope where branch transaction parallelization can be effectively applied.

• Read query cannot read consistent states

In the traditional 2PC algorithm, multiple branch transactions under the same RM are always committed in sequence, so the query cannot guarantee to see or not see both branch transactions at the same time. This means that from the AP’s viewpoint, a query might observe only part of the modification of a distributed transaction while missing out on the rest.

Dilemma 3: Deteriorated Performance Caused by Multiple RTTs

For local standalone transactions, the commit phase typically requires only one Round-Trip Time (RTT). However, compared with local standalone transactions, the classic 2PC algorithm usually requires three RTTs in distributed database applications:

RTT-1: PREPARE phase, driving branches to enter the 2PC process.
RTT-2: GMS sequencing phase, sequencing the distributed transaction through GMS.
RTT-3: COMMIT phase, driving each branch to complete the transaction.

However, the impact of RTT on the overall throughput of the system is obvious:

• Write performance: Each RTT involves communication costs, such as network latency across nodes or the bus latency of the node. Since the 2PC process involves blocking and synchronous waits, excessive latency can significantly degrade the database's write performance.

• Query performance: Since global sequencing isn't performed during the 2PC process, the visibility of the data modified by the transaction cannot be determined. In most situations, you need to block queries to ensure query consistency. If the latency of the commit process is large, the query performance of the database will deteriorate.

PolarDB-X Distributed Database

Given the above problems, PolarDB-X provides a complete solution through the deep cooperation of multiple modules. This article focuses on the ideas and design of the storage engine. However, before that, let's briefly introduce the overall structure of PolarDB-X to set the stage.

PolarDB-X Capabilities

PolarDB-X is a high-performance cloud-native distributed database service independently developed by Alibaba Cloud. It offers high throughput, large storage, low latency, high scalability, and ultra-high availability to cater to a wide variety of business requirements. As a product of the cloud-native era, the characteristics of PolarDB-X can be expressed in five words: financial-grade high availability, transparent distribution, integrated centralized-distributed architecture, HTAP, open source and multi-cloud deployment, and security and stability.

PolarDB-X Service Architecture

The following figure shows the architecture of PolarDB-X.

The modules are as follows:

• Global meta service (GMS): provides distributed metadata and a global timestamp distributor named Timestamp Oracle (TSO) and maintains meta information such as tables, schemas, and statistics. It assumes the role of distributed transaction global sequencing in 2PC.

• Compute node (CN): provides a distributed SQL engine that contains core optimizers and executors. A CN uses a stateless SQL engine to provide distributed routing and computing and uses the two-phase commit protocol (2PC) to coordinate distributed transactions. A CN also executes DDL statements in a distributed manner and maintains global indexes. It assumes the role of the TM in 2PC.

• Data node (DN): provides the capabilities to efficiently and reliably store, retrieve, and process massive data in distributed scenarios through the self-developed Lizard transaction engine. A data node also uses the distributed consensus protocol to ensure high data availability. It assumes the role of the RM in 2PC.

• CDC node: provides a primary/secondary replication protocol that is compatible with MySQL. The primary/secondary replication protocol is compatible with the protocols and data formats that are supported by MySQL binary logging. CDC uses the primary/secondary replication protocol to exchange data.

• Columnar node: provides persistent columnstore indexes and maintains and updates columnstore indexes in real time based on changes recorded in distributed transaction logs to facilitate efficient analytical query processing. By leveraging object storage and working in tandem with CNs, a columnar node provides the scalability required for real-time updates and the capability to execute snapshot-consistent queries.

Lizard Two-phase Commit Algorithm

Lizard is a new transaction engine for the PolarDB-X storage engine. Different from standalone transaction systems such as InnoDB, it was designed to provide transaction engine solutions in distributed scenarios. The Lizard transaction system has three key strategies for the three major dilemmas currently encountered by the classic 2PC algorithm.

TLOG Log Sinking

TLOG is a core module of the 2PC algorithm. The Lizard transaction system introduces the Lizard transaction slot mechanism to solve the problem of high TLOG forwarding costs.

Lizard Transaction Slot Design

The Lizard transaction system allocates a transaction slot to each transaction. The transaction slot maintains status information about the transactions. A typical transaction slot is structured as follows:

Transaction ID	Transaction status	Transaction commit number	XA member information	XA transaction group information
drds-1ef@23425	COMMIT	225208	{drds-1ef...}	{236029, 112899..}
drds-2cb@34856	ROLLBACK	225209	{drds-2cb...}	{258019, 164115..}

Typical transaction information is as follows:

• XA transaction ID: the globally unique identifier of the branch transaction.

• XA transaction status: the status of the branch transaction, including ACTIVE, PREPARE, COMMIT, and FORGET.

• XA transaction commit number: branch transaction commit number, used for sequencing, including external (global) commit numbers and internal commit numbers.

• XA transaction member information: distributed transaction participant information, including global and local participant information.

• XA transaction group information: the relationship between multiple participants in a distributed transaction as a transaction group.

It can be seen that TLOG is sunk from the upper-layer logical table structure to the storage engine, which is taken over by the Lizard transaction slot.

Lizard Transaction Slot Management

The lifecycle of a Lizard transaction slot is managed by the Lizard transaction system. The lifecycle can be roughly divided into the following phases:

• ACTIVE: After the transaction is started, a transaction slot is allocated to the Lizard transaction system and placed in the ACTIVE area for quick access and updates.

• FINISH: After a transaction completes the 2PC process, it enters the FINISH phase. The transaction slot maintains the relevant XA transaction information in a timely manner and moves it from the ACTIVE region to the HISTORY region. In this region, the transaction slot is reserved for a long enough time according to the policy.

• FORGET: Once it is confirmed that the transaction slot is no longer needed, it is cleaned up by the Lizard cleanup system in time and the space is reclaimed for the next use.

Lizard Transaction Slot Replica

To provide financial-grade availability, Lizard transaction slots are synchronized to other nodes. Currently, PolarDB-X uses logical logs to synchronize slots to Standby nodes. The Standby node rebuilds a logical replica of the Lizard transaction slot based on the logical log.

Lizard Transaction Slot Advantages

Compared with the TLOG solution based on logical tables, Lizard's TLOG sinking solution based on transaction slots has significant advantages:

• Low storage cost of transaction slots: Transaction slots are forwarded along with the original transaction without additional transaction objects for maintenance. This means that the cost is just to generate more REDO logs and a small amount of space resources, without adding any synchronization waiting overhead.

• Transaction slot autonomy: The lifecycle of transaction slots is maintained by the Lizard transaction system. This ensures that required transaction slots are retained while transaction slots that are no longer required are cleared to avoid space expansion.

• Lower high-availability cost of transaction slots: Transaction slots are synchronized to the Standby node along with the replication log of the transaction without additional RTT overheads and synchronization waiting overheads. This greatly reduces the high-availability cost.

Branch Transaction Parallelization

Enabling full parallelization of branch transactions is one of the core issues of distributed database storage engines. To address this, the Lizard transaction system provides a variety of transaction policies to allow the TM to handle branch transactions with ease. Some of these policies might appear to challenge the database ACID theory, but they are natural and necessary in distributed scenarios.

Break Down Barriers: Uncommitted Modifications Become Visible

If someone tells me that uncommitted modifications in transactions should also be visible, I may turn to the ACID theory to debate with them. However, this weird phenomenon may be necessary in a distributed scenario.

The reason lies in that although multiple branch transactions on an RM leverage the transaction object structure of a single machine, they are a complete distributed transaction, which means that the modifications in the (distributed) transaction should be visible within this (distributed) transaction.

The Lizard transaction system introduces the transaction group to break down the barriers between transactions in the same group, making their modifications visible to each other, which provides full possibility and flexibility for branch transaction parallelization.

This is also a full embodiment of PolarDB-X's integrated centralized-distributed architecture concept:

• When needed (centralized), standalone transactions remain isolated, as if separated by a vast river, maintaining strict boundaries.

• When not needed (distributed), branch transactions can easily bridge the gap, like removing a piece of paper, effortlessly establishing connections where needed.

Shared Commit Status, Like an Atomic Transaction

Multiple branch transactions have a sequence on the same RM, and the storage engine cannot prevent some queries from being initiated at branch commit intervals. This means that queries can see the modifications of some branch transactions while missing modifications of others.

To make a multi-branch transaction look like a single transaction, the Lizard transaction system identifies the primary branch and the secondary branch in the transaction group, establishing a relationship between them. Through this association, branch transactions that belong to the same transaction group share the same commit status. Therefore, when an external query is initiated, the multi-branch transaction looks like an atomic transaction.

Asynchronous Commit

The MVCC solution for PolarDB-X distributed transactions uses the global Timestamp Oracle (TSO) solution.

In the PolarDB-X storage engine, the sequence number generated by the TSO (undertaken by GMS) is called GCN (Global Commit Number).

The core idea of incorporating TSO into the 2PC process can be summarized as:

• Step 1: "Lock" records that are being committed so they cannot be queried (PREPARE).

• Step 2: Obtain the commit number GCN of the distributed transaction (GMS sequencing).

• Step 3: Release the "lock" in the first step and use the GCN obtained in the second step to drive the transaction into its completed status (COMMIT).

Through the preceding analysis, we know that the traditional TSO-based 2PC scheme incurs significant overhead, and the whole process contains three RTTs. To solve this problem, PolarDB-X adopts the asynchronous commit solution, which greatly reduces the cost of committing transactions.

Among them, the PolarDB-X Lizard storage engine upgrades XA branch transactions to AC (Async Commit) transactions to support PolarDB-X to achieve optimal commit performance. The following describes AC transactions from the perspective of the PolarDB-X storage engine.

AC Transaction Commit

As shown in the preceding figure, the commit process of an AC transaction can be simplified as follows:

• RTT-1: GMS Pre-sequencing

The TM obtains a pre-commit number PRE_GCN from the GMS.

• RTT-2: AC PREPARE

The TM takes PRE_GCN to each RM to initiate an AC PREPARE request. Based on all responses, it learns the largest external number that has been obtained from all participating RMs. The largest external number is used as the final commit number GCN of the distributed transaction.

After all RMs have completed the PREPARE status transitions and responded to the request, the commit status (including the commit number) of the distributed transaction is determined.

The TM may then return an OK message to the AP, indicating that the transaction has been committed. Only two RTTs are used in the entire process.

• Asynchronous: AC COMMIT

The TM drives the transaction into the completed status through AC COMMIT. The process is asynchronous and is not in the user's flow.

AC Transaction Recovery

When a distributed transaction crashes abnormally during the 2PC process, the TM needs to handle hanging transactions in the recovery process and eventually drive all branch transactions to a consistent completed status.

For this reason, the Lizard transaction engine maintains TLOG through transaction slots for easy lookup of transaction status and provides AC interaction capability for TM to drive the state transitions of transactions.

• AC Transaction Status Query

The Lizard transaction system provides the ability to query XA transaction status. Its core module is the TLOG maintained by the Lizard transaction slot. While the Lizard TLOG offers numerous advantages, their chain-based structure means that index queries are not as efficient as those performed on logical tables with dedicated indexing structures. Therefore, the Lizard transaction system introduces two query methods:

• Optimistic query: The Lizard TLOG allows you to directly locate and perform an optimistic query based on the physical location.

• Pessimistic query: The Lizard TLOG is mapped to specific data segments based on the transaction identification information, which greatly improves query efficiency.

By combining the two query modes, the Lizard transaction system can query the XA transaction status at a lower cost.

• AC Transaction Status Transition

The Lizard transaction system defines the status information required for XA transaction recovery so that the TM can transition the status according to the branch transaction status, thus ensuring a correct recovery process. Currently, the Lizard transaction system supports the following status expressions:

STATUS	Description
ATTACHED	A session is processing this transaction, which means that the transaction is not completed.
DETACHED_PREPARE	No session is processing this transaction and the transaction is in the PREPARE state.
COMMIT	The transaction is in the committed state.
ROLLBACK	The transaction is in the rolled-back state.
NOTSTART_OR_FORGET	The transaction information has been cleared, or the transaction never existed.

Through the XA transaction status query, the TM determines the status transition direction of other hanging branch transactions.

AC Transaction Sequencing

The core problem of MVCC theory in distributed databases is actually a sequencing problem. Consistency can be achieved when all queries and modifications have a sequence number that conforms to the operation sequence.

However, as seen from the above analysis, the sequencing scheme of AC transactions is significantly different from that of the traditional 2PC algorithm. You may wonder how AC transactions can guarantee read consistency after adjusting the sequencing rules. Limited by space, I do not intend to introduce rigorous mathematical proofs here but rather to try to use a simple example to explore how AC transactions solve the consistency problem through sequencing.

• Before the PREPARE phase, the AC transaction obtains the pre-commit number (PRE_GCN) from GMS at time T1.

• In the PREPARE phase, the AC transaction "locks" the records that are being committed at time T2.

PRE_GCN cannot be directly used as the final commit number C_GCN of the distributed transaction. This is because queries between T1 and T2 inevitably get a larger sequence number S_GCN (snapshot GCN), while the AC transaction is found to be in the ACTIVE status.

In simpler terms, if you use PRE_GCN as the final commit number, it can be found that the modification of this transaction is sometimes invisible (when the transaction is in the ACTIVE status) and sometimes visible (once the commit is completed, revealing that the query actually occurs after the modification).

For this reason, the AC transaction will return the largest commit number max_GCN[i] among all distributed transactions that have occurred on RM[i] at T2 to TM after T2. The final commit number is:

C_GCN = max {PRE_GCN, max_GCN[0], max_GCN[1]...}

It can be seen that S_GCN <= C_GCN, which reflects the actual operation sequence, so the above problem will no longer exist.

Of course, the above inference is oversimplified and not rigorous, especially for distributed databases such as PolarDB-X, which supports a variety of transaction types in addition to regular distributed transactions, such as single-shard transactions, flashback transactions, and primary-secondary transactions. Its distributed MVCC mechanism is complex and difficult to elaborate in a short time. On the other hand, I also try to avoid using terms such as causal consistency, sequential consistency, and eventual consistency to discuss the issue. Therefore, the above discussion may only be the tip of the iceberg. If you are interested in this aspect, please continue to pay attention to our subsequent technical articles.

AC Transaction Benefits

AC transactions have the following advantages over traditional TSO transactions:

• AC transactions significantly reduce latency. A traditional TSO transaction requires three RTTs, while an AC transaction requires only two RTTs, greatly reducing the response latency of the transaction commit. It is worth noting that for financial-grade high-availability databases like PolarDB-X, all status transitions require reaching a majority through a distributed consensus protocol. While it appears that the number of synchronous waits decreases from 3 RTTs to 2 RTTs, the eliminated RTT specifically pertains to the COMMIT request/response. This means that there is no need to wait for the consensus protocol to reach a majority on the synchronous path.

• AC transactions improve system throughput. Response latency and system throughput are two sides of the same coin. Reducing latency means effectively improving system throughput.

• AC transactions improve resource utilization efficiency. Because the final COMMIT status transitions in the asynchronous process but not on the synchronous path, scheduling policies can be designed more efficiently, such as merging requests by GROUP and thus committing multiple transactions within one RTT.

Lizard Two-phase Commit Algorithm Summary

The Lizard two-phase commit algorithm is a distributed transaction solution guided by PolarDB-X 2.0's integrated centralized-distributed architecture and transparent distribution. Experimental data has proven that the overall performance can be improved by more than 20% after adopting the Lizard two-phase commit strategy.

However, this is not without cost. The new transaction strategy also brings more complex recovery processes and sequencing designs to ensure the ACID features of transactions. Through in-depth collaboration between modules, PolarDB-X provides users with technology benefits transparently and seamlessly. For a deeper dive into these technical aspects, stay tuned for our upcoming articles.