×
Community Blog PolarDB-X Sharding Rule Changes

PolarDB-X Sharding Rule Changes

This article covers PolarDB-X's sharding rule changes, a cloud-native database that distributes data across multiple nodes and supports dynamic repartitioning.

By Guxu

Background

PolarDB-X, as a cloud-native distributed relational database, supports distributing the data of a logical table across multiple data nodes through sharding rules. It also supports changing these sharding rules for data repartition. Data repartition is one of the core capabilities of distributed databases. When the business increases, it can distribute data across more nodes to achieve horizontal scale-out. When business rules change dramatically, it can distribute data according to new sharding rules, thus improving query performance and better adapting to new business rules. The following figure illustrates how a non-partitioned table can be online changed into a sharded table through a simple DDL statement. For more information, see Alibaba Cloud documentation.

1

How is Data Repartitioned?

Back in the heyday of distributed database middleware, developers tried various ways to solve the problem of data repartition. Without exception, these solutions were complex and risky, requiring developers to perform repartition during low-traffic periods in the middle of the night, and the whole process required detailed plans, justifications, and rollback strategies. Due to the distributed nature of the systems, during the transition period when switching from old to new data, they had to struggle against the choice between data consistency and availability. As a result, many solutions involved a write-stop phase, a data verification phase, or both.

After thoroughly evaluating all technical details of data repartition in distributed databases, PolarDB-X achieves strong data consistency, high availability, and transparency to businesses, and can be implemented by a simple DDL statement.

Key Issues in Data Repartition

This topic discusses data repartition at table level. For example, you can repartition the data in a sharded table to different data nodes based on a new rule. However, the repartition process takes time, during which new incremental data enters the system. Once all data has been repartitioned, old and new data must be switched, and finally, dual writing should stop and all old data should be deleted. After carefully studying the details of the above process, it can be summarized into three key sub-issues according to which we will evaluate these implementations:

  1. How to synchronize existing and incremental data?
  2. Primarily consider dual writing at the application layer, data synchronization middleware, and database triggers.
  3. How to switch traffic?
  4. Primarily consider the migration of read and write traffic between old and new data.
  5. How to control the overall repartition process?
  6. Primarily consider how to coordinate these steps, such as when to start/stop incremental data synchronization and when to switch traffic.
  7. How to recover or roll back in case of failures?

Traditional Repartition Solution

2

A traditional repartition process is as follows:

  1. In the initial state, the read and write traffic of the application cluster is on the old data table.
  2. Add a new table and enable dual writing. Dual writing might be implemented at the application layer, based on binlog data synchronization, or by using database triggers. However, this step can easily lead to data inconsistency if not guaranteed by transactions.
  3. Start existing data synchronization.
  4. A data comparison service may be enabled. If data is inconsistent, manual intervention is also required.
  5. Once existing data synchronization is complete and incremental data is synchronized, switch read traffic to the new table while maintaining dual writing for a period.
  6. Complete the repartition by removing write traffic from the old table.

However, upon further examination, we find that data consistency cannot guaranteed in many steps:

  1. In Step 2, when dual writing is enabled, the distributed system cannot ensure that all nodes start dual writing simultaneously. Therefore, there must be a period when only part of the nodes start dual writing. Then there will be an Orphan Data Anomaly issue, resulting in data inconsistency between the old and new tables. The traditional repartition solution would stop writing to address this issue.
  2. Similarly, in Step 6, removing write traffic from the old table can introduce inconsistency.
  3. In Step 2, ensuring consistency of incremental data during dual writing is challenging.

In addition to data consistency, there are still many cumbersome but important issues:

  1. The entire process is largely manually orchestrated, and human intervention is involved in multiple steps, so this solution requires a learning and understanding cost at the beginning and a participation cost during the process.
  2. To avoid data inconsistency, it should stop writing at critical points, but this will disrupt business operations.
  3. The process is loosely coupled with poor fault tolerance.

PolarDB-X Repartition Solution

PolarDB-X is a distributed database that separates storage from compute. Therefore, its architecture includes Compute Node (CN), Data Node (DN), and Global Meta Service (GMS). CNs are responsible for SQL parsing, optimization, and execution, DNs manage data storage, and GMS stores metadata. For performance reasons, each CN caches a copy of the metadata.

3

How to Synchronize Existing and Incremental Data

Dual Writing of Incremental Data

In distributed incremental data dual-writing scenarios, the two ends of dual writing are often located on different data nodes, making standalone transactions unavailable. As previously discussed, all of the XA transactions, binlog synchronization, and triggers cannot guarantee strong consistency between the two ends of dual writing. However, PolarDB-X uses its built-in TSO-based distributed transactions to implement incremental data synchronization, thus ensuring strong consistency of the data switched by read traffic at any time during the repartition process.

Note that if the shard key column value of a row is modified, the row of data might be routed to another data node. Then, what is actually executed at this time is the delete operation of the original data node + the insert operation of the new data node. Therefore, during data repartition, due to the different sharding rules before and after, an update operation on a row could become a distributed transaction involving four data nodes (and even more if there are global secondary indexes). PolarDB-X will handle all of these issues, allowing users to use it as a standalone database without awareness.

Existing Data Synchronization

PolarDB-X performs existing data synchronization in segments. For each segmented synchronization, PolarDB-X attempts to obtain the S lock of the source data within a TSO transaction before writing to the destination. If the destination segment contains the same data, it indicates that the data has been synchronized during the incremental dual-writing phase and can be ignored. However, distributed transactions, like standalone transactions, can cause deadlocks. When adding an S lock to a segment of the original table during the existing data synchronization, a large volume of business update traffic may lead to distributed deadlocks. Therefore, PolarDB-X provides a distributed deadlock detection module to address this issue. After the deadlock is released, the existing data synchronization module retries the operation.

How to Switch Traffic

Online Schema Change

First, let's take a look at how the Orphan Data Anomaly issue as mentioned earlier occurs. When incremental data dual writing is initiated, the metadata in the memory of PolarDB-X compute nodes is not refreshed simultaneously but in a sequence. Therefore, there is always a period when some compute nodes have started dual writing while others have not. This leads to the following situation:

  1. The compute node CN0 has started dual writing and inserted three records into the old table and the new table respectively, as shown in the following figure.
  2. The compute node CN1 has not started dual writing but executed a delete statement, that is, delete a record with id=3 from the old table, but the data in the new table has not been deleted.
  3. The data in the old table and the new table become inconsistent.

4

This issue has been discussed in detail in Google's paper Online, Asynchronous Schema Change in F1. PolarDB-X introduces Online Schema Change [8] to address such issues. For more information, please refer to the previous articles. For repartition, PolarDB-X introduces the states as shown in the following figure to ensure that any two adjacent states are compatible and avoid data consistency issues. Specifically, let's look at some of the most critical states:

target_delete_only and target_write_only: As mentioned above, when we have multiple compute nodes, directly enabling incremental double writing will cause Orphan Data Anomaly. Therefore, before enabling dual writing, all compute nodes should reach the target_delete_only state first and then the target_write_only state (which is the dual writing state). Under the target_delete_only state, compute nodes will only execute delete statements (update statements will be converted to delete ones before executing). For example, in the above figure: CN1 reaches the target_delete_only state first, so even the dual writing is not enabled, it can still delete the data with id=3 from the new table to ensure data consistency.

source_delete_only and source_absent: As discussed earlier, directly stopping dual writing of the old table can cause data inconsistency. Therefore, PolarDB-X introduces the source_delete_only state before source_absent (the state when dual writing is stopped). It also ensures that the Orphan Data Anomaly issue will not occur when the old table is offline.

5

How to Control the Overall Repartition Process and Ensure Stability

PolarDB-X provides users with the ability to change sharding rules (that is, repartition) through DDL. However, the ACID of DDL should also be guaranteed, and it may take a long time to repartition data, so system failures due to power outages or other reasons are inevitable.

DDL Engine

PolarDB-X also implements a stable DDL execution framework that divides DDL into many steps, each of which is idempotent. This ensures that the DDL task can be interrupted at any time and then resumed or rolled back. By linking all steps through DDL and excluding all manual operations, developers no longer need to design repartition plans or manually perform database operations in the middle of the night.

Distributed Metadata Deadlock Detection

After MySQL introduced Online DDL capabilities in version 5.7, DDL could run more effectively with read and write transactions, which is significantly improved compared with previous versions. The basic principle of Online DDL is to obtain MDL only at critical moments, rather than holding an MDL throughout the entire DDL process. When executing repartition, PolarDB-X also obtains MDLs in multiple stages, allowing higher concurrency for transactions. However, MDLs are fair locks and may cause metadata deadlocks.

Obtaining MDL locks multiple times improves performance but increases the possibility of metadata deadlocks. Once a metadata deadlock occurs, all subsequent read and write transactions are blocked. The default MDL timeout in MySQL is one year, which poses a greater risk than ordinary data deadlocks. Therefore, PolarDB-X provides a distributed metadata deadlock detection module to release distributed metadata deadlocks at critical moments.

Summary

Flexible sharding rule change capability is crucial for distributed databases. PolarDB-X supports three types of tables: non-partitioned tables, broadcast tables, and sharded tables. With sharding rule change capabilities, users can convert data tables into any of these types to better adapt to business growth. In addition to sharding rule change capabilities, PolarDB-X also ensures strong data consistency, high availability, transparency to businesses, and ease of use. This article briefly discusses the various technical points used in PolarDB-X to implement sharding rule changes. As can be seen, integrating sharding rule change capabilities into the database kernel is necessary to solve many data consistency issues. This is also one of the distinguishing features of distributed databases compared with distributed database middleware.

The sharding rule change capability is just one of many features in PolarDB-X. For more information, please refer to other articles about PolarDB-X.

References

  1. Asymmetric-Partition Replication for Highly Scalable Distributed Transaction Processing in Practice
  2. Online, Asynchronous Schema Change in F1
  3. What's Really New with NewSQL
  4. https://dev.mysql.com/doc/refman/5.6/en/innodb-online-ddl.html
  5. https://dev.mysql.com/doc/refman/5.6/en/metadata-locking.html
  6. https://zhuanlan.zhihu.com/p/289870241
  7. https://zhuanlan.zhihu.com/p/329978215
  8. https://zhuanlan.zhihu.com/p/341685541
  9. https://zhuanlan.zhihu.com/p/346026906

Try out database products for free:

lQLPJw7V5gCNgtfNBITNCvSwSh_pHTRWM4UGiQoky9W4AA_2804_1156

0 1 0
Share on

ApsaraDB

439 posts | 93 followers

You may also like

Comments

ApsaraDB

439 posts | 93 followers

Related Products

  • PolarDB for MySQL

    Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.

    Learn More
  • Database for FinTech Solution

    Leverage cloud-native database solutions dedicated for FinTech.

    Learn More
  • Lindorm

    Lindorm is an elastic cloud-native database service that supports multiple data models. It is capable of processing various types of data and is compatible with multiple database engine, such as Apache HBase®, Apache Cassandra®, and OpenTSDB.

    Learn More
  • Oracle Database Migration Solution

    Migrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.

    Learn More