By Ziyang
As a cloud-native distributed database that is compatible with MySQL, PolarDB-X aims not only to be simply compatible with MySQL but also to take advantage of cloud-native and distributed extensibility to provide more capabilities, so as to achieve a database that is compatible with MySQL and provides more advanced features than MySQL. This article focuses on the advanced capabilities of PolarDB-X CDC.
For more information about PolarDB-X CDC, refer to our previous articles.
The PolarDB-X CDC component preferentially saves binlog files on local disks, and can upload the files to remote storage such as Object Storage Service (OSS) in real time. Generally, the files are stored on the local disks (that is, hot files) for a short period of time, and on the remote storage (that is, cold files) for a long period of time, such as 15 days. The CDC component provides a transparent consumption feature that shields the storage differences between the local disks and remote storage. Downstream systems can access the binlog files on remote storage without any extra adaptation.
For example: After logging on to the PolarDB-X CDC and entering the /home/admin/binlog directory, you can see that binlog files with numbers smaller than 65 have been archived in OSS.
Local binlog files
Binlog files archived in OSS
If you execute show binary logs in the PolarDB-X command line, files with numbers smaller than 65 are still visible. If you execute show full binary logs, then detailed information about each binlog file can be displayed, including the start time and end time of binlog event, the upload status, and the storage status.
You can use the mysqlbinlog tool to transparently consume objects that have been archived in OSS, or use the show binlog events to achieve the same effect.
Transparent consumption also can demonstrate a fast consumption speed, with more than 200MB/s throughput.
In addition, the built-in SQL Flashback feature allows you to select the time period to be flashed back. Regardless of whether the binlog files are stored on local disks, have been archived in OSS, or are both stored on local disks and archived, they can be flashed back as long as the files still exist. This meets the requirements of Data Repair and Data Audit.
The MySQL primary/secondary replication has an "annoying" design. It supports consumption based on binlog_file_name + binlog_file_position, but cannot guarantee continuity across binlog files on Master and Slave nodes (that is, binlog files with the same name may have different contents on corresponding Master and Slave nodes). Then, when primary/secondary switching occurs, downstream consumption programs cannot continue consumption based on previous positions. Therefore, various additional coordination mechanisms need to be introduced. For example:
1. Coordination Mechanism Based on Global Administrator
It is mainly used in the HA scheme of MySQL. MHA is a typical product that monitors the binlog replication status of different nodes in the entire MySQL cluster. When HA switching is required, the replication relationship between nodes is reconstructed based on the new binlog_file_name and binlog_file_position through a series of log compensation mechanisms and switching algorithms. The details will be omitted here. For more information, please refer to the link and the figure.
2. Auto-coordination Mechanism Based on Timestamp
It is mainly used in the scenario of data synchronization process Auto Recovery. A typical product is Alibaba Canal which configures multiple nodes of MySQL clusters in data synchronization tasks. When one node is unavailable, it can automatically switch to another node. In this case, the binlog position is inconsistent. However, Canal provides a timestamp-based backtracking mechanism to find such a binlog position by traversing binlog files. The timestamp of the binlog event corresponding to this point is smaller than that of the consumed event before HA switching. That is, HA switching is implemented by repeatedly consuming binlog events within a time range to ensure that no data is lost. Of course, due to repeated consumption in this mode, an "idempotent" mechanism is required for the consumption program. The details will be omitted here. For more information, please refer to the link and the figure.
In response to this "annoying" problem, MySQL officially introduced a GTID-based replication scheme in version 5.7 to make up for the shortcomings of the scheme based on binlog_file_name + binlog_file_position. GTID, which does not repeat globally, is used to shield binlog file differences on different nodes, thus greatly reducing the complexity.
However, this convenience can be achieved only when binlog consumption is performed by directly connecting MySQL service processes. If binlog has been archived, it still needs to be managed and operated through the file name. In practical use cases, the archived binlog files are likely to be discontinuous (for example, HA switching, node reconstruction, and cross-machine configuration changes have occurred).
As shown in the following figure, two nodes of an RDS MySQL instance have undergone cross-machine migration. After migration, the binlog files of the two new nodes are numbered from 000001 again. Therefore, there is log data overlap during a certain time range between the new 000001 file and the last binlog file of the old node.
In the above scenario, if a downstream consumption program wants to consume binlog files from a specific historical time point, this possible "inconsistency" problem still exists, which is very troublesome. However, if you use PolarDB-X binlog, you don't have to worry about such problems at all. Regardless of O&M actions (upgrade, configuration change, reconstruction, etc.), PolarDB-X can always ensure the continuity of binlog files and provide a transparent and imperceptible use experience for downstream consumption programs.
The common process for data synchronization includes migrating full data from a specific time point. Then, after the full data migration is completed, migrate incremental data by backtracking binlog files based on the time point when the full data migration starts. In some scenarios where software architectures are built based on the idea of "event tracing", it is often expected to consume binlogs from a very early time point (or even the starting time) to construct new application data. In the above two scenarios, binlog files must be retained for a long time. For example, when you use the Data Transmission Service (DTS) to migrate data, the following prompt is provided in the product documentation:
In order to retain binlog data for a long time, users can choose to build their own binlog data store, but it will introduce additional development and O&M costs. However, when using PolarDB-X with the powerful feature of imperceptible and continuous consumption, users do not need to worry about this problem. This is the great advantage of cloud-native: transparent and simple.
Finally, how do users enable transparent consumption?
If you are using the commercial version of PolarDB-X, the transparent consumption capability of binlog is enabled by default (out-of-the-box). If you are using the community version of PolarDB-X, you need to manually enable it. The method is also very simple. After pulling up the instance, execute the following SQL statements in sequence:
--Currently, two types of backup storage are supported: OSS and LINDORM
--If you choose to use OSS, run the following command
stop master;
set global cdc_binlog_backup_type=OSS;
set global cdc_oss_endpoint=xxx;
set global cdc_oss_bucket_name=xxx;
set global cdc_oss_access_key_id=xxx;
set global cdc_oss_access_key_secret=xxx;
start master;
--If you choose to use LINDORM, run the following command
stop master;
set global cdc_binlog_backup_type=LINDORM;
set global cdc_lindorm_endpoint=xxx;
set global cdc_lindorm_thrift_port=xxx;
set global cdc_lindorm_s3_port=xxx;
set global cdc_lindorm_bucket_name=xxx;
set global cdc_lindorm_access_key_id=xxx;
set global cdc_lindorm_access_key_secret=xxx;
start master;
Q: How many DNs can PolarDB-X support at most?
A: Theoretically, no limit is imposed. The largest test we've performed so far is 1,024 nodes.
Q: How many CDC nodes can PolarDB-X support at most?
A: Theoretically, no limit is imposed. However, global binlogs (binlogs in single-stream mode) have a single-point bottleneck. You cannot achieve linear scaling by adding nodes. For example, when the number of DNs increases, the latency of binlogs in single-stream mode may also increase. Or, even if global binlogs do not have performance bottlenecks, the downstream consumption program may have already reached the upper limit.
Q: How does PolarDB-X solve the single-point bottleneck of binlogs in single-stream mode?
A: PolarDB-X data tables can be partitioned and distributed across multiple DNs, and binlog streams can also be partitioned and distributed across multiple CDC nodes. These are the binlogs in multi-stream mode (Binlog-X mode) provided by PolarDB-X. For more information, please refer to the introduction.
Q: What size of instances does the PolarDB-X binlogs in multi-stream mode apply to?
A: There is no exact answer to this question. As the number of DNs and write traffic continues to increase, the possibility of hitting the bottleneck of a single-stream replication process will increase. In addition, the downstream consumption program, rather than binlogs, will hit the performance bottleneck first. Generally speaking, when the number of DNs is less than or equal to 50, the latency of binlog generation can be controlled within 500ms regardless of the traffic, and up to 70 binlog files can be generated per minute (the size of each binlog file is 500MB).
Q: Is the threshold for using binlogs in multi-stream mode in PolarDB-X high?
A: Binlogs in single-stream mode are fully compatible with MySQL. Binlogs in multi-stream mode are also highly compatible with MySQL. For more information, please refer to the introduction. In summary, whether it is a binlog in single-stream or multi-stream mode, for each log stream, PolarDB-X can provide a user experience that is fully compatible with MySQL.
Note: Here is an additional S. A stream is a Master, and a PolarDB-X instance is a superset of Master.
Based on PolarDB-X CDC binlogs, you can configure data replication not only from PolarDB-X to MySQL but also from PolarDB-X to PolarDB-X. For the former, the PolarDB-X's identity is MySQL, while for the latter, the PolarDB-X's identity is itself.
So, the question is, if one such SQL statement is executed in PolarDB-X, how should it be replicated in binlog?
create table t1(
a int primary key,
b int,
c int,
d varchar(10),
global index (b) partition by key(b) partitions 2,
unique global index g2(c) partition by key(c) partitions 2,
unique global key g3(c) partition by key(c) partitions 2,
global unique index g4(c) partition by key(c) partitions 2,
global unique key g5(c) partition by key(c) partitions 2)
partition by key(a) partitions 3
If the SQL statement is stored as it is in the binlog file, a replication error will be reported in the downstream MySQL database because MySQL cannot recognize the PolarDB-X-specific syntax (global index...) in the SQL statement. However, if the PolarDB-X-specific syntax in the SQL statement is removed, the downstream PolarDB-X cannot completely replay the DDL operations of the upstream PolarDB-X.
The corresponding solution is simple but effective: remove the PolarDB-X-specific syntax in DDL SQL statements, convert the SQL statement into a form compatible with stand-alone MySQL and output it for stand-alone MySQL and related synchronization tools to use, and attach the original SQL statement to the output content in the form of comments for PolarDB-X to use. If the form of a certain SQL statement is unique to PolarDB-X, only the SQL statement in the form of comments can be output. For example:
You can deploy PolarDB-X instances by using the three data centers across two regions architecture, as shown in the following figure. For more information, please refer to the introduction.
With this architecture, one of the core technical challenges is cross-regional data replication. Faced with the 30ms RTT of cross-regional networks, how can we maximize the data transmission speed to reduce RPO? The following are several common methods:
1. Compress data before transmission.
It is a common method to reduce the amount of data transmitted over the network. For example, MySQL provides the corresponding capabilities. When configuring data replication, you can enable compression transmission by properly configuring MASTER_COMPRESSION_ALGORITHMS and MASTER_ZSTD_COMPRESSION_LEVEL. For more information, please refer to the introduction. Compression can certainly reduce bandwidth consumption, but compression itself is also time-consuming, which may have a negative impact on the latency of data replication. For example, assuming that compressing 100MB of data consumes 5s, but 200MB of data can be transmitted within 5s. In this case, compression leads to slower data transmission. Therefore, in different scenarios, it is necessary to consider the compression strategy in combination with factors such as network bandwidth, network RTT, data growth rate, compression speed, and latency requirements.
2. Increase the parallelism of data replication.
When the network bandwidth is sufficient, increasing the parallelism of data transmission can effectively increase the transmission speed and smooth the transmission bottleneck caused by network latency. In this regard, a typical case is Alibaba Otter. In order to solve the bottleneck problem of cross-continent transmission between Chinese and American data centers, Otter adopts the strategy of "sending data in parallel at the sending end and re-serializing and ranking at the receiving end" to increase the number of data messages on the network per unit time. For more information, please refer to the Otter scheduling model.
3. Customized optimization at the product level: Such as protocol optimization
With the three data centers across two regions architecture in PolarDB-X, cross-region Paxos protocol data transmission and binlog data transmission are involved between DNs. What's more, data is transmitted in parallel across different DN clusters. The data transmission of each DN cluster also supports compression and other mechanisms, which can better meet the requirements of low-latency replication. In addition to replication between DNs, data replication between the primary instance and the remote secondary instance is also an essential part. This replication needs to be completed based on the global binlog provided by the PolarDB-X CDC. In this case, higher requirements are put forward for CDC:
For the above requirements, there are two implementation ideas. One idea is to directly increase the number of CDC nodes. CDC nodes can be deployed in each data center to meet Requirement 1. The cross-node binlog replication capability provided by CDC itself can meet Requirement 2. Rack awareness capability and appropriate transformation can meet Requirement 3, as shown in the following figure:
The preceding solution is prone to the performance bottleneck of cross-region data transmission. Since the binlogs in single-stream mode integrate the binlog data of all DNs, single-process cross-region transmission is not an ideal choice. Even if the compression capability is introduced, the drawback is still obvious. The following test case demonstrates that more clearly.
Due to the need to construct a large number of binlogs, when purchasing the instance, set the disk specification to a higher level, so as to avoid the disk becoming the bottleneck of the test. Note that this test does not add additional data disks and the selected system disk specification is (ESSD, 500GB + PL2) with a maximum throughput of about 370 M/s.
• You must ensure that the CDC cluster of the instance has nodes on both A and B and that the replication direction of Dumper is A to B. For the sample topology configuration file, please refer to the Appendix.
• After the cluster is pulled, you need to set the binlog_purge_enabled parameter to false to disable the feature of automatically clearing binlog files.
You can choose to run several TPC-C tests. First, construct enough binlog files, then empty the binlog files on B, replicate them from 000001, and observe the BPS indicator.
Execute the following command on A to simulate the latency: tc qdisc add dev eth0 root netem delay xxms (execute it multiple times and replace add with replace). The ping command can be used to verify whether it is effective, as shown below:
Under different PolarDB-X versions, different environments, and different TCP protocol parameters, the obtained data may be quite different. The main focus should be on the changing trend of data replication throughput under different network latency conditions.
Insert Delay Time |
Compress |
BPS |
Number of Binlog Files Synchronized Per Minute |
Note |
None |
No |
385M/s |
45 |
1. Gzip compression of gRPC is used by default, so the performance is relatively low. Without delay inserted, only 4 to 5 binlog files can be synchronized per minute.<br/>2. After the LZ4 compression algorithm is used, the performance is improved obviously compared with that of Gzip, but there is no advantage compared with that of not using compression. The throughput is even slightly lower than that of not using compression.<br/>3. Enabling compression does not achieve the expected result but even shows a lower performance, mainly because the time consumed by compression and decompression is longer than the network transmission time saved after compression. The introduction of multi-threaded replication and asynchronous processing may have an effect. However, it still needs further adjustment and testing at the program level. |
10ms |
225M/s |
26 |
||
20ms |
136M/s |
16 |
||
30ms |
96M/s |
11 |
||
40ms |
75M/s |
9 |
||
50ms |
60M/s |
7 |
||
None |
Yes (compression algorithm LZ4) |
300M/s |
33 |
|
10ms |
170M/s |
20 |
||
20ms |
125M/s |
14 |
||
30ms |
84M/s |
10 |
||
40ms |
-- |
-- |
||
50ms |
-- |
-- |
Therefore, it is not an ideal choice to simply increase the number of CDC nodes to meet the requirements of three data centers across two regions.
New idea: Theoretically, a PolarDB-X instance can have an unlimited number of CDC clusters. Pull an independent CDC cluster in a remote place, and construct binlogs in single-stream mode that are independent of the central region based on remote DNs, that is, a binlog in single-stream mode can be directly generated based on the physical binlog of the remote DN for use in a remote place without the need to transmit binlogs in single-stream mode across regions.
The multi-cluster capability is precisely the basic capability preset by CDC for future extensibility at the beginning of design. Most metadata tables of CDC are preset with the cluster_id field to support multiple clusters running at the same time.
Specifically, PolarDB-X can provide unlimited clusters of binlogs in single-stream or multi-stream mode. Moreover, the PolarDB-X Columnar clusters and Replica clusters are also managed by the multi-cluster capability of CDC, which is very convenient to expand. Independent binlog clusters deployed in different regions are as follows:
The CDC clusters in the two regions can generate the same binlogs in single-stream mode, but they are independent of each other.
This article summarizes some advanced capabilities of PolarDB-X CDC, but this is by no means the final goal of PolarDB-X CDC. Instead, PolarDB-X aims at simplicity, ecological compatibility, and flexible expansion, and takes advantage of cloud to provide users with easy-to-use features. Based on the two basic capabilities binlog and replica provided by CDC, we are building the native cross-region disaster recovery and multi-cloud disaster recovery capabilities GDN. Welcome to follow the updates!
Sample PXD configuration file
version: v1
type: polardbx
cluster:
name: pxc_test
gms:
image: polardbx/polardbx-engine:latest
host_group: [172.16.0.236]
resources:
mem_limit: 16G
cn:
image: polardbx/polardbx-sql:latest
replica: 1
nodes:
- host: 172.16.0.236
resources:
mem_limit: 32G
dn:
image: polardbx/polardbx-engine:latest
replica: 2
nodes:
- host_group: [172.16.0.236]
- host_group: [172.16.0.236]
- host_group: [172.16.0.236]
resources:
mem_limit: 32G
cdc:
image: polardbx/polardbx-cdc:latest
replica: 2
nodes:
- host: 172.16.0.236
- host: 172.16.0.235
resources:
mem_limit: 16G
[Infographic] Highlights | Database New Features in August 2024
ApsaraDB - October 24, 2022
ApsaraDB - December 21, 2022
ApsaraDB - June 12, 2024
ApsaraDB - June 19, 2024
ApsaraDB - November 12, 2024
ApsaraDB - January 3, 2024
Alibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreA ledger database that provides powerful data audit capabilities.
Learn MoreA financial-grade distributed relational database that features high stability, high scalability, and high performance.
Learn MoreMore Posts by ApsaraDB