MySQL data source provides you with bidirectional channels for reading and writing MySQL data. This topic describes the MySQL data synchronization capabilities supported by DataWorks.
Supported MySQL versions
Offline read and write:
MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x are supported, compatible with Amazon RDS for MySQL, Azure MySQL, and Amazon Aurora MySQL.
Offline synchronization supports reading from view tables.
Real-time reading:
Data Integration real-time reading of MySQL data is implemented based on real-time subscription to MySQL. Currently, it only supports real-time synchronization of MySQL 5.5.x, MySQL 5.6.x, MySQL 5.7.x, and MySQL 8.0.x versions (excluding new features in 8.0, such as functional index, only compatible with existing features), compatible with Amazon RDS for MySQL, Azure MySQL, and Amazon Aurora MySQL.
If you need to synchronize DRDS MySQL, do not configure DRDS MySQL as a MySQL data source. You can refer to the Configure DRDS data source document to directly configure it as a DRDS data source.
Limits
Real-time reading
Synchronization of data from MySQL read-only instance is not supported.
Synchronization of tables containing Functional index is not supported.
XA ROLLBACK is not supported.
For transaction data that has been XA PREPARE, real-time synchronization will synchronize it to the target end. If XA ROLLBACK occurs, real-time synchronization will not perform rollback write operations for the XA PREPARE data. To handle XA ROLLBACK scenarios, you need to manually remove the XA ROLLBACK table from the real-time synchronization task, then add the table again and resynchronize.
Only MySQL server Binlog configuration format set to ROW is supported.
Real-time synchronization will not synchronize associated table records that are cascade deleted.
For Amazon Aurora MySQL databases, you need to connect to your primary/write database because AWS does not allow activating the Binlog feature on Aurora MySQL read replicas. Real-time synchronization tasks require Binlog to perform incremental updates.
Real-time synchronization of online DDL changes only supports adding columns (Add Column) to MySQL tables through Data Management (DMS).
Reading stored procedures from MySQL is not supported.
Offline reading
When MySQL Reader plugin synchronizes multiple tables such as in sharding scenarios, if you want to split a single table, the number of concurrent tasks must be greater than the number of tables. Otherwise, the number of split tasks equals the number of tables.
Reading stored procedures from MySQL is not supported.
Supported field types
For the complete list of field types for various MySQL versions, see the MySQL official documentation. The following table lists the current support status of major fields using MySQL 8.0.x as an example.
Field type | Offline reading (MySQL Reader) | Offline writing (MySQL Writer) | Real-time reading | Real-time writing |
Field type | Offline reading (MySQL Reader) | Offline writing (MySQL Writer) | Real-time reading | Real-time writing |
TINYINT | ||||
SMALLINT | ||||
INTEGER | ||||
BIGINT | ||||
FLOAT | ||||
DOUBLE | ||||
DECIMAL/NUMBERIC | ||||
REAL | ||||
VARCHAR | ||||
JSON | ||||
TEXT | ||||
MEDIUMTEXT | ||||
LONGTEXT | ||||
VARBINARY | ||||
BINARY | ||||
TINYBLOB | ||||
MEDIUMBLOB | ||||
LONGBLOB | ||||
ENUM | ||||
SET | ||||
BOOLEAN | ||||
BIT | ||||
DATE | ||||
DATETIME | ||||
TIMESTAMP | ||||
TIME | ||||
YEAR | ||||
LINESTRING | ||||
POLYGON | ||||
MULTIPOINT | ||||
MULTILINESTRING | ||||
MULTIPOLYGON | ||||
GEOMETRYCOLLECTION |
Preparation for data synchronization: MySQL environment preparation
Before performing data synchronization on DataWorks, you need to prepare the MySQL environment for data synchronization according to this document to ensure normal service when configuring and executing MySQL data synchronization tasks on DataWorks. The following describes the relevant environment preparation for MySQL synchronization.
Confirm MySQL version
Data Integration has requirements for MySQL versions. You can refer to the Supported MySQL versions section above to check if the MySQL to be synchronized meets the version requirements. You can check the current MySQL database version using the following statement in the MySQL database.
SELECT version();
Configure account permissions
It is recommended that you plan and create a MySQL account dedicated to DataWorks for accessing data sources in advance. The operation is as follows.
Optional: Create an account.
For operation details, see Create MySQL account.
Configure permissions.
Offline
In offline synchronization scenarios:
When reading MySQL data offline, this account needs to have read (
SELECT
) permission for the synchronized tables.When writing MySQL data offline, this account needs to have write (
INSERT
,DELETE
,UPDATE
) permissions for the synchronized tables.
Real-time
In real-time synchronization scenarios, this account needs to have
SELECT
,REPLICATION SLAVE
, andREPLICATION CLIENT
permissions for the database.
You can refer to the following commands to add permissions to the account, or directly grant the account
SUPER
permission. When using the following execution statements, please replace'sync account'
with the account created above.-- CREATE USER 'sync account'@'%' IDENTIFIED BY 'password'; //Create a sync account and set a password to allow login from any host. % represents any host. GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'sync account'@'%'; //Grant SELECT, REPLICATION SLAVE, REPLICATION CLIENT permissions on the database to the sync account.
*.*
means granting the above permissions to the sync account for all tables in all databases. You can also specify granting the above permissions to the sync account for specific tables in the target database. For example, to grant the above permissions to the sync account for the user table in the test database, you can use the statementGRANT SELECT, REPLICATION CLIENT ON test.user TO 'sync account'@'%';
.The
REPLICATION SLAVE
statement is a global permission and cannot be specified to grant the sync account permissions for specific tables in the target database.
(Required only for real-time synchronization) Enable MySQL binlog
Data Integration achieves real-time incremental data synchronization through real-time subscription to MySQL Binlog. You need to enable the MySQL Binlog service before configuring synchronization on DataWorks. The operation is as follows:
If Binlog is being consumed, it cannot be deleted by the database. If the real-time synchronization task runs with a delay, it may cause the source Binlog to be consumed for a long time. Please reasonably configure the task's delay alert and pay attention to the database's disk space in a timely manner.
Binlog should be retained for at least 72 hours or more to avoid data loss due to the disappearance of Binlog after a task failure, making it impossible to reset the position to before the failure occurred when restarting (in this case, only full offline synchronization can be used to fill in the data).
Check if Binlog is enabled.
Use the following statement to check if Binlog is enabled.
SHOW variables LIKE "log_bin";
When the return result is ON, it indicates that Binlog is enabled.
If you use a standby database to synchronize data, you can also check if Binlog is enabled using the following statement.
SHOW variables LIKE "log_slave_updates";
When the return result is ON, it indicates that Binlog is enabled on the standby database.
If the returned result does not match the above results:
For open-source MySQL, refer to the MySQL official documentation to enable Binlog.
For Alibaba Cloud RDS MySQL, refer to RDS MySQL log backup to enable Binlog.
For Alibaba Cloud PolarDB MySQL, refer to Enable Binlog to enable Binlog.
Query the Binlog format.
Use the following statement to query the Binlog format.
SHOW variables LIKE "binlog_format";
Explanation of return results:
Returns ROW, indicating that the enabled Binlog format is ROW.
Returns STATEMENT, indicating that the enabled Binlog format is STATEMENT.
Returns MIXED, indicating that the enabled Binlog format is MIXED.
DataWorks real-time synchronization only supports synchronizing MySQL server Binlog configuration format set to ROW. If the return is not ROW, please modify the Binlog Format.
Check if the complete Binlog log is enabled.
Use the following statement to check if the complete Binlog log is enabled.
SHOW variables LIKE "binlog_row_image";
Explanation of return results:
Returns FULL, indicating that the complete log is enabled for Binlog.
Returns MINIMAL, indicating that the minimal log is enabled for Binlog, and the complete log is not enabled.
DataWorks real-time synchronization only supports synchronizing data from MySQL servers that have enabled the complete Binlog log. If the query result returns non-FULL, please modify the binlog_row_image configuration.
OSS binlog read authorization configuration
When adding a MySQL data source, if the Configuration Mode is set to Alibaba Cloud Instance Mode, and the RDS MySQL instance region is in the same region as the DataWorks project space, you can enable Support OSS Binlog Reading. After enabling this, when the RDS Binlog cannot be accessed, it will attempt to obtain Binlog from OSS to avoid interruption of real-time synchronization tasks.
If the selected OSS Binlog Access Identity is Alibaba Cloud RAM User or Alibaba Cloud RAM Role, you need to configure account authorization as follows.
Alibaba Cloud RAM user.
Log in to the RAM Access Control-Users console and find the user that needs authorization. Specific operation:
Click Actions column's Add Permissions.
Configure the following key parameters, then click Confirm.
Resource Scope: Account level.
Policy: System policy.
Policy Name:
AliyunDataWorksAccessingRdsOSSBinlogPolicy
.
Alibaba Cloud RAM role.
Log in to the RAM Access Control-Roles console and create a RAM role. For specific operations, see Create a RAM role with a trusted entity of an Alibaba Cloud account.
Key parameters:
Trusted Entity Type: Alibaba Cloud account.
Trusted Entity Name: Other Alibaba Cloud account, you need to fill in the Alibaba Cloud account that owns the DataWorks workspace.
Role Name: Custom.
Grant precise permissions to the created RAM role. For specific operations, see Grant permissions to a RAM role.
Key parameters:
Policy: System policy.
Policy Name:
AliyunDataWorksAccessingRdsOSSBinlogPolicy
.
Modify the trust policy for the created RAM role. For specific operations, see Modify the trust policy of a RAM role.
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "@cdp.aliyuncs.com", "@dataworks.aliyuncs.com" ] } } ], "Version": "1" }
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Data synchronization task development: MySQL synchronization process guide
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Single table offline synchronization task configuration guide
For the operation process, see Configure offline synchronization tasks using the codeless UI, Configure offline synchronization tasks using the code editor.
For the full parameters and script demo in code editor mode, see Appendix: MySQL script demo and parameter description below.
Single table real-time synchronization task configuration guide
For the operation process, see DataStudio real-time synchronization task configuration.
Configuration guide for database-level synchronization such as offline database, real-time full and incremental database, and real-time sharding database
For the operation process, see Data Integration synchronization task configuration.
FAQ
For more common Data Integration issues, see Common issues in Data Integration.
Appendix: MySQL script demo and parameter description
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
This document provides configuration examples for single database single table and sharding:
The comments in the JSON examples in this document are only used to show the meaning of some important parameters. When configuring, please remove the comment content.
Configure single database single table
{ "type": "job", "version": "2.0",//Version number. "steps": [ { "stepType": "mysql",//Plugin name. "parameter": { "column": [//Column names. "id" ], "connection": [ { "querySql": [ "select a,b from join1 c join join2 d on c.id = d.id;" ], "datasource": ""//Data source name. } ], "where": "",//Filter condition. "splitPk": "",//Split key. "encoding": "UTF-8"//Encoding format. }, "name": "Reader", "category": "reader" }, { "stepType": "stream", "parameter": {}, "name": "Writer", "category": "writer" } ], "setting": { "errorLimit": { "record": "0"//Number of error records. }, "speed": { "throttle": true,//When throttle is false, the mbps parameter is ineffective, indicating no rate limiting; when throttle is true, it indicates rate limiting. "concurrent": 1,//Number of concurrent jobs. "mbps": "12"//Rate limit, here 1mbps = 1MB/s. } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }
Configure sharding
Sharding refers to the ability to select multiple MySQL data tables on the MySQL Reader side, with consistent table structures. The 'sharding' here refers to multiple MySQL tables writing to the same target table. If you want to support database-level sharding configuration, please create a task on the Data Integration site and select the database sharding capability
{ "type": "job", "version": "2.0", "steps": [ { "stepType": "mysql", "parameter": { "indexes": [ { "type": "unique", "column": [ "id" ] } ], "envType": 0, "useSpecialSecret": false, "column": [ "id", "buyer_name", "seller_name", "item_id", "city", "zone" ], "tableComment": "Test order table", "connection": [ { "datasource": "rds_dataservice", "table": [ "rds_table" ] }, { "datasource": "rds_workshop_log", "table": [ "rds_table" ] } ], "where": "", "splitPk": "id", "encoding": "UTF-8" }, "name": "Reader", "category": "reader" }, { "stepType": "odps", "parameter": {}, "name": "Writer", "category": "writer" }, { "name": "Processor", "stepType": null, "category": "processor", "copies": 1, "parameter": { "nodes": [], "edges": [], "groups": [], "version": "2.0" } } ], "setting": { "executeMode": null, "errorLimit": { "record": "" }, "speed": { "concurrent": 2, "throttle": false } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }
Reader script parameters
Script parameter name | Description | Required | Default value |
Script parameter name | Description | Required | Default value |
datasource | Data source name. The code editor mode supports adding data sources, and the content filled in this configuration item must be consistent with the added data source name. | Yes | None |
table | The name of the table to be synchronized. A data integration task can only read data from one table. Advanced usage examples of table for configuring ranges:
The task will read all matched tables and specifically read the columns specified in the column configuration item from these tables. If the table does not exist, or the columns being read do not exist, the task will fail. | Yes | None |
column | The set of column names in the configured table that need to be synchronized, described using a JSON array. The default is to use all columns configuration, such as [*].
| Yes | None |
splitPk | When MySQL Reader extracts data, if splitPk is specified, it means you want to use the field represented by splitPk for data sharding. Data synchronization will then start concurrent tasks for data synchronization, improving the efficiency of data synchronization.
| No | None |
where | Filter condition. In actual business scenarios, you often want to synchronize data for the current day, so you can specify the where condition as
| No | None |
querySql (Advanced mode, this parameter configuration is not supported in codeless UI mode) | In some business scenarios, the where configuration item is not sufficient to describe the filtering conditions. You can use this configuration item to customize filtering SQL. After configuring this item, the data synchronization system will ignore the tables, columns, and splitPk configuration items, and directly use the content configured in this item to filter data. For example, if you need to join multiple tables and then synchronize the data, you can use querySql is case-sensitive. For example, writing it as querysql will not work. | No | None |
useSpecialSecret | For multiple source data sources, whether to use each data source's password. Values include the following:
If you have configured multiple source data sources and each data source uses different usernames and passwords, you can set this parameter to true to use each data source's password. | No | false |
Writer script demo
{
"type": "job",
"version": "2.0",//Version number.
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType": "mysql",//Plugin name.
"parameter": {
"postSql": [],//Statements to execute after import.
"datasource": "",//Data source.
"column": [//Column names.
"id",
"value"
],
"writeMode": "insert",//Write mode, you can set it to insert, replace, or update.
"batchSize": 1024,//Number of records in a single batch commit.
"table": "",//Table name.
"nullMode": "skipNull",//NULL value handling strategy.
"skipNullColumn": [//Columns to skip NULL values.
"id",
"value"
],
"preSql": [
"delete from XXX;"//Statements to execute before import.
]
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {//Number of error records.
"record": "0"
},
"speed": {
"throttle": true,//When throttle is false, the mbps parameter is ineffective, indicating no rate limiting; when throttle is true, it indicates rate limiting.
"concurrent": 1,//Number of concurrent jobs.
"mbps": "12"//Rate limit, controls the maximum rate of synchronization to prevent excessive read/write pressure on upstream/downstream databases, here 1mbps = 1MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Writer script parameters
Script parameter name | Description | Required | Default value |
Script parameter name | Description | Required | Default value |
datasource | Data source name. The code editor mode supports adding data sources, and the content filled in this configuration item must be consistent with the added data source name. | Yes | None |
table | The name of the table to be synchronized. | Yes | None |
writeMode | Select the import mode, which can support insert into, on duplicate key update, and replace into methods:
| No | insert |
nullMode | NULL value handling strategy, value range:
When configured as skipNull, the task will dynamically concatenate SQL statements for writing data to support target default values, which will increase the number of FLUSH operations and reduce synchronization speed. In the worst case, it will FLUSH once per data record. | No | writeNull |
skipNullColumn | When nullMode is configured as skipNull, the columns configured in this parameter will not be forced to write as Configuration format: | No | Default is all columns configured for this task. |
column | Fields in the target table that need to be written with data, separated by commas, for example | Yes | None |
preSql | SQL statements to execute before the data synchronization task. Currently, the codeless UI mode only allows executing one SQL statement, while the code editor mode can support multiple SQL statements. For example, clearing old data from the table before execution (truncate table tablename). When there are multiple SQL statements, transactions are not supported. | No | None |
postSql | SQL statements to execute after the data synchronization task. Currently, the codeless UI mode only allows executing one SQL statement, while the code editor mode can support multiple SQL statements. For example, adding a timestamp When there are multiple SQL statements, transactions are not supported. | No | None |
batchSize | The number of records in a single batch commit. This value can greatly reduce the number of network interactions between the data synchronization system and MySQL, and improve overall throughput. If this value is set too high, it may cause OOM exceptions in the data synchronization process. | No | 256 |
updateColumn | When writeMode is configured as update, this specifies the fields to update when a primary key/unique index conflict occurs. Fields are separated by commas, for example | No | None |