TiDB data source - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides TiDB Reader for you to read data from TiDB data sources in offline mode. This topic describes the capabilities of synchronizing data from TiDB data sources.

Supported TiDB versions

Batch data read: TiDB 7.x and TiDB 8.x databases are supported.
Batch data write: No TiDB database versions are supported.
Real-time data read and write: No TiDB database versions are supported.

Note

TiDB is highly compatible with the MySQL protocol and the features and syntax commonly used by MySQL 5.7 and MySQL 8.0. Data synchronization from a TiDB data source in offline mode is implemented based on the MySQL protocol. For information about compatibility comparison between TiDB and MySQL, see MySQL Compatibility.

Supported data types

For information about all data types in each TiDB version, see Data Types. The following table provides the support status of main data types in TiDB.

Data type	TiDB Reader for batch data read
TINYINT	Supported
SMALLINT	Supported
MEDIUMINT	Supported
INTEGER	Supported
BIGINT	Supported
FLOAT	Supported
DOUBLE	Supported
DECIMAL	Supported
CHAR	Supported
VARCHAR	Supported
JSON	Supported
TEXT	Supported
TINYTEXT	Supported
MEDIUMTEXT	Supported
LONGTEXT	Supported
VARBINARY	Supported
BINARY	Supported
BLOB	Supported
TINYBLOB	Supported
MEDIUMBLOB	Supported
LONGBLOB	Supported
ENUM	Supported
SET	Supported
BOOLEAN	Supported
BIT	Supported
DATE	Supported
DATETIME	Supported
TIMESTAMP	Supported
TIME	Supported
YEAR	Supported

Prepare a TiDB environment before data synchronization

Before you use DataWorks to synchronize data from a TiDB data source, you must prepare a TiDB environment. This ensures that a data synchronization task can be configured and can synchronize data from the TiDB data source as expected. The following information describes how to prepare a TiDB environment for data synchronization from a TiDB data source.

Preparation 1: Check the version of your TiDB database

Data Integration requires that the version of the TiDB database be of 7.x or 8.x. You must check whether the version of your TiDB database from which you want to synchronize data meets the requirement. You can execute the following statement to check the version of your TiDB database:

SELECT TIDB_VERSION()\G

*************************** 1. row ***************************
TIDB_VERSION(): Release Version: v8.1.1
Edition: Community
Git Commit Hash: 821e491a20fbab36604b36b647b5bae26a2c1418
Git Branch: HEAD
UTC Build Time: 2024-08-27 19:16:25
GoVersion: go1.21.10
Race Enabled: false
Check Table Before Drop: false
Store: tikv
1 row in set (0.00 sec)

Preparation 2: Prepare an account that has the required permissions

We recommend that you plan and create an account for DataWorks to access your TiDB database. To prepare such an account, perform the following steps:

Optional. Create an account. For more information, see TiDB User Account Management.
Grant the required permissions to the account.
Batch data read: The account must have the SELECT permission.
You can execute the following statement to grant permissions to the account. Alternatively, you can grant the SUPER permission to the account. Replace Account for data synchronization in the statement with the created account.
```
-- CREATE USER 'Account for data synchronization'@'%' IDENTIFIED BY 'Password'; //Create an account that can be used for data synchronization and specify a password. This way, you can use the account and password to access the database from any host. % indicates a host. 
GRANT SELECT ON *.* TO 'Account for data synchronization'@'%'; // Grant the SELECT permission to the account.
```
*.* indicates that the account is granted the preceding permissions on all tables in all databases. You can also grant the preceding permissions on specific tables in a database to the account. For example, to grant the account the preceding permissions on the user table in the test database, execute the following statement: GRANT SELECT ON test.user TO 'Account for data synchronization'@'%';.
Note
The REPLICATION SLAVE permission is a global permission. You cannot grant this permission on specific tables in a database to the account.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for TiDB Reader

The following code provides an example on how to configure a synchronization task to read data from a TiDB data source.

Note

The comments in the JSON example are used to only indicate the definitions of some major parameters. Remove the comments when you configure the parameters.

{
  "type": "job",
  "version": "2.0",
  "steps":
  [
    {
      "stepType": "tidb",
      "parameter":
      {
        "column":
        [
          "id",
          "name"
        ],
        "where": "",
        "splitPk": "id",
        "connection":
        [
          {
            "selectedDatabase": "test_database",
            "datasource": "test_datasource",
            "table":
            [
              "test_table"
            ]
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "odps",
      "parameter":
      {
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting":
  {
    "errorLimit":
    {
      "record": "0"
    },
    "speed":
    {
      "throttle": false,
      "concurrent": 3
    }
  },
  "order":
  {
    "hops":
    [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Parameters in code for TiDB Reader

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
table	The name of the table from which you want to read data. You can make advanced configurations for the table parameter to specify a table range. Examples: Set the table parameter to `'table_[0-99]'`. This value indicates that TiDB Reader reads data from the tables `'table_0'` to `'table_99'` in sharded databases. Set the table parameter to `'"table": ["table_00[0-9]", "table_0[10-99]", "table_[100-999]"]'`. This value indicates that TiDB Reader reads data from tables `'table_000'` to `'table_999'`. You can use this method only if the numerical suffixes of the names of your tables are of the same length.	Yes	No default value
column	The names of the columns from which you want to read data. Specify the names in a JSON array. The default value is `[ * ]`, which indicates all columns in the source table. You can select specific columns to read. The column order can be changed. This indicates that you can specify columns in an order different from the order specified by the schema of the source table. Constants are supported. The column names must be arranged in compliance with the SQL syntax that is supported by MySQL, such as `["id","table","1","'test_constant'","'null'","to_char(a+1)","2.3","true"]`. `id`: a column name. `table`: the name of a column that contains reserved keywords. `1`: an integer constant. `'test_constant'`: a string constant, which is enclosed in single quotation marks ('). null: `" "` indicates an empty string. `null` indicates a null value. `'null'` indicates the string null. `to_char(a+1)`: a function expression that is used to calculate the length of a string. `2.3`: a floating-point constant. `true`: a Boolean value. The `column` parameter must explicitly specify all the columns from which you want to read data. This parameter cannot be left empty.	Yes	No default value
splitPk	The field that is used for data sharding when TiDB Reader reads data. If you configure this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards. A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, the batch synchronization task ignores the shard key that you specified and uses a single thread to read data. If no shard key is specified, a data synchronization task uses a single thread to read data.	No	No default value
where	The WHERE clause. For example, you can set this parameter to `gmt_create>$bizdate` to read the data that is generated on the current day. You can use the WHERE clause to read incremental data. If the where parameter is not provided or is left empty, TiDB Reader reads all data. The WHERE clause does not support the LIMIT keyword.	No	No default value