All Products
Search
Document Center

DataWorks:TiDB data source

Last Updated:Dec 27, 2024

DataWorks provides TiDB Reader for you to read data from TiDB data sources in offline mode. This topic describes the capabilities of synchronizing data from TiDB data sources.

Supported TiDB versions

  • Batch data read: TiDB 7.x and TiDB 8.x databases are supported.

  • Batch data write: No TiDB database versions are supported.

  • Real-time data read and write: No TiDB database versions are supported.

Note

TiDB is highly compatible with the MySQL protocol and the features and syntax commonly used by MySQL 5.7 and MySQL 8.0. Data synchronization from a TiDB data source in offline mode is implemented based on the MySQL protocol. For information about compatibility comparison between TiDB and MySQL, see MySQL Compatibility.

Supported data types

For information about all data types in each TiDB version, see Data Types. The following table provides the support status of main data types in TiDB.

Data type

TiDB Reader for batch data read

TINYINT

Supported

SMALLINT

Supported

MEDIUMINT

Supported

INTEGER

Supported

BIGINT

Supported

FLOAT

Supported

DOUBLE

Supported

DECIMAL

Supported

CHAR

Supported

VARCHAR

Supported

JSON

Supported

TEXT

Supported

TINYTEXT

Supported

MEDIUMTEXT

Supported

LONGTEXT

Supported

VARBINARY

Supported

BINARY

Supported

BLOB

Supported

TINYBLOB

Supported

MEDIUMBLOB

Supported

LONGBLOB

Supported

ENUM

Supported

SET

Supported

BOOLEAN

Supported

BIT

Supported

DATE

Supported

DATETIME

Supported

TIMESTAMP

Supported

TIME

Supported

YEAR

Supported

Prepare a TiDB environment before data synchronization

Before you use DataWorks to synchronize data from a TiDB data source, you must prepare a TiDB environment. This ensures that a data synchronization task can be configured and can synchronize data from the TiDB data source as expected. The following information describes how to prepare a TiDB environment for data synchronization from a TiDB data source.

Preparation 1: Check the version of your TiDB database

Data Integration requires that the version of the TiDB database be of 7.x or 8.x. You must check whether the version of your TiDB database from which you want to synchronize data meets the requirement. You can execute the following statement to check the version of your TiDB database:

SELECT TIDB_VERSION()\G

*************************** 1. row ***************************
TIDB_VERSION(): Release Version: v8.1.1
Edition: Community
Git Commit Hash: 821e491a20fbab36604b36b647b5bae26a2c1418
Git Branch: HEAD
UTC Build Time: 2024-08-27 19:16:25
GoVersion: go1.21.10
Race Enabled: false
Check Table Before Drop: false
Store: tikv
1 row in set (0.00 sec)

Preparation 2: Prepare an account that has the required permissions

We recommend that you plan and create an account for DataWorks to access your TiDB database. To prepare such an account, perform the following steps:

  1. Optional. Create an account. For more information, see TiDB User Account Management.

  2. Grant the required permissions to the account.

    Batch data read: The account must have the SELECT permission.

    You can execute the following statement to grant permissions to the account. Alternatively, you can grant the SUPER permission to the account. Replace Account for data synchronization in the statement with the created account.

    -- CREATE USER 'Account for data synchronization'@'%' IDENTIFIED BY 'Password'; //Create an account that can be used for data synchronization and specify a password. This way, you can use the account and password to access the database from any host. % indicates a host. 
    GRANT SELECT ON *.* TO 'Account for data synchronization'@'%'; // Grant the SELECT permission to the account.

    *.* indicates that the account is granted the preceding permissions on all tables in all databases. You can also grant the preceding permissions on specific tables in a database to the account. For example, to grant the account the preceding permissions on the user table in the test database, execute the following statement: GRANT SELECT ON test.user TO 'Account for data synchronization'@'%';.

    Note

    The REPLICATION SLAVE permission is a global permission. You cannot grant this permission on specific tables in a database to the account.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for TiDB Reader

The following code provides an example on how to configure a synchronization task to read data from a TiDB data source.

Note

The comments in the JSON example are used to only indicate the definitions of some major parameters. Remove the comments when you configure the parameters.

{
  "type": "job",
  "version": "2.0",
  "steps":
  [
    {
      "stepType": "tidb",
      "parameter":
      {
        "column":
        [
          "id",
          "name"
        ],
        "where": "",
        "splitPk": "id",
        "connection":
        [
          {
            "selectedDatabase": "test_database",
            "datasource": "test_datasource",
            "table":
            [
              "test_table"
            ]
          }
        ]
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "odps",
      "parameter":
      {
      },
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting":
  {
    "errorLimit":
    {
      "record": "0"
    },
    "speed":
    {
      "throttle": false,
      "concurrent": 3
    }
  },
  "order":
  {
    "hops":
    [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Parameters in code for TiDB Reader

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

table

The name of the table from which you want to read data.

You can make advanced configurations for the table parameter to specify a table range. Examples:

  • Set the table parameter to 'table_[0-99]'. This value indicates that TiDB Reader reads data from the tables 'table_0' to 'table_99' in sharded databases.

  • Set the table parameter to '"table": ["table_00[0-9]", "table_0[10-99]", "table_[100-999]"]'. This value indicates that TiDB Reader reads data from tables 'table_000' to 'table_999'. You can use this method only if the numerical suffixes of the names of your tables are of the same length.

Yes

No default value

column

The names of the columns from which you want to read data. Specify the names in a JSON array. The default value is [ * ], which indicates all columns in the source table.

  • You can select specific columns to read.

  • The column order can be changed. This indicates that you can specify columns in an order different from the order specified by the schema of the source table.

  • Constants are supported. The column names must be arranged in compliance with the SQL syntax that is supported by MySQL, such as ["id","table","1","'test_constant'","'null'","to_char(a+1)","2.3","true"].

    • id: a column name.

    • table: the name of a column that contains reserved keywords.

    • 1: an integer constant.

    • 'test_constant': a string constant, which is enclosed in single quotation marks (').

    • null:

      • " " indicates an empty string.

      • null indicates a null value.

      • 'null' indicates the string null.

    • to_char(a+1): a function expression that is used to calculate the length of a string.

    • 2.3: a floating-point constant.

    • true: a Boolean value.

  • The column parameter must explicitly specify all the columns from which you want to read data. This parameter cannot be left empty.

Yes

No default value

splitPk

The field that is used for data sharding when TiDB Reader reads data. If you configure this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency.

  • We recommend that you specify the name of the primary key column of a source table as the shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.

  • A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, the batch synchronization task ignores the shard key that you specified and uses a single thread to read data.

  • If no shard key is specified, a data synchronization task uses a single thread to read data.

No

No default value

where

The WHERE clause. For example, you can set this parameter to gmt_create>$bizdate to read the data that is generated on the current day.

  • You can use the WHERE clause to read incremental data. If the where parameter is not provided or is left empty, TiDB Reader reads all data.

  • The WHERE clause does not support the LIMIT keyword.

No

No default value