All Products
Search
Document Center

DataWorks:Configure a batch synchronization task to synchronize data from tables in sharded databases

Last Updated:Dec 10, 2024

DataWorks allows you to configure a batch synchronization task by using the codeless user interface (UI) or code editor. You can configure a batch synchronization task to synchronize data from tables in sharded databases to a single table. In this example, a batch synchronization task is configured to synchronize data from tables in sharded MySQL databases.

Prerequisites

The data sources from which you want to read data and to which you want to write data are prepared. For more information, see Supported data source types, Reader plug-ins, and Writer plug-ins.

Background information

The following table compares the configuration methods of batch synchronization tasks for synchronizing data from tables in sharded databases to a single table.

Comparison item

Sharded data source + codeless UI (recommended)

Common data source + codeless UI

Code editor

Support for the codeless UI

Supported.

Supported.

Not Supported.

Support for specifying source table names by rule

Specifying source table names based on a regular expression is supported. When the related synchronization task is run, the synchronization task matches tables based on the regular expression and synchronizes data from the matched tables.

Not supported.

Specifying source table names based on a number range is supported. Example: tb_[1-10]. The numbers in the range must be consecutive, and each table specified by the range must exist.

Consistency of table schemas

A policy for missing fields can be configured. Missing fields in some source tables are allowed, and the corresponding fields are synchronized to the destination as NULL values.

The schemas of all source tables must be the same.

Reference for field mappings

Fields in the first mapped table in the meta data source specified for the sharded data source are used as references for field mappings.

Fields in the first table in the first data source are used as references for field mappings.

You must manually specify the fields for which you want to configure mappings.

Number of supported data sources

A maximum of 5,000 data sources are supported.

We recommend that you configure no more than 50 data sources.

Requirement for modifying or deploying the related synchronization task if data sources are changed

You do not need to modify or deploy the synchronization task. After you change the data sources for the synchronization task, the change takes effect when the latest instance generated for the synchronization task is run.

You must modify and deploy the synchronization task. You must add the required data source for the synchronization task and specify the names of tables from which you want to synchronize data.

Supported data source types

MySQL and PolarDB

MySQL, PolarDB, and AnalyticDB

MySQL, PolarDB, AnalyticDB, SQL Server, Oracle, PostgreSQL, DM, DB2, and Oracle

Sharded data source + codeless UI

  1. Go to the Data Integration page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.

  2. In the left-side navigation pane of the Data Integration page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, select the desired data source type. This method supports only MySQL and PolarDB data sources.

  3. Add each desired database to DataWorks as a common data source. In this example, MySQL sharded databases are added as common data sources. For more information, see Add and manage data sources in Data Integration.

    image

  4. Merge multiple common data sources into one sharded data source and use the sharded data source as the source of a synchronization task.

    1. In the Add Data Source dialog box, click Sub-warehouse and sub-table.image

    2. Select a data source type based on your business requirements and configure the related parameters. In this example, MySQL is selected.

      image

      Take note of the following parameters:

      • Data Source Name: Specify a name based on your business requirements.

      • Select Data Source: Select the common data sources that are added.

      • Meta Data Source: Select the data source that you want to use as the default metadata template. If you want to synchronize data from a sharded data source, make sure that the schemas of all tables in the selected common data sources are the same. After you specify a meta data source, the system can extract the default schema when you configure a synchronization task.

        Important

        If the schemas of tables in the data sources are different, an error is reported when the synchronization task is run.

  5. Create a batch synchronization node.

  6. Configure a synchronization task to synchronize data from tables in sharded databases.

    Configure the synchronization task by using the codeless UI. In the Configure Network Connections and Resource Group step, select MySQL(Sharding) from the Source drop-down list, and then select the sharded data source that you added from the Data Source Name drop-down list. For more information about how to configure the synchronization task, see Configure a batch synchronization task by using the codeless UI.

    Note

    In this example, a MySQL sharded data source is used. You can select a sharded data source based on your business requirements.

    image

  7. Click Next.

  8. In the Configure tasks step, select the tables from which you want to synchronize data and the table to which you want to write data. Then, save and deploy the synchronization task and perform subsequent operations.

    image

Common data source + codeless UI

  1. Add each desired database to DataWorks as a common data source. In this example, MySQL sharded databases are added as common data sources. For more information, see Add and manage data sources in Data Integration.

    image

  2. Create a batch synchronization node.

  3. Configure a synchronization task to synchronize data from tables in sharded databases.

    Configure the synchronization task by using the codeless UI. In the Configure Network Connections and Resource Group step, click Edit Source to select multiple data sources after you select the source type. For more information about how to configure the synchronization task, see Configure a batch synchronization task by using the codeless UI.

    image

  4. In the Edit Data Source dialog box, select the desired data sources in the Data Sources to Be Selected list and click the rightward arrow to move the selected data sources to the Selected Data Sources list. Then, click OK.

    image

  5. Click Next.

  6. In the Configure tasks step, select the tables from which you want to synchronize data in each selected data source in the Source section, and configure the parameters in the Destination section. Then, save and deploy the synchronization task and perform subsequent operations.

    image

Code editor

  1. Create a batch synchronization node.

  2. Configure a synchronization task to synchronize data from tables in sharded databases.

    Configure the synchronization task by using the code editor. The following code provides a sample script. For more information about how to configure the synchronization task, see Configure a batch synchronization task by using the code editor.

    Important

    You must delete the comments from the following code before you run the code.

    {
        "type":"job",
        "version":"2.0",
        "steps":[
            {
                "stepType":"mysql",
                "parameter":{
                    "envType":0,
                    "column":[
                        "id",
                        "name"
                    ],
                    "socketTimeout":3600000,
                    "tableComment":"",
                    "connection":[    // The sharded data source.
                        {
                            "datasource": "datasourceName1"  // The name of the first sharded database.
                            "table":[           // The names of the source tables in the first sharded database.
                                "tb1"
                            ]
                        },
                        {
                            "datasource": "datasourceName2" // The name of the second sharded database.
                            "table":[          // The names of the source tables in the second sharded database.
                                "tb2",
                                "tb3"
                            ]
                        }
                    ],
                    "useSpecialSecret":true,// Specify whether to use separate passwords for the sharded databases.
                    "where":"",
                    "splitPk":"id",
                    "encoding":"UTF-8"
                    },
                "name":"Reader",
                "category":"reader"
                },
            {
                "stepType":"odps",
                "parameter":{
                    "partition":"pt=${bizdate}",
                    "truncate":true,
                    "datasource":"odpsname",
                    "envType":0,
                    "isSupportThreeModel":false,
                    "column":[
                        "id",
                        "name"
                    ],
                    "emptyAsNull":false,
                    "tableComment":"",
                    "table":"t1",
                    "consistencyCommit":false
                    },
                "name":"Writer",
                "category":"writer"
                }
        ],
        "setting":{
            "executeMode":null,
            "errorLimit":{
                "record":""
                },
            "speed":{
                "concurrent":2,
                "throttle":false
                }
        },
        "order":{
            "hops":[
                {
                    "from":"Reader",
                    "to":"Writer"
                    }
            ]
        }
    }