All Products
Search
Document Center

ApsaraDB for ClickHouse:Use DataWorks to synchronize data

Last Updated:Aug 05, 2024

This topic describes how to use DataWorks to synchronize data from MaxCompute to ApsaraDB for ClickHouse.

Background information

You can use the batch data synchronization feature of DataWorks to synchronize data from various data sources to ApsaraDB for ClickHouse. For more information about the data sources supported by batch data synchronization, see Supported data source types, Reader plug-ins, and Writer plug-ins.

Prerequisites

Usage notes

  • ApsaraDB for ClickHouse supports only exclusive resource groups for Data Integration.

  • If you want to synchronize a table that has been synchronized before, execute the TRUNCATE TABLE <Table name>; statement to clear synchronized data in the ApsaraDB for ClickHouse table.

Procedure

  1. Add data sources.

    You need to add data sources for MaxCompute and ApsaraDB for ClickHouse.

    Note

    For more information, see Add a MaxCompute data source and Add a ClickHouse data source.

  2. Create a MaxCompute table.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspaces.

    3. In the top navigation bar, select the region where the desired workspace resides. On the Workspaces page, find the workspace and choose Shortcuts > Data Development in the Actions column.

    4. On the DataStudio page, move the pointer over the 新建 icon and choose Create Table > MaxCompute > Table.

    5. In the Create Table dialog box, select a path from the Path drop-down list and configure the Name parameter. In this example, odptabletest1 is used as the table name. Click Create.

    6. In the General section, configure the parameters.

      基本属性

      The following table describes the parameters.

      Parameter

      Description

      Display Name

      The display name of the table.

      Theme

      The folders that are used to store and manage the table. You can specify the level-1 and level-2 folders to store the table. The Level-1 Theme and Level-2 Theme parameters can be used to categorize tables based on business categories. You can store tables of the same business category in the same folder.

      Note

      The level-1 and level-2 themes in the Workspace Tables pane of the DataStudio page help you better manage tables in folders. You can quickly find the current table in the Workspace Tables pane by theme. If no theme is available, you can create one. For information about how to create a theme, see the Create or manage folders for tables section of the "Manage settings for tables" topic.

    7. Click DDL in the toolbar.

    8. In the DDL dialog box, enter the following statement and click Generate Table Schema:

      CREATE TABLE IF NOT EXISTS odptabletest1
      (    
      v1  TINYINT,    
      v2  SMALLINT                
      );
    9. Click Commit to Development Environment and Commit to Production Environment in sequence.

  3. Write data to the MaxCompute table.

    1. On the DataStudio page, click Ad Hoc Query in the left-side navigation pane.

    2. Move the pointer over the 新建 icon and choose Create > ODPS SQL.

    3. In the Create Node dialog box, select a path from the Path drop-down list and configure the Name parameter.

    4. Click Confirm.

    5. On the edit page of the node, enter the following statement to write data to the MaxCompute table:

      insert into odptabletest1 values (1,"a"),(2,"b"),(3,"c"),(4,"d");
    6. Click the 执行 icon in the toolbar.

    7. In the Estimate MaxCompute Computing Cost dialog box, click Run.

  4. Create an ApsaraDB for ClickHouse table.

    1. Log on to the ApsaraDB for ClickHouse console.

    2. In the top navigation bar, select the region where the desired cluster is deployed.

    3. On the Clusters page, click a tab based on your cluster edition and click the ID of the cluster that you want to manage.

    4. On the Cluster Information page, click Log On to Database in the upper-right corner.

    5. In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.

    6. Enter the following statement and click Execute(F8). Sample statement:

      create table default.dataworktest ON CLUSTER default (
      v1 Int, 
      v2 String
      ) ENGINE = MergeTree ORDER BY v1;
      Note

      The schema type of the ApsaraDB for ClickHouse table must map the schema type of the MaxCompute table.

  5. Create a workflow.

    If you already have a workflow, skip this step.

    1. On the DataStudio page, click Scheduled Workflow in the left-side navigation pane.

    2. Move the pointer over the 新建 icon and select Create Workflow.

    3. In the Create Workflow dialog box, configure the Workflow Name parameter.

      Important

      The name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

    4. Click Create.

  6. Create a batch synchronization node.

    1. Click the newly created workflow and right-click Data Integration.

    2. Choose Create Node > Offline synchronization.

    3. In the Create Node dialog box, configure the Name parameter, and select a path from the Path drop-down list.

      Important

      The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).

    4. Click Confirm.

  7. Configure the data source and destination.

    1. Source: Select a supported data source. In this example, MaxCompute is selected.

      选择数据源

      The following table describes the parameters.

      Parameter

      Description

      Connection

      The type and name of the data source.

      Production Project Name

      The name of the project in the production environment. You cannot change the value.

      Table

      The table that you want to synchronize.

      Partition Key Column

      If your daily incremental data is stored in the partitions of a specific date, you can specify the partition information to synchronize the daily incremental data. For example, set v17 to ${bizdate}.

      Note

      DataWorks cannot map the fields in partitioned MaxCompute tables. If you want to read data from a partitioned MaxCompute table, you must specify each desired partition when you configure MaxCompute Reader.

      Note

      For more information about the parameters, see MaxCompute Reader.

    2. Target: Select ClickHouse.

      选择数据源

      The following table describes the parameters.

      Parameter

      Description

      Connection

      The type and name of the data source. Select ClickHouse.

      Table

      The table to which you want to import the synchronized data.

      Primary key or unique key conflict handling

      Set this value to insert into (Insert).

      Pre sql

      The SQL statement that you want to execute before the synchronization task is run.

      Post sql

      The SQL statement that you want to execute after the synchronization task is run.

      Batch insert byte size

      The maximum number of bytes to be inserted.

      Number of batches

      The number of data entries to be inserted in one batch.

    3. (Optional) Mappings: You can select the field mapping. The Field on the left side corresponds to the Field on the right side.

      字段映射

      Note

      For information about the parameters, see the Step 4: Configure mappings between source fields and destination fields section of the Configure a batch synchronization task by using the codeless UI topic.

    4. (Optional) Channel: Configure the maximum transmission rate and dirty data check rules.

      通道控制

      Note

      For information about the parameters, see the Step 5: Configure channel control policies section of the Configure a batch synchronization task by using the codeless UI topic.

  8. Configure the resource group for Data Integration.

    Click Resource Group configuration on the right and select a group from the Exclusive Resource Group drop-down list.集成资源组配置

  9. Run and save the synchronization task.

    1. Click the 保存 icon in the toolbar to save the synchronization task.

    2. Click the 运行 icon in the toolbar to run the synchronization task.p353650.png

Verify the data synchronization result

  1. Log on to the ApsaraDB for ClickHouse console.

  2. In the top navigation bar, select the region where the desired cluster is deployed.

  3. On the Clusters page, click a tab based on your cluster edition and click the ID of the cluster that you want to manage.

  4. On the Cluster Information page, click Log On to Database in the upper-right corner.

  5. In the Log on to Database Instance dialog box, enter the username and password of your database account and click Login.

  6. Enter the following query statement and click Execute(F8) to check whether the data has been synchronized:

    SELECT * FROM dataworktest;

    The following result is returned.查询结果

    Note

    If the result is returned after you execute the query statement, the data has been synchronized from MaxComput to ApsaraDB for ClickHouse.