Use Data Integration to import data to DataHub - DataWorks

This topic describes how to use Data Integration to import offline data to DataHub. In this example, a batch synchronization node is configured by using the code editor to import data from Stream to DataHub.

Prerequisites

An Alibaba Cloud account and its AccessKey pair are created. For more information, see Activate DataWorks.
MaxCompute is activated. After you activate MaxCompute, a default MaxCompute data source is automatically generated. The Alibaba Cloud account is used to log on to the DataWorks console.
A workspace is created in the DataWorks console. This way, you can collaborate with other members in the workspace to develop workflows and maintain data and tasks in the workspace. For information about how to create a workspace, see Create a workspace.
Note
If you want to create a data integration task as a RAM user, grant the required permissions to the RAM user. For information about how to create a RAM user and grant permissions to the RAM user, see Prepare a RAM user and Manage permissions on workspace-level services.

Background information

Data Integration is a data synchronization platform that is provided by Alibaba Cloud. The platform is reliable, secure, cost-effective, and scalable. It can be used to synchronize data across heterogeneous data storage systems and provides offline data synchronization channels for more than 20 types of data sources in diverse network environments. In this example, a DataHub data source is used. For information about how to use other types of data sources to configure synchronization tasks, see Supported data source types and synchronization operations.

Procedure

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the Scheduled Workflow pane of the DataStudio page, find the desired workflow and click its name. Right-click Data Integration and choose Create Node > Offline synchronization.
In the Create Node dialog box, configure the Name and Path parameters and click Confirm.
Note
- The task name cannot exceed 128 characters in length.
- The Path parameter specifies the auto triggered workflow in which you want to create the created batch synchronization node. For information about how to create an auto triggered workflow, see the "Create an auto triggered workflow" section in Create a workflow.
After the batch synchronization node is created, configure items such as network connectivity and resources based on your business requirements and click Next. Click the icon in the top toolbar of the configuration tab of the batch synchronization node.
In the Tips message, click OK to switch to the code editor.
Click the icon in the top toolbar.
In the Import Template dialog box, configure the Source type, Target type, and Data source parameters to generate an import template used to import data from Stream to DataHub. Then, click Confirmation.

After the template is imported, edit code in the code editor based on your business requirements.

{
"type": "job",
"version": "1.0",
"configuration": {
 "setting": {
   "errorLimit": {
     "record": "0"
   },
   "speed": {
     "mbps": "1",// The maximum transmission rate. Unit: MB/s. 
     "concurrent": 1,// The maximum number of parallel threads. 
     "throttle": false
   }
 },
 "reader": {
   "plugin": "stream",
   "parameter": {
     "column": [// The names of the columns from which you want to read data. 
       {
         "value": "field", // The column property. 
         "type": "string"
       },
       {
         "value": true,
         "type": "bool"
       },
       {
         "value": "byte string",
         "type": "bytes"
       }
     ],
     "sliceRecordCount": "100000"
   }
 },
 "writer": {
   "plugin": "datahub",
   "parameter": {
     "datasource": "datahub",// The name of the data source. 
     "topic": "xxxx",// The minimum unit for data subscription and publication in DataHub. You can use topics to distinguish different types of streaming data. 
     "mode": "random",// The write mode. The value random indicates that data is randomly written. 
     "shardId": "0",// Shards are parallel channels that are used for data transmission in a topic. Each shard has a unique ID. 
     "maxCommitSize": 524288,// The amount of data that Data Integration buffers before Data Integration sends the data to the destination for the purpose of improving writing efficiency. Unit: MB. The default value is 1 MB. 
     "maxRetryCount": 500
   }
 }
}
}

After the configuration is complete, click the and icons in the top toolbar of the configuration tab of the batch synchronization node.
Note
- You can import data to DataHub only in the code editor.
- If you want to change the template, click the icon in the top toolbar. After you apply the new template, the original content is overwritten.
- If you click the icon after you save the batch synchronization task, the task is immediately run.
  You can also click the icon to commit the batch synchronization node to the scheduling system. The scheduling system periodically runs the batch synchronization node from the next day based on the properties configured for the node.