Get started with DataHub - DataHub - Alibaba Cloud ドキュメントセンター

Step 1: Activate DataHub

Log on to the DataHub console.
Activate DataHub as prompted.

Step 2: Create a project and a topic

Log on to the DataHub console.
On the Project List page, click Create Project in the upper-right corner and set the parameters as required to create a project.

Parameter	Description
Name	The name of the project. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub.
Description	The description of the project.

3. On the details page of a project, click Create Topic in the upper-right corner and set the parameters as required to create a topic.

Parameter	Description
Creation Type	The method that is used to create the topic. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub.
Name	The name of the topic.
Type	The type of the data in the topic. TUPLE indicates structured data. BLOB indicates unstructured data.
Schema Details	The details of the schema. The Schema Details parameter is displayed if you set the Type parameter to TUPLE. You can create fields based on your business requirements. If you select Allow Null for a field, the field is set to NULL if the field does not exist in the upstream. If you clear Allow Null for a field, the field configuration is strictly verified. An error is returned if the type specified for the field is invalid.
Number of Shards	The number of shards in the topic. Shards ensure the concurrent data transmission of a topic. Each shard has a unique ID. A shard may be in one of the following states: Opening: The shard is starting. Active: The shard is started and available. Each available shard consumes resources on the server. We recommended that you create shards as needed.
Lifecycle	The maximum period during which data written to the topic can be stored in DataHub, in days. Minimum value: 1. Maximum value: 7. To modify the time-to-live (TTL) period of a topic, call the updateTopic method by using DataHub SDK for Java. For more information, see DataHub SDK for Java.
Description	The description of the topic.

Step 3: Write data to the created topic

DataHub provides multiple methods for you to write data. You can use plug-ins such as Apache Flume to write logs. If you want to write data stored in databases, you can use Data Transformation Services (DTS), Canal, or an SDK. In this example, the console command-line tool is used to write data by uploading a file.

Download and decompress the installation package of the console command-line tool, and then specify an AccessKey pair and an endpoint as required. For more information, see Console command-line tool.

Run the following command to upload a file:

uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000

Sample data to assess data quality.
1. Select a shard, such as Shard 0. In the Sample: 0 panel, set the number of data entries to be sampled and the start time for sampling.
2. Click Sample. The sampled data is displayed.

Step 4: Synchronize data

Synchronize data to MaxCompute.

In the left-side navigation pane of the DataHub console, click Project Manager. On the Project List page, find a project and click View in the Actions column. On the details page of the project, find a topic and click View in the Actions column.
On the details page of the topic, click Connector in the upper-right corner. In the Create Connector panel, create a DataConnector as required.
Click MaxCompute. The following parameters are displayed.

Description of partial parameters:

The following part describes partial parameters that are used to create a DataConnector in the console. To create a DataConnector in a more flexible manner, use an SDK.

Import Fields
You can specify the columns to be synchronized to the destination MaxCompute table.
Partition Mode
The partition mode determines to which partition in MaxCompute data is written. The following table describes the partition modes supported by DataHub.

Partition mode	Partition basis	Supported data type of a topic	Description
USER_DEFINE	Based on the values in the partition key column in the records. The name of partition key column must be the same as that of the partition field in MaxCompute.	TUPLE	1. The schema of the topic must contain the partition field in MaxCompute. 2. The column values must be `strings encoded in UTF-8 and cannot be NULL`.
SYSTEM_TIME	Based on the timestamps when the records are written to DataHub.	TUPLE and BLOB	1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.
EVENT_TIME	Based on the values in the `event_time(TIMESTAMP)` column in the records.	TUPLE	1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.
META_TIME	Based on the values in the `__dh_meta_time__` property column in the records.	TUPLE and BLOB	1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.

In SYSTEM_TIME, EVENT_TIME, or META_TIME mode, data is synchronized to different partitions in the destination MaxCompute table based on the timestamps and the specified time zone. By default, the timestamps are in microseconds.

The Partition Config parameter specifies the configurations that are used to convert timestamps to implement time-based partitioning in the destination MaxCompute table. The following table describes the default MaxCompute time formats that are supported in the DataHub console.

Partition type	Time format	Description
ds	%Y%m%d	Day
hh	%H	Hour
mm	%M	Minute

The Time Range parameter specifies the intervals at which partitions are generated in the destination MaxCompute table. Valid values: 15 to 1440, in minutes. The step size is 15.
The Timezone parameter specifies the time zone used to implement time-based partitioning.
If you synchronize data of the BLOB type to MaxCompute, you can use hexadecimal delimiters to split the data before synchronization. For example, you can set the Split Key parameter to 0A, which indicates line feeds (\n).
By default, topics whose data type is BLOB store binary data. However, such data is mapped to columns of the STRING type in MaxCompute. Therefore, Base64 encoding is automatically enabled when you create a DataConnector in the DataHub console. If you want to customize your DataConnectors, use an SDK.

Step 5: View the DataConnector

For more information, see Synchronize data to MaxCompute.