Step 1: Activate DataHub
Log on to the DataHub console.
Activate DataHub as prompted.
Step 2: Create a project and a topic
Log on to the DataHub console.
On the Project List page, click Create Project in the upper-right corner and set the parameters as required to create a project.
Parameter | Description |
---|---|
Name | The name of the project. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub. |
Description | The description of the project. |
3. On the details page of a project, click Create Topic in the upper-right corner and set the parameters as required to create a topic.
Parameter | Description |
---|---|
Creation Type | The method that is used to create the topic. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub. |
Name | The name of the topic. |
Type | The type of the data in the topic. TUPLE indicates structured data. BLOB indicates unstructured data. |
Schema Details | The details of the schema. The Schema Details parameter is displayed if you set the Type parameter to TUPLE. You can create fields based on your business requirements. If you select Allow Null for a field, the field is set to NULL if the field does not exist in the upstream. If you clear Allow Null for a field, the field configuration is strictly verified. An error is returned if the type specified for the field is invalid. |
Number of Shards | The number of shards in the topic. Shards ensure the concurrent data transmission of a topic. Each shard has a unique ID. A shard may be in one of the following states: Opening: The shard is starting. Active: The shard is started and available. Each available shard consumes resources on the server. We recommended that you create shards as needed. |
Lifecycle | The maximum period during which data written to the topic can be stored in DataHub, in days. Minimum value: 1. Maximum value: 7. To modify the time-to-live (TTL) period of a topic, call the updateTopic method by using DataHub SDK for Java. For more information, see DataHub SDK for Java. |
Description | The description of the topic. |
Step 3: Write data to the created topic
DataHub provides multiple methods for you to write data. You can use plug-ins such as Apache Flume to write logs. If you want to write data stored in databases, you can use Data Transformation Services (DTS), Canal, or an SDK. In this example, the console command-line tool is used to write data by uploading a file.
Download and decompress the installation package of the console command-line tool, and then specify an AccessKey pair and an endpoint as required. For more information, see Console command-line tool.
Run the following command to upload a file:
uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000
Sample data to assess data quality.
Select a shard, such as Shard 0. In the Sample: 0 panel, set the number of data entries to be sampled and the start time for sampling.
Click Sample. The sampled data is displayed.
Step 4: Synchronize data
Synchronize data to MaxCompute.
In the left-side navigation pane of the DataHub console, click Project Manager. On the Project List page, find a project and click View in the Actions column. On the details page of the project, find a topic and click View in the Actions column.
On the details page of the topic, click
Connector
in the upper-right corner. In the Create Connector panel, create a DataConnector as required.Click MaxCompute. The following parameters are displayed.
Description of partial parameters:
The following part describes partial parameters that are used to create a DataConnector in the console. To create a DataConnector in a more flexible manner, use an SDK.
Import Fields
You can specify the columns to be synchronized to the destination MaxCompute table.
Partition Mode
The partition mode determines to which partition in MaxCompute data is written. The following table describes the partition modes supported by DataHub.
Partition mode | Partition basis | Supported data type of a topic | Description |
---|---|---|---|
USER_DEFINE | Based on the values in the partition key column in the records. The name of partition key column must be the same as that of the partition field in MaxCompute. | TUPLE | 1. The schema of the topic must contain the partition field in MaxCompute. 2. The column values must be |
SYSTEM_TIME | Based on the timestamps when the records are written to DataHub. | TUPLE and BLOB | 1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone. |
EVENT_TIME | Based on the values in the | TUPLE | 1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone. |
META_TIME | Based on the values in the | TUPLE and BLOB | 1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone. |
In SYSTEM_TIME
, EVENT_TIME
, or META_TIME
mode, data is synchronized to different partitions in the destination MaxCompute table based on the timestamps and the specified time zone. By default, the timestamps are in microseconds.
The Partition Config parameter specifies the configurations that are used to convert timestamps to implement time-based partitioning in the destination MaxCompute table. The following table describes the default MaxCompute time formats that are supported in the DataHub console.
Partition type | Time format | Description |
---|---|---|
ds | %Y%m%d | Day |
hh | %H | Hour |
mm | %M | Minute |
The Time Range parameter specifies the intervals at which partitions are generated in the destination MaxCompute table. Valid values:
15 to 1440, in minutes
. The step size is15
.The Timezone parameter specifies the time zone used to implement time-based partitioning.
If you synchronize data of the BLOB type to MaxCompute, you can use hexadecimal delimiters to split the data before synchronization. For example, you can set the Split Key parameter to
0A
, which indicatesline feeds (\n)
.By default, topics whose data type is BLOB store binary data. However, such data is mapped to columns of the STRING type in MaxCompute. Therefore, Base64 encoding is automatically enabled when you create a DataConnector in the DataHub console. If you want to customize your DataConnectors, use an SDK.
Step 5: View the DataConnector
For more information, see Synchronize data to MaxCompute.