Configure LogHub Reader - DataWorks - Alibaba Cloud Documentation Center

LogHub Reader reads data from LogHub topics in real time by using the LogHub SDK and supports shard merging and splitting. After shards are merged or split, duplicate data records may exist but no data is lost.

Background information

The following table describes the metadata fields that LogHub Reader for real-time synchronization provides.

Field provided by LogHub Reader for real-time synchronization	Data type	Description
__time__	STRING	A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds.
__source__	STRING	A reserved field of Simple Log Service. The field specifies the source device from which logs are collected.
__topic__	STRING	A reserved field of Simple Log Service. The field specifies the name of the topic for logs.
__tag__:__receive_time__	STRING	The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds.
__tag__:__client_ip__	STRING	The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs.
__tag__:__path__	STRING	The path of the log file collected by Logtail. Logtail automatically adds this field to logs.
__tag__:__hostname__	STRING	The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs.

Procedure

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the Scheduled Workflow pane of the DataStudio page, move the pointer over the icon and choose Create Node > Data Integration > Real-time Synchronization.
Alternatively, find the desired workflow in the Scheduled Workflow pane, right-click the workflow name, and then choose Create Node > Data Integration > Real-time Synchronization.
In the Create Node dialog box, set the Sync Method parameter to End-to-end ETL and configure the Name and Path parameters.
Important
The node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).
Click Confirm.
On the configuration tab of the real-time synchronization node, drag LogHub in the Input section to the canvas on the right.

Click the LogHub node. In the panel that appears, configure the parameters.

Parameter	Description
Data source	The name of the LogHub data source that you added to DataWorks. You can select only a LogHub data source. If no data source is available, click New data source on the right to go to the Data Sources page in Management Center to add a LogHub data source. For more information, see Add a LogHub (SLS) data source.
Logstore	The name of the Logstore from which you want to read data. You can click Data preview to preview data in the selected Logstore.
Advanced configuration	Specifies whether to split data in the Logstore. If you set the Split tasks parameter to Split, you must configure the Split rules parameter. You can specify a sharding rule in the format of shardId % X = Y. The equation is used to obtain the remainder of shardId divided by X. shardId indicates the ID of a sharding task, X indicates the total number of shards, and Y indicates the ID of a shard on which the sharding task takes effect. The value range of Y is [0, X-1]. For example, shardId % 5 = 3 indicates that the source data that you want to synchronize is divided into five shards, and a sharding task is assigned to take effect on the shard whose ID is 3.
Output Fields	The fields from which you want to synchronize data. For information about the field descriptions, see Background information.

In the top toolbar of the configuration tab of the real-time synchronization node, click the icon to save the node.