All Products
Search
Document Center

DataWorks:Configure LogHub Reader

Last Updated:Nov 27, 2024

LogHub Reader reads data from LogHub topics in real time by using the LogHub SDK and supports shard merging and splitting. After shards are merged or split, duplicate data records may exist but no data is lost.

Background information

The following table describes the metadata fields that LogHub Reader for real-time synchronization provides.

Field provided by LogHub Reader for real-time synchronization

Data type

Description

__time__

STRING

A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds.

__source__

STRING

A reserved field of Simple Log Service. The field specifies the source device from which logs are collected.

__topic__

STRING

A reserved field of Simple Log Service. The field specifies the name of the topic for logs.

__tag__:__receive_time__

STRING

The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds.

__tag__:__client_ip__

STRING

The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs.

__tag__:__path__

STRING

The path of the log file collected by Logtail. Logtail automatically adds this field to logs.

__tag__:__hostname__

STRING

The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs.

Procedure

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. In the Scheduled Workflow pane of the DataStudio page, move the pointer over the 新建 icon and choose Create Node > Data Integration > Real-time Synchronization.

    Alternatively, find the desired workflow in the Scheduled Workflow pane, right-click the workflow name, and then choose Create Node > Data Integration > Real-time Synchronization.

  3. In the Create Node dialog box, set the Sync Method parameter to End-to-end ETL and configure the Name and Path parameters.

    Important

    The node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

  4. Click Confirm.

  5. On the configuration tab of the real-time synchronization node, drag LogHub in the Input section to the canvas on the right.

  6. Click the LogHub node. In the panel that appears, configure the parameters.

    image

    Parameter

    Description

    Data source

    The name of the LogHub data source that you added to DataWorks. You can select only a LogHub data source.

    If no data source is available, click New data source on the right to go to the Data Sources page in Management Center to add a LogHub data source. For more information, see Add a LogHub (SLS) data source.

    Logstore

    The name of the Logstore from which you want to read data. You can click Data preview to preview data in the selected Logstore.

    Advanced configuration

    Specifies whether to split data in the Logstore. If you set the Split tasks parameter to Split, you must configure the Split rules parameter.

    You can specify a sharding rule in the format of shardId % X = Y. The equation is used to obtain the remainder of shardId divided by X. shardId indicates the ID of a sharding task, X indicates the total number of shards, and Y indicates the ID of a shard on which the sharding task takes effect. The value range of Y is [0, X-1]. For example, shardId % 5 = 3 indicates that the source data that you want to synchronize is divided into five shards, and a sharding task is assigned to take effect on the shard whose ID is 3.

    Output Fields

    The fields from which you want to synchronize data. For information about the field descriptions, see Background information.

  7. In the top toolbar of the configuration tab of the real-time synchronization node, click the 保存 icon to save the node.