LogHub Reader reads data from LogHub topics in real time by using the LogHub SDK and supports shard merging and splitting. After shards are merged or split, duplicate data records may exist but no data is lost.
Background information
The following table describes the metadata fields that LogHub Reader for real-time synchronization provides.
Field provided by LogHub Reader for real-time synchronization | Data type | Description |
__time__ | STRING | A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds. |
__source__ | STRING | A reserved field of Simple Log Service. The field specifies the source device from which logs are collected. |
__topic__ | STRING | A reserved field of Simple Log Service. The field specifies the name of the topic for logs. |
__tag__:__receive_time__ | STRING | The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds. |
__tag__:__client_ip__ | STRING | The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. |
__tag__:__path__ | STRING | The path of the log file collected by Logtail. Logtail automatically adds this field to logs. |
__tag__:__hostname__ | STRING | The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs. |
Procedure
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the Scheduled Workflow pane of the DataStudio page, move the pointer over the icon and choose .
Alternatively, find the desired workflow in the Scheduled Workflow pane, right-click the workflow name, and then choose
.In the Create Node dialog box, set the Sync Method parameter to End-to-end ETL and configure the Name and Path parameters.
ImportantThe node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).
Click Confirm.
On the configuration tab of the real-time synchronization node, drag LogHub in the Input section to the canvas on the right.
Click the LogHub node. In the panel that appears, configure the parameters.
Parameter
Description
Data source
The name of the LogHub data source that you added to DataWorks. You can select only a LogHub data source.
If no data source is available, click New data source on the right to go to the Data Sources page in Management Center to add a LogHub data source. For more information, see Add a LogHub (SLS) data source.
Logstore
The name of the Logstore from which you want to read data. You can click Data preview to preview data in the selected Logstore.
Advanced configuration
Specifies whether to split data in the Logstore. If you set the Split tasks parameter to Split, you must configure the Split rules parameter.
You can specify a sharding rule in the format of shardId % X = Y. The equation is used to obtain the remainder of shardId divided by X. shardId indicates the ID of a sharding task, X indicates the total number of shards, and Y indicates the ID of a shard on which the sharding task takes effect. The value range of Y is [0, X-1]. For example, shardId % 5 = 3 indicates that the source data that you want to synchronize is divided into five shards, and a sharding task is assigned to take effect on the shard whose ID is 3.
Output Fields
The fields from which you want to synchronize data. For information about the field descriptions, see Background information.
In the top toolbar of the configuration tab of the real-time synchronization node, click the icon to save the node.