DataWorks provides LogHub Reader and LogHub Writer for you to read data from and write data to Simple Log Service data sources. This topic describes the capabilities of synchronizing data from or to Simple Log Service data sources.
Limits
When you use DataWorks Data Integration to run batch synchronization tasks to write data to Simple Log Service, Simple Log Service does not ensure idempotence. If you rerun a failed task, redundant data may be generated.
Data types
The following table provides the support status of main data types in Simple Log Service.
Data type | LogHub Reader for batch data read | LogHub Writer for batch data write | LogHub Reader for real-time data read |
Data type | LogHub Reader for batch data read | LogHub Writer for batch data write | LogHub Reader for real-time data read |
STRING | Supported | Supported | Supported |
LogHub Writer for batch data write
LogHub Writer converts the data types supported by Data Integration to STRING before data is written to Simple Log Service. The following table lists the data type mappings based on which LogHub Writer converts data types.
Data Integration data type | Simple Log Service data type |
Data Integration data type | Simple Log Service data type |
LONG | STRING |
DOUBLE | STRING |
STRING | STRING |
DATE | STRING |
BOOLEAN | STRING |
BYTES | STRING |
LogHub Reader for real-time data read
The following table describes the metadata fields that LogHub Reader for real-time data synchronization provides.
Field provided by LogHub Reader for real-time data synchronization | Data type | Description |
Field provided by LogHub Reader for real-time data synchronization | Data type | Description |
__time__ | STRING | A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds. |
__source__ | STRING | A reserved field of Simple Log Service. The field specifies the source device from which logs are collected. |
__topic__ | STRING | A reserved field of Simple Log Service. The field specifies the name of the topic for logs. |
__tag__:__receive_time__ | STRING | The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds. |
__tag__:__client_ip__ | STRING | The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. |
__tag__:__path__ | STRING | The path of the log file collected by Logtail. Logtail automatically adds this field to logs. |
__tag__:__hostname__ | STRING | The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs. |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Note
When you configure a data synchronization task that synchronizes data from a Simple Log Service data source, the data source allows you to filter data by using the query syntax of Simple Log Service and SLS Processing Language (SPL) statements. Simple Log Service uses SPL to process logs. For more information, see Appendix 2: SPL syntax.
Appendix 1: Code and parameters
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Code for LogHub Reader
{
"type":"job",
"version":"2.0",
"steps":[
{
"stepType":"LogHub",
"parameter":{
"datasource":"",
"column":[
"col0",
"col1",
"col2",
"col3",
"col4",
"C_Category",
"C_Source",
"C_Topic",
"C_MachineUUID",
"C_HostName",
"C_Path",
"C_LogTime"
],
"beginDateTime":"",
"batchSize":"",
"endDateTime":"",
"fieldDelimiter":",",
"logstore":""
},
"name":"Reader",
"category":"reader"
},
{
"stepType":"stream",
"parameter":{},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"
},
"speed":{
"throttle":true,
"concurrent":1
"mbps":"12",
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}
Parameters in code for LogHub Reader
Parameter | Description | Required | Default Value |
endPoint | The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints. | Yes | No default value |
accessId | The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
accessKey | The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
project | The name of the Simple Log Service project. A project is the basic unit for managing resources in Simple Log Service. Projects are used to isolate resources and control access to the resources. | Yes | No default value |
logstore | The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service. | Yes | No default value |
batchSize | The number of data entries to read from Simple Log Service at a time. | No | 128 |
column | The names of the columns. You can set this parameter to the metadata in Simple Log Service. Supported metadata includes the log topic, unique identifier of the host, hostname, path, and log time. Note Column names are case-sensitive. For more information about column names in Simple Log Service, see Introduction. | Yes | No default value |
beginDateTime | The start time of data consumption. The value is the time at which log data arrives at Simple Log Service. This parameter defines the left boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013000. This parameter can work with the scheduling parameters in DataWorks. For example, if you enter beginDateTime=${yyyymmdd-1} in the Parameters field on the Properties tab, you can set Start Timestamp to ${beginDateTime}000000 on the task configuration tab to consume logs that are generated from 00:00:00 of the data timestamp. For more information, see Supported formats of scheduling parameters. Note The beginDateTime and endDateTime parameters must be used in pairs. | Yes | No default value |
endDateTime | The end time of data consumption. This parameter defines the right boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013010. This parameter can work with the scheduling parameters in DataWorks. For example, if you enter endDateTime=${yyyymmdd} in the Parameters field on the Properties tab, you can set End Timestamp to ${endDateTime}000000 on the task configuration tab to consume logs that are generated until 00:00:00 of the next day of the data timestamp. For more information, see Supported formats of scheduling parameters. Note The time that is specified by the endDateTime parameter of the previous interval cannot be earlier than the time that is specified by the beginDateTime parameter of the current interval. Otherwise, data in some regions may not be read. | Yes | No default value |
Code for LogHub Writer
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "stream",
"parameter": {},
"name": "Reader",
"category": "reader"
},
{
"stepType":"LogHub",
"parameter": {
"datasource": "",
"column": [
"col0",
"col1",
"col2",
"col3",
"col4",
"col5"
],
"topic": "",
"batchSize": "1024",
"logstore": ""
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle":true,
"concurrent":3,
"mbps":"12"
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}
Parameters in code for LogHub Writer
Note
LogHub Writer obtains data from a reader and converts the data types supported by Data Integration into STRING. If the number of data records reaches the value specified for the batchSize parameter, LogHub Writer sends the data records to Simple Log Service at a time by using Simple Log Service SDK for Java.
Parameter | Description | Required | Default Value |
endpoint | The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints. | Yes | No default value |
accessKeyId | The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
accessKeySecret | The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project. | Yes | No default value |
project | The name of the Simple Log Service project. | Yes | No default value |
logstore | The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service. | Yes | No default value |
topic | The name of the topic. | No | An empty string |
batchSize | The number of data records to write to Simple Log Service at a time. Default value: 1024. Maximum value: 4096. Note The size of the data to write to Simple Log Service at a time cannot exceed 5 MB. You can change the value of this parameter based on the size of a single data record. | No | 1,024 |
column | The names of columns in each data record. | Yes | No default value |
Appendix 2: SPL syntax
When you configure a data synchronization task that synchronizes data from a Simple Log Service data source, the data source allows you to filter data by using the query syntax of Simple Log Service and SPL statements. Simple Log Service uses SPL to process logs. The following table describes the SPL syntax in different scenarios.
Scenario | SQL statement | SPL statement |
Scenario | SQL statement | SPL statement |
Data filtering |
SELECT * WHERE Type='write'
|
|
Field processing and filtering | Search for a field in exact mode and rename the field.
SELECT "__tag__:node" AS node, path
| Search for a field in exact mode and rename the field.
| project node="__tag__:node", path
Search for fields by mode.
| project -wildcard "__tag__:*"
Rename a field without affecting other fields.
| project-rename node="__tag__:node"
Remove fields by mode.
| project-away -wildcard "__tag__:*"
|
Data standardization (SQL function calls) | Convert a data type and parse time.
SELECT
CAST(Status AS BIGINT) AS Status,
date_parse(Time, '%Y-%m-%d %H:%i') AS Time
| Convert a data type and parse time.
| extend Status=cast(Status as BIGINT), extend Time=date_parse(Time, '%Y-%m-%d %H:%i')
|
Field extraction | Extract data by using a regular expression.
SELECT
CAST(Status AS BIGINT) AS Status,
date_parse(Time, '%Y-%m-%d %H:%i') AS Time
Extract JSON data.
SELECT
CAST(Status AS BIGINT) AS Status,
date_parse(Time, '%Y-%m-%d %H:%i') AS Time
| Extract data by using a regular expression based on one-time matching.
| parse-regexp protocol, '(\w+)/(\d+)' as scheme, version
Extract JSON data based on full expansion.
| parse-json -path='$.0' content
Extract data from a CSV file.
| parse-csv -delim='^_^' content as ip, time, host
|