All Products
Search
Document Center

DataWorks:Simple Log Service data source

Last Updated:Nov 19, 2024

DataWorks provides LogHub Reader and LogHub Writer for you to read data from and write data to Simple Log Service data sources. This topic describes the capabilities of synchronizing data from or to Simple Log Service data sources.

Limits

When you use DataWorks Data Integration to run batch synchronization tasks to write data to Simple Log Service, Simple Log Service does not ensure idempotence. If you rerun a failed task, redundant data may be generated.

Data types

The following table provides the support status of main data types in Simple Log Service.

Data type

LogHub Reader for batch data read

LogHub Writer for batch data write

LogHub Reader for real-time data read

STRING

Supported

Supported

Supported

  • LogHub Writer for batch data write

    LogHub Writer converts the data types supported by Data Integration to STRING before data is written to Simple Log Service. The following table lists the data type mappings based on which LogHub Writer converts data types.

    Data Integration data type

    Simple Log Service data type

    LONG

    STRING

    DOUBLE

    STRING

    STRING

    STRING

    DATE

    STRING

    BOOLEAN

    STRING

    BYTES

    STRING

  • LogHub Reader for real-time data read

    The following table describes the metadata fields that LogHub Reader for real-time data synchronization provides.

    Field provided by LogHub Reader for real-time data synchronization

    Data type

    Description

    __time__

    STRING

    A reserved field of Simple Log Service. The field specifies the time when logs are written to Simple Log Service. The field value is a UNIX timestamp in seconds.

    __source__

    STRING

    A reserved field of Simple Log Service. The field specifies the source device from which logs are collected.

    __topic__

    STRING

    A reserved field of Simple Log Service. The field specifies the name of the topic for logs.

    __tag__:__receive_time__

    STRING

    The time when logs arrive at the server. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs. The field value is a UNIX timestamp in seconds.

    __tag__:__client_ip__

    STRING

    The public IP address of the source device. If you enable the public IP address recording feature, this field is added to each raw log when the server receives the logs.

    __tag__:__path__

    STRING

    The path of the log file collected by Logtail. Logtail automatically adds this field to logs.

    __tag__:__hostname__

    STRING

    The hostname of the device from which Logtail collects data. Logtail automatically adds this field to logs.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Note

When you configure a data synchronization task that synchronizes data from a Simple Log Service data source, the data source allows you to filter data by using the query syntax of Simple Log Service and SLS Processing Language (SPL) statements. Simple Log Service uses SPL to process logs. For more information, see Appendix 2: SPL syntax.

Configure a batch synchronization task to synchronize data of a single table

Configure a real-time synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Create a real-time synchronization task to synchronize incremental data from a single table and Configure a real-time synchronization task in DataStudio.

Configure synchronization settings to implement batch synchronization of all data in a database, real-time synchronization of full data or incremental data in a database, and real-time synchronization of data from sharded tables in a sharded database

For more information about the configuration procedure, see Configure a synchronization task in Data Integration.

FAQ

For more information, see FAQ about Data Integration.

Appendix 1: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for LogHub Reader

{
 "type":"job",
 "version":"2.0",// The version number. 
 "steps":[
     {
         "stepType":"LogHub",// The plug-in name. 
         "parameter":{
             "datasource":"",// The name of the data source. 
             "column":[// The names of the columns. 
                 "col0",
                 "col1",
                 "col2",
                 "col3",
                 "col4",
                 "C_Category",
                 "C_Source",
                 "C_Topic",
                 "C_MachineUUID", // The log topic. 
                 "C_HostName", // The hostname. 
                 "C_Path", // The path. 
                 "C_LogTime" // The time when the event occurred. 
             ],
             "beginDateTime":"",// The start time of data consumption. 
             "batchSize":"",// The number of data entries that are queried at a time. 
             "endDateTime":"",// The end time of data consumption. 
             "fieldDelimiter":",",// The column delimiter. 
             "logstore":""// The name of the Logstore. 
         },
         "name":"Reader",
         "category":"reader"
     },
     { 
         "stepType":"stream",
         "parameter":{},
         "name":"Writer",
         "category":"writer"
     }
 ],
 "setting":{
     "errorLimit":{
         "record":"0"// The maximum number of dirty data records allowed. 
     },
     "speed":{
         "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1 // The maximum number of parallel threads. 
            "mbps":"12",// The maximum transmission rate. Unit: MB/s. 
     }
 },
 "order":{
     "hops":[
         {
             "from":"Reader",
             "to":"Writer"
         }
     ]
 }
}

Parameters in code for LogHub Reader

Parameter

Description

Required

Default Value

endPoint

The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints.

Yes

No default value

accessId

The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

accessKey

The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

project

The name of the Simple Log Service project. A project is the basic unit for managing resources in Simple Log Service. Projects are used to isolate resources and control access to the resources.

Yes

No default value

logstore

The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service.

Yes

No default value

batchSize

The number of data entries to read from Simple Log Service at a time.

No

128

column

The names of the columns. You can set this parameter to the metadata in Simple Log Service. Supported metadata includes the log topic, unique identifier of the host, hostname, path, and log time.

Note

Column names are case-sensitive. For more information about column names in Simple Log Service, see Introduction.

Yes

No default value

beginDateTime

The start time of data consumption. The value is the time at which log data arrives at Simple Log Service. This parameter defines the left boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013000. This parameter can work with the scheduling parameters in DataWorks.

For example, if you enter beginDateTime=${yyyymmdd-1} in the Parameters field on the Properties tab, you can set Start Timestamp to ${beginDateTime}000000 on the task configuration tab to consume logs that are generated from 00:00:00 of the data timestamp. For more information, see Supported formats of scheduling parameters.

Note

The beginDateTime and endDateTime parameters must be used in pairs.

Yes

No default value

endDateTime

The end time of data consumption. This parameter defines the right boundary of a left-closed, right-open interval in the format of yyyyMMddHHmmss, such as 20180111013010. This parameter can work with the scheduling parameters in DataWorks.

For example, if you enter endDateTime=${yyyymmdd} in the Parameters field on the Properties tab, you can set End Timestamp to ${endDateTime}000000 on the task configuration tab to consume logs that are generated until 00:00:00 of the next day of the data timestamp. For more information, see Supported formats of scheduling parameters.

Note

The time that is specified by the endDateTime parameter of the previous interval cannot be earlier than the time that is specified by the beginDateTime parameter of the current interval. Otherwise, data in some regions may not be read.

Yes

No default value

Code for LogHub Writer

{
    "type": "job",
    "version": "2.0",// The version number. 
    "steps": [
        { 
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType":"LogHub",// The plug-in name. 
            "parameter": {
                "datasource": "",// The name of the data source. 
                "column": [// The names of the columns. 
                    "col0",
                    "col1",
                    "col2",
                    "col3",
                    "col4",
                    "col5"
                ],
                "topic": "",// The name of the topic. 
                "batchSize": "1024",// The number of data records to write at a time. 
                "logstore": ""// The name of the Logstore. 
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": ""// The maximum number of dirty data records allowed. 
        },
        "speed": {
            "throttle":true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":3, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate. Unit: MB/s. 
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Parameters in code for LogHub Writer

Note

LogHub Writer obtains data from a reader and converts the data types supported by Data Integration into STRING. If the number of data records reaches the value specified for the batchSize parameter, LogHub Writer sends the data records to Simple Log Service at a time by using Simple Log Service SDK for Java.

Parameter

Description

Required

Default Value

endpoint

The endpoint of Simple Log Service. The endpoint is a URL that you can use to access the project and the log data in the project. The endpoint varies based on the project name and the Alibaba Cloud region where the project resides. For more information about the endpoints of Simple Log Service in each region, see Endpoints.

Yes

No default value

accessKeyId

The AccessKey ID of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

accessKeySecret

The AccessKey secret of the Alibaba Cloud account that is used to access the Simple Log Service project.

Yes

No default value

project

The name of the Simple Log Service project.

Yes

No default value

logstore

The name of the Logstore. A Logstore is a basic unit that you can use to collect, store, and query log data in Simple Log Service.

Yes

No default value

topic

The name of the topic.

No

An empty string

batchSize

The number of data records to write to Simple Log Service at a time. Default value: 1024. Maximum value: 4096.

Note

The size of the data to write to Simple Log Service at a time cannot exceed 5 MB. You can change the value of this parameter based on the size of a single data record.

No

1,024

column

The names of columns in each data record.

Yes

No default value

Appendix 2: SPL syntax

When you configure a data synchronization task that synchronizes data from a Simple Log Service data source, the data source allows you to filter data by using the query syntax of Simple Log Service and SPL statements. Simple Log Service uses SPL to process logs. The following table describes the SPL syntax in different scenarios.

Note

For more information about SPL, see SPL overview.

Scenario

SQL statement

SPL statement

Data filtering

SELECT * WHERE Type='write'
| where Type='write'

Field processing and filtering

Search for a field in exact mode and rename the field.

SELECT "__tag__:node" AS node, path
  • Search for a field in exact mode and rename the field.

    | project node="__tag__:node", path
  • Search for fields by mode.

    | project -wildcard "__tag__:*"
  • Rename a field without affecting other fields.

    | project-rename node="__tag__:node"
  • Remove fields by mode.

    | project-away -wildcard "__tag__:*"

Data standardization

(SQL function calls)

Convert a data type and parse time.

SELECT 
  CAST(Status AS BIGINT) AS Status, 
  date_parse(Time, '%Y-%m-%d %H:%i') AS Time

Convert a data type and parse time.

| extend Status=cast(Status as BIGINT), extend Time=date_parse(Time, '%Y-%m-%d %H:%i')

Field extraction

Extract data by using a regular expression.

SELECT 
  CAST(Status AS BIGINT) AS Status, 
  date_parse(Time, '%Y-%m-%d %H:%i') AS Time

Extract JSON data.

SELECT 
  CAST(Status AS BIGINT) AS Status, 
  date_parse(Time, '%Y-%m-%d %H:%i') AS Time
  • Extract data by using a regular expression based on one-time matching.

    | parse-regexp protocol, '(\w+)/(\d+)' as scheme, version
  • Extract JSON data based on full expansion.

    | parse-json -path='$.0' content
  • Extract data from a CSV file.

    | parse-csv -delim='^_^' content as ip, time, host