SLS Data Transformation SPL Versus Old DSL Scenarios

This article discusses the comparison of using the new data transformation language SPL versus the old data transformation language DSL in different data processing requirements.

By Lingsheng

1. Introduction

As mentioned in the previous article, Comprehensive Upgrade of SLS Data Transformation: Integrate SPL Syntax, SLS data transformation now integrates the SLS data transformation syntax SPL. Compared with the old version of data transformation domain-specific language (DSL), SPL offers many improvements in terms of syntax simplicity when dealing with unstructured data scenarios, including improvements in maintaining intermediate types, referencing fields, and being seamlessly compatible with SQL functions.

Here we continue to discuss the comparison of using the new data transformation SPL versus the old version of data transformation DSL in various data processing requirements. For the scenario of data synchronization where no data transformation is required, both the new SPL and the old DSL can simply specify a null logic, which will not be further detailed here.

2. Scenario 1: Data Filtering and Cleaning

In routine D&M, error log analysis is a crucial step in identifying and locating issues. Here, we use service logs as an example to introduce how to apply data transformation to complete data cleaning.

In the old data transformation DSL, data cleaning was accomplished using e_keep/e_drop, whereas in the new data transformation SPL, the where instruction is used.

2.1 Exact Match

It needs to filter out error logs, that is, logs where the value of the level field is the string ERROR.

New SPL	Directly reference the level field. `\| where level='ERROR'`
Old DSL	Need to use a function to reference the field v("level"), with different implementations: ❖ `e_keep(v("level") == "ERROR")` ❖ `e_drop(v("level") != "ERROR")` ❖ `e_if(v("level") != "ERROR", e_drop())` ❖ `e_keep(e_search("level==ERROR"))`

2.2 Fuzzy Match

Due to differences in coding standards across various service modules, the value of the level field might not be fixed and could be ERROR, ERR, or E, among others. In this scenario, a string fuzzy match is necessary.

New SPL	Reuse the like expression of SQL. `\| where level like '%E%'`
Old DSL	Different implementations: ❖ `e_keep(op_in(v("level"), "E"))` ❖ `e_keep(e_search("level: E")` ❖ `e_if(op_not_in(v("level"), "E"), e_drop())`

2.3 Value Range

Besides filtering text logs, we also need to compare value ranges. For instance, in access logs, we need to filter out user errors to analyze which operations may be poorly designed. Specifically, we need to extract data where the value of the status field falls within the 4xx range.

New SPL	Maintain the intermediate type of the status field, without multiple conversions. `\| extend cast(status as bigint) as status` `\| where status>=400 and status<500`
Old DSL	`e_keep(ct_int(v("status"))>=400 and ct_int(v("status"))<500)`

2.4 Existence Check

In another O&M scenario, if the service runs into an error, it writes an error field; otherwise, the error field does not exist. We need to filter out log entries that contain the error field.

New SPL	Use the null expression of SQL to determine the existence of a field. `\| where error is not null`
Old DSL	Use the e_has function to determine the existence of a field. `e_keep(e_has("error"))`

3. Scenario 2: Field Management

3.1 New Field Construction

SPL uses the extend instruction to assign values to fields, which is an operation equivalent to the e_set instruction in the data transformation DSL.

New SPL 1.
Set a fixed field and retain the value type, which can be used directly.
| extend kb=1024
| extend size=size/1024

2.
Extract an information item by using a regular expression.
| extend version=regexp_extract(data, '"version":\d+')

3.
Extract an information item by using JSON and assign the value to a field, and for the JSON object path reference syntax in SPL, see JsonPath.
| extend version=json_extract(data, '$.version')

Old DSL 1.
Set a fixed field, change the type to str, and convert it again when using it.
e_set("kb", 1024)
e_set("size", ct_int(v("size")) / ct_int(v("kb")))

2.
Extract an information item by using a regular expression and assign the value to a field.
e_set("version", regex_select(v("data"), r'"version":\d+'))

3.
Extract an information item by using JSON and assign the value to a field, and for the JSON query syntax, see JMESPath syntax.
e_set("version", json_select(v("data"), "version"))

3.2 Selection, Exclusion, and Renaming

SPL allows you to process specified fields in place. Therefore, you do not need to specify a complete data Schema including a list of fields and their types, and can directly perform operations on the specified fields without affecting other irrelevant fields.

New SPL	1. Select fields precisely. `\| project node="__tag__:node", path` 2. Select fields by mode and turn on the Wildcard switch. `\| project -wildcard "__tag__:"` 3. Rename some fields in place. `\| project-rename node="__tag__:node"` 4. Exclude fields by mode and turn on the Wildcard switch. `\| project-away -wildcard "__tag__:"`
Old DSL	1. Select fields precisely. `e_keep_fields("__tag__:node", "path", regex=False)` 2. Select fields by mode. `e_keep_fields("__tag__:.", regex=True)` 3. Rename some fields in place. `e_rename("__tag__:node", node)` 4. Exclude fields by mode. `e_drop_fields("__tag__:.", regex=True)`

3.3 Conditional Expressions

Conditional expressions are essential for handling mixed types of data. SPL completes condition evaluation through SQL expressions.

New SPL 1.
SQL expression: IF returns the value of different expressions based on conditions.
| extend valid=IF(type is not null, 'true', 'false')

2.
SQL expression: COALESCE takes the value of the first non-null expression, or adds a default value.
| extend size=COALESCE(input, output, 0)

3.
SQL expression: CASE-WHEN categorizes or selects data based on conditions.
| extend size=CASE WHEN dir='I' THEN input WHEN dir='O' THEN output ELSE 0 END

Old DSL 1.
op_if returns the value of different expressions based on conditions.
e_set("valid", op_if(e_has("type"), "true", "false"))

2.
op_coalesce takes the value of the first non-null expression, or adds a default value.
e_set("size", op_coalesce(v("input"), v("output"), "0"))

3.
e_if_else controls process and constructs fields.
e_if_else(e_has("type"),e_set("valid", "true"),e_set("valid", "false"),)

4. Scenario 3: Time Information Parsing and Formatting

During SPL execution, the data type of an SLS time field is always INTEGER or BIGINT. SLS time fields include the data timestamp field time and the data time nanosecond field __time_ns_part__. To update the data time, you need to use the extend instruction.

New SPL	1. Extract the log time field __time__ `\| extend time=date_parse(time, '%Y/%m/%d %H-%i-%S')` `\| extend __time__=cast(to_unixtime(time) as bigint)` 2. Normalize the time format. `\| extend time=date_parse(time, '%Y/%m/%d %H-%i-%S')` `\| extend time=date_format(time, '%Y-%m-%d %H:%i:%S')`
Old DSL	1. Extract the log time field __time__ `e_set("__time__",dt_parsetimestamp(v("time"),fmt="%Y/%m/%d %H-%M-%S",),)` 2. Normalize the time format. `e_set("time",dt_strftime(dt_parse(v("time"),fmt="%Y/%m/%d %H-%M-%S",),fmt="%Y-%m-%d %H:%M:%S",),)`

5. Scenario 4: Unstructured or Semi-Structured Data Extraction

In machine data processing scenarios, extracting key information from unstructured or semi-structured data can be a tedious process. Since the data lacks a fixed pattern, many processing details need to be considered, but high efficiency is also required. SPL provides instructions to implement different data extraction, such as regular expression, JSON, and CSV.

5.1 Text Extraction by Using a Regular Expression

New SPL	Use the parse-regexp command. `\| parse-regexp data, '(\S+)\s+(\w+)' as time, level`
Old DSL	Use the e_regex function. `e_regex("data", r"(\S+)\s+(\w+)", ["time", "level"])`

5.2 JSON Structured Data Extraction

New SPL	Use the parse-json command, and for the JSON object path reference syntax, see JsonPath. `\| parse-json -path='$.x.y.z' data`
Old DSL	Use the e_json function and for the JSON query syntax, see JMESPath syntax. `e_json("data", depth=1, jmes="x.y.z")`

5.3 CSV-formatted Data Extraction

New SPL	❖ Single Character Delimiter CSV RFC 4180, supports specifying the delimiter and the quote \| parse-csv -delim='0' -quote='"' data as time, addr, user❖ Multi-character Delimiter \| parse-csv -delim='^_^' data AS time, addr, user
Old DSL	e_csv function e_csv("data", ["time", "addr", "user"], sep="0", quote='"')

Community

SLS Data Transformation SPL Versus Old DSL Scenarios

1. Introduction

2. Scenario 1: Data Filtering and Cleaning

2.1 Exact Match

2.2 Fuzzy Match

2.3 Value Range

2.4 Existence Check

3. Scenario 2: Field Management

3.1 New Field Construction

3.2 Selection, Exclusion, and Renaming

3.3 Conditional Expressions

4. Scenario 3: Time Information Parsing and Formatting

5. Scenario 4: Unstructured or Semi-Structured Data Extraction

5.1 Text Extraction by Using a Regular Expression

5.2 JSON Structured Data Extraction

5.3 CSV-formatted Data Extraction

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Simple Log Service

Storage Capacity Unit

Cloud-Native Applications Management Solution

Hybrid Cloud Storage