By Lingsheng
As mentioned in the previous article, Comprehensive Upgrade of SLS Data Transformation: Integrate SPL Syntax, SLS data transformation now integrates the SLS data transformation syntax SPL. Compared with the old version of data transformation domain-specific language (DSL), SPL offers many improvements in terms of syntax simplicity when dealing with unstructured data scenarios, including improvements in maintaining intermediate types, referencing fields, and being seamlessly compatible with SQL functions.
Here we continue to discuss the comparison of using the new data transformation SPL versus the old version of data transformation DSL in various data processing requirements. For the scenario of data synchronization where no data transformation is required, both the new SPL and the old DSL can simply specify a null logic, which will not be further detailed here.
In routine D&M, error log analysis is a crucial step in identifying and locating issues. Here, we use service logs as an example to introduce how to apply data transformation to complete data cleaning.
In the old data transformation DSL, data cleaning was accomplished using e_keep/e_drop, whereas in the new data transformation SPL, the where instruction is used.
It needs to filter out error logs, that is, logs where the value of the level field is the string ERROR.
New SPL | Directly reference the level field.| where level='ERROR'
|
Old DSL | Need to use a function to reference the field v("level"), with different implementations: ❖ e_keep(v("level") == "ERROR") ❖ e_drop(v("level") != "ERROR") ❖ e_if(v("level") != "ERROR", e_drop()) ❖ e_keep(e_search("level==ERROR"))
|
Due to differences in coding standards across various service modules, the value of the level field might not be fixed and could be ERROR, ERR, or E, among others. In this scenario, a string fuzzy match is necessary.
New SPL | Reuse the like expression of SQL.| where level like '%E%'
|
Old DSL | Different implementations: ❖ e_keep(op_in(v("level"), "E")) ❖ e_keep(e_search("level: E") ❖ e_if(op_not_in(v("level"), "E"), e_drop())
|
Besides filtering text logs, we also need to compare value ranges. For instance, in access logs, we need to filter out user errors to analyze which operations may be poorly designed. Specifically, we need to extract data where the value of the status field falls within the 4xx range.
New SPL | Maintain the intermediate type of the status field, without multiple conversions.| extend cast(status as bigint) as status | where status>=400 and status<500
|
Old DSL | e_keep(ct_int(v("status"))>=400 and ct_int(v("status"))<500) |
In another O&M scenario, if the service runs into an error, it writes an error field; otherwise, the error field does not exist. We need to filter out log entries that contain the error field.
New SPL | Use the null expression of SQL to determine the existence of a field.| where error is not null
|
Old DSL | Use the e_has function to determine the existence of a field.e_keep(e_has("error"))
|
SPL uses the extend instruction to assign values to fields, which is an operation equivalent to the e_set instruction in the data transformation DSL.
New SPL | 1. Set a fixed field and retain the value type, which can be used directly. | extend kb=1024 | extend size=size/1024 2. Extract an information item by using a regular expression. | extend version=regexp_extract(data, '"version":\d+') 3. Extract an information item by using JSON and assign the value to a field, and for the JSON object path reference syntax in SPL, see JsonPath. | extend version=json_extract(data, '$.version')
|
Old DSL | 1. Set a fixed field, change the type to str, and convert it again when using it. e_set("kb", 1024) e_set("size", ct_int(v("size")) / ct_int(v("kb"))) 2. Extract an information item by using a regular expression and assign the value to a field. e_set("version", regex_select(v("data"), r'"version":\d+')) 3. Extract an information item by using JSON and assign the value to a field, and for the JSON query syntax, see JMESPath syntax. e_set("version", json_select(v("data"), "version"))
|
SPL allows you to process specified fields in place. Therefore, you do not need to specify a complete data Schema including a list of fields and their types, and can directly perform operations on the specified fields without affecting other irrelevant fields.
New SPL | 1. Select fields precisely. | project node="__tag__:node", path 2. Select fields by mode and turn on the Wildcard switch. | project -wildcard "__tag__:*" 3. Rename some fields in place. | project-rename node="__tag__:node" 4. Exclude fields by mode and turn on the Wildcard switch. | project-away -wildcard "__tag__:*"
|
Old DSL | 1. Select fields precisely. e_keep_fields("__tag__:node", "path", regex=False) 2. Select fields by mode. e_keep_fields("__tag__:.*", regex=True) 3. Rename some fields in place. e_rename("__tag__:node", node) 4. Exclude fields by mode. e_drop_fields("__tag__:.*", regex=True)
|
Conditional expressions are essential for handling mixed types of data. SPL completes condition evaluation through SQL expressions.
New SPL | 1. SQL expression: IF returns the value of different expressions based on conditions. | extend valid=IF(type is not null, 'true', 'false') 2. SQL expression: COALESCE takes the value of the first non-null expression, or adds a default value. | extend size=COALESCE(input, output, 0) 3. SQL expression: CASE-WHEN categorizes or selects data based on conditions. | extend size=CASE WHEN dir='I' THEN input WHEN dir='O' THEN output ELSE 0 END
|
Old DSL | 1. op_if returns the value of different expressions based on conditions. e_set("valid", op_if(e_has("type"), "true", "false")) 2. op_coalesce takes the value of the first non-null expression, or adds a default value. e_set("size", op_coalesce(v("input"), v("output"), "0")) 3. e_if_else controls process and constructs fields. e_if_else(e_has("type"),e_set("valid", "true"),e_set("valid", "false"),)
|
During SPL execution, the data type of an SLS time field is always INTEGER or BIGINT. SLS time fields include the data timestamp field time and the data time nanosecond field __time_ns_part__. To update the data time, you need to use the extend instruction.
New SPL | 1. Extract the log time field __time__ | extend time=date_parse(time, '%Y/%m/%d %H-%i-%S') | extend __time__=cast(to_unixtime(time) as bigint) 2. Normalize the time format. | extend time=date_parse(time, '%Y/%m/%d %H-%i-%S') | extend time=date_format(time, '%Y-%m-%d %H:%i:%S')
|
Old DSL | 1. Extract the log time field __time__e_set("__time__",dt_parsetimestamp(v("time"),fmt="%Y/%m/%d %H-%M-%S",),) 2. Normalize the time format. e_set("time",dt_strftime(dt_parse(v("time"),fmt="%Y/%m/%d %H-%M-%S",),fmt="%Y-%m-%d %H:%M:%S",),)
|
In machine data processing scenarios, extracting key information from unstructured or semi-structured data can be a tedious process. Since the data lacks a fixed pattern, many processing details need to be considered, but high efficiency is also required. SPL provides instructions to implement different data extraction, such as regular expression, JSON, and CSV.
New SPL | Use the parse-regexp command.| parse-regexp data, '(\S+)\s+(\w+)' as time, level
|
Old DSL | Use the e_regex function.e_regex("data", r"(\S+)\s+(\w+)", ["time", "level"])
|
New SPL | Use the parse-json command, and for the JSON object path reference syntax, see JsonPath.| parse-json -path='$.x.y.z' data
|
Old DSL | Use the e_json function and for the JSON query syntax, see JMESPath syntax.e_json("data", depth=1, jmes="x.y.z")
|
New SPL | ❖ Single Character Delimiter CSV RFC 4180, supports specifying the delimiter and the quote | parse-csv -delim='0' -quote='"' data as time, addr, user❖ Multi-character Delimiter | parse-csv -delim='^_^' data AS time, addr, user |
Old DSL | e_csv function e_csv("data", ["time", "addr", "user"], sep="0", quote='"') |
ZooKeeper Practice: Master the Authentication Mode of ZooKeeper
Faster and Stronger: SLS Introduces High-performance SPL Log Query Mode
506 posts | 48 followers
FollowAlibaba Cloud Native - September 12, 2024
Alibaba Cloud Native Community - March 29, 2024
Alibaba Cloud Native - April 26, 2024
Alibaba Cloud Community - October 19, 2021
Alibaba Cloud Native Community - May 31, 2024
Alibaba Cloud Native Community - April 2, 2024
506 posts | 48 followers
FollowAn all-in-one service for log-type data
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreA cost-effective, efficient and easy-to-manage hybrid cloud storage solution.
Learn MoreMore Posts by Alibaba Cloud Native Community