Data transformation is a service provided by Alibaba Cloud Log Service to extract, transform, and load (ETL) log data. It supports data transformation, filtering, distribution, and enrichment.
The data transformation service is integrated into Log Service.
The following figure shows the common scenarios supported by the data transformation service.
1. Data standardization (one-to-one)
2. Data distribution (one-to-many)
In the following section, we will use the resolution of Nginx logs as an example to help you quickly get started with data transformation for Alibaba Cloud Log Service.
Assume that we have collected the default Nginx log in simple mode. The default Nginx log is in the following format:
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
The following figure shows the log on a server.
The following figure shows the log collected by Alibaba Cloud Log Service in simple mode.
Log on to the console and enable Data Transformation. Enter domain specific language (DSL) statements in the text box and click Preview Data to preview the data transformation result.
Extract fields from Nginx logs by using regex. The capture group name in regex is used to set the variable name.
e_regex("Source field name", "Regex or named capture regex", "Target field name or array (optional)", mode="fill-auto")
We recommend that you use the regex compilation tool available at: https://regex101.com/
DSL statement used:
e_regex("content",'(? <remote_addr>[0-9:\.] *) - (? <remote_user>[a-zA-Z0-9\-_]*) \[(? <local_time>[a-zA-Z0-9\/ :\-]*)\] "(? <request>[^"]*)" (? <status>[0-9]*) (? <body_bytes_sent>[0-9\-]*) "(? <refer>[^"]*)" "(? <http_user_agent>[^"]*)"')
The default local time format is not easy to read and can be resolved into a more readable format.
DSL statement used:
e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")
dt_strftime(Date and time expression, "format string")
dt_strptime('Value such as v("Field name")', "Format string")
DSL statement used:
e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))
Next, we want to extract the request field. We can see that the request field consists of HTTP_METHOD, URI, and HTTP version.
We can use the following function for implementation:
e_regex("Source field name", "Regex or named capture regex", "Target field name or array (optional)", mode="fill-auto")
# Decode the URI
url_decoding('Value such as v("Field name")')
# Set the field value
e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")
e_kv extracts the key value pair from the request URI.
Statement
e_regex("request", "(? <request_method>[^\s]*) (? <request_uri>[^\s]*) (? <http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")
Result
If we want to map an HTTP code to a specific code description, such as map "404" to "not found", we can use the e_dict_map function.
e_dict_map("Dictionary such as {'v':'v1', 'k2':'v2'}", "Source field regex or list", "Target field name")
If no key is matched by the DSL, the value of the key (*) is used.
e_dict_map({'200':'OK',
'304' : '304 Not Modified',
'400':'Bad Request',
'401':'Unauthorized',
'403':'Forbidden',
'404':'Not Found',
'500':'Internal Server Error',
'*':'unknown'}, "status", "status_desc")
Result:
If we want to know the operating system version of a client, we can use fields in user agent for regex matching. The following DSL statement is used:
e_switch("Condition 1 e_match(...)", "Operation 1 such as e_regex(...)", "Condition 2", "Operation 2", ..., default="Optional operation upon no match")
regex_match('Value such as v("Field name")', r"Regex", full=False)
e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")
DSL statement used:
e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
regex_match(v("content"), "Linux"), e_set("os", "linux"),
regex_match(v("content"), "Windows"), e_set("os", "windows"),
default=e_set("os", "unknown")
)
Result:
We can use the e_output function to ship logs and use the regex_match function to match fields.
regex_match('Value such as v("Field name")', r"Regex", full=False)
e_output(name=None, project=None, logstore=None, topic=None, source=None, tags=None)
e_if("Condition 1 such as e_match(...)", "Operation 1 such as e_regex(...)", "Condition 2", "Operation 2", ....)
DSL statement used:
e_if(regex_match(v("status"),"^4. *"),
e_output(name="logstore_4xx",
project="dashboard-demo",
logstore="dsl-nginx-out-4xx"))
We can see the result in the preview. When we save the transformation result, we need to set the AccessKey information of the corresponding project and Logstore.
Complete DSL code
# Extract general fields
e_regex("content",'(? <remote_addr>[0-9:\.] *) - (? <remote_user>[a-zA-Z0-9\-_]*) \[(? <local_time>[a-zA-Z0-9\/ :\-]*)\] "(? <request>[^"]*)" (? <status>[0-9]*) (? <body_bytes_sent>[0-9\-]*) "(? <refer>[^"]*)" "(? <http_user_agent>[^"]*)"')
# Set the local time
e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))
# Extract the URI field
e_regex("request", "(? <request_method>[^\s]*) (? <request_uri>[^\s]*) (? <http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")
# Map the HTTP code
e_dict_map({'200':'OK',
'304':'304 Not Modified',
'400':'Bad Request',
'401':'Unauthorized',
'403':'Forbidden',
'404':'Not Found',
'500':'Internal Server Error',
'*':'unknown'}, "status", "status_desc")
# Identify the User Agent field
e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
regex_match(v("content"), "Linux"), e_set("os", "linux"),
regex_match(v("content"), "Windows"), e_set("os", "windows"),
default=e_set("os", "unknown")
)
# Ship the log to a specified Logstore
e_if(regex_match(v("status"),"^4. *"),
e_output(name="logstore_4xx", project="dashboard-demo", logstore="dsl-nginx-out-4xx"))
After we submit the code on the page, save the transformation result.
Configure the destination Logstore. If the e_output function is used, we need to specify the destination storage name, project, and Logstore, which must be the same as those in the code.
After we save the transformation result, the data is published. We can find the task under Data Transformation > Data Transformation. After we click the task name, we can find information such as the transformation delay.
If we need to modify the task, we can also click the task name and modify it on the page that appears.
Alibaba Clouder - August 2, 2018
Teddy.Sun - February 3, 2021
DavidZhang - December 30, 2020
janmee - July 6, 2020
Alibaba Cloud Blockchain Service Team - January 17, 2019
Alibaba Cloud Storage - June 19, 2019
An all-in-one service for log-type data
Learn MoreLog into an artificial intelligence for IT operations (AIOps) environment with an intelligent, all-in-one, and out-of-the-box log management solution
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MoreMore Posts by Teddy.Sun