Writing Flink SQL for Weakly Structured Logs: Leveraging SLS SPL

By Weilong Pan (Huolang)

Background

Simple Log Service (SLS) is a cloud-native observation and analysis platform that provides large-scale, low-cost, and real-time services for log, metric, and trace data. With SLS's convenient data access, system logs and business logs can be stored and analyzed by accessing logs to SLS. Alibaba Cloud Realtime Compute for Apache Flink is a big data analysis platform built by Alibaba Cloud based on Apache Flink. It is widely used in real-time data analysis and risk detection. Realtime Compute for Apache Flink supports the SLS Connector, enabling users to use SLS as a source table or a result table on the Alibaba Cloud Realtime Compute for Apache Flink platform.

Alibaba Cloud Realtime Compute for Apache Flink SLS Connector simplifies the structure of log data. Through configuration, it maps the log field of SLS directly to the Flink SQL Table field column. However, a significant portion of business logs is not completely structured. For example, all log content might be written into a single field, necessitating the extraction of structured fields using methods like regular expression and delimiter splitting. In this regard, this article outlines the use of SLS SPL (Structured Programming Language) to configure the SLS Connector, enabling standardized data. This approach encompasses log cleansing and format standardization.

Pain Points of the Weakly Structured Logs Processing

Contradiction Between the Current Situation of Weakly Structured Logs and the Requirements for Structured Processing

Log data often comes from multiple sources in various formats and usually lacks a fixed schema. Therefore, before processing and analyzing the data, we need to cleanse and standardize it to ensure a consistent format. The data can be in different formats such as JSON strings, CSV, or irregular Java stack logs.

Flink SQL is a real-time computing model that supports SQL syntax and enables analysis of structured data. However, it requires a fixed schema for the source data, including fixed field names, types, and quantities. This requirement serves as the foundation of the SQL computing model.

There exists a gap between the weakly structured nature of log data and the structured analysis capability of Flink SQL. Consequently, a middle layer is necessary to bridge the gap and perform data cleansing and standardization. Several options are available for this middle layer. In the following section, we will briefly compare different solutions and propose a new approach based on SLS SPL to address the lightweight data cleansing and standardization needs.

Weakly Structured Log Data

The following is an example of a log. The format is complex, including JSON strings and mixed strings. Among them:

Payload is a JSON string. The content of the schedule field is also a JSON structure.
requestURL is a standard URL path.
The first half of the error field contains the CouldNotExecuteQuery, which is a string, while the second half is a JSON structure.
tag__: __path contains the path of the log file, where service_a may be the service name.
caller contains the file name and the number of rows of the file.

{
  "Payload": "{\"lastNotified\": 1705030483, \"serverUri\": \"http://test.alert.com/alert-api/tasks\", \"jobID\": \"44d6ce47bb4995ef0c8052a9a30ed6d8\", \"alertName\": \"alert-12345678-123456\", \"project\": \"test-sls-project\", \"projectId\": 123, \"aliuid\": \"1234567890\", \"alertDisplayName\": \"\\u6d4b\\u8bd5\\u963f\\u91cc\\u4e91\\u544a\\u8b66\", \"checkJobUri\": \"http://test.alert.com/alert-api/task_check\", \"schedule\": {\"timeZone\": \"\", \"delay\": 0, \"runImmediately\": false, \"type\": \"FixedRate\", \"interval\": \"1m\"}, \"jobRunID\": \"bf86aa5e67a6891d-61016da98c79b-5071a6b\", \"firedNotNotified\": 25161}",
  "TaskID": "bf86aa5e67a6891d-61016da98c79b-5071a6b-334f81a-5c38aaa1-9354-43ec-8369-4f41a7c23887",
  "TaskType": "ALERT",
  "__source__": "11.199.97.112",
  "__tag__:__hostname__": "iabcde12345.cloud.abc121",
  "__tag__:__path__": "/var/log/service_a.LOG",
  "caller": "executor/pool.go:64",
  "error": "CouldNotExecuteQuery : {\n    \"httpCode\": 404,\n    \"errorCode\": \"LogStoreNotExist\",\n    \"errorMessage\": \"logstore k8s-event does not exist\",\n    \"requestID\": \"65B7C10AB43D9895A8C3DB6A\"\n}",
  "requestURL": "/apis/autoscaling/v2beta1/namespaces/python-etl/horizontalpodautoscalers/cn-shenzhen-56492-1234567890123?timeout=30s",
  "ts": "2024-01-29 22:57:13"
}

Requirements of Structured Data Processing

Data cleansing is required to extract more valuable information from such logs. First, we need to extract important fields and then perform data analysis on these fields. This article focuses on the extraction of important fields. The analysis can still be performed in Flink.

Assume that the specific requirements for extracting fields are as follows:

Extract httpCode, errorCode, errorMessage, and requestID from error.
Extract service_a from__tag__:__path__as serviceName.
Extract pool.go from caller as fileName and extract 64 as fileNo.
Extract project from Payload. Extract type from schedule in Payload as scheuleType.
Rename source to serviceIP.
Discard the remaining fields.

The final required fields are listed as follows. With such a table model, we can easily use Flink SQL for data analysis.

Solutions

There are many ways to implement such data cleansing. Here are several solutions based on SLS and Flink. There is no absolute best solution. Different solutions are used in different scenarios.

Data processing solution: Create a destination Logstore in the SLS console and create a data processing job to clean data.

Flink solution: Specify error and payload as source table fields, use SQL regular functions and JSON functions to parse the fields, write the parsed fields to a temporary table, and then analyze the temporary table.

SPL solution: Configure SPL statements in the Flink SLS Connector to cleanse the data. In Flink, source table fields are defined as the cleansed data structure.

We can conclude from the principles of the preceding three solutions that in scenarios where data cleansing is required, configuring SPL in SLS Connector is a solution that is lightweight, easy to maintain, and easy to scale.

In scenarios where log data is weakly structured, the SPL solution avoids creating a temporary intermediate Logstore in the first Solution and avoids creating a temporary table in Flink in the second Solution. The SPL solution cleanses the data in a closer position to the data source and focuses on business logic on the computing platform. Therefore, the separation of duties is clearer.

How to Use SPL in Flink

Next, a weakly structured log is used as an example to describe how to use Flink based on SLS SPL. For better demonstration, we configure the source table of SLS in the Flink console and then start a continuous query to observe the effect. In actual use, you only need to modify the configurations of the SLS source table to complete data cleansing and field standardization.

Preparing Data from SLS

Activate SLS, create a Project and a Logstore in SLS, and create the AK/SK that has permission to consume the Logstore.
The current Logstore data is written to simulated data by using the SLS SDK. The format uses the preceding log sample which contains weakly structured fields such as JSON and complex strings.

Previewing SPL Effects

You can enable the scan mode in the Logstore. The SLS SPL pipeline syntax uses the 丨delimiter to separate different commands. Each time you enter a command, you can view the results in a real-time manner. Then, you can increase the number of pipelines to obtain the final results in a progressive and exploratory manner.

A brief description of the SPL in the preceding figure:

* | project Payload, error, "__tag__:__path__", "__tag__:__hostname__", caller 
 | parse-json Payload 
 | project-away Payload 
 | parse-regexp error, 'CouldNotExecuteQuery : ({[\w":\s,\-}]+)' as errorJson 
 | parse-json errorJson 
 | parse-regexp "__tag__:__path__", '\/var\/log\/([\w\_]+).LOG' as serviceName 
 | parse-regexp caller, '\w+/([\w\.]+):(\d+)' as fileName, fileNo 
 | project-rename serviceHost="__tag__:__hostname__" 
 | extend scheduleType = json_extract_scalar(schedule, '$.type') 
 | project httpCode, errorCode,errorMessage,requestID,fileName, fileNo, serviceHost,scheduleType, project

Line 1: project command: Retain the Payload, error, __tag__:__path__, and caller fields from the original result, and discard other fields. These retained fields will be used for subsequent parsing.
Line 2: parse-json command: Expand the Payload string into JSON. The first-layer fields appear in the result, including lastNotified, serviceUri, and jobID.
Line 3: project-away command: Remove the original Payload field.
Line 4: parse-regexp command: According to the content in the error field, parse part of the JSON content and place it in the errorJson field.
Line 5: parse-json command: Expand the errorJson field to obtain fields including httpCode, errorCode, and errorMessage.
Line 6: parse-regexp command: Parse out the file name of __tag__:__path__by regular expression and name it serviceName.
Line 7: parse-regexp command: Parse out the file name and the number of rows of the caller type through the regular expression capture group and place the file name and the number in the fileName and fileNo fields.
Line 8: project-rename command: Rename the tag__:__hostname field to serviceHost.
Line 9: extend command: Use the json_extract_scalar function to extract the type field from schedule and name it scheduleType.
Line 10: project command: Keep a list of required fields, where the project fields come from the Payload.

Create an SQL Job

Create a blank SQL stream job draft in the Realtime Compute for Apache Flink console. Click Next to write the job.

Enter the following statements to create a temporary table in the job draft:

CREATE TEMPORARY TABLE sls_input_complex (
  errorCode STRING,
  errorMessage STRING,
  fileName STRING,
  fileNo STRING,
  httpCode STRING,
  requestID STRING,
  scheduleType STRING,
  serviceHost STRING,
  project STRING,
  proctime as PROCTIME()
) WITH (
  'connector' = 'sls',
  'endpoint' ='cn-beijing-intranet.log.aliyuncs.com',
  'accessId' = '${ak}',
  'accessKey' = '${sk}',
  'starttime' = '2024-02-01 10:30:00',
  'project' ='${project}',
  'logstore' ='${logtore}',
  'query' = '* | project Payload, error, "__tag__:__path__", "__tag__:__hostname__", caller | parse-json Payload | project-away Payload | parse-regexp error, ''CouldNotExecuteQuery : ({[\w":\s,\-}]+)'' as errorJson | parse-json errorJson | parse-regexp "__tag__:__path__", ''\/var\/log\/([\w\_]+).LOG'' as serviceName | parse-regexp caller, ''\w+/([\w\.]+):(\d+)'' as fileName, fileNo | project-rename serviceHost="__tag__:__hostname__" | extend scheduleType = json_extract_scalar(schedule, ''$.type'') | project httpCode, errorCode,errorMessage,requestID,fileName, fileNo, serviceHost,scheduleType,project'
  );

In the statement, ${ak}, ${sk}, ${project}, ${logstore} should be replaced with an AK ID that has the consumption permission.
The query field should be replaced with the preceding SPL. Note that in the Alibaba Cloud Realtime Compute for Apache Flink console, you need to escape single quotation marks and eliminate line breaks.
The final field list obtained by SPL corresponds to the fields in TABLE.

Continuous Query and Effect

Enter the analysis statement in the job and view the result data:

SELECT * FROM sls_input_complex

Click the Debug button in the upper-right corner to debug. You can view the values of each column in TABLE, which correspond to the results of SPL processing.

Summary

To meet the requirements of weakly structured log data, the Flink SLS Connector has been upgraded. It now allows for direct configuration of SPL (Stream Processing Language) through the Connector to cleanse and push down SLS data sources. This upgraded solution, particularly when extracting regular expression fields, JSON fields, and CSV fields, is lighter-weight compared to the original data processing solution and the original Flink SLS Connector solution. This streamlined approach simplifies the data cleansing task.

Furthermore, performing data cleansing at the data source end reduces network transmission traffic, ensuring that the data arriving in Flink is already standardized. This enables you to focus more on business data analysis in Flink.

In addition, the SLS scan-based query now supports SPL queries, making SPL verification and testing easier. You can view the real-time execution results of the SPL pipeline syntax.

References

[1] SLS Overview
https://www.alibabacloud.com/help/en/sls/product-overview/what-is-log-service
[2] SPL Overview
https://www.alibabacloud.com/help/en/sls/user-guide/spl-overview
[3] Alibaba Cloud Realtime Compute for Apache Flink Connector SLS
https://www.alibabacloud.com/help/en/flink/developer-reference/log-service-connector
[4] SLS Scan-Based Query Overview
https://www.alibabacloud.com/help/en/sls/user-guide/scan-based-query-overview

Community

Writing Flink SQL for Weakly Structured Logs: Leveraging SLS SPL

Background

Pain Points of the Weakly Structured Logs Processing

Contradiction Between the Current Situation of Weakly Structured Logs and the Requirements for Structured Processing

Weakly Structured Log Data

Requirements of Structured Data Processing

Solutions

How to Use SPL in Flink

Preparing Data from SLS

Previewing SPL Effects

Create an SQL Job

Continuous Query and Effect

Summary

References

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Simple Log Service