Converting JSON-formatted Log Data with MaxCompute Built-in Functions and UDTF

Join us at the Alibaba Cloud ACtivate Online Conference on March 5-6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.

Analysis of Business Scenarios

Due to the complexity of the business, data developers need to handle different types of data from different sources. They need to extract this data to the data platform and extract key business fields according to the designed data model, to form a two-dimensional table for subsequent statistical analysis and contextual computing in the big data platform or data warehouse.

This article introduces how to use MaxCompute to convert JSON-formatted log data with a specific case.

Data Source: The application writes to the log file in real time under the specified ECS host directory.
Data Formats: In log files, the format of each log is shown in the following figure (the data is simplified and masked in the example). Each log contains the device information, as well as the information for 1 or more sessions, and the number of sessions in each log is dynamic: 1 or more sessions. An example of the contents of a log is as follows:
Data Processing Requirements: These requirements are to collect log data, resolve and convert the log data, and perform statistical analysis on the converted log data in MaxCompute. The log data is in JSON format, so it contains information of multiple business fields. The business fields need to be extracted before subsequent business statistics can be carried out in MaxCompute (for example, PV/UV statistics by time period, statistics by device type, and statistics by associating device ID with membership information). Therefore, the key requirement of this article is how to resolve the key information of JSON-formatted data into a two-dimensional table containing the business fields.

Recommended Solution

In the solution presented in this article, the Alibaba Cloud's Log Service + MaxCompute product portfolio is selected to meet the business requirements above. The log service only performs the functions of log collection and delivery, and does not perform data resolution and conversion.

Log Collection

Log data is obtained from the log service to logstore (for this part, see the Log Service Help Document)s

Using the delivery function of Log Service, the log is regularly delivered and archived to one raw log table of MaxCompute, in which all information of each log is written into the "content" field of the raw log table.

Use MaxCompute to perform field resolution and extraction of the raw data.

Use the built-in function get_json_object to extract data

select
get_json_object(content,'$. DeviceID') as DeviceID,
get_json_object(content,'$. UniqueIdentifier') as UniqueIdentifier,
get_json_object(content,'$. GameID') as GameID,
get_json_object(content,'$. Device') as Device,
get_json_object(content,'$. Sessions\[0]. SessionID') as Session1_ID,
get_json_object(content,'$. Sessions\[0]. Events\[0]. Name') as Session1_EventName,
get_json_object(content,'$. Sessions\[1]. SessionID') as Session2_ID,
get_json_object(content,'$. Sessions\[1]. Events\[0]. Name') as Session2_EventName
from log_target_json where pt='20180725' limit 10

The extraction results are as follows:

Summary of the Solution:

The processing logic above extracts the business fields of a log into row fields respectively, which is suitable for JSON records that have fixed information and can be mapped into table fields. For example, in the example above, after extracting the information of session1 and session2, they are treated as different column fields respectively. However, if the number of sessions contained in each log record is dynamic rather than fixed, this processing logic has difficulty meeting the requirements. For example, the next log contains 3 sessions. If the information of each session is to be extracted, the resolved SQL is required to add Session3_ID and Session3_EventName logic. What if the next log contains 100 sessions? This extraction method has difficulty in dealing with this situation.

In this case, the UDTF custom function can be used.

Develop MaxCompute UDTF Function to Process Logs

According to the characteristics of the data, 1 log contains multiple sessions, which has a 1:N relationship. When the log is converted into a two-dimensional table of the data warehouse, the session information needs to be resolved to the minimum granularity, and 1 row needs to be converted into N rows to extract all session information. The business objectives are as follows:

In MaxCompute, UDTF is needed to convert a row of records into multiple rows of records.

Taking JAVA UDTF as an example, each JSON record in the "content" field is resolved to obtain and return the business field to be extracted The UDTF processing logic here goes as far as JSON level 3, cyclically resolving the data with the smallest granularity and returning multiple records.

package com.aliyun.odps;

import com.aliyun.odps.udf.UDFException;
import com.aliyun.odps.udf.UDTF;
import com.aliyun.odps.udf.annotation.Resolve;
import com.google.gson.Gson;

import java.io.IOException;
import java.util.List;
import java.util.Map;

@Resolve("string->string,string,string,string,string,string,string,string")
public class get_json_udtf extends UDTF {
    @Override
    public void process(Object[] objects) throws UDFException, IOException {
        String input = (String) objects[0];
        Map map = new Gson().fromJson(input, Map.class);

        Object deviceID = map.get("DeviceID");
        Object uniqueIdentifier = map.get("UniqueIdentifier");
        Object gameID = map.get("GameID");
        Object device = map.get("Device");

        List sessions = (List) map.get("Sessions");
        for (Object session : sessions) {
            Map sMap = (Map) session;
            Object sessionID = sMap.get("SessionID");
            List events = (List) sMap.get("Events");
            for (Object event : events) {
                String name = (String) ((Map) event).get("Name");
                String timestamp = (String) ((Map) event).get("Timestamp");
                String networkStatus = (String) ((Map) event).get("NetworkStatus");
                forward(deviceID, uniqueIdentifier,gameID,device,
                        sessionID,name,timestamp,networkStatus);
            }
        }
    }
}

Note: For the details of writing, packaging and uploading, and creating functions for UDF, see the official documentation

After the program is written, you need to package UDTF, upload it and create a UDF function:

Package the compiled program, generate a jar package, and upload this resource in the MaxCompute client (odpscmd):

add jar maxcompute_demo-1.0-SNAPSHOT.jar -f;

Then, create a function from the command line:

create function get_json_udtf as com.aliyun.odps.get_json_udtf using maxcompute_demo-1.0-SNAPSHOT.jar';

View the function after creation:

Test and verification:

Query the table containing the raw log and query the "content" field with the created get_json_udtf:

The query results are as follows: UDFT function processes each JSON record and generates multiple records as expected:

And if the processing logic needs to be fixed, you can also use the "insert into" syntax to query the resolved results to a new table, and realize periodic data conversion through task scheduling.

Summary

Based on a big data analysis scenario of log analysis, and taking a common JSON log processing requirement as an example, this article introduces how to collect logs for MaxCompute through the log service, and how to use MaxCompute built-in functions and UDF to resolve and convert JSON-formatted log data, extract key business fields and generate log tables that can be used for subsequent analysis.

Community

Converting JSON-formatted Log Data with MaxCompute Built-in Functions and UDTF

Analysis of Business Scenarios

Recommended Solution

Log Collection

Develop MaxCompute UDTF Function to Process Logs

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

Simple Log Service