This topic describes how to use the data transformation feature of Simple Log Service to transform complex JSON data.
Transform complex JSON data with multiple subkeys each of which is an array
Program-built logs are written in a statistical JSON format, usually containing basic information and multiple subkeys each of which is an array. For example, a server writes a log at an interval of 1 minute. The log contains the data information status and the statistical status of servers and clients generating logs.
Sample log
__source__: 192.0.2.1 __topic__: content:{ "service": "search_service", "overal_status": "yellow", "servers": [ { "host": "192.0.2.1", "status": "green" }, { "host": "192.0.2.2", "status": "green" } ], "clients": [ { "host": "192.0.2.3", "status": "green" }, { "host": "192.0.2.4", "status": "red" } ] }
Data transformation requirements
Split the raw log by
topic
, includingoverall_type
,client_status
, andserver_status
.Store different information in each
topic
as follows:overall_type
: stores the server count, client count, overall status (color), and service information.client_status
: stores the host IP address, status, and service information.server_status
: stores the host IP address, status, and service information.
Expected result
__source__: 192.0.2.1 __topic__: overall_type client_count: 2 overal_status: yellow server_count: 2 service: search_service __source__: 192.0.2.1 __topic__: client_status host: 192.0.2.4 status: red service: search_service __source__: 192.0.2.1 __topic__: client_status host: 192.0.2.3 status: green service: search_service __source__: 192.0.2.1 __topic__: server_status host: 192.0.2.1 status: green service: search_service __source__: 192.0.2.1 __topic__: server_status host: 192.0.2.2 status: green service: search_service
Solution
Split the raw log into three logs and then further split the logs by topic. After the splitting, the three logs have the same information except for the
topic
field.e_set("__topic__", "server_status,client_status,overall_type") e_split("__topic__")
The log after processing is as follows:
__source__: 192.0.2.1 __topic__: server_status // The topics in the other two logs are client_status and overall_type. Except for the topic field, all the other information in the three logs is the same. content: { ... Same as that in the raw log... }
Spread the JSON data in the
content
field at the first layer and delete thecontent
field.e_json('content',depth=1) e_drop_fields("content")
The log after processing is as follows:
__source__: 192.0.2.1 __topic__: overall_type // The topics in the other two logs are client_status and server_status. Except for the topic field, all the other information in the three logs is the same. clients: [{"host": "192.0.2.3", "status": "green"}, {"host": "192.0.2.4", "status": "red"}] overal_status: yellow servers: [{"host": "192.0.2.1", "status": "green"}, {"host": "192.0.2.2", "status": "green"}] service: search_service
For the log with the topic
overall_type
, compute the values forclient_count
andserver_count
.e_if(e_search("__topic__==overall_type"), e_compose( e_set("client_count", json_select(v("clients"), "length([*])", default=0)), e_set("server_count", json_select(v("servers"), "length([*])", default=0)) ))
The log after processing is as follows:
__topic__: overall_type server_count: 2 client_count: 2
Delete the clients and servers fields.
e_if(e_search("__topic__==overall_type"), e_drop_fields("clients", "servers"))
Further split the log with the topic
server_status
.e_if(e_search("__topic__==server_status"), e_compose( e_split("servers"), e_json("servers", depth=1) ))
The log is split into the following two logs:
__topic__: server_status servers: {"host": "192.0.2.1", "status": "green"} host: 192.0.2.1 status: green
__topic__: server_status servers: {"host": "192.0.2.2", "status": "green"} host: 192.0.2.2 status: green
Delete the servers field.
e_if(e_search("__topic__==overall_type"), e_drop_fields("servers"))
Further split the log with the topic
client_status
and delete the clients field.e_if(e_search("__topic__==client_status"), e_compose( e_split("clients"), e_json("clients", depth=1), e_drop_fields("clients") ))
The log is split into the following two logs:
__topic__: client_status host: 192.0.2.3 status: green
__topic__: clients host: 192.0.2.4 status: red
To sum up, use the following LOG domain specific language (DSL) rules:
# Split the raw log by topic. e_set("__topic__", "server_status,client_status,overall_type") e_split("__topic__") e_json('content',depth=1) e_drop_fields("content") # Process the log with the topic overall_type. e_if(e_search("__topic__==overall_type"), e_compose( e_set("client_count", json_select(v("clients"), "length([*])", default=0)), e_set("server_count", json_select(v("servers"), "length([*])", default=0)) )) # Process the log with the topic server_status. e_if(e_search("__topic__==server_status"), e_compose( e_split("servers"), e_json("servers", depth=1) )) e_if(e_search("__topic__==overall_type"), e_drop_fields("servers")) # Process the log with the topic client_status. e_if(e_search("__topic__==client_status"), e_compose( e_split("clients"), e_json("clients", depth=1), e_drop_fields("clients") ))
Solution optimization
The preceding solution does not work well if the content.clients
or content.servers
field is empty. Assume that the raw log is as follows:
__source__: 192.0.2.1
__topic__:
content:{
"service": "search_service",
"overal_status": "yellow",
"servers": [ ],
"clients": [ ]
}
If you split this raw log into three logs by using the preceding solution, the logs with the topics client_status
and server_status
are empty.
__source__: 192.0.2.1
__topic__: overall_type
client_count: 0
overal_status: yellow
server_count: 0
service: search_service
__source__: 192.0.2.1
__topic__: client_status
service: search_service
__source__: 192.0.2.1
__topic__: server_status
host: 192.0.2.1
status: green
service: search_service
Optimized solution 1
Check whether the logs with the topics
server_status
andclient_status
are empty after the raw log is split. If so, discard the logs.# Check whether the log with the topic server_status is empty. If so, discard it. If not, retain it. e_keep(op_and(e_search("__topic__==server_status"), json_select(v("servers"), "length([*])"))) # Check whether the log with the topic client_status is empty. If so, discard it. If not, retain it. e_keep(op_and(e_search("__topic__==client_status"), json_select(v("clients"), "length([*])")))
To sum up, use the following LOG DSL rules:
# Split the raw log by topic. e_set("__topic__", "server_status,client_status,overall_type") e_split("__topic__") e_json('content',depth=1) e_drop_fields("content") # Process the log with the topic overall_type. e_if(e_search("__topic__==overall_type"), e_compose( e_set("client_count", json_select(v("clients"), "length([*])", default=0)), e_set("server_count", json_select(v("servers"), "length([*])", default=0)) )) # (New) Check whether the log with the topic server_status is empty. If so, discard it. If not, retain it. e_keep(op_and(e_search("__topic__==server_status"), json_select(v("servers"), "length([*])"))) # Process the log with the topic server_status. e_if(e_search("__topic__==server_status"), e_compose( e_split("servers"), e_json("servers", depth=1) )) e_if(e_search("__topic__==overall_type"), e_drop_fields("servers")) # (New) Check whether the log with the topic client_status is empty. If so, discard it. If not, retain it. e_keep(op_and(e_search("__topic__==client_status"), json_select(v("clients"), "length([*])"))) # Process the log with the topic client_status. e_if(e_search("__topic__==client_status"), e_compose( e_split("clients"), e_json("clients", depth=1), e_drop_fields("clients") ))
Optimized solution 2
Check whether a field is empty before splitting the raw log. If the field is not empty, split the raw log based on the field.
# Set the initial topic. e_set("__topic__", "server_status") # If the content.servers field is not empty, split the raw log to obtain a log with the topic server_status. e_if(json_select(v("content"), "length(servers[*])"), e_compse( e_set("__topic__", "server_status,overall_type"), e_split("__topic__") )) # If the content.clients field is not empty, further split the raw log to obtain a log with the topic client_status. e_if(op_and(e_search("__topic__==overall_type"), json_select(v("content"), "length(clients[*])")), e_compse( e_set("__topic__", "client_status,overall_type"), e_split("__topic__") ))
To sum up, use the following LOG DSL rules:
# Split the raw log. e_set("__topic__", "server_status") # If the content.servers field is not empty, split the raw log to obtain a log with the topic server_status. e_if(json_select(v("content"), "length(servers[*])"), e_compse( e_set("__topic__", "server_status,overall_type"), e_split("__topic__") )) # If the content.clients field is not empty, further split the raw log to obtain a log with the topic client_status. e_if(op_and(e_search("__topic__==overall_type"), json_select(v("content"), "length(clients[*])")), e_compse( e_set("__topic__", "client_status,overall_type"), e_split("__topic__") )) # Process the log with the topic overall_type. e_if(e_search("__topic__==overall_type"), e_compose( e_set("client_count", json_select(v("clients"), "length([*])", default=0)), e_set("server_count", json_select(v("servers"), "length([*])", default=0)) )) # Process the log with the topic server_status. e_if(e_search("__topic__==server_status"), e_compose( e_split("servers"), e_json("servers", depth=1) )) e_if(e_search("__topic__==overall_type"), e_drop_fields("servers")) # Process the log with the topic client_status. e_if(e_search("__topic__==client_status"), e_compose( e_split("clients"), e_json("clients", depth=1), e_drop_fields("clients") ))
Solution comparison
Solution 1 is redundant in logic because it deletes empty logs after obtaining them from the raw log. However, the rules are simple and easy to maintain. We recommend that you use this solution by default.
Solution 2 has higher processing efficiency because it checks for empty fields before splitting. However, this solution uses redundant rules. We recommend that you use this solution only for specific scenarios, for example, when a large number of additional events may be produced after the raw log is split.
Transform complex JSON data with multiple layers of nested array objects
Take the following complex JSON data with multiple layers of nested arrays as an example. Assume that you want to split the logon information stored in login_histories
of different objects in the users
field into separate logon events.
Raw log
__source__: 192.0.2.1 __topic__: content:{ "users": [ { "name": "user1", "login_histories": [ { "date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6" }, { "date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6" }, { ... More logon information... } ] }, { "name": "user2", "login_histories": [ { "date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7" }, { "date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9" }, { ... More logon information... } ] }, { ... More users... } ] }
Expected logs after splitting
__source__: 192.0.2.1 name: user1 date: 2019-10-11 1:0:0 login_ip: 192.0.2.6 __source__: 192.0.2.1 name: user1 date: 2019-10-11 0:0:0 login_ip: 192.0.2.6 __source__: 192.0.2.1 name: user2 date: 2019-10-11 0:0:0 login_ip: 192.0.2.7 __source__: 192.0.2.1 name: user2 date: 2019-10-11 1:0:0 login_ip: 192.0.2.9 ... More logs...
Solution
Split the log and spread data based on
users
in thecontent
field.e_split("content", jmes='users[*]', output='item') e_json("item",depth=1)
The log after processing is as follows:
__source__: 192.0.2.1 __topic__: content:{... Same as that in the raw log...} item: {"name": "user1", "login_histories": [{"date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6"}, {"date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6"}]} login_histories: [{"date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6"}, {"date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6"}] name: user1 __source__: 192.0.2.1 __topic__: content:{... Same as that in the raw log...} item: {"name": "user2", "login_histories": [{"date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7"}, {"date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9"}]} login_histories: [{"date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7"}, {"date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9"}] name: user2
Split the log and spread data based on
login_histories
.e_split("login_histories") e_json("login_histories", depth=1)
The log after processing is as follows:
__source__: 192.0.2.1 __topic__: content: {... Same as that in the raw log...} date: 2019-10-11 0:0:0 item: {"name": "user2", "login_histories": [{"date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7"}, {"date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9"}]} login_histories: {"date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7"} login_ip: 192.0.2.7 name: user2 __source__: 192.0.2.1 __topic__: content: {... Same as that in the raw log...} date: 2019-10-11 1:0:0 item: {"name": "user2", "login_histories": [{"date": "2019-10-11 0:0:0", "login_ip": "192.0.2.7"}, {"date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9"}]} login_histories: {"date": "2019-10-11 1:0:0", "login_ip": "192.0.2.9"} login_ip: 192.0.2.9 name: user2 __source__: 192.0.2.1 __topic__: content: {... Same as that in the raw log...} date: 2019-10-10 1:0:0 item: {"name": "user1", "login_histories": [{"date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6"}, {"date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6"}]} login_histories: {"date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6"} login_ip: 192.0.2.6 name: user1 __source__: 192.0.2.1 __topic__: content: {... Same as that in the raw log...} date: 2019-10-10 0:0:0 item: {"name": "user1", "login_histories": [{"date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6"}, {"date": "2019-10-10 1:0:0", "login_ip": "192.0.2.6"}]} login_histories: {"date": "2019-10-10 0:0:0", "login_ip": "192.0.2.6"} login_ip: 192.0.2.6 name: user1
Delete irrelevant fields.
e_drop_fields("content", "item", "login_histories")
The log after processing is as follows:
__source__: 192.0.2.1 __topic__: name: user1 date: 2019-10-11 1:0:0 login_ip: 192.0.2.6 __source__: 192.0.2.1 __topic__: name: user1 date: 2019-10-11 0:0:0 login_ip: 192.0.2.6 __source__: 192.0.2.1 __topic__: name: user2 date: 2019-10-11 0:0:0 login_ip: 192.0.2.7 __source__: 192.0.2.1 __topic__: name: user2 date: 2019-10-11 1:0:0 login_ip: 192.0.2.9
To sum up, use the following LOG DSL rules:
e_split("content", jmes='users[*]', output='item') e_json("item",depth=1) e_split("login_histories") e_json("login_histories", depth=1) e_drop_fields("content", "item", "login_histories")
Conclusion: If you have requirements similar to the above, split the log, spread data based on specified fields, and then delete irrelevant fields.