By Lige
Efficiently utilizing MaxCompute (ODPS) SQL and pushing basic SQL syntax to its limits.
In today's era of big data, SQL is not only used by data professionals but also by non-technical personnel in product and operations roles. Efficiently harnessing the power of SQL to leverage data becomes crucial.
MaxCompute (ODPS) SQL has evolved to a mature stage. As an SQL dialect, it offers comprehensive syntax support, rich built-in functions, and advanced features such as window functions, user-defined functions, and types. It can efficiently handle various data processing scenarios.
This article aims to explore the full potential of SQL capabilities by taking a unique approach that emphasizes flexible and divergent data processing thinking using basic syntax.
This article is helpful for beginners and senior personnel, but it is more suitable for intermediate and advanced readers. It focuses on data processing thinking, avoiding excessive use of advanced syntax. Functions and syntax features are not extensively covered, allowing readers to explore them based on their needs.
The article focuses on sequence generation, interval transformation, permutation and combination, continuous discrimination, and practical applications through case studies. Topics have a slight dependency, so it is recommended to read them in order.
SQL statements in this article utilize only the basic syntax features of MaxCompute (ODPS) SQL. In theory, all SQL statements can run in the latest version. Runtime environment and compatibility issues are not discussed.
Sequences are common data forms, typically finite. This section starts with the simplest increasing sequence, identifying general methods and applying them to more general scenarios.
First, we introduce a simple increasing integer sequence scenario:
• Start from value 0;
• Each subsequent value is increased by 1;
• End with value 3;
How do you generate a sequence that satisfies the above three conditions? That is, [0, 1, 2, 3].
In fact, there are many ways to generate this sequence, and here is a simple and general scheme.
-- SQL - 1
select
t.pos as a_n
from (
select posexplode(split(space(3), space(1), false))
) t;
The preceding SQL snippet shows that generating an increasing sequence only requires the following three steps:
1) Generate an array of appropriate length, and the elements in the array do not need to have actual meaning;
2) Use the UDTF function posexplode to generate an index subscript for each element in the array;
3) Take out the index subscript of each element. The above three steps can be extended to more general sequence scenarios: arithmetic sequence and geometric sequence. Based on this scheme, the final implementation template will be directly given below.
If the first term is and the tolerance is , the general term formula of the arithmetic sequence is .
-- SQL - 2
select
a + t.pos * d as a_n
from (
select posexplode(split(space(n - 1), space(1), false))
) t;
If the first term is and the common ratio is , the general formula of the geometric sequence is .
-- SQL - 2
select
a + t.pos * d as a_n
from (
select posexplode(split(space(n - 1), space(1), false))
) t;
Note: You can also use the MaxCompute (ODPS) system function sequence to generate a sequence.
-- SQL - 4
select sequence(1, 3, 1);
-- result
[1, 2, 3]
In multi-dimensional analysis scenarios, high-order aggregate functions, such as cube, rollup, and grouping sets, may be used to perform aggregate statistics on data in different dimension combinations.
In the existing user access log table visit_log, each row of data represents a user access log.
-- SQL - 5
with visit_log as (
select stack (
6,
'2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
'2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
'2024-01-01', '103', 'Sichuan', 'Chengdu', 'Windows',
'2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
'2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
'2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
)
-- Fields: date, user, province, city, and device type
as (dt, user_id, province, city, device_type)
)
select * from visit_log;
The user page view under different dimension combinations is obtained through grouping sets aggregation statistics for the three dimension columns of province, city, and device_type. Questions:
1) How can I know which dimension columns a statistical result is aggregated based on?
2) What should I do if I want to generate the names of aggregated dimension columns for downstream reporting scenarios?
You can use the GROUPING__ID provided by MaxCompute to solve the problems. The core method is to implement the reverse implementation of the GROUPING__ID.
The procedure is described as follows:
1) Prepare all GROUPING__IDs.
Generate an increasing sequence that contains numeric values, convert each numeric value into a binary string, and expand each bit of the string.
GROUPING__ID | bits |
0 | { ..., 0, 0, 0 } |
1 | { ..., 0, 0, 1 } |
2 | { ..., 0, 1, 0 } |
3 | { ..., 0, 1, 1 } |
... | ... |
2n | ... |
n indicates the number of dimension columns. 2n indicates the number of dimension combinations. Each value indicates a GROUPING__ID.
2) Prepare all dimension names.
Generate a string sequence to store dimension column names, that is,
{ dim_name_1, dim_name_2, ..., dim_name_n }
3) Map the GROUPING__ID to the names of dimension columns.
For each value in the GROUPING__ID increasing sequence, map each bit of the value with the subscript of the dimension name sequence, and generate the dimension names corresponding to bit 0. For example:
GROUPING__ID:3 => { 0, 1, 1 }
Sequence of dimension names: {province, city, device type}
Mapping: { 0: province, 1: city, 1: device type}
The aggregate dimension of the data row whose GROUPING__ID is 3 is province.
-- SQL - 6
with group_dimension as (
select -- Dimension fields corresponding to each group
gb.group_id, concat_ws(",", collect_list(case when gb.placeholder_bit = 0 then dim_col.val else null end)) as dimension_name
from (
select groups.pos as group_id, pe.*
from (
select posexplode(split(space(cast(pow(2, 3) as int) - 1), space(1), false))
) groups -- All groups
lateral view posexplode(regexp_extract_all(lpad(conv(groups.pos,10,2), 3, "0"), '(0|1)')) pe as placeholder_idx, placeholder_bit -- Bit information of each group
) gb
left join ( -- All dimension fields
select posexplode(split("province, city, device type", ','))
) dim_col on gb.placeholder_idx = dim_col.pos
group by gb.group_id
)
select
group_dimension.dimension_name,
province, city, device_type,
visit_count
from (
select
grouping_id(province, city, device_type) as group_id,
province, city, device_type,
count(1) as visit_count
from visit_log b
group by province, city, device_type
GROUPING SETS(
(province),
(province, city),
(province, city, device_type)
)
) t
join group_dimension on t.group_id = group_dimension.group_id
order by group_dimension.dimension_name;
Intervals have different data characteristics compared with sequences, but in practice, the processing of sequences and intervals has many similarities. This section will introduce some common interval scenarios and abstract out general solutions.
If you have a numeric interval , how do you divide this interval into subintervals?
The question can be simplified as a sequence problem. The sequence formula is , where . The specific steps are as follows:
1) Generate an array with the length of ;
2) Use the UDTF function posexplode to generate an index subscript for each element in the array;
3) Take out the index subscript of each element and perform a sequence formula calculation to obtain the start and end values of each subinterval.
-- SQL - 7
select
a + t.pos * d as sub_interval_start, -- Start value of the subinterval
a + (t.pos + 1) * d as sub_interval_end -- End value of subinterval
from (
select posexplode(split(space(n - 1), space(1), false))
) t;
It is known that the two date intervals have overlap - ['2024-01-01', '2024-01-03'] and ['2024-01-02', '2024-01-04']. Questions:
1) How can I merge two date intervals and return the merged new interval?
2) How can I know which date is a cross date and return the number of cross times for that date?
There are many ways to solve the above problems, and here is a simple and general solution. The core idea is to combine the sequence generation and interval division methods. Firstly, decompose the date interval into the smallest processing unit, that is, the sequence composed of multiple dates. Then, make statistics based on the date granularity. The specific steps are as follows:
1) Get the number of days contained in each date interval;
2) Split the date interval into a corresponding number of increasing date sequences based on the results obtained in the previous step;
3) Count the merged interval and the number of crossings through the date sequence.
-- SQL - 8
with dummy_table as (
select stack(
2,
'2024-01-01', '2024-01-03',
'2024-01-02', '2024-01-04'
) as (date_start, date_end)
)
select
min(date_item) as date_start_merged,
max(date_item) as date_end_merged,
collect_set( -- Cross date count
case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
) as overlap_date
from (
select
-- Single date after disassembly
date_add(date_start, pos) as date_item,
-- The number of occurrences of a single date after disassembly
count(1) over (partition by date_add(date_start, pos)) as date_item_cnt
from dummy_table
lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t;
date_start_merged | date_end_merged | overlap_date |
2024-01-01 | 2024-01-04 | ["2024-01-02:2","2024-01-03:2"] |
Let's make it harder!
If there are multiple date intervals and the crossing status between the intervals is unknown, how do you solve the above problem? That is:
1) How can I merge multiple date intervals and return the merged multiple new intervals?
2) How can I know which date is a cross date and return the number of cross times for that date?
SQL implementation:
-- SQL - 9
with dummy_table as (
select stack(
5,
'2024-01-01', '2024-01-03',
'2024-01-02', '2024-01-04',
'2024-01-06', '2024-01-08',
'2024-01-08', '2024-01-08',
'2024-01-07', '2024-01-10'
) as (date_start, date_end)
)
select
min(date_item) as date_start_merged,
max(date_item) as date_end_merged,
collect_set( -- Cross date count
case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
) as overlap_date
from (
select
-- Single date after disassembly
date_add(date_start, pos) as date_item,
-- The number of occurrences of a single date after disassembly
count(1) over (partition by date_add(date_start, pos)) as date_item_cnt,
-- The single date after disassembly is reorganized as the mark of the new interval
date_add(date_add(date_start, pos), 1 - dense_rank() over (order by date_add(date_start, pos))) as cont
from dummy_table
lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t
group by cont;
date_start_merged | date_end_merged | overlap_date |
2024-01-01 | 2024-01-04 | ["2024-01-02:2","2024-01-03:2"] |
2024-01-06 | 2024-01-10 | ["2024-01-07:2","2024-01-08:3"] |
Scenario Description
In the existing user repayment plan table user_repayment, a piece of data in this table indicates the daily repayment amount of users within the specified date interval [date_start, date_end].
-- SQL - 10
with user_repayment as (
select stack(
3,
'101', '2024-01-01', '2024-01-15', 10,
'102', '2024-01-05', '2024-01-20', 20,
'103', '2024-01-10', '2024-01-25', 30
)
-- Fields: user, start date, end date, and daily repayment amount
as (user_id, date_start, date_end, repayment)
)
select * from user_repayment;
How can I count the total amount of repayment payable by all users every day in any period of time (e.g., from 2024-01-15 to 2024-01-16)?
The core idea is to convert the date interval into a date sequence, and then summarize statistics by the date sequence.
-- SQL - 11
select
date_item as day,
sum(repayment) as total_repayment
from (
select
date_add(date_start, pos) as date_item,
repayment
from user_repayment
lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t
where date_item >= '2024-01-15' and date_item <= '2024-01-16'
group by date_item
order by date_item;
day | total_repayment |
2024-01-15 | 60 |
2024-01-16 | 50 |
Permutation and combination are commonly used data organization methods for discrete data, and this section will introduce the implementation of permutation and combination and focus on the processing of data through combination with examples.
Given that 2 characters can be repeatedly selected from the character sequence [ 'A', 'B', 'C' ] at a time, how can all permutations be obtained?
With the help of multiple lateral views, the overall implementation is relatively simple.
-- SQL - 12
select
concat(val1, val2) as perm
from (select split('A,B,C', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2;
perm |
AA |
AB |
AC |
BA |
BB |
BC |
CA |
CB |
CC |
Given that 2 characters can be repeatedly selected from the character sequence [ 'A', 'B', 'C' ] at a time, how can all combinations be obtained?
With the help of multiple lateral views, the overall implementation is relatively simple.
-- SQL - 13
select
concat(least(val1, val2), greatest(val1, val2)) as comb
from (select split('A,B,C', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2
group by least(val1, val2), greatest(val1, val2);
comb |
AA |
AB |
AC |
BB |
BC |
CC |
Note: You can also use the MaxCompute (ODPS) system function, combinations, to generate combinations quickly.
-- SQL - 14
select combinations(array('foo', 'bar', 'boo'),2);
-- result
[['foo', 'bar'], ['foo', 'boo']['bar', 'boo']]
In an existing delivery strategy conversion table, a piece of data indicates the number of orders generated by a delivery strategy within a day.
-- SQL - 15
with strategy_order as (
select stack(
3,
'2024-01-01', 'Strategy A', 10,
'2024-01-01', 'Strategy B', 20,
'2024-01-01', 'Strategy C', 30
)
-- Fields: date, delivery policy, and single volume
as (dt, strategy, order_cnt)
)
select * from strategy_order;
How can I create pairwise comparison groups based on the delivery strategy and show the conversion volume of different strategies by groups?
Comparison group | Delivery strategy | Conversion volume |
Strategy A-Strategy B | Strategy A | xxx |
Strategy A-Strategy B | Strategy B | xxx |
The core idea is to extract two strategies from the list of all delivery strategies without repetition, generate all combination results, and then associate the grouping statistical results of the strategy_order table.
-- SQL - 16
select /*+ mapjoin(combs) */
combs.strategy_comb,
so.strategy,
so.order_cnt
from strategy_order so
join ( -- Generate all comparison groups
select
concat(least(val1, val2), '-', greatest(val1, val2)) as strategy_comb,
least(val1, val2) as strategy_1, greatest(val1, val2) as strategy_2
from (
select collect_set(strategy) as strategies
from strategy_order
) dummy
lateral view explode(strategies) t1 as val1
lateral view explode(strategies) t2 as val2
where val1 <> val2
group by least(val1, val2), greatest(val1, val2)
) combs on 1 = 1
where so.strategy in (combs.strategy_1, combs.strategy_2)
order by combs.strategy_comb, so.strategy;
Comparison group | Delivery strategy | Conversion volume |
Strategy A-Strategy B | Strategy A | 10 |
Strategy A-Strategy B | Strategy B | 20 |
Strategy A-Strategy C | Strategy A | 10 |
Strategy A-Strategy C | Strategy C | 30 |
Strategy B-Strategy C | Strategy B | 20 |
Strategy B-Strategy C | Strategy C | 30 |
This section mainly introduces continuity and focuses on common continuous activity scenarios. For the continuous activity of the static type and the continuous activity of the dynamic type, different implementation schemes are expounded respectively.
In the existing user access log table visit_log, each row of data represents a user access log.
-- SQL - 17
with visit_log as (
select stack (
6,
'2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
'2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
'2024-01-01', '103', 'Sichuan', 'Chengdu', 'Windows',
'2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
'2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
'2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
)
-- Fields: date, user, province, city, and device type
as (dt, user_id, province, city, device_type)
)
select * from visit_log;
How can I get users with continuous access greater than or equal to 2 days?
In the preceding question, when we analyze continuity, the continuity results should exceed the fixed threshold, which is classified as common continuous activity scenario statistics with more than N continuous active days.
Implementation based on adjacent date difference ( lag/lead version)
The overall implementation is relatively simple.
-- SQL - 18
select user_id
from (
select
*,
lag(dt, 2 - 1) over (partition by user_id order by dt) as lag_dt
from (select dt, user_id from visit_log group by dt, user_id) t0
) t1
where datediff(dt, lag_dt) + 1 = 2
group by user_id;
user_id |
101 |
102 |
The overall implementation is relatively simple.
-- SQL - 19
select user_id
from (
select *,
dense_rank() over (partition by user_id order by dt) as dr
from visit_log
) t1
where datediff(dt, date_add(dt, 1 - dr)) + 1 = 2
group by user_id;
user_id |
101 |
102 |
It can be regarded as a derivative of implementation based on adjacent date difference (sorted version). This implementation can obtain more information, such as the number of consecutive active days.
-- SQL - 20
select user_id
from (
select
*,
-- consecutive active days
count(distinct dt)
over (partition by user_id, cont) as cont_days
from (
select
*,
date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
) t1
) t2
where cont_days >= 2
group by user_id;
user_id |
101 |
102 |
It can be regarded as a derivative of the implementation based on adjacent date differences (sorted version). This implementation can obtain more information, such as continuous active intervals.
-- SQL - 21
select user_id
from (
select
user_id, cont,
-- continuous active interval
min(dt) as cont_date_start, max(dt) as cont_date_end
from (
select
*,
date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
) t1
group by user_id, cont
) t2
where datediff(cont_date_end, cont_date_start) + 1 >= 2
group by user_id;
user_id |
101 |
102 |
In the existing user access log table visit_log, each row of data represents a user access log.
-- SQL - 22
with visit_log as (
select stack (
6,
'2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
'2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
'2024-01-01', '103', 'Sichuan', 'Changsha', 'Windows',
'2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
'2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
'2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
)
-- Fields: date, user, province, city, and device type
as (dt, user_id, province, city, device_type)
)
select * from visit_log;
How can I obtain two users with the longest continuous active days, and generate users, the longest continuous active days, and the longest continuous active date interval?
In the preceding question, when we analyze continuity, the obtained continuity results are not and cannot be compared with a fixed threshold, and each result uses the longest continuous active day as the dynamic threshold, which is classified as dynamic continuous activity scenario statistics.
The following code provides the final SQL statement based on the idea of common continuous activity scenario statistics:
-- SQL - 23
select
user_id,
-- The maximum number of continuous active days
datediff(max(dt), min(dt)) + 1 as cont_days,
-- The longest continuous active date interval
min(dt) as cont_date_start, max(dt) as cont_date_end
from (
select
*,
date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
) t1
group by user_id, cont
order by cont_days desc
limit 2;
user_id | cont_days | cont_date_start | cont_date_end |
101 | 3 | 2024-01-01 | 2024-01-03 |
102 | 2 | 2024-01-01 | 2024-01-02 |
In this section, a combination and variation of the contents of the previous sections of this article will be introduced, which is more complex.
In the existing user scanning or connection WiFi record table user_wifi_log, each row of data represents a user scanning or connection WiFi log at a certain moment.
-- SQL - 24
with user_wifi_log as (
select stack (
9,
'2024-01-01 10:01:00', '101', 'cmcc-Starbucks', 'scan', -- Scan
'2024-01-01 10:02:00', '101', 'cmcc-Starbucks', 'scan',
'2024-01-01 10:03:00', '101', 'cmcc-Starbucks', 'scan',
'2024-01-01 10:04:00', '101', 'cmcc-Starbucks', 'conn', -- Connect
'2024-01-01 10:05:00', '101', 'cmcc-Starbucks', 'conn',
'2024-01-01 10:06:00', '101', 'cmcc-Starbucks', 'conn',
'2024-01-01 11:01:00', '101', 'cmcc-Starbucks', 'conn',
'2024-01-01 11:02:00', '101', 'cmcc-Starbucks', 'conn',
'2024-01-01 11:03:00', '101', 'cmcc-Starbucks', 'conn'
)
-- Fields: time, user, WIFI, and status (scan, connect)
as (time, user_id, wifi, status)
)
select * from user_wifi_log
User behavior analysis is required. How can I divide different WiFi behavior intervals? Requirements:
1) There are two types of behavior: connect and scan;
2) The definition of the behavior interval: the behavior types are the same, and the time difference between two adjacent behaviors does not exceed 30 minutes;
3) Different behavior intervals should be taken to the longest if they meet the definition.
user_id | wifi | status | time_start | time_end | Note |
101 | cmcc-Starbucks | scan | 2024-01-01 10:01:00 | 2024-01-01 10:03:00 | The user has scanned WiFi |
101 | cmcc-Starbucks | conn | 2024-01-01 10:04:00 | 2024-01-01 10:06:00 | The user has connected WiFi |
101 | cmcc-Starbucks | conn | 2024-01-01 11:01:00 | 2024-01-01 11:02:00 | It has been more than 30 minutes since the last connection, which is considered a new connection behavior. |
The preceding question is a bit more complex and can be considered as a variant of the longest continuous active day in dynamic continuous activity statistics. It can be described as the question of combining the continuity threshold and the context information in the behavior sequence to divide the longest subinterval.
Core logic: With users and WIFI as groups, combine with continuity thresholds and behavior sequence context information to divide behavior intervals.
Detailed steps:
1) Sort the data in ascending time order in the grouping window by user and WIFI groups.
2) Traverse two adjacent records in the grouping window in sequence. If the time difference between the two records exceeds 30 minutes, or the behavior state (scanning state or connection state) of the two records changes, the behavior interval is divided by the critical point. Stop until all records are traversed.
3) Final output: user, WIFI, behavior state (scanning state or connection state), behavior start time, and behavior end time.
-- SQL - 25
select
user_id,
wifi,
max(status) as status,
min(time) as start_time,
max(time) as end_time
from (
select *,
max(if(lag_status is null or lag_time is null or status <> lag_status or datediff(time, lag_time, 'ss') > 60 * 30, rn, null))
over (partition by user_id, wifi order by time) as group_idx
from (
select *,
row_number() over (partition by user_id, wifi order by time) as rn,
lag(time, 1) over (partition by user_id, wifi order by time) as lag_time,
lag(status, 1) over (partition by user_id, wifi order by time) as lag_status
from user_wifi_log
) t1
) t2
group by user_id, wifi, group_idx
user_id | wifi | status | start_time | end_time |
101 | cmcc-Starbucks | scan | 2024-01-01 10:01:00 | 2024-01-01 10:03:00 |
101 | cmcc-Starbucks | conn | 2024-01-01 10:04:00 | 2024-01-01 10:06:00 |
101 | cmcc-Starbucks | conn | 2024-01-01 11:01:00 | 2024-01-01 11:03:00 |
The conditions for determining continuity in this case can be extended to more scenarios. For example, scenarios that involve date difference, time difference, enumeration type, and distance difference as determining conditions for continuity.
Through flexible and divergent data processing thinking, this article explores the use of basic syntax to solve complex data scenarios. General solutions are provided for common scenarios such as sequence generation, interval transformation, permutation and combination, and continuous discrimination, with practical examples.
This article takes a unique approach, emphasizing flexible data processing thinking in the hopes of providing valuable insights and assistance to readers. Due to personal limitations, the ideas presented may not be optimal and could contain errors. Therefore, feedback and suggestions are welcome.
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
How is the Flame Graph Created? Exploring Flame Graphs in Pyroscope Source Code (2)
1,044 posts | 257 followers
FollowOceanBase - August 26, 2022
ApsaraDB - December 5, 2018
ApsaraDB - June 16, 2023
Apache Flink Community China - March 29, 2021
ApsaraDB - October 16, 2024
ApsaraDB - November 17, 2020
1,044 posts | 257 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud Community