How to Make Full Use of SQL Capabilities?

The focus of this article is maximizing SQL capabilities. It explores a unique approach, using basic syntax to solve complex data scenarios through flexible and divergent data processing thinking.

By Lige

1. Preface

1.1 Original Intention

Efficiently utilizing MaxCompute (ODPS) SQL and pushing basic SQL syntax to its limits.

In today's era of big data, SQL is not only used by data professionals but also by non-technical personnel in product and operations roles. Efficiently harnessing the power of SQL to leverage data becomes crucial.

MaxCompute (ODPS) SQL has evolved to a mature stage. As an SQL dialect, it offers comprehensive syntax support, rich built-in functions, and advanced features such as window functions, user-defined functions, and types. It can efficiently handle various data processing scenarios.

This article aims to explore the full potential of SQL capabilities by taking a unique approach that emphasizes flexible and divergent data processing thinking using basic syntax.

1.2 Targeted Readers

This article is helpful for beginners and senior personnel, but it is more suitable for intermediate and advanced readers. It focuses on data processing thinking, avoiding excessive use of advanced syntax. Functions and syntax features are not extensively covered, allowing readers to explore them based on their needs.

1.3 Content Structure

The article focuses on sequence generation, interval transformation, permutation and combination, continuous discrimination, and practical applications through case studies. Topics have a slight dependency, so it is recommended to read them in order.

1.4 Prompt Information

SQL statements in this article utilize only the basic syntax features of MaxCompute (ODPS) SQL. In theory, all SQL statements can run in the latest version. Runtime environment and compatibility issues are not discussed.

2. Sequences

Sequences are common data forms, typically finite. This section starts with the simplest increasing sequence, identifying general methods and applying them to more general scenarios.

2.1 Common Sequences

2.1.1 A Simple Increasing Sequence

First, we introduce a simple increasing integer sequence scenario:

• Start from value 0;
• Each subsequent value is increased by 1;
• End with value 3;

How do you generate a sequence that satisfies the above three conditions? That is, [0, 1, 2, 3].

In fact, there are many ways to generate this sequence, and here is a simple and general scheme.

-- SQL - 1
select
    t.pos as a_n
from (
    select posexplode(split(space(3), space(1), false))
) t;

The preceding SQL snippet shows that generating an increasing sequence only requires the following three steps:

1) Generate an array of appropriate length, and the elements in the array do not need to have actual meaning;

2) Use the UDTF function posexplode to generate an index subscript for each element in the array;

3) Take out the index subscript of each element. The above three steps can be extended to more general sequence scenarios: arithmetic sequence and geometric sequence. Based on this scheme, the final implementation template will be directly given below.

2.1.2 Arithmetic Sequence

If the first term is and the tolerance is , the general term formula of the arithmetic sequence is .

SQL implementation:

-- SQL - 2
select
    a + t.pos * d as a_n
from (
    select posexplode(split(space(n - 1), space(1), false))
) t;

2.1.3 Geometric Sequence

If the first term is a1a and the common ratio is , the general formula of the geometric sequence is .

SQL implementation:

-- SQL - 2
select
    a + t.pos * d as a_n
from (
    select posexplode(split(space(n - 1), space(1), false))
) t;

Note: You can also use the MaxCompute (ODPS) system function sequence to generate a sequence.

-- SQL - 4
select sequence(1, 3, 1);

-- result
[1, 2, 3]

2.2 Examples of Application Scenarios

2.2.1 Restore the Names of Dimension Column Clusters in Any Dimension Combination

In multi-dimensional analysis scenarios, high-order aggregate functions, such as cube, rollup, and grouping sets, may be used to perform aggregate statistics on data in different dimension combinations.

Scenario Description

In the existing user access log table visit_log, each row of data represents a user access log.

-- SQL - 5
with visit_log as (
    select stack (
        6,
        '2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
        '2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
        '2024-01-01', '103', 'Sichuan', 'Chengdu', 'Windows',
        '2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
        '2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
        '2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
    ) 
    -- Fields: date, user, province, city, and device type
    as (dt, user_id, province, city, device_type)
)
select * from visit_log;

The user page view under different dimension combinations is obtained through grouping sets aggregation statistics for the three dimension columns of province, city, and device_type. Questions:

1) How can I know which dimension columns a statistical result is aggregated based on?

2) What should I do if I want to generate the names of aggregated dimension columns for downstream reporting scenarios?

Solution Ideas

You can use the GROUPING__ID provided by MaxCompute to solve the problems. The core method is to implement the reverse implementation of the GROUPING__ID.

The procedure is described as follows:

1) Prepare all GROUPING__IDs.

Generate an increasing sequence that contains numeric values, convert each numeric value into a binary string, and expand each bit of the string.

GROUPING__ID	bits
0	{ ..., 0, 0, 0 }
1	{ ..., 0, 0, 1 }
2	{ ..., 0, 1, 0 }
3	{ ..., 0, 1, 1 }
...	...
2ⁿ	...

n indicates the number of dimension columns. 2ⁿ indicates the number of dimension combinations. Each value indicates a GROUPING__ID.

2) Prepare all dimension names.

Generate a string sequence to store dimension column names, that is,

{ dim_name_1, dim_name_2, ..., dim_name_n }

3) Map the GROUPING__ID to the names of dimension columns.

For each value in the GROUPING__ID increasing sequence, map each bit of the value with the subscript of the dimension name sequence, and generate the dimension names corresponding to bit 0. For example:

GROUPING__ID：3 => { 0, 1, 1 }
Sequence of dimension names: {province, city, device type}

Mapping: { 0: province, 1: city, 1: device type}

The aggregate dimension of the data row whose GROUPING__ID is 3 is province.

SQL Implementation

-- SQL - 6
with group_dimension as (
    select -- Dimension fields corresponding to each group
        gb.group_id, concat_ws(",", collect_list(case when gb.placeholder_bit = 0 then dim_col.val else null end)) as dimension_name
    from (
        select groups.pos as group_id, pe.*
        from (
            select posexplode(split(space(cast(pow(2, 3) as int) - 1), space(1), false))
        ) groups -- All groups
        lateral view posexplode(regexp_extract_all(lpad(conv(groups.pos,10,2), 3, "0"), '(0|1)')) pe as placeholder_idx, placeholder_bit -- Bit information of each group
    ) gb
    left join ( -- All dimension fields
        select posexplode(split("province, city, device type", ','))
    ) dim_col on gb.placeholder_idx = dim_col.pos
    group by gb.group_id
)
select 
    group_dimension.dimension_name,
    province, city, device_type,
    visit_count
from (
    select
        grouping_id(province, city, device_type) as group_id,
        province, city, device_type,
        count(1) as visit_count
    from visit_log b
    group by province, city, device_type
    GROUPING SETS(
        (province),
        (province, city),
        (province, city, device_type)
    )
) t
join group_dimension on t.group_id = group_dimension.group_id
order by group_dimension.dimension_name;

3. Interval

Intervals have different data characteristics compared with sequences, but in practice, the processing of sequences and intervals has many similarities. This section will introduce some common interval scenarios and abstract out general solutions.

3.1 Common Interval Operations

3.1.1 Interval Division

If you have a numeric interval , how do you divide this interval into subintervals?

The question can be simplified as a sequence problem. The sequence formula is an2 , where . The specific steps are as follows:

1) Generate an array with the length of ;

2) Use the UDTF function posexplode to generate an index subscript for each element in the array;

3) Take out the index subscript of each element and perform a sequence formula calculation to obtain the start and end values of each subinterval.

SQL implementation:

-- SQL - 7
select
    a + t.pos * d as sub_interval_start, -- Start value of the subinterval
    a + (t.pos + 1) * d as sub_interval_end -- End value of subinterval
from (
    select posexplode(split(space(n - 1), space(1), false))
) t;

3.1.2 Interval Crossing

It is known that the two date intervals have overlap - ['2024-01-01', '2024-01-03'] and ['2024-01-02', '2024-01-04']. Questions:

1) How can I merge two date intervals and return the merged new interval?

2) How can I know which date is a cross date and return the number of cross times for that date?

There are many ways to solve the above problems, and here is a simple and general solution. The core idea is to combine the sequence generation and interval division methods. Firstly, decompose the date interval into the smallest processing unit, that is, the sequence composed of multiple dates. Then, make statistics based on the date granularity. The specific steps are as follows:

1) Get the number of days contained in each date interval;

2) Split the date interval into a corresponding number of increasing date sequences based on the results obtained in the previous step;

3) Count the merged interval and the number of crossings through the date sequence.

SQL implementation:

-- SQL - 8
with dummy_table as (
    select stack(
        2,
        '2024-01-01', '2024-01-03',
        '2024-01-02', '2024-01-04'
    ) as (date_start, date_end)
)
select 
    min(date_item) as date_start_merged, 
    max(date_item) as date_end_merged, 
    collect_set( -- Cross date count
        case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
    ) as overlap_date
from (
    select 
        -- Single date after disassembly
        date_add(date_start, pos) as date_item,
        -- The number of occurrences of a single date after disassembly
        count(1) over (partition by date_add(date_start, pos)) as date_item_cnt
    from dummy_table
    lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t;

date_start_merged	date_end_merged	overlap_date
2024-01-01	2024-01-04	["2024-01-02:2","2024-01-03:2"]

Let's make it harder!

If there are multiple date intervals and the crossing status between the intervals is unknown, how do you solve the above problem? That is:

1) How can I merge multiple date intervals and return the merged multiple new intervals?
2) How can I know which date is a cross date and return the number of cross times for that date?

SQL implementation:

-- SQL - 9
with dummy_table as (
    select stack(
        5,
        '2024-01-01', '2024-01-03',
        '2024-01-02', '2024-01-04',
        '2024-01-06', '2024-01-08',
        '2024-01-08', '2024-01-08',
        '2024-01-07', '2024-01-10'
    ) as (date_start, date_end)
)
select
    min(date_item) as date_start_merged, 
    max(date_item) as date_end_merged,
    collect_set( -- Cross date count
        case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
    ) as overlap_date
from (
    select 
        -- Single date after disassembly
        date_add(date_start, pos) as date_item,
        -- The number of occurrences of a single date after disassembly
        count(1) over (partition by date_add(date_start, pos)) as date_item_cnt,
        -- The single date after disassembly is reorganized as the mark of the new interval
        date_add(date_add(date_start, pos), 1 - dense_rank() over (order by date_add(date_start, pos))) as cont
    from dummy_table
    lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t
group by cont;

date_start_merged	date_end_merged	overlap_date
2024-01-01	2024-01-04	["2024-01-02:2","2024-01-03:2"]
2024-01-06	2024-01-10	["2024-01-07:2","2024-01-08:3"]

3.2 Examples of Application Scenarios

3.2.1 Count Data by Any Period

Scenario Description

In the existing user repayment plan table user_repayment, a piece of data in this table indicates the daily repayment amount of users within the specified date interval [date_start, date_end].

-- SQL - 10
with user_repayment as (
    select stack(
        3,
        '101', '2024-01-01', '2024-01-15', 10,
        '102', '2024-01-05', '2024-01-20', 20,
        '103', '2024-01-10', '2024-01-25', 30
    ) 
    -- Fields: user, start date, end date, and daily repayment amount
    as (user_id, date_start, date_end, repayment)
)
select * from user_repayment;

How can I count the total amount of repayment payable by all users every day in any period of time (e.g., from 2024-01-15 to 2024-01-16)?

Solution Ideas

The core idea is to convert the date interval into a date sequence, and then summarize statistics by the date sequence.

SQL Implementation

-- SQL - 11
select 
    date_item as day, 
    sum(repayment) as total_repayment
from (
    select 
        date_add(date_start, pos) as date_item,
        repayment
    from user_repayment
    lateral view posexplode(split(space(datediff(date_end, date_start)), space(1), false)) t as pos, val
) t
where date_item >= '2024-01-15' and date_item <= '2024-01-16'
group by date_item
order by date_item;

day	total_repayment
2024-01-15	60
2024-01-16	50

4. Permutation and Combination

Permutation and combination are commonly used data organization methods for discrete data, and this section will introduce the implementation of permutation and combination and focus on the processing of data through combination with examples.

4.1 Common Permutation and Combination Operations

4.1.1 Permutation

Given that 2 characters can be repeatedly selected from the character sequence [ 'A', 'B', 'C' ] at a time, how can all permutations be obtained?

With the help of multiple lateral views, the overall implementation is relatively simple.

-- SQL - 12
select 
    concat(val1, val2) as perm
from (select split('A,B,C', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2;

perm

4.1.2 Combination

Given that 2 characters can be repeatedly selected from the character sequence [ 'A', 'B', 'C' ] at a time, how can all combinations be obtained?

With the help of multiple lateral views, the overall implementation is relatively simple.

-- SQL - 13
select 
    concat(least(val1, val2), greatest(val1, val2)) as comb
from (select split('A,B,C', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2
group by least(val1, val2), greatest(val1, val2);

comb

Note: You can also use the MaxCompute (ODPS) system function, combinations, to generate combinations quickly.

-- SQL - 14
select combinations(array('foo', 'bar', 'boo'),2);

-- result
[['foo', 'bar'], ['foo', 'boo']['bar', 'boo']]

4.2 Examples of Application Scenarios

4.2.1 Group Comparison

Scenario Description

In an existing delivery strategy conversion table, a piece of data indicates the number of orders generated by a delivery strategy within a day.

-- SQL - 15
with strategy_order as (
    select stack(
        3,
        '2024-01-01', 'Strategy A', 10,
        '2024-01-01', 'Strategy B', 20,
        '2024-01-01', 'Strategy C', 30
    ) 
    -- Fields: date, delivery policy, and single volume
    as (dt, strategy, order_cnt)
)
select * from strategy_order;

How can I create pairwise comparison groups based on the delivery strategy and show the conversion volume of different strategies by groups?

Comparison group	Delivery strategy	Conversion volume
Strategy A-Strategy B	Strategy A	xxx
Strategy A-Strategy B	Strategy B	xxx

Solution Ideas

The core idea is to extract two strategies from the list of all delivery strategies without repetition, generate all combination results, and then associate the grouping statistical results of the strategy_order table.

SQL Implementation

-- SQL - 16
select /*+ mapjoin(combs) */
    combs.strategy_comb,
    so.strategy,
    so.order_cnt
from strategy_order so
join ( -- Generate all comparison groups
    select 
        concat(least(val1, val2), '-', greatest(val1, val2)) as strategy_comb,
        least(val1, val2) as strategy_1, greatest(val1, val2) as strategy_2
    from (
        select collect_set(strategy) as strategies
        from strategy_order
    ) dummy
    lateral view explode(strategies) t1 as val1
    lateral view explode(strategies) t2 as val2
    where val1 <> val2
    group by least(val1, val2), greatest(val1, val2)
) combs on 1 = 1
where so.strategy in (combs.strategy_1, combs.strategy_2)
order by combs.strategy_comb, so.strategy;

Comparison group	Delivery strategy	Conversion volume
Strategy A-Strategy B	Strategy A	10
Strategy A-Strategy B	Strategy B	20
Strategy A-Strategy C	Strategy A	10
Strategy A-Strategy C	Strategy C	30
Strategy B-Strategy C	Strategy B	20
Strategy B-Strategy C	Strategy C	30

5. Continuity

This section mainly introduces continuity and focuses on common continuous activity scenarios. For the continuous activity of the static type and the continuous activity of the dynamic type, different implementation schemes are expounded respectively.

5.1 General Continuous Activity Statistics

Scenario Description

In the existing user access log table visit_log, each row of data represents a user access log.

-- SQL - 17
with visit_log as (
    select stack (
        6,
        '2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
        '2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
        '2024-01-01', '103', 'Sichuan', 'Chengdu', 'Windows',
        '2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
        '2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
        '2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
    ) 
    -- Fields: date, user, province, city, and device type
    as (dt, user_id, province, city, device_type)
)
select * from visit_log;

How can I get users with continuous access greater than or equal to 2 days?

In the preceding question, when we analyze continuity, the continuity results should exceed the fixed threshold, which is classified as common continuous activity scenario statistics with more than N continuous active days.

SQL Implementation

Implementation based on adjacent date difference ( lag/lead version)

The overall implementation is relatively simple.

-- SQL - 18
select user_id
from (
    select 
        *,
        lag(dt, 2 - 1) over (partition by user_id order by dt) as lag_dt
    from (select dt, user_id from visit_log group by dt, user_id) t0
) t1
where datediff(dt, lag_dt) + 1 = 2
group by user_id;

user_id

101

102

Implementation based on adjacent date difference (sorted version)

The overall implementation is relatively simple.

-- SQL - 19
select user_id
from (
    select *, 
        dense_rank() over (partition by user_id order by dt) as dr
    from visit_log
) t1
where datediff(dt, date_add(dt, 1 - dr)) + 1 = 2
group by user_id;

user_id

101

102

Implementation based on the number of continuous active days

It can be regarded as a derivative of implementation based on adjacent date difference (sorted version). This implementation can obtain more information, such as the number of consecutive active days.

-- SQL - 20
select user_id
from (
    select 
        *,
        -- consecutive active days
        count(distinct dt) 
            over (partition by user_id, cont) as cont_days
    from (
        select 
            *, 
            date_add(dt, 1 - dense_rank() 
                over (partition by user_id order by dt)) as cont
        from visit_log
    ) t1
) t2
where cont_days >= 2
group by user_id;

user_id

101

102

Implementation based on continuous active interval

It can be regarded as a derivative of the implementation based on adjacent date differences (sorted version). This implementation can obtain more information, such as continuous active intervals.

-- SQL - 21
select user_id
from (
    select 
        user_id, cont, 
        -- continuous active interval
        min(dt) as cont_date_start, max(dt) as cont_date_end
    from (
        select 
            *, 
            date_add(dt, 1 - dense_rank() 
                over (partition by user_id order by dt)) as cont
        from visit_log
    ) t1
    group by user_id, cont
) t2
where datediff(cont_date_end, cont_date_start) + 1 >= 2
group by user_id;

user_id

101

102

5.2 Dynamic Continuous Activity Statistics

Scenario Description

In the existing user access log table visit_log, each row of data represents a user access log.

-- SQL - 22
with visit_log as (
    select stack (
        6,
        '2024-01-01', '101', 'Hubei', 'Wuhan', 'Android',
        '2024-01-01', '102', 'Hunan', 'Changsha', 'IOS',
        '2024-01-01', '103', 'Sichuan', 'Changsha', 'Windows',
        '2024-01-02', '101', 'Hubei', 'Xiaogan', 'Mac',
        '2024-01-02', '102', 'Hunan', 'Shaoyang', 'Android',
        '2024-01-03', '101', 'Hubei', 'Wuhan', 'IOS'
    ) 
    -- Fields: date, user, province, city, and device type
    as (dt, user_id, province, city, device_type)
)
select * from visit_log;

How can I obtain two users with the longest continuous active days, and generate users, the longest continuous active days, and the longest continuous active date interval?

In the preceding question, when we analyze continuity, the obtained continuity results are not and cannot be compared with a fixed threshold, and each result uses the longest continuous active day as the dynamic threshold, which is classified as dynamic continuous activity scenario statistics.

SQL Implementation

The following code provides the final SQL statement based on the idea of common continuous activity scenario statistics:

-- SQL - 23
select
    user_id, 
    -- The maximum number of continuous active days
    datediff(max(dt), min(dt)) + 1 as cont_days,
    -- The longest continuous active date interval
    min(dt) as cont_date_start, max(dt) as cont_date_end
from (
    select 
        *, 
        date_add(dt, 1 - dense_rank() 
            over (partition by user_id order by dt)) as cont
    from visit_log
) t1
group by user_id, cont
order by cont_days desc
limit 2;

user_id	cont_days	cont_date_start	cont_date_end
101	3	2024-01-01	2024-01-03
102	2	2024-01-01	2024-01-02

6. Expansion

In this section, a combination and variation of the contents of the previous sections of this article will be introduced, which is more complex.

6.1 Interval Continuity (the Splitting of the Longest Subinterval )

Scenario Description

In the existing user scanning or connection WiFi record table user_wifi_log, each row of data represents a user scanning or connection WiFi log at a certain moment.

-- SQL - 24
with user_wifi_log as (
    select stack (
        9,
        '2024-01-01 10:01:00', '101', 'cmcc-Starbucks', 'scan', -- Scan
        '2024-01-01 10:02:00', '101', 'cmcc-Starbucks', 'scan',
        '2024-01-01 10:03:00', '101', 'cmcc-Starbucks', 'scan',
        '2024-01-01 10:04:00', '101', 'cmcc-Starbucks', 'conn', -- Connect
        '2024-01-01 10:05:00', '101', 'cmcc-Starbucks', 'conn',
        '2024-01-01 10:06:00', '101', 'cmcc-Starbucks', 'conn',
        '2024-01-01 11:01:00', '101', 'cmcc-Starbucks', 'conn',
        '2024-01-01 11:02:00', '101', 'cmcc-Starbucks', 'conn',
        '2024-01-01 11:03:00', '101', 'cmcc-Starbucks', 'conn'
    ) 
    -- Fields: time, user, WIFI, and status (scan, connect)
    as (time, user_id, wifi, status)
)
select * from user_wifi_log

User behavior analysis is required. How can I divide different WiFi behavior intervals? Requirements:

1) There are two types of behavior: connect and scan;

2) The definition of the behavior interval: the behavior types are the same, and the time difference between two adjacent behaviors does not exceed 30 minutes;

3) Different behavior intervals should be taken to the longest if they meet the definition.

user_id	wifi	status	time_start	time_end	Note
101	cmcc-Starbucks	scan	2024-01-01 10:01:00	2024-01-01 10:03:00	The user has scanned WiFi
101	cmcc-Starbucks	conn	2024-01-01 10:04:00	2024-01-01 10:06:00	The user has connected WiFi
101	cmcc-Starbucks	conn	2024-01-01 11:01:00	2024-01-01 11:02:00	It has been more than 30 minutes since the last connection, which is considered a new connection behavior.

The preceding question is a bit more complex and can be considered as a variant of the longest continuous active day in dynamic continuous activity statistics. It can be described as the question of combining the continuity threshold and the context information in the behavior sequence to divide the longest subinterval.

SQL Implementation

Core logic: With users and WIFI as groups, combine with continuity thresholds and behavior sequence context information to divide behavior intervals.

Detailed steps:

1) Sort the data in ascending time order in the grouping window by user and WIFI groups.

2) Traverse two adjacent records in the grouping window in sequence. If the time difference between the two records exceeds 30 minutes, or the behavior state (scanning state or connection state) of the two records changes, the behavior interval is divided by the critical point. Stop until all records are traversed.

3) Final output: user, WIFI, behavior state (scanning state or connection state), behavior start time, and behavior end time.

-- SQL - 25
select 
    user_id, 
    wifi,
    max(status) as status,
    min(time) as start_time, 
    max(time) as end_time
from (
    select *,
        max(if(lag_status is null or lag_time is null or status <> lag_status or datediff(time, lag_time, 'ss') > 60 * 30, rn, null)) 
            over (partition by user_id, wifi order by time) as group_idx
    from (
        select *,
            row_number() over (partition by user_id, wifi order by time) as rn,
            lag(time, 1) over (partition by user_id, wifi order by time) as lag_time,
            lag(status, 1) over (partition by user_id, wifi order by time) as lag_status
        from user_wifi_log
    ) t1
) t2
group by user_id, wifi, group_idx

user_id	wifi	status	start_time	end_time
101	cmcc-Starbucks	scan	2024-01-01 10:01:00	2024-01-01 10:03:00
101	cmcc-Starbucks	conn	2024-01-01 10:04:00	2024-01-01 10:06:00
101	cmcc-Starbucks	conn	2024-01-01 11:01:00	2024-01-01 11:03:00

The conditions for determining continuity in this case can be extended to more scenarios. For example, scenarios that involve date difference, time difference, enumeration type, and distance difference as determining conditions for continuity.

Summary

Through flexible and divergent data processing thinking, this article explores the use of basic syntax to solve complex data scenarios. General solutions are provided for common scenarios such as sequence generation, interval transformation, permutation and combination, and continuous discrimination, with practical examples.

This article takes a unique approach, emphasizing flexible data processing thinking in the hopes of providing valuable insights and assistance to readers. Due to personal limitations, the ideas presented may not be optimal and could contain errors. Therefore, feedback and suggestions are welcome.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

How to Make Full Use of SQL Capabilities?

1. Preface

1.1 Original Intention

1.2 Targeted Readers

1.3 Content Structure

1.4 Prompt Information

2. Sequences

2.1 Common Sequences

2.1.1 A Simple Increasing Sequence

2.1.2 Arithmetic Sequence

SQL implementation:

2.1.3 Geometric Sequence

SQL implementation:

2.2 Examples of Application Scenarios

2.2.1 Restore the Names of Dimension Column Clusters in Any Dimension Combination

Scenario Description

Solution Ideas

SQL Implementation

3. Interval

3.1 Common Interval Operations

3.1.1 Interval Division

SQL implementation:

3.1.2 Interval Crossing

SQL implementation:

3.2 Examples of Application Scenarios

3.2.1 Count Data by Any Period

Solution Ideas

SQL Implementation

4. Permutation and Combination

4.1 Common Permutation and Combination Operations

4.1.1 Permutation

4.1.2 Combination

4.2 Examples of Application Scenarios

4.2.1 Group Comparison

Scenario Description

Solution Ideas

SQL Implementation

5. Continuity

5.1 General Continuous Activity Statistics

Scenario Description

SQL Implementation

Implementation based on adjacent date difference (sorted version)

Implementation based on the number of continuous active days

Implementation based on continuous active interval

5.2 Dynamic Continuous Activity Statistics

Scenario Description

SQL Implementation

6. Expansion

6.1 Interval Continuity (the Splitting of the Longest Subinterval )

Scenario Description

SQL Implementation

Summary

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

ApsaraDB for HBase