This topic describes how to use funnel and retention functions to perform funnel analysis in AnalyticDB for MySQL.
Prerequisites
The minor version of an AnalyticDB for MySQL cluster is 3.1.6.0 or later.
For information about how to query the minor version of a cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.
Background information
Funnel analysis is a common type of conversion analysis that is used to reflect the conversion rates of user behavior in various stages of a process. Funnel analysis is widely used for user behavior and app data analysis scenarios, such as traffic analysis and product conversion rate analysis. AnalyticDB for MySQL supports the following funnel and retention functions:
window_funnel: searches an event list in a sliding time window and counts the maximum number of consecutive events.
retention: analyzes whether an event meets specified conditions.
retention_range_count and retention_range_sum: record and summarize retention status.
Test dataset
In this topic, the Taobao behavior data from Tianchi Lab is used to test funnel and retention functions. For more information, see User Behavior Data from Taobao for Recommendation.
The following table describes four types of user behavior that are provided in the test dataset.
Behavior type: | Description |
pv | Browses a commodity. |
buy | Purchases a commodity. |
cart | Adds a commodity to the shopping cart. |
fav | Adds a commodity to favorites. |
Before you perform a test, you must upload the test dataset to Object Storage Service (OSS) and then import the data from OSS to AnalyticDB for MySQL.
Upload the test dataset to OSS. For more information, see Upload objects.
Use an OSS external table to import the data to AnalyticDB for MySQL.
Create an OSS external table.
CREATE TABLE `user_behavior_oss` ( `user_id` string, `item_id` string, `cate_id` string, `event` string, `ts` bigint) ENGINE = 'oss' TABLE_PROPERTIES = '{ "endpoint":"oss-cn-zhangjiakou.aliyuncs.com", "accessid":"******", "accesskey":"*******", "url":"oss://<bucket-name>/user_behavior/", "delimiter":"," }'
For more information about the syntax of OSS external tables, see the "Create an OSS external table for a non-partitioned object" section of the Use external tables to import data to Data Warehouse Edition topic.
Create an AnalyticDB for MySQL test table.
CREATE TABLE user_behavior( uid string, event string, ts string )
Import data from the OSS external table to the test table.
SUBMIT JOB INSERT OVERWRITE user_behavior SELECT user_id , event, ts FROM user_behavior_oss;
window_funnel
The window_funnel function is used to search an event list in a sliding time window and count the maximum number of consecutive events. The window_funnel function starts the count from the first event in the event list that you specify, checks the events in sequence, and then returns the maximum number of consecutive events.
If you specify an event list that contains c1, c2, and c3 and the user data contains c1, c2, c3, and c4, the function returns 3.
If you specify an event list that contains c1, c2, and c3 and the user data contains c4, c3, c2, and c1, the function returns 1.
If you specify an event list that contains c1, c2, and c3 and the user data contains c4 and c3, the function returns 0.
Syntax
window_funnel(window, mode, timestamp, cond1, cond2, ..., condN)
Parameters
Parameter | Description |
window | The size of the sliding time window. |
mode | The working mode of the function. This parameter is an extensible parameter. Set the value to default. |
timestamp | The time column, which must be of the BIGINT data type. If the time column is of another data type, you must convert it to the BIGINT type. For example, for the TIMESTAMP data type, you must use the TIMESTAMPDIFF function to query the difference between the timestamp and the initial time. Sample statement:
|
cond | The event steps. |
Examples
For example, you want to analyze the conversion status from 2017-11-25 00:00:00 to 2017-11-26 00:00:00 based on the following conversion path: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity. The related SQL statement involves the following conditions:
The size of the sliding time window is 30 minutes, which is equivalent to 1,800 seconds.
The query time is from 2017-11-25 00:00:00 to 2017-11-26 00:00:00, which is equivalent to 1511539200 to 1511625600 in the timestamp format.
The event steps are performed in the following order: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity.
Execute the following SQL statement:
SELECT funnel, count(1) FROM (SELECT uid, window_funnel(cast(1800 as integer),"default", ts, event='pv', event='fav', event='cart', event='buy') AS funnel FROM user_behavior WHERE ts > 1511539200 AND ts < 1511625600 group by uid) GROUP BY funnel;
Sample result:
+--------+----------+
| funnel | count(1) |
+--------+----------+
| 0 | 19687 |
| 2 | 78458 |
| 3 | 11640 |
| 1 | 596104 |
| 4 | 746 |
+--------+----------+
5 rows in set (0.64 sec)
retention
The retention function can use a set of conditions as parameters to analyze whether an event meets the conditions.
Syntax
retention(cond1, cond2, ..., cond32)
Parameters
Parameter | Description |
cond | The analysis condition, which is of the UINT8 data type. You can specify 1 to 32 conditions. If a condition is met, the return value is increased by 1. If a condition is not met, the return value is increased by 0. |
Examples
For example, you want to analyze the user retention status as of November 25, 2017. The related SQL statement involves the following conditions:
The start date for analysis is 2017-11-25.
The number of active users on the first day is
sum(r[1])
. The numbers of retained users on the second day to seventh day aresum(r[2]), sum(r[3])...sum(r[7])
.
Execute the following SQL statement:
SELECT sum(r[1]),sum(r[2]),sum(r[3]),sum(r[4]),sum(r[5]),sum(r[6]),sum(r[7]) FROM (SELECT retention(ds='2017-11-25' AND event='pv', ds='2017-11-25', ds='2017-11-26',ds='2017-11-27',ds='2017-11-28',ds='2017-11-29',ds='2017-11-30') AS r FROM user_behavior_date GROUP BY uid);
Sample result:
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| sum(r[1]) | sum(r[2]) | sum(r[3]) | sum(r[4]) | sum(r[5]) | sum(r[6]) | sum(r[7]) |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| 686953 | 686953 | 544367 | 529979 | 523516 | 524530 | 528105 |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
1 row in set (2.96 sec)
retention_range_count and retention_range_sum
The retention analysis functions (retention_range_count and retention_range_sum) can be used to generate visualized graphs and analyze user retention in user growth analysis scenarios. The retention_range_count function can be used to record the retention status of each user. The return value of this function is a 2-dimensional array and can serve as an input parameter of the retention_range_sum function. The retention_range_sum function can be used to summarize the daily retention rate of all users.
Syntax
retention_range_count
retention_range_count(is_first, is_active, dt, intervals, outputFormat)
retention_range_sum
retention_range_sum(retention_range_count_result)
Parameters
Parameter | Description |
is_first | Specifies whether the first event is matched. Valid values:
|
is_active | Specifies whether the retention condition is met. Valid values:
|
dt | The date when the user behavior occurs. Example: |
intervals[] | The retention interval. Up to 15 retention intervals are supported. |
outputFormat | The format of the return value. Valid values:
The default value is normal. |
Examples
For example, you want to query the retention status on May 1, 2022 and May 2, 2022 based on the user data from May 1, 2022 to May 4, 2022. The activation event is login, and the retention event is pay.
Create a test table and insert data into the table.
Create a table.
CREATE TABLE event(uid string, event string, ds date);
Insert data into the table.
INSERT INTO event VALUES("user1", "pay", "2022-05-01"),("user1", "login", "2022-05-01"),("user1", "pay", "2022-05-02"),("user1", "login", "2022-05-02"),("user2", "login", "2022-05-01"),("user3", "login", "2022-05-02"),("user3", "pay", "2022-05-03"),("user3", "pay", "2022-05-04");
Sample test data:
+-------+-------+------------+ | uid | event | ds | +-------+-------+------------+ | user1 | login | 2022-05-01 | | user1 | pay | 2022-05-01 | | user1 | login | 2022-05-02 | | user1 | pay | 2022-05-02 | | user2 | login | 2022-05-01 | | user3 | login | 2022-05-02 | | user3 | pay | 2022-05-03 | | user3 | pay | 2022-05-04 | +-------+-------+------------+
Query the retention status of each user.
SELECT uid, r FROM ( SELECT uid, retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r FROM event GROUP BY uid ) AS t ORDER BY uid;
Sample result:
+-------+-----------------------------+ | uid | r | +-------+-----------------------------+ | user1 | [[738642,0,0],[738641,1,0]] | | user2 | [[738641,0,0]] | | user3 | [[738642,1,1]] | +-------+-----------------------------+
Query the daily retention rate of all users.
SELECT from_days(u [1]) AS ds, u [3] / u [2] AS retention_d1, u [4] / u [2] AS retention_d2 FROM ( SELECT retention_range_sum(r) AS r FROM ( SELECT uid, retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r FROM event GROUP BY uid ) AS t ORDER BY uid ) AS r, unnest(r.r) AS t(u);
Sample result:
+------------+--------------+--------------+ | ds | retention_d1 | retention_d2 | +------------+--------------+--------------+ | 2022-05-02 | 0.5 | 0.5 | | 2022-05-01 | 0.5 | 0.0 | +------------+--------------+--------------+