All Products
Search
Document Center

AnalyticDB:Funnel and retention functions

Last Updated:Mar 11, 2024

This topic describes how to use funnel and retention functions to perform funnel analysis in AnalyticDB for MySQL.

Prerequisites

The minor version of an AnalyticDB for MySQL cluster is 3.1.6.0 or later.

Note

For information about how to query the minor version of a cluster, see How do I query the version of an AnalyticDB for MySQL cluster? To update the minor version of a cluster, contact technical support.

Background information

Funnel analysis is a common type of conversion analysis that is used to reflect the conversion rates of user behavior in various stages of a process. Funnel analysis is widely used for user behavior and app data analysis scenarios, such as traffic analysis and product conversion rate analysis. AnalyticDB for MySQL supports the following funnel and retention functions:

Test dataset

In this topic, the Taobao behavior data from Tianchi Lab is used to test funnel and retention functions. For more information, see User Behavior Data from Taobao for Recommendation.

The following table describes four types of user behavior that are provided in the test dataset.

Behavior type:

Description

pv

Browses a commodity.

buy

Purchases a commodity.

cart

Adds a commodity to the shopping cart.

fav

Adds a commodity to favorites.

Before you perform a test, you must upload the test dataset to Object Storage Service (OSS) and then import the data from OSS to AnalyticDB for MySQL.

  1. Upload the test dataset to OSS. For more information, see Upload objects.

  2. Use an OSS external table to import the data to AnalyticDB for MySQL.

    1. Create an OSS external table.

      CREATE TABLE `user_behavior_oss` (
        `user_id` string,
        `item_id` string,
        `cate_id` string,
        `event` string,
         `ts` bigint) ENGINE = 'oss'
      TABLE_PROPERTIES = '{
      "endpoint":"oss-cn-zhangjiakou.aliyuncs.com",
      "accessid":"******",
      "accesskey":"*******",
      "url":"oss://<bucket-name>/user_behavior/",
      "delimiter":","
      }'

      For more information about the syntax of OSS external tables, see the "Create an OSS external table for a non-partitioned object" section of the Use external tables to import data to Data Warehouse Edition topic.

    2. Create an AnalyticDB for MySQL test table.

      CREATE TABLE user_behavior(
        uid string,
        event string,
        ts string
      )
    3. Import data from the OSS external table to the test table.

      SUBMIT JOB INSERT OVERWRITE user_behavior SELECT user_id , event, ts FROM user_behavior_oss;

window_funnel

The window_funnel function is used to search an event list in a sliding time window and count the maximum number of consecutive events. The window_funnel function starts the count from the first event in the event list that you specify, checks the events in sequence, and then returns the maximum number of consecutive events.

  • If you specify an event list that contains c1, c2, and c3 and the user data contains c1, c2, c3, and c4, the function returns 3.

  • If you specify an event list that contains c1, c2, and c3 and the user data contains c4, c3, c2, and c1, the function returns 1.

  • If you specify an event list that contains c1, c2, and c3 and the user data contains c4 and c3, the function returns 0.

Syntax

window_funnel(window, mode, timestamp, cond1, cond2, ..., condN)

Parameters

Parameter

Description

window

The size of the sliding time window.

mode

The working mode of the function. This parameter is an extensible parameter. Set the value to default.

timestamp

The time column, which must be of the BIGINT data type. If the time column is of another data type, you must convert it to the BIGINT type.

For example, for the TIMESTAMP data type, you must use the TIMESTAMPDIFF function to query the difference between the timestamp and the initial time. Sample statement:

select uid, window_funnel(cast(1800 as integer),"default", TIMESTAMPDIFF('second','2017-11-25 00:00:00.000',ts),event = 'pv',event = 'fav',event = 'cart',event = 'buy') as funnel from user_behavior where ts > '2017-11-25 00:00:00.000' and ts < '2017-11-26 00:00:000.000'

cond

The event steps.

Examples

For example, you want to analyze the conversion status from 2017-11-25 00:00:00 to 2017-11-26 00:00:00 based on the following conversion path: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity. The related SQL statement involves the following conditions:

  • The size of the sliding time window is 30 minutes, which is equivalent to 1,800 seconds.

  • The query time is from 2017-11-25 00:00:00 to 2017-11-26 00:00:00, which is equivalent to 1511539200 to 1511625600 in the timestamp format.

  • The event steps are performed in the following order: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity.

Execute the following SQL statement:

SELECT funnel, count(1) FROM (SELECT uid, window_funnel(cast(1800 as integer),"default", ts, event='pv',  event='fav', event='cart', event='buy') AS funnel FROM user_behavior WHERE ts > 1511539200 AND ts < 1511625600 group by uid) GROUP BY funnel;

Sample result:

+--------+----------+
| funnel | count(1) |
+--------+----------+
|      0 |    19687 |
|      2 |    78458 |
|      3 |    11640 |
|      1 |   596104 |
|      4 |      746 |
+--------+----------+
5 rows in set (0.64 sec)

retention

The retention function can use a set of conditions as parameters to analyze whether an event meets the conditions.

Syntax

retention(cond1, cond2, ..., cond32)

Parameters

Parameter

Description

cond

The analysis condition, which is of the UINT8 data type. You can specify 1 to 32 conditions.

If a condition is met, the return value is increased by 1. If a condition is not met, the return value is increased by 0.

Examples

For example, you want to analyze the user retention status as of November 25, 2017. The related SQL statement involves the following conditions:

  • The start date for analysis is 2017-11-25.

  • The number of active users on the first day is sum(r[1]). The numbers of retained users on the second day to seventh day are sum(r[2]), sum(r[3])...sum(r[7]).

Execute the following SQL statement:

SELECT sum(r[1]),sum(r[2]),sum(r[3]),sum(r[4]),sum(r[5]),sum(r[6]),sum(r[7]) FROM (SELECT retention(ds='2017-11-25' AND event='pv', ds='2017-11-25', ds='2017-11-26',ds='2017-11-27',ds='2017-11-28',ds='2017-11-29',ds='2017-11-30') AS r FROM user_behavior_date GROUP BY uid);

Sample result:

+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| sum(r[1]) | sum(r[2]) | sum(r[3]) | sum(r[4]) | sum(r[5]) | sum(r[6]) | sum(r[7]) |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|    686953 |    686953 |    544367 |    529979 |    523516 |    524530 |    528105 |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
1 row in set (2.96 sec)

retention_range_count and retention_range_sum

The retention analysis functions (retention_range_count and retention_range_sum) can be used to generate visualized graphs and analyze user retention in user growth analysis scenarios. The retention_range_count function can be used to record the retention status of each user. The return value of this function is a 2-dimensional array and can serve as an input parameter of the retention_range_sum function. The retention_range_sum function can be used to summarize the daily retention rate of all users.

Syntax

  • retention_range_count

    retention_range_count(is_first, is_active, dt, intervals, outputFormat)
  • retention_range_sum

    retention_range_sum(retention_range_count_result)

Parameters

Parameter

Description

is_first

Specifies whether the first event is matched. Valid values:

  • true

  • false

is_active

Specifies whether the retention condition is met. Valid values:

  • true

  • false

dt

The date when the user behavior occurs. Example: 2022-05-01.

intervals[]

The retention interval. Up to 15 retention intervals are supported.

outputFormat

The format of the return value. Valid values:

  • normal: The return value is in the [[d1(Start date), 1, 0...], [d2(Start date), 1, 0...], [d3(Start date), 1, 0...]] format. A value of 1 indicates that a retention event exists. A value of 0 indicates that no retention event exists.

  • expand: The return value is in the [[d1(Start date),d1+1(Retention date)], [d1, d1+2], [d2, d2+1], [d2, d2+3]] format.

The default value is normal.

Examples

For example, you want to query the retention status on May 1, 2022 and May 2, 2022 based on the user data from May 1, 2022 to May 4, 2022. The activation event is login, and the retention event is pay.

  1. Create a test table and insert data into the table.

    1. Create a table.

      CREATE TABLE event(uid string, event string, ds date);
    2. Insert data into the table.

      INSERT INTO event VALUES("user1", "pay", "2022-05-01"),("user1", "login", "2022-05-01"),("user1", "pay", "2022-05-02"),("user1", "login", "2022-05-02"),("user2", "login", "2022-05-01"),("user3", "login", "2022-05-02"),("user3", "pay", "2022-05-03"),("user3", "pay", "2022-05-04");

      Sample test data:

      +-------+-------+------------+
      | uid   | event | ds         |
      +-------+-------+------------+
      | user1 | login | 2022-05-01 |
      | user1 | pay   | 2022-05-01 |
      | user1 | login | 2022-05-02 |
      | user1 | pay   | 2022-05-02 |
      | user2 | login | 2022-05-01 |
      | user3 | login | 2022-05-02 |
      | user3 | pay   | 2022-05-03 |
      | user3 | pay   | 2022-05-04 |
      +-------+-------+------------+
  2. Query the retention status of each user.

    SELECT
      uid,
      r
    FROM
      (
        SELECT
          uid,
          retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r
        FROM
          event
        GROUP BY
          uid
      ) AS t
    ORDER BY
      uid;

    Sample result:

    +-------+-----------------------------+
    | uid   | r                           |
    +-------+-----------------------------+
    | user1 | [[738642,0,0],[738641,1,0]] |
    | user2 | [[738641,0,0]]              |
    | user3 | [[738642,1,1]]              |
    +-------+-----------------------------+
  3. Query the daily retention rate of all users.

    SELECT
      from_days(u [1]) AS ds,
      u [3] / u [2] AS retention_d1,
      u [4] / u [2] AS retention_d2
    FROM
      (
        SELECT
          retention_range_sum(r) AS r
        FROM
          (
            SELECT
              uid,
              retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r
            FROM
              event
            GROUP BY
              uid
          ) AS t
        ORDER BY
          uid
      ) AS r,
      unnest(r.r) AS t(u);

    Sample result:

    +------------+--------------+--------------+
    | ds         | retention_d1 | retention_d2 |
    +------------+--------------+--------------+
    | 2022-05-02 |          0.5 |          0.5 |
    | 2022-05-01 |          0.5 |          0.0 |
    +------------+--------------+--------------+