Retail eCommerce Funnel Analysis Demo with 1 million members and 87 million record dataset using StarRocks | Demo of StarRocks

Note

Below is an example scenario. Please see demo/documentation-samples/datalakehouse at master · StarRocks/demo · GitHub for a more detailed example with detailed configurations.

The dataset

The data comes from a user behavior dataset. This dataset randomly selected 1 million users, recording their actions on taobao from November 25, 2017 to December 3, 2017.

The dataset contains 86,953,525 records with five dimensions: userID, itemID, categoryID, behaviorType, and timestamp. The data spans eight days from November 25, 2017, to December 2, 2017. It involves a total of 987,982 unique users, 3,962,559 unique products, and 9,377 product categories. The dataset includes four types of user behaviors: page view (PV), purchase (Buy), add to cart (Cart), and favorite (Fav).

Environment

You need at least 16GB of ram to run this tutorial.

Where to download it.

The 1.1 GB file can be downloaded at https://cdn.starrocks.io/dataset/user_behavior_sample_data.parquet

Loading the data

We offer 2 different ways to load the data.

Using INSERT INTO FILES()

create table user_behavior as
    SELECT * FROM FILES(
        "path" = "s3://warehouse/user_behavior_sample_data.parquet",
        "format" = "parquet",
        "aws.s3.access_key" = "admin",
        "aws.s3.secret_key" = "password",
        "aws.s3.region" = "us-west-2",
        "aws.s3.use_instance_profile" = "false",
        "aws.s3.enable_path_style_access" = "true",
         "aws.s3.endpoint" = "http://minio:9000"
);

Using Create Table and Broker Load

Create database demo;
Use demo;
Drop table user_behavior;
CREATE TABLE `user_behavior` (
  `UserID` int(11) NULL COMMENT "",
  `ItemID` int(11) NULL COMMENT "",
  `CategoryID` int(11) NULL COMMENT "",
  `BehaviorType` varchar(65533) NULL COMMENT "",
  `Timestamp` datetime NULL COMMENT ""
) ENGINE=OLAP 
DUPLICATE KEY(`UserID`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`UserID`) BUCKETS 1 
PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"replicated_storage" = "true",
"compression" = "LZ4"
);
LOAD LABEL demo.user_behavior_3
(
    DATA INFILE("s3a://s3loadtable1/user_behavior_sample_data.parquet")
    into table user_behavior
    format as "parquet"
 )
 with BROKER
 (
   "aws.s3.use_instance_profile" = "false",
   "aws.s3.region" = "us-west-2",
   "aws.s3.access_key" = "XXXX",
   "aws.s3.secret_key" = "YYYYY"
   );
   show load where label = 'user_behavior_3';

If you get an error like type:LOAD_RUN_FAIL; msg:Memory of Query3a47bc7f-46b5-428f-bfe7-43dd10d6a141 exceed limit. Pipeline Backend: 10.0.25.171, fragment: 3a47bc7f-46b5-428f-bfe7-43dd10d6a142 Used: 2333611624, Limit: 2147483648. Mem usage has exceed the limit of single query, You can change the limit by set session variable query_mem_limit., you can execute SET GLOBAL query_mem_limit = 137438953472; to change the memory used for the load and then retry the load again.

Create the item table by CTAS.

create table item ( ItemID bigint(20), Name String);
insert into item(ItemID, name) select distinct ItemID, concat("item ", ItemID) from user_behavior;

Scenario 1: Understanding the ecommerce conversion funnel

with tmp1 as (
  with tmp as (
    select 
      t.level as level, 
      count(UserID) as res 
    from 
      (
        select 
          UserID, 
          window_funnel(
            18000, 
            `Timestamp`, 
            0, 
            [BehaviorType = 'pv' , 
            BehaviorType = 'cart', 
            BehaviorType = 'buy' ]
          ) as level 
        from 
          user_behavior 
        where `Timestamp` >= '2017-12-02 00:00:00' 
            and `Timestamp` <= '2017-12-02 23:59:59'
        group by 
          UserID
      ) as t 
    where 
      t.level > 0 
    group by 
      t.level 
  ) 
  select 
    tmp.level, 
    sum(tmp.res) over (
      order by 
        tmp.level rows between current row 
        and unbounded following
    ) as retention 
  from 
    tmp
) 
select 
  tmp1.level, 
  tmp1.retention, 
  last_value(tmp1.retention) over(
    order by 
      tmp1.level rows between current row 
      and 1 following
  )/ tmp1.retention as retention_ratio 
from 
  tmp1;
+-------+-----------+---------------------+
| level | retention | retention_ratio     |
+-------+-----------+---------------------+
|     1 |    913314 | 0.34725078122091635 |
|     2 |    317149 | 0.23266981765668504 |
|     3 |     73791 |                   1 |
+-------+-----------+---------------------+
3 rows in set (1.85 sec)

This indicates that only 34% of users who viewed the product added it to their cart, and only 23% of users who added it to their cart proceeded to place an order. The conversion rate is not good.

Scenario 2: Examine the item IDs of the top ten products with the worst conversion rate from PV (page views) to buy.

with tmp1 as (
  with tmp as (
    select 
      ItemID, 
      t.level as level, 
      count(UserID) as res 
    from 
      (
        select 
          ItemID, 
          UserID, 
          window_funnel(
            1800, 
            timestamp, 
            0, 
            [BehaviorType = 'pv', 
            BehaviorType ='buy' ]
          ) as level 
        from 
          user_behavior 
        where timestamp >= '2017-12-02 00:00:00' 
            and timestamp <= '2017-12-02 23:59:59'
        group by 
          ItemID, 
          UserID
      ) as t 
    where 
      t.level > 0 
    group by 
      t.ItemID, 
      t.level 
  ) 
  select 
    tmp.ItemID, 
    tmp.level, 
    sum(tmp.res) over (
      partition by tmp.ItemID 
      order by 
        tmp.level rows between current row 
        and unbounded following
    ) as retention 
  from 
    tmp
) 
select 
  tmp1.ItemID, 
  i.name,
  tmp1.level, 
  tmp1.retention / last_value(tmp1.retention) over(
    partition by tmp1.ItemID 
    order by 
      tmp1.level desc rows between current row 
      and 1 following
  ) as retention_ratio 
from 
  tmp1 
JOIN item i ON tmp1.ItemID = i.ItemID
order by 
  tmp1.level desc, 
  retention_ratio 
limit 
  10;
+---------+--------------+-------+-----------------------+
| ItemID  | name         | level | retention_ratio       |
+---------+--------------+-------+-----------------------+
|   59883 | item 59883   |     2 | 0.0003616636528028933 |
|  394978 | item 394978  |     2 | 0.0006357279084551812 |
| 1164931 | item 1164931 |     2 | 0.0006648936170212766 |
| 4622270 | item 4622270 |     2 | 0.0007692307692307692 |
|  812879 | item 812879  |     2 | 0.0009121313469139556 |
| 1783990 | item 1783990 |     2 | 0.0009132420091324201 |
| 3847054 | item 3847054 |     2 |  0.000925925925925926 |
| 2742138 | item 2742138 |     2 | 0.0009881422924901185 |
|  530918 | item 530918  |     2 | 0.0010193679918450561 |
|  600756 | item 600756  |     2 | 0.0010319917440660474 |
+---------+--------------+-------+-----------------------+
10 rows in set (5.07 sec)

At this point, we have identified items with poor conversion, such as item_id=59883.

Scenario 3: Would like to see the user paths of those who dropped off?

select 
  log.BehaviorType, 
  count(log.BehaviorType) 
from 
  (
    select 
      ItemID, 
      UserID, 
      window_funnel(
        1800, 
        timestamp, 
        0, 
        [BehaviorType = 'pv' , 
        BehaviorType = 'buy' ]
      ) as level 
    from 
      user_behavior 
    where timestamp >= '2017-12-02 00:00:00' 
        and timestamp <= '2017-12-02 23:59:59'
    group by 
      ItemID, 
      UserID
  ) as list 
  left join (
    select 
      UserID, 
      array_agg(BehaviorType) as BehaviorType 
    from 
      user_behavior 
    where 
      ItemID = 59883 
      and timestamp >= '2017-12-02 00:00:00' 
      and timestamp <= '2017-12-02 23:59:59' 
    group by 
      UserID
  ) as log on list.UserID = log.UserID 
where 
  list.ItemID = 59883
  and list.level = 1 
group by 
  log.BehaviorType 
order by 
  count(BehaviorType) desc;
+------------------------------------------------------+-------------------------+
| BehaviorType                                         | count(log.BehaviorType) |
+------------------------------------------------------+-------------------------+
| ["pv"]                                               |                    2704 |
| ["pv","pv"]                                          |                      43 |
| ["pv","pv","pv"]                                     |                       4 |
| ["cart","pv"]                                        |                       3 |
| ["fav","pv"]                                         |                       3 |
| ["pv","pv","pv","pv"]                                |                       2 |
| ["pv","cart"]                                        |                       1 |
| ["pv","cart","pv"]                                   |                       1 |
| ["cart","pv","pv"]                                   |                       1 |
| ["pv","pv","pv","pv","pv"]                           |                       1 |
| ["fav","pv","pv","pv","pv","pv","pv","pv","pv","pv"] |                       1 |
+------------------------------------------------------+-------------------------+
11 rows in set (3.38 sec)

We can see that the majority of people just viewed the products and left without taking any further actions.

1 Like