Note
Below is an example scenario. Please see demo/documentation-samples/datalakehouse at master · StarRocks/demo · GitHub for a more detailed example with detailed configurations.
The dataset
The data comes from a user behavior dataset. This dataset randomly selected 1 million users, recording their actions on taobao from November 25, 2017 to December 3, 2017.
The dataset contains 86,953,525 records with five dimensions: userID, itemID, categoryID, behaviorType, and timestamp. The data spans eight days from November 25, 2017, to December 2, 2017. It involves a total of 987,982 unique users, 3,962,559 unique products, and 9,377 product categories. The dataset includes four types of user behaviors: page view (PV), purchase (Buy), add to cart (Cart), and favorite (Fav).
Environment
You need at least 16GB of ram to run this tutorial.
Where to download it.
The 1.1 GB file can be downloaded at https://cdn.starrocks.io/dataset/user_behavior_sample_data.parquet
Loading the data
We offer 2 different ways to load the data.
Using INSERT INTO FILES()
create table user_behavior as
SELECT * FROM FILES(
"path" = "s3://warehouse/user_behavior_sample_data.parquet",
"format" = "parquet",
"aws.s3.access_key" = "admin",
"aws.s3.secret_key" = "password",
"aws.s3.region" = "us-west-2",
"aws.s3.use_instance_profile" = "false",
"aws.s3.enable_path_style_access" = "true",
"aws.s3.endpoint" = "http://minio:9000"
);
Using Create Table and Broker Load
Create database demo;
Use demo;
Drop table user_behavior;
CREATE TABLE `user_behavior` (
`UserID` int(11) NULL COMMENT "",
`ItemID` int(11) NULL COMMENT "",
`CategoryID` int(11) NULL COMMENT "",
`BehaviorType` varchar(65533) NULL COMMENT "",
`Timestamp` datetime NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`UserID`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`UserID`) BUCKETS 1
PROPERTIES (
"replication_num" = "1",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"replicated_storage" = "true",
"compression" = "LZ4"
);
LOAD LABEL demo.user_behavior_3
(
DATA INFILE("s3a://s3loadtable1/user_behavior_sample_data.parquet")
into table user_behavior
format as "parquet"
)
with BROKER
(
"aws.s3.use_instance_profile" = "false",
"aws.s3.region" = "us-west-2",
"aws.s3.access_key" = "XXXX",
"aws.s3.secret_key" = "YYYYY"
);
show load where label = 'user_behavior_3';
If you get an error like type:LOAD_RUN_FAIL; msg:Memory of Query3a47bc7f-46b5-428f-bfe7-43dd10d6a141 exceed limit. Pipeline Backend: 10.0.25.171, fragment: 3a47bc7f-46b5-428f-bfe7-43dd10d6a142 Used: 2333611624, Limit: 2147483648. Mem usage has exceed the limit of single query, You can change the limit by set session variable query_mem_limit., you can execute SET GLOBAL query_mem_limit = 137438953472; to change the memory used for the load and then retry the load again.
Create the item table by CTAS.
create table item ( ItemID bigint(20), Name String);
insert into item(ItemID, name) select distinct ItemID, concat("item ", ItemID) from user_behavior;
Scenario 1: Understanding the ecommerce conversion funnel
with tmp1 as (
with tmp as (
select
t.level as level,
count(UserID) as res
from
(
select
UserID,
window_funnel(
18000,
`Timestamp`,
0,
[BehaviorType = 'pv' ,
BehaviorType = 'cart',
BehaviorType = 'buy' ]
) as level
from
user_behavior
where `Timestamp` >= '2017-12-02 00:00:00'
and `Timestamp` <= '2017-12-02 23:59:59'
group by
UserID
) as t
where
t.level > 0
group by
t.level
)
select
tmp.level,
sum(tmp.res) over (
order by
tmp.level rows between current row
and unbounded following
) as retention
from
tmp
)
select
tmp1.level,
tmp1.retention,
last_value(tmp1.retention) over(
order by
tmp1.level rows between current row
and 1 following
)/ tmp1.retention as retention_ratio
from
tmp1;
+-------+-----------+---------------------+
| level | retention | retention_ratio |
+-------+-----------+---------------------+
| 1 | 913314 | 0.34725078122091635 |
| 2 | 317149 | 0.23266981765668504 |
| 3 | 73791 | 1 |
+-------+-----------+---------------------+
3 rows in set (1.85 sec)
This indicates that only 34% of users who viewed the product added it to their cart, and only 23% of users who added it to their cart proceeded to place an order. The conversion rate is not good.
Scenario 2: Examine the item IDs of the top ten products with the worst conversion rate from PV (page views) to buy.
with tmp1 as (
with tmp as (
select
ItemID,
t.level as level,
count(UserID) as res
from
(
select
ItemID,
UserID,
window_funnel(
1800,
timestamp,
0,
[BehaviorType = 'pv',
BehaviorType ='buy' ]
) as level
from
user_behavior
where timestamp >= '2017-12-02 00:00:00'
and timestamp <= '2017-12-02 23:59:59'
group by
ItemID,
UserID
) as t
where
t.level > 0
group by
t.ItemID,
t.level
)
select
tmp.ItemID,
tmp.level,
sum(tmp.res) over (
partition by tmp.ItemID
order by
tmp.level rows between current row
and unbounded following
) as retention
from
tmp
)
select
tmp1.ItemID,
i.name,
tmp1.level,
tmp1.retention / last_value(tmp1.retention) over(
partition by tmp1.ItemID
order by
tmp1.level desc rows between current row
and 1 following
) as retention_ratio
from
tmp1
JOIN item i ON tmp1.ItemID = i.ItemID
order by
tmp1.level desc,
retention_ratio
limit
10;
+---------+--------------+-------+-----------------------+
| ItemID | name | level | retention_ratio |
+---------+--------------+-------+-----------------------+
| 59883 | item 59883 | 2 | 0.0003616636528028933 |
| 394978 | item 394978 | 2 | 0.0006357279084551812 |
| 1164931 | item 1164931 | 2 | 0.0006648936170212766 |
| 4622270 | item 4622270 | 2 | 0.0007692307692307692 |
| 812879 | item 812879 | 2 | 0.0009121313469139556 |
| 1783990 | item 1783990 | 2 | 0.0009132420091324201 |
| 3847054 | item 3847054 | 2 | 0.000925925925925926 |
| 2742138 | item 2742138 | 2 | 0.0009881422924901185 |
| 530918 | item 530918 | 2 | 0.0010193679918450561 |
| 600756 | item 600756 | 2 | 0.0010319917440660474 |
+---------+--------------+-------+-----------------------+
10 rows in set (5.07 sec)
At this point, we have identified items with poor conversion, such as item_id=59883.
Scenario 3: Would like to see the user paths of those who dropped off?
select
log.BehaviorType,
count(log.BehaviorType)
from
(
select
ItemID,
UserID,
window_funnel(
1800,
timestamp,
0,
[BehaviorType = 'pv' ,
BehaviorType = 'buy' ]
) as level
from
user_behavior
where timestamp >= '2017-12-02 00:00:00'
and timestamp <= '2017-12-02 23:59:59'
group by
ItemID,
UserID
) as list
left join (
select
UserID,
array_agg(BehaviorType) as BehaviorType
from
user_behavior
where
ItemID = 59883
and timestamp >= '2017-12-02 00:00:00'
and timestamp <= '2017-12-02 23:59:59'
group by
UserID
) as log on list.UserID = log.UserID
where
list.ItemID = 59883
and list.level = 1
group by
log.BehaviorType
order by
count(BehaviorType) desc;
+------------------------------------------------------+-------------------------+
| BehaviorType | count(log.BehaviorType) |
+------------------------------------------------------+-------------------------+
| ["pv"] | 2704 |
| ["pv","pv"] | 43 |
| ["pv","pv","pv"] | 4 |
| ["cart","pv"] | 3 |
| ["fav","pv"] | 3 |
| ["pv","pv","pv","pv"] | 2 |
| ["pv","cart"] | 1 |
| ["pv","cart","pv"] | 1 |
| ["cart","pv","pv"] | 1 |
| ["pv","pv","pv","pv","pv"] | 1 |
| ["fav","pv","pv","pv","pv","pv","pv","pv","pv","pv"] | 1 |
+------------------------------------------------------+-------------------------+
11 rows in set (3.38 sec)
We can see that the majority of people just viewed the products and left without taking any further actions.