在数仓游戏产品中,用户粘性指标很重要。请看下面数据结构,统计出最近一月内连续3天登陆的用户。请根据下边数据创建一张表,要求日期格式是一个集合。输出结果:
10002 ["2021-10-02","2021-10-03","2021-10-04"]
10002 ["2021-09-03","2021-09-04","2021-09-05"]
10004 ["2021-08-02","2021-08-03","2021-08-04"]
10006 ["2021-10-02","2021-10-03","2021-10-04"]
10001 2021-08-01,2021-08-02,2021-08-03,2021-08-04 10002 2021-08-01,2021-08-02,2021-09-03,2021-09-04,2021-09-05,2021-10-02,2021-10-03,2021-10-04 10003 2021-08-01,2021-08-02,2021-09-03,2029-08-04 10004 2021-07-01,2021-08-02,2021-08-03,2021-08-04 10005 2021-07-01,2021-08-02,2021-08-03,2021-07-04 10006 2021-08-01,2021-10-02,2021-10-03,2021-10-04创建表加载数据
--创建表-- drop table if exists user_login_log; create table user_login_log( id STRING, login_date_list ARRAY查看数据 解题思路) row format delimited fields terminated by 't' collection items terminated by ','; --加载数据-- load data local inpath "/opt/user_login_log.txt" into table user_login_log;
化繁为简逐步击破。
第一步:将数据进行炸裂实现列转行;
第二步:根据用户进行开窗,根据日期进行排名rk;
第三步:使用date_sub(dt,rk)函数求出差距日期date_dif;
第四步:根据id,date_diff分组聚合。
第一步 将数据进行炸裂实现列转行--语句-- select id, login_date from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date; --结果-- 10001 2021-08-01 10001 2021-08-02 10001 2021-08-03 10001 2021-08-04 10002 2021-08-01 10002 2021-08-02 10002 2021-09-03 10002 2021-09-04 10002 2021-09-05 10002 2021-10-02 10002 2021-10-03 10002 2021-10-04 10003 2021-08-01 10003 2021-08-02 10003 2021-09-03 10003 2029-08-04 10004 2021-07-01 10004 2021-08-02 10004 2021-08-03 10004 2021-08-04 10005 2021-07-01 10005 2021-08-02 10005 2021-08-03 10005 2021-07-04 10006 2021-08-01 10006 2021-10-02 10006 2021-10-03 10006 2021-10-04
第二步 根据用户进行开窗,根据日期进行排名rk;
--语句-- select id, login_date, rank() over (partition by id order by id,login_date) rk from ( select id, login_date from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date )t1; --结果-- 10001 2021-08-01 1 10001 2021-08-02 2 10001 2021-08-03 3 10001 2021-08-04 4 10002 2021-08-01 1 10002 2021-08-02 2 10002 2021-09-03 3 10002 2021-09-04 4 10002 2021-09-05 5 10002 2021-10-02 6 10002 2021-10-03 7 10002 2021-10-04 8 10004 2021-07-01 1 10004 2021-08-02 2 10004 2021-08-03 3 10004 2021-08-04 4 10005 2021-07-01 1 10005 2021-07-04 2 10005 2021-08-02 3 10005 2021-08-03 4 10003 2021-08-01 1 10003 2021-08-02 2 10003 2021-09-03 3 10003 2029-08-04 4 10006 2021-08-01 1 10006 2021-10-02 2 10006 2021-10-03 3 10006 2021-10-04 4第三步 使用date_sub(dt,rk)函数求出差距日期date_dif
--语句-- select id, -- login_date, date_sub(login_date,rk) date_diff from ( select id, login_date, rank() over (partition by id order by login_date) rk from ( select id, login_date from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date )t1 )t2 ; --数据--- 10001 2021-08-01 2021-07-31 10001 2021-08-02 2021-07-31 10001 2021-08-03 2021-07-31 10001 2021-08-04 2021-07-31 10002 2021-08-01 2021-07-31 10002 2021-08-02 2021-07-31 10002 2021-09-03 2021-08-31 10002 2021-09-04 2021-08-31 10002 2021-09-05 2021-08-31 10002 2021-10-02 2021-09-26 10002 2021-10-03 2021-09-26 10002 2021-10-04 2021-09-26 10004 2021-07-01 2021-06-30 10004 2021-08-02 2021-07-31 10004 2021-08-03 2021-07-31 10004 2021-08-04 2021-07-31 10005 2021-07-01 2021-06-30 10005 2021-07-04 2021-07-02 10005 2021-08-02 2021-07-30 10005 2021-08-03 2021-07-30 10003 2021-08-01 2021-07-31 10003 2021-08-02 2021-07-31 10003 2021-09-03 2021-08-31 10003 2029-08-04 2029-07-31 10006 2021-08-01 2021-07-31 10006 2021-10-02 2021-09-30 10006 2021-10-03 2021-09-30 10006 2021-10-04 2021-09-30
说明实际中不需要login_date这一列,为了看得懂特意加上,仔细观察date_dif的个数,如果出现三个以上相同的就是连续登录三天的用户。
第四步 根据id,date_diff分组聚合select id, date_list from ( select id , datediff(max(login_date),min(login_date)) as diff_days, COLLECT_set(login_date) date_list from ( select id, login_date, date_sub(login_date,rk) date_diff from ( select id, login_date, rank() over (partition by id order by login_date) rk from ( select id, login_date from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date )t1 )t2 )t3 group by id ,date_diff )t4 where diff_days=2 order by id; --结果-- 10002 ["2021-10-02","2021-10-03","2021-10-04"] 10002 ["2021-09-03","2021-09-04","2021-09-05"] 10004 ["2021-08-02","2021-08-03","2021-08-04"] 10006 ["2021-10-02","2021-10-03","2021-10-04"]
思考:做报表可定这样不行的,那么如何统计各个天数的用户人数,当前登陆天数的占比,小于等于当前天数的占比。
--将上边数据保存起来-- select id, login_date, datediff(max(login_date),min(login_date)) as diff_days, COLLECT_set(last_day) from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date; drop table if exists tmp_continue_login_user_status; create temporary table tmp_continue_login_user_status as select id , datediff(max(login_date),min(login_date)) as diff_days, COLLECT_set(login_date) date_list from ( select id, login_date, date_sub(login_date,rk) date_diff from ( select id, login_date, rank() over (partition by id order by login_date) rk from ( select id, login_date from user_login_log LATERAL VIEW explode (login_date_list) dateTable AS login_date )t1 )t2 )t3 group by id ,date_diff; --在上边结果集合进行统计-- select diff_days, -- 连续登录天数 users, -- 连续登录天数对应的用户数 (sum(users) over()) as total_users, -- 总体用户数 (sum(id) over( order by diff_days desc )) as aggregate_users, --小于当前天数的累计用户数 id/(sum(users) over()) as diff_days_rate, -- 连续登录天数对应的用户占用户用户的比列 (sum(id) over( order by diff_days desc ))/(sum(users) over()) as aggregate_users_rate -- 小于等于当前天数 连续登录的用户数累计占比 from ( select diff_days, count(distinct id) as users from tmp_continue_login_user_status group by diff_days ) tmp order by diff_days desc; --结果展示-- diff_days users total_users aggregate_users diff_days_rate aggregate_users_rate 3 1 11 1 0.09 0.09 2 3 11 4 0.27 0.36 1 3 11 7 0.27 0.64 0 4 11 11 0.36 1.0



