栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

Hive实战系列之用户粘性统计

Hive实战系列之用户粘性统计

需求 

在数仓游戏产品中,用户粘性指标很重要。请看下面数据结构,统计出最近一月内连续3天登陆的用户。请根据下边数据创建一张表,要求日期格式是一个集合。输出结果:

 10002    ["2021-10-02","2021-10-03","2021-10-04"]
 10002    ["2021-09-03","2021-09-04","2021-09-05"]
 10004    ["2021-08-02","2021-08-03","2021-08-04"]
 10006    ["2021-10-02","2021-10-03","2021-10-04"]

 10001   2021-08-01,2021-08-02,2021-08-03,2021-08-04
 10002   2021-08-01,2021-08-02,2021-09-03,2021-09-04,2021-09-05,2021-10-02,2021-10-03,2021-10-04
 10003   2021-08-01,2021-08-02,2021-09-03,2029-08-04
 10004   2021-07-01,2021-08-02,2021-08-03,2021-08-04
 10005   2021-07-01,2021-08-02,2021-08-03,2021-07-04
 10006   2021-08-01,2021-10-02,2021-10-03,2021-10-04
创建表加载数据
--创建表--
drop table if exists user_login_log;
create table user_login_log(
id STRING,
login_date_list ARRAY
)
row format delimited fields terminated by 't'
collection items terminated by ',';
--加载数据--
load data local inpath "/opt/user_login_log.txt" into table user_login_log;
查看数据

 解题思路

化繁为简逐步击破。

第一步:将数据进行炸裂实现列转行;

第二步:根据用户进行开窗,根据日期进行排名rk;

第三步:使用date_sub(dt,rk)函数求出差距日期date_dif;

第四步:根据id,date_diff分组聚合。

第一步 将数据进行炸裂实现列转行
--语句--
select
id,
login_date
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date;
--结果--
 10001	2021-08-01
 10001	2021-08-02
 10001	2021-08-03
 10001	2021-08-04
 10002	2021-08-01
 10002	2021-08-02
 10002	2021-09-03
 10002	2021-09-04
 10002	2021-09-05
 10002	2021-10-02
 10002	2021-10-03
 10002	2021-10-04
 10003	2021-08-01
 10003	2021-08-02
 10003	2021-09-03
 10003	2029-08-04
 10004	2021-07-01
 10004	2021-08-02
 10004	2021-08-03
 10004	2021-08-04
 10005	2021-07-01
 10005	2021-08-02
 10005	2021-08-03
 10005	2021-07-04
 10006	2021-08-01
 10006	2021-10-02
 10006	2021-10-03
 10006	2021-10-04

 第二步 根据用户进行开窗,根据日期进行排名rk;

--语句--
select
id,
login_date,
rank() over (partition by id order by id,login_date)  rk
from
(
select
id,
login_date
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date
)t1;

--结果--
 10001	2021-08-01	1
 10001	2021-08-02	2
 10001	2021-08-03	3
 10001	2021-08-04	4
 10002	2021-08-01	1
 10002	2021-08-02	2
 10002	2021-09-03	3
 10002	2021-09-04	4
 10002	2021-09-05	5
 10002	2021-10-02	6
 10002	2021-10-03	7
 10002	2021-10-04	8
 10004	2021-07-01	1
 10004	2021-08-02	2
 10004	2021-08-03	3
 10004	2021-08-04	4
 10005	2021-07-01	1
 10005	2021-07-04	2
 10005	2021-08-02	3
 10005	2021-08-03	4
 10003	2021-08-01	1
 10003	2021-08-02	2
 10003	2021-09-03	3
 10003	2029-08-04	4
 10006	2021-08-01	1
 10006	2021-10-02	2
 10006	2021-10-03	3
 10006	2021-10-04	4
  第三步 使用date_sub(dt,rk)函数求出差距日期date_dif
--语句--

select
id,
-- login_date,
date_sub(login_date,rk) date_diff
from
(
select
id,
login_date,
rank() over (partition by id order by  login_date)  rk
from
(
select
id,
login_date
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date
)t1
)t2
;

--数据---
 10001	2021-08-01	2021-07-31
 10001	2021-08-02	2021-07-31
 10001	2021-08-03	2021-07-31
 10001	2021-08-04	2021-07-31
 10002	2021-08-01	2021-07-31
 10002	2021-08-02	2021-07-31
 10002	2021-09-03	2021-08-31
 10002	2021-09-04	2021-08-31
 10002	2021-09-05	2021-08-31
 10002	2021-10-02	2021-09-26
 10002	2021-10-03	2021-09-26
 10002	2021-10-04	2021-09-26
 10004	2021-07-01	2021-06-30
 10004	2021-08-02	2021-07-31
 10004	2021-08-03	2021-07-31
 10004	2021-08-04	2021-07-31
 10005	2021-07-01	2021-06-30
 10005	2021-07-04	2021-07-02
 10005	2021-08-02	2021-07-30
 10005	2021-08-03	2021-07-30
 10003	2021-08-01	2021-07-31
 10003	2021-08-02	2021-07-31
 10003	2021-09-03	2021-08-31
 10003	2029-08-04	2029-07-31
 10006	2021-08-01	2021-07-31
 10006	2021-10-02	2021-09-30
 10006	2021-10-03	2021-09-30
 10006	2021-10-04	2021-09-30

说明实际中不需要login_date这一列,为了看得懂特意加上,仔细观察date_dif的个数,如果出现三个以上相同的就是连续登录三天的用户。

  第四步 根据id,date_diff分组聚合
select
id,
date_list
from
(
select
id ,
datediff(max(login_date),min(login_date)) as diff_days, 
COLLECT_set(login_date) date_list
from
(
select
id,
login_date,
date_sub(login_date,rk) date_diff
from
(
select
id,
login_date,
rank() over (partition by id order by  login_date)  rk
from
(
select
id,
login_date
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date
)t1
)t2
)t3
group by id ,date_diff
)t4
where diff_days=2
order by id;

--结果--
 10002	["2021-10-02","2021-10-03","2021-10-04"]
 10002	["2021-09-03","2021-09-04","2021-09-05"]
 10004	["2021-08-02","2021-08-03","2021-08-04"]
 10006	["2021-10-02","2021-10-03","2021-10-04"]


思考:做报表可定这样不行的,那么如何统计各个天数的用户人数,当前登陆天数的占比,小于等于当前天数的占比。

--将上边数据保存起来--
select
id,
login_date,
datediff(max(login_date),min(login_date)) as diff_days, 
COLLECT_set(last_day)
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date;

drop table if exists tmp_continue_login_user_status;
create  temporary table tmp_continue_login_user_status as 
select
id ,
datediff(max(login_date),min(login_date)) as diff_days, 
COLLECT_set(login_date) date_list
from
(
select
id,
login_date,
date_sub(login_date,rk) date_diff
from
(
select
id,
login_date,
rank() over (partition by id order by  login_date)  rk
from
(
select
id,
login_date
from user_login_log
LATERAL VIEW explode (login_date_list)  dateTable AS login_date
)t1
)t2
)t3
group by id ,date_diff;






--在上边结果集合进行统计--
select
diff_days, -- 连续登录天数
users, -- 连续登录天数对应的用户数
(sum(users) over()) as total_users, -- 总体用户数
(sum(id) over( order by diff_days desc )) as aggregate_users, --小于当前天数的累计用户数
id/(sum(users) over()) as diff_days_rate, -- 连续登录天数对应的用户占用户用户的比列
(sum(id) over( order by diff_days desc ))/(sum(users) over()) as aggregate_users_rate -- 小于等于当前天数 连续登录的用户数累计占比
from
(
select
diff_days,
count(distinct id) as users
from
tmp_continue_login_user_status
group by diff_days
)  tmp
order by diff_days desc;



--结果展示--
diff_days	users	total_users	aggregate_users	diff_days_rate	aggregate_users_rate
3	1	11	1	0.09	0.09
2	3	11	4	0.27	0.36
1	3	11	7	0.27	0.64
0	4	11	11	0.36	1.0




转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/423287.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号