目录
0 需求分析
1 数据准备
2 数据分析
3 小 结
0 需求分析
业务中涉及到学生退费的统计问题,求截止当前月退费总人数【退费人数:上月存在,这月不存在的学生个数】。
数据如下:
hive> select * from stu; OK 2020-01-02 1001 2020-01-02 1002 2020-02-02 1001 2020-02-02 1002 2020-02-02 1003 2020-02-02 1004 2020-03-02 1001 2020-03-02 1002 2020-04-02 1005 2020-05-02 1006 Time taken: 0.487 seconds, Fetched: 10 row(s)
1 数据准备
(1)建表
create table stu
(
day string ,--'日期',
stu_id int -- '学生id'
)row format delimited fields terminated by 't';
(2) 数据
insert into stu values
("2020-01-02", 1001),
("2020-01-02", 1002),
("2020-02-02", 1001),
("2020-02-02", 1002),
("2020-02-02", 1003),
("2020-02-02", 1004),
("2020-03-02", 1001),
("2020-03-02", 1002),
("2020-04-02", 1005),
("2020-05-02", 1006);
2 数据分析
目标:求上月存在,这月不存在的学生个数。
(1)按照每个月将stu_id进行合并。
合并利用collect_list()函数完成。具体SQL如下:
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
计算结果如下:
2020-01 [1001,1002] 2020-02 [1001,1002,1003,1004] 2020-03 [1001,1002] 2020-04 [1005] 2020-05 [1006]
(2)按照月份排序,获取下一月的合并后的值,SQL如下
select month
,stu_id_arr
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
计算结果如下:
OK 2020-01 [1001,1002] [1001,1002,1003,1004] 2020-02 [1001,1002,1003,1004] [1001,1002] 2020-03 [1001,1002] [1005] 2020-04 [1005] [1006] 2020-05 [1006] NULL Time taken: 15.035 seconds, Fetched: 5 row(s)
如果hive中有计算两个数组之间的差集的函数,我们可以利用数组差集函数求出差集然后再size()求出个数,最后在sum求和即可。查阅hive中的关于数组函数,只有array_contains能够帮助我们,该函数表示在数组中查询某个元素是否存在,因而我们可以借助该函数查询某个学生当前id是否在下一组记录里面存在,如果存在标记为0,不存在标记为1。于是思路如下:利用步骤2结果与原表进行关联,获取当前学生id。
查询中间结果如下:
select t0.*
,t1.*
from
(select substr(day,1,7) as month
,stu_id
from stu) t0
left join
(
select month
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
) t1
on t0.month = t1.month
结果如下:
OK 2020-01 1001 2020-01 [1001,1002,1003,1004] 2020-01 1002 2020-01 [1001,1002,1003,1004] 2020-02 1001 2020-02 [1001,1002] 2020-02 1002 2020-02 [1001,1002] 2020-02 1003 2020-02 [1001,1002] 2020-02 1004 2020-02 [1001,1002] 2020-03 1001 2020-03 [1005] 2020-03 1002 2020-03 [1005] 2020-04 1005 2020-04 [1006] 2020-05 1006 2020-05 NULL Time taken: 10.885 seconds, Fetched: 10 row(s)
(3)利用array_contains()函数判断当前的stu_id是否在下一月array数组中,如果存在标记0,不存在标记1.其意义就是,如果当前的学生在下一个月中出现了,说明不是要找的学生,如果不在,就说明在本月出现在下一个月未出现,就是我们要找的学生。具体SQL如下:
select t0.month
,t0.stu_id
,if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0) flag
from
(select substr(day,1,7) as month
,stu_id
from stu) t0
left join
(
select month
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
) t1
on t0.month = t1.month
结果如下:
-------------------------------------------------------------------------------- OK 2020-01 1001 0 2020-01 1002 0 2020-02 1001 0 2020-02 1002 0 2020-02 1003 1 2020-02 1004 1 2020-03 1001 1 2020-03 1002 1 2020-04 1005 1 2020-05 1006 1 Time taken: 9.941 seconds, Fetched: 10 row(s)
(4) 基于步骤3对上述flag按月求和即为下一个月退费中人数。具体SQL如下:
select distinct t0.month
,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month)
from
(select substr(day,1,7) as month
,stu_id
from stu) t0
left join
(
select month
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
) t1
on t0.month = t1.month
计算结果如下:
-------------------------------------------------------------------------------- OK 2020-01 0 2020-02 2 2020-03 2 2020-04 1 2020-05 1 Time taken: 11.048 seconds, Fetched: 5 row(s)
(5)计算截止当前月的退费人数。
由于步骤4计算的是下一个月退费的人数,要计算截止当前退费人数,我们先将上一个月的退费人数获取到当前行再进行累加,SQL如下:
select month
,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
from(
select distinct t0.month as month
,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
from
(select substr(day,1,7) as month
,stu_id
from stu) t0
left join
(
select month
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
) t1
on t0.month = t1.month
) n
计算结果如下:
-------------------------------------------------------------------------------- OK 2020-01 0 2020-02 0 2020-03 2 2020-04 2 2020-05 1 Time taken: 23.148 seconds, Fetched: 5 row(s)
计算截止当前月的退费人数,最终SQL如下:
select month
,sum(lag_month_cnt) over(order by month)
from(
select month
,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
from(
select distinct t0.month as month
,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
from
(select substr(day,1,7) as month
,stu_id
from stu) t0
left join
(
select month
,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
select substr(day,1,7) as month
,collect_list(stu_id) as stu_id_arr
from stu
group by substr(day,1,7)
) m
) t1
on t0.month = t1.month
) n
) o
计算结果如下:
-------------------------------------------------------------------------------- OK 2020-01 0 2020-02 0 2020-03 2 2020-04 4 2020-05 5 Time taken: 20.76 seconds, Fetched: 5 row(s)
3 小 结
本文对存在性问题进行了探讨和研究,此类问题往往需要对不同的记录做对比分析,我们可以先将符合条件的数据域按照collect_set()或collect_list()函数进行聚合转换成数组,然后获取历史的数据域放入当前行,最后利用hive中数组的相关处理手段进行对比分析。常用的hive数组处理函数如expode()、size()、array()、array_contains()等函数,本题就借助于hive ,array_contains()函数进行存在性问题分析。
针对本题的问题分析方法,读者可自行尝试完成以下存在性问题分析,具体需求如下:
- 订单表:
- order_id(订单id), user_id(学生id), lesson_id(课程id), subject_id(学科id), order_time(订单时间)
- 求当前订单之前, 非当前学科,存在多少课程 和 学科。



