0 需求分析

1 数据准备

2 数据分析

3 小结

0 需求分析

业务中涉及到学生退费的统计问题，求截止当前月退费总人数【退费人数：上月存在，这月不存在的学生个数】。

数据如下：

hive> select * from stu;
OK
2020-01-02	1001
2020-01-02	1002
2020-02-02	1001
2020-02-02	1002
2020-02-02	1003
2020-02-02	1004
2020-03-02	1001
2020-03-02	1002
2020-04-02	1005
2020-05-02	1006
Time taken: 0.487 seconds, Fetched: 10 row(s)

1 数据准备

（1）建表

create table stu
(
    day string ,--'日期',
    stu_id int -- '学生id' 
)row format delimited fields terminated by 't';

(2) 数据

insert into stu values 
("2020-01-02", 1001),
("2020-01-02", 1002),
("2020-02-02", 1001),
("2020-02-02", 1002),
("2020-02-02", 1003),
("2020-02-02", 1004),
("2020-03-02", 1001),
("2020-03-02", 1002),
("2020-04-02", 1005),
("2020-05-02", 1006);

2 数据分析

目标：求上月存在，这月不存在的学生个数。

（1）按照每个月将stu_id进行合并。

合并利用collect_list()函数完成。具体SQL如下：

    select substr(day,1,7) as month
         ,collect_list(stu_id) as stu_id_arr
    from stu
    group by substr(day,1,7)

计算结果如下：

2020-01	[1001,1002]
2020-02	[1001,1002,1003,1004]
2020-03	[1001,1002]
2020-04	[1005]
2020-05	[1006]

（2）按照月份排序，获取下一月的合并后的值，SQL如下

select month
      ,stu_id_arr
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
    select substr(day,1,7) as month
         ,collect_list(stu_id) as stu_id_arr
    from stu
    group by substr(day,1,7)
) m

计算结果如下：

OK
2020-01	[1001,1002]	[1001,1002,1003,1004]
2020-02	[1001,1002,1003,1004]	[1001,1002]
2020-03	[1001,1002]	[1005]
2020-04	[1005]	[1006]
2020-05	[1006]	NULL
Time taken: 15.035 seconds, Fetched: 5 row(s)

如果hive中有计算两个数组之间的差集的函数，我们可以利用数组差集函数求出差集然后再size()求出个数，最后在sum求和即可。查阅hive中的关于数组函数，只有array_contains能够帮助我们，该函数表示在数组中查询某个元素是否存在，因而我们可以借助该函数查询某个学生当前id是否在下一组记录里面存在，如果存在标记为0，不存在标记为1。于是思路如下：利用步骤2结果与原表进行关联，获取当前学生id。

查询中间结果如下：

select t0.*
      ,t1.*
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

结果如下：

OK
2020-01	1001	2020-01	[1001,1002,1003,1004]
2020-01	1002	2020-01	[1001,1002,1003,1004]
2020-02	1001	2020-02	[1001,1002]
2020-02	1002	2020-02	[1001,1002]
2020-02	1003	2020-02	[1001,1002]
2020-02	1004	2020-02	[1001,1002]
2020-03	1001	2020-03	[1005]
2020-03	1002	2020-03	[1005]
2020-04	1005	2020-04	[1006]
2020-05	1006	2020-05	NULL
Time taken: 10.885 seconds, Fetched: 10 row(s)

（3）利用array_contains()函数判断当前的stu_id是否在下一月array数组中，如果存在标记0，不存在标记1.其意义就是，如果当前的学生在下一个月中出现了，说明不是要找的学生，如果不在，就说明在本月出现在下一个月未出现，就是我们要找的学生。具体SQL如下：

select t0.month
     ,t0.stu_id
     ,if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0) flag
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

结果如下：

--------------------------------------------------------------------------------
OK
2020-01	1001	0
2020-01	1002	0
2020-02	1001	0
2020-02	1002	0
2020-02	1003	1
2020-02	1004	1
2020-03	1001	1
2020-03	1002	1
2020-04	1005	1
2020-05	1006	1
Time taken: 9.941 seconds, Fetched: 10 row(s)

(4) 基于步骤3对上述flag按月求和即为下一个月退费中人数。具体SQL如下：

select distinct t0.month
     ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) 
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

计算结果如下：

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	2
2020-03	2
2020-04	1
2020-05	1
Time taken: 11.048 seconds, Fetched: 5 row(s)

（5）计算截止当前月的退费人数。

由于步骤4计算的是下一个月退费的人数，要计算截止当前退费人数，我们先将上一个月的退费人数获取到当前行再进行累加，SQL如下：

select month
     ,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
from(
    select distinct t0.month as month
         ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
    from
        (select substr(day,1,7) as month
              ,stu_id
        from stu) t0
    left join
    (
        select month
              ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
        from(
             select substr(day,1,7) as month
                   ,collect_list(stu_id) as stu_id_arr
             from stu
             group by substr(day,1,7)
            ) m
    ) t1
    on t0.month = t1.month
) n

计算结果如下：

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	0
2020-03	2
2020-04	2
2020-05	1
Time taken: 23.148 seconds, Fetched: 5 row(s)

计算截止当前月的退费人数，最终SQL如下：

select month
      ,sum(lag_month_cnt) over(order by month)
from(
    select month
         ,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
    from(
        select distinct t0.month as month
             ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
        from
            (select substr(day,1,7) as month
                  ,stu_id
            from stu) t0
        left join
        (
            select month
                  ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
            from(
                 select substr(day,1,7) as month
                       ,collect_list(stu_id) as stu_id_arr
                 from stu
                 group by substr(day,1,7)
                ) m
        ) t1
        on t0.month = t1.month
    ) n
) o

计算结果如下：

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	0
2020-03	2
2020-04	4
2020-05	5
Time taken: 20.76 seconds, Fetched: 5 row(s)

3 小结
本文对存在性问题进行了探讨和研究，此类问题往往需要对不同的记录做对比分析，我们可以先将符合条件的数据域按照collect_set（）或collect_list（）函数进行聚合转换成数组，然后获取历史的数据域放入当前行，最后利用hive中数组的相关处理手段进行对比分析。常用的hive数组处理函数如expode()、size()、array()、array_contains()等函数，本题就借助于hive ,array_contains()函数进行存在性问题分析。

针对本题的问题分析方法，读者可自行尝试完成以下存在性问题分析，具体需求如下：

订单表:
order_id(订单id), user_id(学生id), lesson_id(课程id), subject_id(学科id), order_time(订单时间)
求当前订单之前，非当前学科，存在多少课程和学科。

SQL之存在性问题分析-HQL面试题39

大数据系统相关栏目本月热门文章