栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

SQL之存在性问题分析-HQL面试题39

SQL之存在性问题分析-HQL面试题39

目录

0 需求分析

1 数据准备

2 数据分析

3 小 结


0 需求分析

业务中涉及到学生退费的统计问题,求截止当前月退费总人数【退费人数:上月存在,这月不存在的学生个数】。

数据如下:

hive> select * from stu;
OK
2020-01-02	1001
2020-01-02	1002
2020-02-02	1001
2020-02-02	1002
2020-02-02	1003
2020-02-02	1004
2020-03-02	1001
2020-03-02	1002
2020-04-02	1005
2020-05-02	1006
Time taken: 0.487 seconds, Fetched: 10 row(s)

1 数据准备

(1)建表

create table stu
(
    day string ,--'日期',
    stu_id int -- '学生id' 
)row format delimited fields terminated by 't';

(2) 数据

insert into stu values 
("2020-01-02", 1001),
("2020-01-02", 1002),
("2020-02-02", 1001),
("2020-02-02", 1002),
("2020-02-02", 1003),
("2020-02-02", 1004),
("2020-03-02", 1001),
("2020-03-02", 1002),
("2020-04-02", 1005),
("2020-05-02", 1006);

2 数据分析

目标:求上月存在,这月不存在的学生个数。

(1)按照每个月将stu_id进行合并。

合并利用collect_list()函数完成。具体SQL如下:

    select substr(day,1,7) as month
         ,collect_list(stu_id) as stu_id_arr
    from stu
    group by substr(day,1,7)

计算结果如下:

2020-01	[1001,1002]
2020-02	[1001,1002,1003,1004]
2020-03	[1001,1002]
2020-04	[1005]
2020-05	[1006]

(2)按照月份排序,获取下一月的合并后的值,SQL如下

select month
      ,stu_id_arr
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
    select substr(day,1,7) as month
         ,collect_list(stu_id) as stu_id_arr
    from stu
    group by substr(day,1,7)
) m

计算结果如下:

OK
2020-01	[1001,1002]	[1001,1002,1003,1004]
2020-02	[1001,1002,1003,1004]	[1001,1002]
2020-03	[1001,1002]	[1005]
2020-04	[1005]	[1006]
2020-05	[1006]	NULL
Time taken: 15.035 seconds, Fetched: 5 row(s)

如果hive中有计算两个数组之间的差集的函数,我们可以利用数组差集函数求出差集然后再size()求出个数,最后在sum求和即可。查阅hive中的关于数组函数,只有array_contains能够帮助我们,该函数表示在数组中查询某个元素是否存在,因而我们可以借助该函数查询某个学生当前id是否在下一组记录里面存在,如果存在标记为0,不存在标记为1。于是思路如下:利用步骤2结果与原表进行关联,获取当前学生id。

查询中间结果如下:

select t0.*
      ,t1.*
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

结果如下:

OK
2020-01	1001	2020-01	[1001,1002,1003,1004]
2020-01	1002	2020-01	[1001,1002,1003,1004]
2020-02	1001	2020-02	[1001,1002]
2020-02	1002	2020-02	[1001,1002]
2020-02	1003	2020-02	[1001,1002]
2020-02	1004	2020-02	[1001,1002]
2020-03	1001	2020-03	[1005]
2020-03	1002	2020-03	[1005]
2020-04	1005	2020-04	[1006]
2020-05	1006	2020-05	NULL
Time taken: 10.885 seconds, Fetched: 10 row(s)

(3)利用array_contains()函数判断当前的stu_id是否在下一月array数组中,如果存在标记0,不存在标记1.其意义就是,如果当前的学生在下一个月中出现了,说明不是要找的学生,如果不在,就说明在本月出现在下一个月未出现,就是我们要找的学生。具体SQL如下:

select t0.month
     ,t0.stu_id
     ,if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0) flag
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

结果如下:

--------------------------------------------------------------------------------
OK
2020-01	1001	0
2020-01	1002	0
2020-02	1001	0
2020-02	1002	0
2020-02	1003	1
2020-02	1004	1
2020-03	1001	1
2020-03	1002	1
2020-04	1005	1
2020-05	1006	1
Time taken: 9.941 seconds, Fetched: 10 row(s)

(4) 基于步骤3对上述flag按月求和即为下一个月退费中人数。具体SQL如下:

select distinct t0.month
     ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) 
from
(select substr(day,1,7) as month
      ,stu_id
from stu) t0
left join
(
select month
      ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
from(
     select substr(day,1,7) as month
           ,collect_list(stu_id) as stu_id_arr
     from stu
     group by substr(day,1,7)
    ) m
) t1
on t0.month = t1.month

计算结果如下:

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	2
2020-03	2
2020-04	1
2020-05	1
Time taken: 11.048 seconds, Fetched: 5 row(s)

(5)计算截止当前月的退费人数。

由于步骤4计算的是下一个月退费的人数,要计算截止当前退费人数,我们先将上一个月的退费人数获取到当前行再进行累加,SQL如下:

select month
     ,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
from(
    select distinct t0.month as month
         ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
    from
        (select substr(day,1,7) as month
              ,stu_id
        from stu) t0
    left join
    (
        select month
              ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
        from(
             select substr(day,1,7) as month
                   ,collect_list(stu_id) as stu_id_arr
             from stu
             group by substr(day,1,7)
            ) m
    ) t1
    on t0.month = t1.month
) n

计算结果如下:

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	0
2020-03	2
2020-04	2
2020-05	1
Time taken: 23.148 seconds, Fetched: 5 row(s)

计算截止当前月的退费人数,最终SQL如下:

select month
      ,sum(lag_month_cnt) over(order by month)
from(
    select month
         ,lag(next_month_cnt,1,0) over(order by month) as lag_month_cnt
    from(
        select distinct t0.month as month
             ,sum(if(!array_contains(t1.lag_stu_id_arr,t0.stu_id),1,0)) over(partition by t0.month) as next_month_cnt
        from
            (select substr(day,1,7) as month
                  ,stu_id
            from stu) t0
        left join
        (
            select month
                  ,lead(stu_id_arr,1) over(order by month) as lag_stu_id_arr
            from(
                 select substr(day,1,7) as month
                       ,collect_list(stu_id) as stu_id_arr
                 from stu
                 group by substr(day,1,7)
                ) m
        ) t1
        on t0.month = t1.month
    ) n
) o

计算结果如下:

--------------------------------------------------------------------------------
OK
2020-01	0
2020-02	0
2020-03	2
2020-04	4
2020-05	5
Time taken: 20.76 seconds, Fetched: 5 row(s)

3 小 结

本文对存在性问题进行了探讨和研究,此类问题往往需要对不同的记录做对比分析,我们可以先将符合条件的数据域按照collect_set()或collect_list()函数进行聚合转换成数组,然后获取历史的数据域放入当前行,最后利用hive中数组的相关处理手段进行对比分析。常用的hive数组处理函数如expode()、size()、array()、array_contains()等函数,本题就借助于hive ,array_contains()函数进行存在性问题分析。

针对本题的问题分析方法,读者可自行尝试完成以下存在性问题分析,具体需求如下:

  • 订单表:
  • order_id(订单id), user_id(学生id), lesson_id(课程id), subject_id(学科id), order_time(订单时间)
  • 求当前订单之前, 非当前学科,存在多少课程 和 学科。

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/434104.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号