栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

大数据SQL题

大数据SQL题

1:每个app下访问次数最多的5个用户

用户登陆日志
app, user_id, datetime
a1,user1,2020-01-01 00:00:00
a2,user2,2020-01-01 00:00:01
a3,user3,2020-01-01 00:00:02
a4,user4,2020-01-01 00:00:03
……
输出每个app下访问次数最多的5个用户

考察窗口函数:

select app,userid,rank from (
select app, userid, row_number() over (partition by app,userid order by c desc) as rank
from (
    select app, userid, count(*) as c 
        from table group by app,userid
)
)
where rank <5

2:一定认识的用户的组合数

现有城市网吧访问数据,字段:网吧id,访客id(身份证号),上线时间,下线时间
规则1、如果有两个用户在一家网吧的前后上下线时间在10分钟以内,则两人可能认识
规则2、如果这两个用户在三家以上网吧出现【规则1】的情况,则两人一定认识
需求:
该城市上网用户中两人一定认识的组合数。

理解题目意思,梳理 join 条件,过滤符合条件的组合。

组合?t1.userid != t2.userid 这样在自己join自己的时候会出现两对。因此用的是t1.userid > t2.userid

select u1, u2,  from (
select t1.userid as u1, t2.userid as u2, t1.store_id as store_id from 
    select * from table as t1
join
    select * from table as t2
on t1.store_id = t2.store_id and t1.userid > t2.userid and 
abs(t1.online_timestamp - t2.online_timestamp) < 600 and abs(t1.offline_timestamp - t2.offline_timestamp) < 600 )
group by u1,u2
having count (distinct store_id) > 3

3:每年收入前一万

现在有中国人从1980年到2020年每年每人的收入记录表A如下:
id,year,income
求每年收入前一万的人 窗口函数+ desc

select id,year,rank from (
select id,year,row_number() over (partition by year order by income desc) as rank
from table)
where rank < 10000

深度思考:每年13亿人,每个reducer处理的数据量很大,那么我们可以将id的前四位当作partition by 的参数,分组取topn,再合并。

select id, year, row_number() over (partition by year order by income desc) as rank_year from (
select id,year,rank from (
select id, year, 
 row_number() over (partition by year,substr(id,1,4) order by income desc) as rank
from table
)
where rank < 10000
)
where rank_year < 10000

参考链接:c(72条消息) SparkSQL—最大连续登录天数、当前连续登录天数、最大连续未登录天数问题_helloxiaozhe的博客-CSDN博客

4:最大连续登录天数(最大连续活跃天数)

表字段,userid, install_time, active_time

为什么要给你注册时间,关键点就在于这个字段。

登陆时间 与 某一个固定时间的间隔天数  date_diff

按照userid分区,按照登陆时间排序的 rank

date_diff 和rank之间的差值 value

方案1:row_number 函数,让两条数据之间产生联系。(最好的方法)

所有的登陆时间减去这个注册时间,就是注册与登录间隔的天数 diff_date,然后按照登陆时间大小给每条排序 rank,如果是连续登录 diff_date - rank 应该是一个固定的值,求取这个固定值 group by count(*) 再取MAX。

2021.1.1  2021.1.2     diff_date: 1   rank:1   value:0

2021.1.1  2021.1.3    diff_date: 2    rank:2   value:0

2021.1.1 2021.1.4    diff_date: 3     rank:3   value:0

2021.1.1 2021.1.6    diff_date: 5     rank:4   value:1

最大连续登陆天数为 3 天

select userid, max(c) from ( 
select userid, value, count(1) as c from (
    select userid, row_number() over (partition by userid order by active_time desc) as 
            rank ,date_diff(active_time, install_time) as diff_time , diff_time-rank as value
    from table
)
group by userid, value)
group by userid

5:当前连续登陆天数(也就是最后一次连续登录的天数)

思路类似上题,上一题是求所有 value 求count之后的最大值,这一题是求最大 value 的count值

spark sql 的 datediff 函数:

> SELECt datediff('2009-07-31', '2009-07-30');

1

--先准备好三个值
with active as (
select userid, row_number() over (partition by userid order by active_time) as rank,
date_diff(active_time, install_time) as date_interval, date_interval - rank as value
from table
)

--当前连续登陆天数,那么肯定要去max(value) 的 value 的count,需要一次join

select userid, count(1) as last_active_date_interval from 
select * from active t1
left join
select userid, max(value) from active group by userid t1
on t1.userid = t2.userid and t1.value= t2.value
group by userid

6:最大连续未登录天数:

2021.1.1  2021.1.2     diff_date: 1   rank:1   interval - rank : 0

2021.1.1  2021.1.3    diff_date: 2    rank:2   interval - rank : 0

2021.1.1 2021.1.4    diff_date: 3     rank:3   interval - rank : 0

2021.1.1 2021.1.6    diff_date: 5     rank:4   interval - rank : 1

假设今天是2021.1.10,今天未登录,没有记录,那么还需考虑用户最后一次活跃日期距当前日期的时间间隔,然后再取最大值。

正常情况,求的应该是 rank 相差等于 1 的之前 diff_date的差值。

最大连续登录是 3 天,当前连续登录是 1 天,最大连续未登录天数 4 天.

这一题求得是rank相差等于1的登陆天数之前的value值的差值(等于连续未登录时间长度),然后求这个长度的最大值即可

先求出登陆时间与注册时间的差值,以及当前登陆时间的rank值,以及时间差值与rank值之间的差值:

with active as (
select userid, row_number() over (partition by userid order by active_date desc) as rank,
date_diff(active_date - install_date) as date_interval, date_interval - rank as value 
from table
)


select userid,max(non_active_date_interval ) from (
select t1.userid as userid, t1.value- t2.value as non_active_date_interval from 
select * from active t1
left join
select * from active t2
on t1.userid = t2.userid and t1.rank = t2.rank-1

union all

select userid, date_diff(current_date - max(active_date)) as non_active_date_interval 
from table group by userid
)
group by userid

7:用户自从安装app以来的连续未登录天数的历史记录。

rank相差为1的记录相join时,将t1的active_date 当作是起始时间,t2的active_date 当作终止时间,这两个时间的中间段是连续未登录天数

with active as (
select userid, row_number() over (partition by userid order by active_date desc) as rank,
date_diff(active_date - install_date) as date_interval, date_interval - rank as value 
from table
)


select userid, start_date, end_date, non_active_date_interval from (
select t1.userid as userid, t1.active_date as start_date, t2.active_date as end_date, t1.value- t2.value as non_active_date_interval from 
select * from active t1
left join
select * from active t2
on t1.userid = t2.userid and t1.rank = t2.rank-1

union all

select userid, max(active_date) as start_date, current_date as end_date, date_diff(current_date - max(active_date)) as non_active_date_interval 
from table group by userid
)
where non_active_date_interval >0

8:连续登录超过3天的用户:

思路同上

with active as (
    select userid, row_number() over (partition by userid order by active_date) as rank, 
           date_diff(active_date, install_date) as date_interval, 
           date_interval - rank as value
    from table
)


select userid from (
    select userid, date_interval, count(1) 
    from active group by userid, date_interval
    having count(1) > 3
)

 方案2:lead 和 lag 函数 (不适用一天多次登录记录的情况呀~有限~)

lead的函数是把某一列数据的后面第n行数据拼接到当前行,lag是把指定列的前面第n行数据拼接到当前行。

lag(column,n,default)
lead(column,n,default)

求登陆时长超过3天的用户

9: 选出连续座位的编号

原数据表:user_seat表

表字段:座位号编号、座位是否可选

 为什么要用abs,都自己join自己了,感觉不需要。。。

select distinct t1.seatedid 
from table t1
join table t2
on abs(t1.seatid - t2.seatid) = 1 and t1.free = 1 and t2.free = 1
select t1.seatedid 
from table t1
join table t2
on t1.seatid - t2.seatid = 1 and t1.free = 1 and t2.free = 1

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/676094.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号