- 1、 窗口基本用法
- 1.1 over 关键字
- 1.2 partition by子句
- 1.3 order by 子句
- 2、Window子句
- 3、排名函数
- 4、序列函数
- 5、GROUPING SETS、GROUPING__ID、CUBE和ROLLUP
窗口函数又名开窗函数,属于分析函数的一种。用于解决复杂报表统计需求的功能强大的函数,很多场景都需要用到。窗口函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行。
1、 窗口基本用法 1.1 over 关键字使用窗口函数之前一般要要通过over()进行开窗
select ename, sal, sum(sal) over() salsum, concat(round(sal / sum(sal) over()*100, 1) || '%') ratiosal from emp; +---------+-------+---------+-----------+ | ename | sal | salsum | ratiosal | +---------+-------+---------+-----------+ | MILLER | 1300 | 29025 | 4.5% | | FORD | 3000 | 29025 | 10.3% | | JAMES | 950 | 29025 | 3.3% | | ADAMS | 1100 | 29025 | 3.8% | | TURNER | 1500 | 29025 | 5.2% | | KING | 5000 | 29025 | 17.2% | | SCOTT | 3000 | 29025 | 10.3% | | CLARK | 2450 | 29025 | 8.4% | | BLAKE | 2850 | 29025 | 9.8% | | MARTIN | 1250 | 29025 | 4.3% | | JonES | 2975 | 29025 | 10.2% | | WARD | 1250 | 29025 | 4.3% | | ALLEN | 1600 | 29025 | 5.5% | | SMITH | 800 | 29025 | 2.8% | +---------+-------+---------+-----------+
窗口函数是针对每一行数据的;如果over中没有参数,默认的是全部结果集;
1.2 partition by子句在over窗口中进行分区,对某一列进行分区统计,窗口的大小就是分区的大小
select
deptno,
ename,
sal,
sum(sal) over(partition by deptno) salsum
from emp order by deptno limit 20;
+---------+---------+-------+---------+
| deptno | ename | sal | salsum |
+---------+---------+-------+---------+
| 10 | MILLER | 1300 | 8750 |
| 10 | KING | 5000 | 8750 |
| 10 | CLARK | 2450 | 8750 |
| 20 | ADAMS | 1100 | 10875 |
| 20 | SCOTT | 3000 | 10875 |
| 20 | FORD | 3000 | 10875 |
| 20 | JonES | 2975 | 10875 |
| 20 | SMITH | 800 | 10875 |
| 30 | BLAKE | 2850 | 9400 |
| 30 | WARD | 1250 | 9400 |
| 30 | MARTIN | 1250 | 9400 |
| 30 | TURNER | 1500 | 9400 |
| 30 | JAMES | 950 | 9400 |
| 30 | ALLEN | 1600 | 9400 |
+---------+---------+-------+---------+
1.3 order by 子句
select ename, sal, deptno, sum(sal) over(partition by deptno) salsum1, sum(sal) over(partition by deptno order by sal desc) salsum from emp; +---------+-------+---------+----------+---------+ | ename | sal | deptno | salsum1 | salsum | +---------+-------+---------+----------+---------+ | KING | 5000 | 10 | 8750 | 5000 | | CLARK | 2450 | 10 | 8750 | 7450 | | MILLER | 1300 | 10 | 8750 | 8750 | | FORD | 3000 | 20 | 10875 | 6000 | | SCOTT | 3000 | 20 | 10875 | 6000 | | JonES | 2975 | 20 | 10875 | 8975 | | ADAMS | 1100 | 20 | 10875 | 10075 | | SMITH | 800 | 20 | 10875 | 10875 | | BLAKE | 2850 | 30 | 9400 | 2850 | | ALLEN | 1600 | 30 | 9400 | 4450 | | TURNER | 1500 | 30 | 9400 | 5950 | | MARTIN | 1250 | 30 | 9400 | 8450 | | WARD | 1250 | 30 | 9400 | 8450 | | JAMES | 950 | 30 | 9400 | 9400 | +---------+-------+---------+----------+---------+ # 20 6000 #| FORD | 3000 | 20 | 10875 | 6000 | #| SCOTT | 3000 | 20 | 10875 | 6000 |
over() 没有order by 子句是结果是整个组的累加
有了order by 是组内逐行累加
rows between ... and ...
如果要对窗口的结果做更细粒度的划分,使用window子句,有如下的几个选项:
- unbounded preceding。组内第一行数据
- n preceding。组内当前行的前n行数据
- current row。当前行数据
- n following。组内当前行的后n行数据
- unbounded following。组内最后一行数据
-- 等价。组内,第一行到当前行的和
select ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename)
from emp;
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename rows between unbounded preceding and current row
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 2450 |
| KING | 5000 | 10 | 7450 |
| MILLER | 1300 | 10 | 8750 |
| ADAMS | 1100 | 20 | 1100 |
| FORD | 3000 | 20 | 4100 |
| JonES | 2975 | 20 | 7075 |
| SCOTT | 3000 | 20 | 10075 |
| SMITH | 800 | 20 | 10875 |
| ALLEN | 1600 | 30 | 1600 |
| BLAKE | 2850 | 30 | 4450 |
| JAMES | 950 | 30 | 5400 |
| MARTIN | 1250 | 30 | 6650 |
| TURNER | 1500 | 30 | 8150 |
| WARD | 1250 | 30 | 9400 |
+---------+-------+---------+---------------+
-- 组内,第一行到最后一行的和
-- 等价 sum(sal) over(partition by deptno)
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename
rows between unbounded preceding and unbounded following
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 8750 |
| KING | 5000 | 10 | 8750 |
| MILLER | 1300 | 10 | 8750 |
| ADAMS | 1100 | 20 | 10875 |
| FORD | 3000 | 20 | 10875 |
| JonES | 2975 | 20 | 10875 |
| SCOTT | 3000 | 20 | 10875 |
| SMITH | 800 | 20 | 10875 |
| ALLEN | 1600 | 30 | 9400 |
| BLAKE | 2850 | 30 | 9400 |
| JAMES | 950 | 30 | 9400 |
| MARTIN | 1250 | 30 | 9400 |
| TURNER | 1500 | 30 | 9400 |
| WARD | 1250 | 30 | 9400 |
+---------+-------+---------+---------------+
-- 组内,前一行、当前行、后一行的和
select
ename,
sal,
deptno,
sum(sal) over(partition by deptno order by ename
rows between 1 preceding and 1 following
)
from emp;
+---------+-------+---------+---------------+
| ename | sal | deptno | sum_window_0 |
+---------+-------+---------+---------------+
| CLARK | 2450 | 10 | 7450 |
| KING | 5000 | 10 | 8750 |
| MILLER | 1300 | 10 | 6300 |
| ADAMS | 1100 | 20 | 4100 |
| FORD | 3000 | 20 | 7075 |
| JonES | 2975 | 20 | 8975 |
| SCOTT | 3000 | 20 | 6775 |
| SMITH | 800 | 20 | 3800 |
| ALLEN | 1600 | 30 | 4450 |
| BLAKE | 2850 | 30 | 5400 |
| JAMES | 950 | 30 | 5050 |
| MARTIN | 1250 | 30 | 3700 |
| TURNER | 1500 | 30 | 4000 |
| WARD | 1250 | 30 | 2750 |
+---------+-------+---------+---------------+
3、排名函数
都是从1开始,生成数据项在分组中的排名。
- row_number()。排名顺序增加不会重复;如1、2、3、4、… …
- RANK()。 排名相等会在名次中留下空位;如1、2、2、4、5、… …
- DENSE_RANK()。 排名相等会在名次中不会留下空位 ;如1、2、2、3、4、… …
-- 数据 class1,s01,100 class1,s03,100 class1,s05,100 class1,s07,99 class1,s09,98 class1,s02,98 class1,s04,97 class2,s21,100 class2,s24,99 class2,s27,99 class2,s22,98 class2,s25,98 class2,s28,97 class2,s26,96 -- 创建表加载数据 create table rank( cname string, sname string, score int ) row format delimited fields terminated by ','; load data local inpath '/root/bigdata/test_data/hive/rank.txt' into table rank; select cname, sname, score, row_number() over (partition by cname order by score desc) rank1, rank() over (partition by cname order by score desc) rank2, dense_rank() over (partition by cname order by score desc) rank3 from rank; (i.e. spark, tez) or using Hive 1.X releases. +---------+--------+--------+--------+--------+--------+ | cname | sname | score | rank1 | rank2 | rank3 | +---------+--------+--------+--------+--------+--------+ | class1 | s01 | 100 | 1 | 1 | 1 | | class1 | s03 | 100 | 2 | 1 | 1 | | class1 | s05 | 100 | 3 | 1 | 1 | | class1 | s07 | 99 | 4 | 4 | 2 | | class1 | s09 | 98 | 5 | 5 | 3 | | class1 | s02 | 98 | 6 | 5 | 3 | | class1 | s04 | 97 | 7 | 7 | 4 | | class2 | s21 | 100 | 1 | 1 | 1 | | class2 | s24 | 99 | 2 | 2 | 2 | | class2 | s27 | 99 | 3 | 2 | 2 | | class2 | s22 | 98 | 4 | 4 | 3 | | class2 | s25 | 98 | 5 | 4 | 3 | | class2 | s28 | 97 | 6 | 6 | 4 | | class2 | s26 | 96 | 7 | 7 | 5 | +---------+--------+--------+--------+--------+--------+4、序列函数
- lag。返回当前数据行的上一行数据
- lead。返回当前数据行的下一行数据
- first_value。取分组内排序后,截止到当前行,第一个值
- last_value。分组内排序后,截止到当前行,最后一个值
- ntile。将分组的数据按照顺序切分成n片,返回当前切片值
-- 数据 cookie1,2019-04-10,1 cookie1,2019-04-11,5 cookie1,2019-04-12,7 cookie1,2019-04-13,3 cookie1,2019-04-14,2 cookie1,2019-04-15,4 cookie1,2019-04-16,4 cookie2,2019-04-10,2 cookie2,2019-04-11,3 cookie2,2019-04-12,5 cookie2,2019-04-13,6 cookie2,2019-04-14,3 cookie2,2019-04-15,9 cookie2,2019-04-16,7 -- 建表语句 create table userpv( cid string, ctime date, pv int ) row format delimited fields terminated by ","; -- 加载数据 Load data local inpath '/home/hadoop/data/userpv.dat' into table userpv; select * from userpv; +-------------+---------------+------------+ | userpv.cid | userpv.ctime | userpv.pv | +-------------+---------------+------------+ | cookie1 | 2019-04-10 | 1 | | cookie1 | 2019-04-11 | 5 | | cookie1 | 2019-04-12 | 7 | | cookie1 | 2019-04-13 | 3 | | cookie1 | 2019-04-14 | 2 | | cookie1 | 2019-04-15 | 4 | | cookie1 | 2019-04-16 | 4 | | cookie2 | 2019-04-10 | 2 | | cookie2 | 2019-04-11 | 3 | | cookie2 | 2019-04-12 | 5 | | cookie2 | 2019-04-13 | 6 | | cookie2 | 2019-04-14 | 3 | | cookie2 | 2019-04-15 | 9 | | cookie2 | 2019-04-16 | 7 | +-------------+---------------+------------+
-- lag 返回当前数据行的上一行数据 -- lead 返回当前数据行的下一行数据 select cid, ctime, pv, lag(pv) over(partition by cid order by ctime) lagpv, lead(pv) over(partition by cid order by ctime) leadpv from userpv; +----------+-------------+-----+--------+---------+ | cid | ctime | pv | lagpv | leadpv | +----------+-------------+-----+--------+---------+ | cookie1 | 2019-04-10 | 1 | NULL | 5 | | cookie1 | 2019-04-11 | 5 | 1 | 7 | | cookie1 | 2019-04-12 | 7 | 5 | 3 | | cookie1 | 2019-04-13 | 3 | 7 | 2 | | cookie1 | 2019-04-14 | 2 | 3 | 4 | | cookie1 | 2019-04-15 | 4 | 2 | 4 | | cookie1 | 2019-04-16 | 4 | 4 | NULL | | cookie2 | 2019-04-10 | 2 | NULL | 3 | | cookie2 | 2019-04-11 | 3 | 2 | 5 | | cookie2 | 2019-04-12 | 5 | 3 | 6 | | cookie2 | 2019-04-13 | 6 | 5 | 3 | | cookie2 | 2019-04-14 | 3 | 6 | 9 | | cookie2 | 2019-04-15 | 9 | 3 | 7 | | cookie2 | 2019-04-16 | 7 | 9 | NULL | +----------+-------------+-----+--------+---------+ select cid, ctime, pv, first_value(pv) over (partition by cid order by ctime rows between unbounded preceding and unbounded following) as firstpv, last_value(pv) over (partition by cid order by ctime rows between unbounded preceding and unbounded following) as lastpv from userpv; +----------+-------------+-----+----------+---------+ | cid | ctime | pv | firstpv | lastpv | +----------+-------------+-----+----------+---------+ | cookie1 | 2019-04-10 | 1 | 1 | 4 | | cookie1 | 2019-04-11 | 5 | 1 | 4 | | cookie1 | 2019-04-12 | 7 | 1 | 4 | | cookie1 | 2019-04-13 | 3 | 1 | 4 | | cookie1 | 2019-04-14 | 2 | 1 | 4 | | cookie1 | 2019-04-15 | 4 | 1 | 4 | | cookie1 | 2019-04-16 | 4 | 1 | 4 | | cookie2 | 2019-04-10 | 2 | 2 | 7 | | cookie2 | 2019-04-11 | 3 | 2 | 7 | | cookie2 | 2019-04-12 | 5 | 2 | 7 | | cookie2 | 2019-04-13 | 6 | 2 | 7 | | cookie2 | 2019-04-14 | 3 | 2 | 7 | | cookie2 | 2019-04-15 | 9 | 2 | 7 | | cookie2 | 2019-04-16 | 7 | 2 | 7 | +----------+-------------+-----+----------+---------+ -- ntile。按照cid进行分组,每组数据分成2份 select cid, ctime, pv, ntile(2) over(partition by cid order by ctime) ntile from userpv; +----------+-------------+-----+--------+ | cid | ctime | pv | ntile | +----------+-------------+-----+--------+ | cookie1 | 2019-04-10 | 1 | 1 | | cookie1 | 2019-04-11 | 5 | 1 | | cookie1 | 2019-04-12 | 7 | 1 | | cookie1 | 2019-04-13 | 3 | 1 | | cookie1 | 2019-04-14 | 2 | 2 | | cookie1 | 2019-04-15 | 4 | 2 | | cookie1 | 2019-04-16 | 4 | 2 | | cookie2 | 2019-04-10 | 2 | 1 | | cookie2 | 2019-04-11 | 3 | 1 | | cookie2 | 2019-04-12 | 5 | 1 | | cookie2 | 2019-04-13 | 6 | 1 | | cookie2 | 2019-04-14 | 3 | 2 | | cookie2 | 2019-04-15 | 9 | 2 | | cookie2 | 2019-04-16 | 7 | 2 | +----------+-------------+-----+--------+5、GROUPING SETS、GROUPING__ID、CUBE和ROLLUP
-- 数据 2015-03,2015-03-10,cookie1 2015-03,2015-03-10,cookie5 2015-03,2015-03-12,cookie7 2015-04,2015-04-12,cookie3 2015-04,2015-04-13,cookie2 2015-04,2015-04-13,cookie4 2015-04,2015-04-16,cookie4 2015-03,2015-03-10,cookie2 2015-03,2015-03-10,cookie3 2015-04,2015-04-12,cookie5 2015-04,2015-04-13,cookie6 2015-04,2015-04-15,cookie3 2015-04,2015-04-15,cookie2 2015-04,2015-04-16,cookie1 create table cookie5(month string, day string, cookieid string) row format delimited fields terminated by ','; load data local inpath "/home/hadoop/cookie5.txt" into table cookie5; select * from cookie5; +----------------+--------------+-------------------+ | cookie5.month | cookie5.day | cookie5.cookieid | +----------------+--------------+-------------------+ | 2015-03 | 2015-03-10 | cookie1 | | 2015-03 | 2015-03-10 | cookie5 | | 2015-03 | 2015-03-12 | cookie7 | | 2015-04 | 2015-04-12 | cookie3 | | 2015-04 | 2015-04-13 | cookie2 | | 2015-04 | 2015-04-13 | cookie4 | | 2015-04 | 2015-04-16 | cookie4 | | 2015-03 | 2015-03-10 | cookie2 | | 2015-03 | 2015-03-10 | cookie3 | | 2015-04 | 2015-04-12 | cookie5 | | 2015-04 | 2015-04-13 | cookie6 | | 2015-04 | 2015-04-15 | cookie3 | | 2015-04 | 2015-04-15 | cookie2 | | 2015-04 | 2015-04-16 | cookie1 | +----------------+--------------+-------------------+
--在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNIOn ALL ,GROUPING__ID表示结果属于哪一个分组集合。 select month, day, count(distinct cookieid) as uv, GROUPING__ID from cookie.cookie5 group by month,day grouping sets (month,day) order by GROUPING__ID; --等价于 SELECt month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNIOn ALL SELECt NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day +----------+-------------+-----+---------------+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+ | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-03-10 | 4 | 2 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | +----------+-------------+-----+---------------+ SELECt month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID limit 100; --等价于 SELECt month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNIOn ALL SELECt NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day UNIOn ALL SELECt month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day +----------+-------------+-----+---------------+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+ | 2015-03 | 2015-03-12 | 1 | 0 | | 2015-03 | 2015-03-10 | 4 | 0 | | 2015-04 | 2015-04-13 | 3 | 0 | | 2015-04 | 2015-04-15 | 2 | 0 | | 2015-04 | 2015-04-12 | 2 | 0 | | 2015-04 | 2015-04-16 | 2 | 0 | | 2015-03 | NULL | 5 | 1 | | 2015-04 | NULL | 6 | 1 | | NULL | 2015-03-12 | 1 | 2 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-03-10 | 4 | 2 | +----------+-------------+-----+---------------+
-- CUBE 根据GROUP BY的维度的所有组合进行聚合 SELECt month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; --等价于 SELECt NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM cookie5 UNIOn ALL SELECt month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month UNIOn ALL SELECt NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day UNIOn ALL SELECt month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day +----------+-------------+-----+---------------+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+ | 2015-03 | 2015-03-10 | 4 | 0 | | 2015-03 | 2015-03-12 | 1 | 0 | | 2015-04 | 2015-04-12 | 2 | 0 | | 2015-04 | 2015-04-13 | 3 | 0 | | 2015-04 | 2015-04-15 | 2 | 0 | | 2015-04 | 2015-04-16 | 2 | 0 | | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | NULL | 2015-04-12 | 2 | 2 | | NULL | 2015-03-10 | 4 | 2 | | NULL | 2015-04-13 | 3 | 2 | | NULL | 2015-04-15 | 2 | 2 | | NULL | 2015-04-16 | 2 | 2 | | NULL | 2015-03-12 | 1 | 2 | | NULL | NULL | 7 | 3 | +----------+-------------+-----+---------------+
-- ROLLUP -- 是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合 SELECt month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM cookie5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; -- 等价于 select NULL ,NULL, COUNT(DISTINCT cookieid) AS uv, 0 AS GROUP_ID FROM cookie5 UNIOn ALL select month,NULL, COUNT(DISTINCT cookieid) AS uv, 1 AS GROUP_ID FROM cookie5 GROUP BY month UNIOn ALL select month,day, COUNT(DISTINCT cookieid) AS uv, 2 AS GROUP_ID FROM cookie5 GROUP BY month,day +----------+-------------+-----+---------------+ | month | day | uv | grouping__id | +----------+-------------+-----+---------------+ | 2015-04 | 2015-04-16 | 2 | 0 | | 2015-04 | 2015-04-13 | 3 | 0 | | 2015-04 | 2015-04-12 | 2 | 0 | | 2015-04 | 2015-04-15 | 2 | 0 | | 2015-03 | 2015-03-12 | 1 | 0 | | 2015-03 | 2015-03-10 | 4 | 0 | | 2015-04 | NULL | 6 | 1 | | 2015-03 | NULL | 5 | 1 | | NULL | NULL | 7 | 3 | +----------+-------------+-----+---------------+
参考链接:https://www.cnblogs.com/qingyunzong/p/8798987.html



