栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

spark rollup和cube的区别

spark rollup和cube的区别

这两个函数在spark中是用来代替spark sql 的GROUPING SETS函数的。
主要的作用都是对多列做groupBy。既然spark中已经有了groupBy函数,这两个函数又是用来干啥的?他们之间有什么区别?
Spark The Defination Guide书中解释是这样的:
cube函数:
Rather than treating elements hierarchically, a cube
does the same thing across all dimensions

rowup函数:
A rollup is a multidimensional aggregation that performs a variety of group-by style calculations
看起来不明所以。
其实cube就是key的所有可能组合(这里包含交,并,补)
rowup是key分层下钻组合,顺序从左到右逐步下钻组合。

在回到groupBy本身其实只有简单的两两组合

所以最好还是举个栗子:如果我们要对year,month,day这三个字段进行作为groupby 分组依据,对组内进行计数操作。
那么cube相当于做了如下的所有操作

GROUP BY
SELECt COUNT(*) FROM table GROUP BY year, month, day
SELECt COUNT(*) FROM table GROUP BY year, month
SELECt COUNT(*) FROM table GROUP BY year, day
SELECt COUNT(*) FROM table GROUP BY year
SELECt COUNT(*) FROM table GROUP BY month, day
SELECt COUNT(*) FROM table GROUP BY month
SELECt COUNT(*) FROM table GROUP BY day
null, null, null SELECt COUNT(*) FROM table

对应的rollup相当于做了如下操作

GROUP BY
SELECt COUNT(*) FROM table GROUP BY year, month, day
SELECt COUNT(*) FROM table GROUP BY year, month
SELECt COUNT(*) FROM table GROUP BY year
null SELECt COUNT(*) FROM table

而groupBy只做了SELECt COUNT(*) FROM table GROUP BY year, month, day这个操作

进一步,可以举一个pyspark的例子
数据如下所示:

+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+
Cube操作
df.cube('category', 'name').agg(sum('how_many'))

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+

可以看出,cube做了四种groupby:

单独category列的计数单独namel列的计数category和name组合列的计数所有category和name的计数总和 rowup操作

df.rollup('category', 'name').agg(sum('how_many'))

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+

rowup做了三种操作:

category和name的组合计数category的计数所有category的计数总和

Reference:
1.文中主要例子来源
2.What is the difference between cube, rollup and groupBy operators?——Stack Overflow

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/707363.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号