GROUPING SETS 在 GROUP BY 中,允许同一份记录集合按多种 group by 选项。所有的 GROUPING SET 等价于 多个 GROUP BY 的结果用 UNIOn 连接起来。GROUPING SET 中,空的集合()代表是全局的汇聚。
表1 – Grouping set 查询和等价的 GROUP BY 汇聚查询
| GROUPING SETS 汇聚查询 | 等价的 group by查询 |
|---|---|
| SELECt a, b, SUM© FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) ) | SELECt a, b, SUM© FROM tab1 GROUP BY a, b |
| SELECt a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a) | SELECt a, b, SUM( c ) FROM tab1 GROUP BY a, b UNIOn SELECt a, null, SUM( c ) FROM tab1 GROUP BY a |
| SELECt a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b) | SELECt a, null, SUM( c ) FROM tab1 GROUP BY a UNIOn SELECt null, b, SUM( c ) FROM tab1 GROUP BY b |
| SELECt a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) ) | SELECt a, b, SUM( c ) FROM tab1 GROUP BY a, b UNIOn SELECt a, null, SUM( c ) FROM tab1 GROUP BY a, null UNIOn SELECt null, b, SUM( c ) FROM tab1 GROUP BY null, b UNIOn SELECt null, null, SUM( c ) FROM tab1 |
RULL UP 代表按层级进行汇总。
如 GROUP BY a, b, c WITH RULL UP 等价于 GROUP BY a, b, c GROUPING SETS ((a, b, c),(b, c), (a), ())
CUBE 按所有的可能的汇聚组合进行汇聚
如 GROUP BY a, b, c WITH CUBE 等价于 GROUP BY a, b, c GROUPING SETS (( a, b, c), (a, b), (b, c), (a, c), a, b, c, ());
如有表 T 内容如下:
| Column1(key) | Column2(value) |
|---|---|
| 1 | NULL |
| 1 | 2 |
| 2 | 2 |
| 3 | 3 |
| 3 | NULL |
| 4 | 5 |
建表和插入数据的 SQL 如下:
create table t1(key int, value int); insert into t1 values(1,null),(1,1),(2,2),(3,3),(3,null),(4,5);
检索 SQL
select key,value,count(*) from t1 group by key,value with rollup;
| key | value | count |
|---|---|---|
| 1 | NULL | 1 |
| 1 | 1 | 1 |
| 2 | NULL | 1 |
| 4 | NULL | 1 |
| 4 | 5 | 1 |
| NULL | NULL | 6 |
| 1 | NULL | 2 |
| 2 | 2 | 1 |
| 3 | NULL | 1 |
| 3 | NULL | 2 |
| 3 | 3 | 1 |
请看第1行为 1 null 1,第 7 行为 1 null 2 。1 null 1 是按 (key, value)这两个字段 group by 的结果,1 null 2 是按 key group by 的结果,没有按 value 进行 group by。
从数据上不好区分,GROUPING__ID 用于区分某一列本身是 null,还是没按此字段汇聚。
GROUPING__ID 返回一个位向量,进行汇聚的列对应位置是1,否则是 0。按列进行汇聚,此列的 null 值没有意义。
如 group by key, value,则对应的表格如下。
| GROUP BY key? | GROUP BY value? | bit vector | GROUPING__ID |
|---|---|---|---|
| 否 | 否 | 00 | 0 |
| 否 | 是 | 01 | 1 |
| 是 | 否 | 10 | 2 |
| 是 | 是 | 11 | 3 |
select key,value,grouping__id,count(*) from t1 group by key,value with rollup;
| key | value | GOUPING_ID | count |
|---|---|---|---|
| 1 | NULL | 0 | 1 |
| 1 | 1 | 0 | 1 |
| 2 | NULL | 1 | 1 |
| 4 | NULL | 1 | 1 |
| 4 | 5 | 0 | 1 |
| NULL | NULL | 3 | 6 |
| 1 | NULL | 1 | 2 |
| 2 | 2 | 0 | 1 |
| 3 | NULL | 0 | 1 |
| 3 | NULL | 1 | 2 |
| 3 | 3 | 0 | 1 |
请看第1行为 1 null 0 1,第 7 行为 1 null 1 2 。
第 1 列的 GROUPING_ID 为 0,代表 1 null 分别对应列的值,列的值有意义。
第 1 列的 GROUPING_ID 为 1,对应的二进制为 01,说明 key 对应的是 0,value 对应的是1。说明 key 对应的列有意义,value 对应的 NULL 不对应真实数据,没有意义。
GROUPING__ID 按GROUP BY 列的顺序显示对应的位向量的值,GROUPING 函数可以指定列和列的顺序
如
SELECt key, value, GROUPING__ID, grouping(key, value), grouping(value, key), grouping(key), grouping(value) FROM t1 GROUP BY key, value WITH ROLLUP;
| key | value | GROUPING__ID | grouping(key,value) | grouping(value,key) | grouping(key) | grouping(value) |
|---|---|---|---|---|---|---|
| 1 | NULL | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | NULL | 1 | 1 | 2 | 0 | 1 |
| 4 | NULL | 1 | 1 | 2 | 0 | 1 |
| 4 | 5 | 0 | 0 | 0 | 0 | 0 |
| NULL | NULL | 3 | 3 | 3 | 1 | 1 |
| 1 | NULL | 1 | 1 | 2 | 0 | 1 |
| 2 | 2 | 0 | 0 | 0 | 0 | 0 |
| 3 | NULL | 0 | 0 | 0 | 0 | 0 |
| 3 | NULL | 1 | 1 | 2 | 0 | 1 |
| 3 | 3 | 0 | 0 | 0 | 0 | 0 |
可以看到,GROUPING__ID 和 grouping(key, value) 的值完全相同。
第 3 行,grouping(key) 为0, grouping(value)为1,grouping(key, value) 对应的二进制为 01,所以10 进制为 1。grouping(value, key) 对应的二进制为 10,所以10 进制为 2。
超过此限额的 group by 组合,使用新的 作业。
如 SQL select a, b, c, count(1) from T group by a, b, c with rollup; Map 端对于每条数据,生成 4 条数据 (a, b, c), (a, b, null), (a, null, null), (null, null, null),每条记录加上对应的 GROUPING__ID,导致 map 输出的数据量膨胀,而 map 端的汇聚不能有效减少数据量。
以上SQL 的 group by 的组合的数量是 4,没有到 30,所以只用一个作业完成计算。
hive.new.job.grouping.set.cardinality 30 Whether a new map-reduce job should be launched for grouping sets/rollups/cubes. For a query like: select a, b, c, count(1) from T group by a, b, c with rollup; 4 rows are created per row: (a, b, c), (a, b, null), (a, null, null), (null, null, null). This can lead to explosion across map-reduce boundary if the cardinality of T is very high, and map-side aggregation does not do a very good job. This parameter decides if Hive should add an additional map-reduce job. If the grouping set cardinality (4 in the example above), is more than this value, a new MR job is added under the assumption that the original group by will reduce the data size.



