先说结论,两者没有区别,先看执行计划
1、group byexplain select prov_id from dim.dim_city group by prov_id ; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: dim_city Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: prov_id (type: int) outputColumnNames: prov_id Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: prov_id (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
2、distinct
explain select distinct prov_id from dim.dim_city ; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: dim_city Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: prov_id (type: int) outputColumnNames: prov_id Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: prov_id (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
执行过程完全一致,distinct在map端同样会先做group by聚合,而不是都在reduce端做这个操作,老版本的hive没有这个优化,都在reduce端执行的话会有很大的性能差异



