对于简单的hivesql,看懂执行计划能细致的知道怎么优化
对于如下这一段简单的sql,看下explain出来的内容
select prov_id
,deep
,count(1) as num
from dim.dim_city
where prov_id = 110000
and deep = 1
group by prov_id
,deep
;
执行计划
//描述任务之间的依赖关系 STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 //每个stage的详细信息 STAGE PLANS: //stage-1的详细信息 Stage: Stage-1 Map Reduce //map阶段的操作 Map Operator Tree: TableScan //访问的表 alias: dim_city //filterexpression filter描述信息 filterExpr: ((prov_id = 110000) and (deep = 1)) (type: boolean) //当前阶段的数据统计信息 读取的数据行数和大小,这两个都是预估值 Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE //实际的filter操作,就是过滤数据 Filter Operator predicate: ((prov_id = 110000) and (deep = 1)) (type: boolean) //过滤之后的数据大小 Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE //select字段 Select Operator Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE //之前的结果集上分组聚合,这里已经在map端发生了聚合,系统设置了set hive.map.aggr = true; Group By Operator //分组聚合使用的算法 aggregations: count() keys: 110000 (type: int) mode: hash //这里是映射的列名 outputColumnNames: _col0, _col1 Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE //当前描述的是对之前结果聚合后的输出信息,map端聚合的输出信息 //这里也可以理解是shffle阶段 Reduce Output Operator //map和reduce阶段输出都是k-v形式,这里key -> _col0,value -> _col1 key expressions: _col0 (type: int) //输出是否排序,+是正序 sort order: + //map阶段输出到reduce阶段的分区列,即按照_col10进行分区 Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: bigint) //接下来就到了reduce阶段 Reduce Operator Tree: Group By Operator //聚合操作是count,value中的_col0 aggregations: count(VALUE._col0) //聚合的key keys: KEY._col0 (type: int) mode: mergepartial //输出了2列,这里对应的是prov_id 和 deep outputColumnNames: _col0, _col1 Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE //上面的2列加上聚合的一列 Select Operator expressions: 110000 (type: int), 1 (type: int), _col1 (type: bigint) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat //序列化的方式 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink



