Hive 中子查询VS条件过滤.

基于Hive 2.1.x 首先直接做关联，查看执行计划

-- 先连接后过滤
explain 
select
	count(1)
from student t1 inner join student t2
	on t1.id = t2.id and t2.name = 'shanghai'
where
	t1.city = 'beijing';

1	STAGE DEPENDENCIES:
2	  Stage-5 is a root stage
3	  Stage-2 depends on stages: Stage-5
4	  Stage-0 depends on stages: Stage-2
5	
6	STAGE PLANS:
7	  Stage: Stage-5
8	    Map Reduce Local Work
9	      Alias -> Map Local Tables:
10	        t1 
11	          Fetch Operator
12	            limit: -1
13	      Alias -> Map Local Operator Tree:
14	        t1 
15	          TableScan
16	            alias: t1
17	            filterExpr: (id is not null and (city = 'beijing')) (type: boolean)
18	            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
19	            Filter Operator
20	              predicate: (id is not null and (city = 'beijing')) (type: boolean)
21	              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
22	              HashTable Sink Operator
23	                keys:
24	                  0 id (type: int)
25	                  1 id (type: int)
26	
27	  Stage: Stage-2
28	    Map Reduce
29	      Map Operator Tree:
30	          TableScan
31	            alias: t2
32	            filterExpr: ((name = 'shanghai') and id is not null) (type: boolean)
33	            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
34	            Filter Operator
35	              predicate: ((name = 'shanghai') and id is not null) (type: boolean)
36	              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
37	              Map Join Operator
38	                condition map:
39	                     Inner Join 0 to 1
40	                keys:
41	                  0 id (type: int)
42	                  1 id (type: int)
43	                Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
44	                Group By Operator
45	                  aggregations: count(1)
46	                  mode: hash
47	                  outputColumnNames: _col0
48	                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
49	                  Reduce Output Operator
50	                    sort order: 
51	                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
52	                    value expressions: _col0 (type: bigint)
53	      Local Work:
54	        Map Reduce Local Work
55	      Reduce Operator Tree:
56	        Group By Operator
57	          aggregations: count(VALUE._col0)
58	          mode: mergepartial
59	          outputColumnNames: _col0
60	          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
61	          File Output Operator
62	            compressed: false
63	            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
64	            table:
65	                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
66	                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
67	                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
68	
69	  Stage: Stage-0
70	    Fetch Operator
71	      limit: -1
72	      Processor Tree:

再使用子查询进行先过滤，再连接。

explain 
select
	count(1)
from 
	(select * from student where city = 'beijing') t1
	inner join student t2
	on t1.id = t2.id and t2.name = 'shanghai';

1	STAGE DEPENDENCIES:
2	  Stage-5 is a root stage
3	  Stage-2 depends on stages: Stage-5
4	  Stage-0 depends on stages: Stage-2
5	
6	STAGE PLANS:
7	  Stage: Stage-5
8	    Map Reduce Local Work
9	      Alias -> Map Local Tables:
10	        t1:student 
11	          Fetch Operator
12	            limit: -1
13	      Alias -> Map Local Operator Tree:
14	        t1:student 
15	          TableScan
16	            alias: student
17	            filterExpr: ((city = 'beijing') and id is not null) (type: boolean)
18	            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
19	            Filter Operator
20	              predicate: ((city = 'beijing') and id is not null) (type: boolean)
21	              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
22	              Select Operator
23	                expressions: id (type: int)
24	                outputColumnNames: _col0
25	                Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
26	                HashTable Sink Operator
27	                  keys:
28	                    0 _col0 (type: int)
29	                    1 id (type: int)
30	
31	  Stage: Stage-2
32	    Map Reduce
33	      Map Operator Tree:
34	          TableScan
35	            alias: t2
36	            filterExpr: ((name = 'shanghai') and id is not null) (type: boolean)
37	            Statistics: Num rows: 3 Data size: 9 Basic stats: COMPLETE Column stats: NONE
38	            Filter Operator
39	              predicate: ((name = 'shanghai') and id is not null) (type: boolean)
40	              Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
41	              Map Join Operator
42	                condition map:
43	                     Inner Join 0 to 1
44	                keys:
45	                  0 _col0 (type: int)
46	                  1 id (type: int)
47	                Statistics: Num rows: 1 Data size: 3 Basic stats: COMPLETE Column stats: NONE
48	                Group By Operator
49	                  aggregations: count(1)
50	                  mode: hash
51	                  outputColumnNames: _col0
52	                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
53	                  Reduce Output Operator
54	                    sort order: 
55	                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
56	                    value expressions: _col0 (type: bigint)
57	      Local Work:
58	        Map Reduce Local Work
59	      Reduce Operator Tree:
60	        Group By Operator
61	          aggregations: count(VALUE._col0)
62	          mode: mergepartial
63	          outputColumnNames: _col0
64	          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
65	          File Output Operator
66	            compressed: false
67	            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
68	            table:
69	                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
70	                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
71	                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
72	
73	  Stage: Stage-0
74	    Fetch Operator
75	      limit: -1
76	      Processor Tree:
77	        ListSink

从中我们可以清晰的看到，两者的执行计划和执行效率其实都差不多。应该是Hive 引擎自动做了谓词下推的优化机制。

Hive 中子查询VS条件过滤.

大数据系统相关栏目本月热门文章