Spark优化之执行计划

1.sparksql 执行计划

spark.sql("select a.uid,count(1) from a join u on a.uid=u.uid where a.actorlevel>10 group by a.uid order by a.uid").explain()

（1）explain() 查看物理计划

== Physical Plan ==
*(4) Sort [uid#84 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(uid#84 ASC NULLS FIRST, 200)
+- *(3) HashAggregate(keys=[uid#84], functions=[count(1)])
+- Exchange hashpartitioning(uid#84, 200)
+- *(2) HashAggregate(keys=[uid#84], functions=[partial_count(1)])
+- *(2) Project [uid#84]
+- *(2) BroadcastHashJoin [cast(uid#84 as bigint)], [uid#23L], Inner, BuildRight
:- *(2) Project [uid#84]
: +- *(2) Filter ((isnotnull(actorlevel#50) && (cast(actorlevel#50 as int) > 10)) && isnotnull(uid#84))
: +- *(2) FileScan json [actorlevel#50,uid#84] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/workspace/sparktest/datas/actorinfo], PartitionFilters: [], PushedFilters: [IsNotNull(actorlevel), IsNotNull(uid)], ReadSchema: struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
+- *(1) Project [uid#23L]
+- *(1) Filter isnotnull(uid#23L)
+- *(1) FileScan json [uid#23L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/workspace/sparktest/datas/yhtx_1022], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct

（2）explain(true)查看整个SQL的执行计划，主要分为4个阶段

　--1:解析过程

== Parsed Logical Plan ==
'Sort ['a.uid ASC NULLS FIRST], true
+- 'Aggregate ['a.uid], ['a.uid, unresolvedalias('count(1), None)]
+- 'Filter ('a.actorlevel > 10)
+- 'Join Inner, ('a.uid = 'u.uid)
:- 'UnresolvedRelation `a`
+- 'UnresolvedRelation `u`

　　--2:逻辑阶段

== Analyzed Logical Plan ==
uid: string, count(1): bigint
Project [uid#84, count(1)#181L]
+- Sort [uid#84 ASC NULLS FIRST], true
+- Aggregate [uid#84], [uid#84, count(1) AS count(1)#181L]
+- Filter (cast(actorlevel#50 as int) > 10)
+- Join Inner, (cast(uid#84 as bigint) = uid#23L)
:- SubqueryAlias a
: +- Relation[actorid#48,actorinfoflag#49,actorlevel#50,actvdate#51,adchannel#52,adsubchannel#53,areaid#54,channel#55,city#56,clienLogDate#57,deviceid#58,edition#59,gameid#60,guid#61,imei#62,language#63,lastgamedate#64,logdate#65,logindates#66,logtype#67L,lvlupdate#68,lvluptimes#69L,money#70,netsp#71,... 13 more fields] json
+- SubqueryAlias u
+- Relation[actorid#6L,actorname#7,approach#8,areaid#9L,channel#10L,deviceid#11,edition#12L,gameid#13L,imei#14,item#15L,itemcount#16L,itemname#17,logdate#18,logtype#19L,opersys#20L,subchannel#21,type#22L,uid#23L] json

　--3:优化阶段

== Optimized Logical Plan ==
Sort [uid#84 ASC NULLS FIRST], true
+- Aggregate [uid#84], [uid#84, count(1) AS count(1)#181L]
+- Project [uid#84]
+- Join Inner, (cast(uid#84 as bigint) = uid#23L)
:- Project [uid#84]
: +- Filter ((isnotnull(actorlevel#50) && (cast(actorlevel#50 as int) > 10)) && isnotnull(uid#84))
: +- Relation[actorid#48,actorinfoflag#49,actorlevel#50,actvdate#51,adchannel#52,adsubchannel#53,areaid#54,channel#55,city#56,clienLogDate#57,deviceid#58,edition#59,gameid#60,guid#61,imei#62,language#63,lastgamedate#64,logdate#65,logindates#66,logtype#67L,lvlupdate#68,lvluptimes#69L,money#70,netsp#71,... 13 more fields] json
+- Project [uid#23L]
+- Filter isnotnull(uid#23L)
+- Relation[actorid#6L,actorname#7,approach#8,areaid#9L,channel#10L,deviceid#11,edition#12L,gameid#13L,imei#14,item#15L,itemcount#16L,itemname#17,logdate#18,logtype#19L,opersys#20L,subchannel#21,type#22L,uid#23L] json

--4:物理执行计划

== Physical Plan ==
*(4) Sort [uid#84 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(uid#84 ASC NULLS FIRST, 200)
+- *(3) HashAggregate(keys=[uid#84], functions=[count(1)], output=[uid#84, count(1)#181L])
+- Exchange hashpartitioning(uid#84, 200)
+- *(2) HashAggregate(keys=[uid#84], functions=[partial_count(1)], output=[uid#84, count#184L])
+- *(2) Project [uid#84]
+- *(2) BroadcastHashJoin [cast(uid#84 as bigint)], [uid#23L], Inner, BuildRight
:- *(2) Project [uid#84]
: +- *(2) Filter ((isnotnull(actorlevel#50) && (cast(actorlevel#50 as int) > 10)) && isnotnull(uid#84))
: +- *(2) FileScan json [actorlevel#50,uid#84] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/workspace/sparktest/datas/actorinfo], PartitionFilters: [], PushedFilters: [IsNotNull(actorlevel), IsNotNull(uid)], ReadSchema: struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
+- *(1) Project [uid#23L]
+- *(1) Filter isnotnull(uid#23L)
+- *(1) FileScan json [uid#23L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/D:/workspace/sparktest/datas/yhtx_1022], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct

Spark优化之执行计划

大数据系统相关栏目本月热门文章