栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

with as 语句真的会把数据存内存嘛?(源码剖析)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

with as 语句真的会把数据存内存嘛?(源码剖析)

最近有好几个朋友都有咨询这个问题 大概有两类

1、为啥我用了with..as效率没有提高

2、sql跑不动 改成with..as的写法 会不会更好些

网上博客几乎都有结论with ... as语句会把数据放在内存

 

一、 Hive-SQL

 在hive中有一个参数

hive.optimize.cte.materialize.threshold

 这个参数在默认情况下是-1 关闭的 当开启 大于0 比如设置为2 则如果with..as语句被引用2次及以上时 会把with..as语句生成的table物化 从而做到with..as语句只执行一次 来提高效率。

1.1 测试 
explain
with atable as (
 SELECt id,source,channel
 FROM test
select source from atable WHERe channel 直播 
union ALL
select source from atable WHERe channel 视频 

 不设置该参数时 执行计划

STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-0 depends on stages: Stage-1
STAGE PLANS:
 Stage: Stage-1
 Map Reduce
 Map Operator Tree:
 TableScan
 alias: test
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Filter Operator
 predicate: (channel 直播 ) (type: boolean)
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Select Operator
 expressions: source (type: string)
 outputColumnNames: _col0
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Union
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 table:
 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 TableScan
 alias: test
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Filter Operator
 predicate: (channel 视频 ) (type: boolean)
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Select Operator
 expressions: source (type: string)
 outputColumnNames: _col0
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Union
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 table:
 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 Stage: Stage-0
 Fetch Operator
 limit: -1
 Processor Tree:
 ListSink

从执行计划上看 test表被读两次。

设置set hive.optimize.cte.materialize.threshold 1 执行计划

STAGE DEPENDENCIES:
 Stage-1 is a root stage
 Stage-6 depends on stages: Stage-1 , consists of Stage-3, Stage-2, Stage-4
 Stage-3
 Stage-0 depends on stages: Stage-3, Stage-2, Stage-5
 Stage-8 depends on stages: Stage-0
 Stage-2
 Stage-4
 Stage-5 depends on stages: Stage-4
 Stage-7 depends on stages: Stage-8
STAGE PLANS:
 Stage: Stage-1
 Map Reduce
 Map Operator Tree:
 TableScan
 alias: test
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Select Operator
 expressions: id (type: int), source (type: string), channel (type: string)
 outputColumnNames: _col0, _col1, _col2
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 table:
 input format: org.apache.hadoop.mapred.TextInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 name: default.atable
 Stage: Stage-6
 Conditional Operator
 Stage: Stage-3
 Move Operator
 files:
 hdfs directory: true
 destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002
 Stage: Stage-0
 Move Operator
 files:
 hdfs directory: true
 destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b
 Stage: Stage-8
 Map Reduce
 Map Operator Tree:
 TableScan
 alias: atable
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Filter Operator
 predicate: (channel 直播 ) (type: boolean)
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Select Operator
 expressions: source (type: string)
 outputColumnNames: _col0
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Union
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 table:
 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 TableScan
 alias: atable
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Filter Operator
 predicate: (channel 视频 ) (type: boolean)
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Select Operator
 expressions: source (type: string)
 outputColumnNames: _col0
 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 Union
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 File Output Operator
 compressed: false
 Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
 table:
 input format: org.apache.hadoop.mapred.SequenceFileInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 Stage: Stage-2
 Map Reduce
 Map Operator Tree:
 TableScan
 File Output Operator
 compressed: false
 table:
 input format: org.apache.hadoop.mapred.TextInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 name: default.atable
 Stage: Stage-4
 Map Reduce
 Map Operator Tree:
 TableScan
 File Output Operator
 compressed: false
 table:
 input format: org.apache.hadoop.mapred.TextInputFormat
 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 name: default.atable
 Stage: Stage-5
 Move Operator
 files:
 hdfs directory: true
 destination: hdfs://localhost:9000/tmp/hive/bytedance/bae441cb-0ef6-4e9c-9f7a-8a5f97d0e560/_tmp_space.db/ce44793b-6eed-4299-b737-f05c66b2281b/.hive-staging_hive_2021-03-24_20-17-38_169_5695913330535939856-1/-ext-10002
 Stage: Stage-7
 Fetch Operator
 limit: -1
 Processor Tree:
 ListSink

可以看到test表被物化了。

1.2 源码

 从源码看 在获取元数据时 会做参数判断 判断参数阈值及cte的引用次数

二、spark-sql

spark对cte的操作比较少 在spark侧 现在还没发现有相关的优化参数

with atable as (
 SELECT content_type,
 channel,
 channel_note,
 enter_method,
 enter_method_note
 FROM search_dw.dim_ecom_enter_channel_df
select channel from atable WHERe content_type 直播 
union ALL
select channel from atable WHERe content_type 视频 

 参考资料 公众号 数据仓库与Python大数据 -《with as 语句真的会把数据存内存嘛 源码剖析 》

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/267667.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号