Hive调优笔记（二）

一、CBO优化

基于成本的优化器（默认已开启），参数设置
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

谓词下推（默认为true）
set hive.optimize.ppd=true;

例子：o 表的 o.id 会在 reduce 之前经历 filter：o.id <= 10
explain select o.id from bigtable b join bigtable o on
o.id = b.id where o.id <= 10;

二、Mapjoin（小表join大表）

1、原理解读：
MapJoin 是将 Join 双方比较小的表直接分发到各个 Map 进程的内存中 在 Map 进程中进行 Join 操作 这样就不用进行 Reduce 步骤 从而提高了速度。如果是left join则此方式失效

2、设置自动选择 MapJoin（默认为 true）
set hive.auto.convert.join=true; 

3、大表小表的阈值设置（默认 25M 以下认为是小表）：
set hive.mapjoin.smalltable.filesize=25000000;

三、SMB Join（大表join大表）

1、原理解读：将大表按id进行分桶，分而治之。两张表分桶的数量相等或者是倍数关系，这样原来能通过id关联的数据，一定存在相同的分桶中
set hive.optimize.bucketmapjoin = true;
set hive.optimize. bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

2、将表1数据写入临时分桶表
create table bigtable_buck1(
id bigint,
t bigint,
uid string,
keyword string,
url_rank int,
click_num int,
click_url string)
clustered by(id)
sorted by(id)
into 6 buckets   --bigtable_buck2注意这个数量的设置，相等或倍数
row format delimited fields terminated by t';

load data local inpath '/opt/module/data/bigtable' into table
bigtable_buck1;

3、执行join操作
insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from b igtable_buck1 s
join bigtable_buck2 b
on b.id = s.id;

Hive调优笔记（二）

大数据系统相关栏目本月热门文章