hive大小文件合并_数据挖掘与分析

hive -e “set tez.queue.name=usershell; 启用哪个队列
set hive.execution.engine=tez; 启用引擎
set hive.merge.tezfiles=true; 开启合并
set hive.merge.smallfiles.avgsize=16000000; 文件合并标准（低于16000000kb进行合并）
set hive.merge.size.per.task=128000000;文件合并大小（最终文件大于128000000后，停止合并，合并到另一个文件）
insert into table hot_words_view.vvv4_hot_words_new_two partition (domain_id, import_date)
select a.hot_words hot_words,a.rank rank_one,a.leng1 leng,a.total total1,b.total total2,a.mysql_id mysql_id,b.hot_words two_word,b.rank rank_two
,a.domain_id domain_id,a.import_date import_date
from (select import_date,domain_id,hot_words,collect_set(search_word)[0] search_word,number,
collect_set(leng)[0] leng1,mysql_id,collect_set(rank)[0] rank,count(1) total
from hot_words_view.vvv4_hot_words_new where leng =2 and domain_id=’$1’ and import_date=’$2’
group by import_date,domain_id,hot_words,number,mysql_id) a
left join
(select import_date,domain_id,hot_words,collect_set(search_word)[0] search_word,number,
mysql_id,collect_set(rank)[0] rank,count(1) total
from hot_words_view.vvv4_hot_words_new where leng >=2 and domain_id=’$1’ and import_date=’$2’ group by import_date,domain_id,hot_words,number,mysql_id) b
on a.number=b.number and a.import_date=b.import_date and a.domain_id=b.domain_id
where a.total <= b.total;”;

hive

1 合并小文件
set hive.merge.mapfile=true;
set hive.merge.mapredfile=true;
set hive.merge.size.pertask=32000…
set hive.merge.smallfiles.avsize=12…
2 压缩文件（结果压缩）
set hive.exec.compression.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.Gzipcode;
set mapred.output.compression.type=block;
***3***中间压缩
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.code=org.apache.hadoop.io.compress.snappycodec;
set hive.intermediate.compression.type=block;
***4***动态分区（默认情况下，在 Hive 0.9.0 之前禁用动态分区插入，并在 Hive 0.9.0及更高版本中默认启用。这些是动态分区插入的相关配置属性：）
set hive.mapred.mode=nonstrict;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partition.pernode=10000(默认100）
***5***group by 导致数据倾斜
set hive.groupby.skewindate=true;
6 自动mapjoin
set hive.auto.convert.join=true;
***7***自行本地运行（单节点运行）版本后Hive开始支持任务执行选择本地模式(local mode)。大多数的Hadoop job是需要hadoop提供的完整的可扩展性来处理大数据的。不过，有时hive的输入数据量是非常小的。在这种情况下，为查询出发执行任务的时间消耗可能会比实际job的执行时间要多的多。对于大多数这种情况，hive可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间会明显被缩短。
hive> set hive.exec.mode.local.auto=true;(默认为false);
hive.exec.mode.local.auto.inputbytes.max=134,217,728(默认128MB);
hive.exec.mode.local.auto.tasks.max=2(默认4);
当一个job满足如下条件才能真正使用本地模式：
1.job的输入数据大小必须小于参数2.job的map数必须小于参数3.job的reduce数必须为0或者1;
***8***对reduce个数限定
set mapred.reduce.tasks=-1;
set hive.exec.reduces.bytes.per.reduce=1000…(默认1000M）；
set hive.exec.reduces.max=999(默认);
***9***控制reduce个数
set mapred.reduce.tasks=15;
***10***配置map输入合并
set hive.input.format=org.apache.hadoop.hive.ql.io.combineHiveInputFormate;(执行map前小文件合并)
set hive.merge.mapfiles=true; (在map only任务结束时合并小文件)
set hive.merge.mapredfiles=false;(为true时在mapreduce任务结束时合并小文件)
set hive.merge.size.per.task=25600000;（合并文件大小）
set mapred.max.split.size=25600;(每个map最大分割大小)
set mepred.min.split.size.per.node=1;(一个节点上split最小值)
***11***设置引擎
set hive.execution.engine=mr;
set hive.execution.engine=spark;
set hive.execution.engine=tez;
***12***队列
set mapred.queue.names=;
set mapreduce.job.queuename=etl;(使用原生mapreduce)
***13***输入小文件合并
set mapred.max.split.size=256000000;(每个map处理的最大输入文件大小256M)
set mapred.min.split.size.per.node=10000;(一个节点上split文件最小值)
set mapred.min.split.size.per.rack=1000;(一个交换机下split文件最小值)
set hive.input.format=org.apache.hadoop.hive.ql.io.combineHiveInputformat;(在开启format后，一个date node节点上多个小文件进行合并，合并文件数由mapred.max.split.size限制的大小决定）

tez

***1***使用tez队列
set tez.queue.name=user;
***2***大小文件合并
set tez.queue.name=usershell; 启用哪个队列
set hive.execution.engine=tez; 启用引擎
set hive.merge.tezfiles=true; 开启合并
set hive.merge.smallfiles.avgsize=16000000; 文件合并标准（低于16000000kb进行合并）
set hive.merge.size.per.task=128000000;文件合并大小（最终文件大于128000000后，停止合并，合并到另一个文件）

Hive的优化
1、fetch抓取：修改配置文件hive.fetch.task.conversion为more，修改好全局查找和字段查找以及limit都不会触发MR任务。

2、本地模式：大多数的Hadoop Job要求Hadoop提供完整的可扩展性来触发大数据集，不过有时候hive的输入数据量非常的小，这样的情况下可能触发任务的事件比执行的事件还长，我们就可以通过本地模式把小量的数据放到本地上面计算。

3、Group by：默认情况下，map阶段同一key的数据会发给一个reduce，当一个key数据过大就会倾斜，并不是所有的聚合操作都需要reduce端完成，很多聚合操作都可以现在map端进行部分聚合，最后在reduce端得出最终结果。（1）、开启在map端聚合：hive.map.aggr = true。（2）、在map端进行聚合操作的条目数：hive.groupby.mapaggr.checkinterval = 100000。（3）、有数据倾斜的时候进行负载均衡：hive.groupby.skewindata = true。

4、count dinstinc去重：大的数据集先用count distinct查找重复的字段，然后用group by再去重。

5、行列过滤：列处理：只拿自己需要的列，如果有，尽量使用分区过滤。行处理：在分区裁剪的时候，当使用外关联的时候，先完全关联再过滤。

6、动态分区调整：（1）、开启动态分区：hive.exec.dynamic.partition=true。（2）、设置为非严格模式：hive.exec.dynamic.partiton.mode = nostrict。（3）、实操：创建分区表、加载数据到分区表中、创建目标分区、设置动态分区、查看目标分区表的分区情况。

7、小文件进行合并：在map执行之前合并小文件，减少map的数量，设置 set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat。

8、调整map的数量和reduce的数量。

9、并行执行：在Hive中可能有很多个阶段，不一定是一个阶段，这些阶段并非相互依赖的。然后这些阶段可以并行完成，设置并行：set hive.exec.parallel = true。

10、JVM的重用，缺点是task槽会被占用，一直要等到任务完成才会释放。

hive大小文件合并

数据挖掘与分析相关栏目本月热门文章