记Hive tez引擎和union all一起使用的一次糟糕体验

昨天在进行一个指标统计的时候，同时使用了tez引擎和union all，造成极不好的体验，具体问题如下：

1、使用tez引擎执行union all 的查询任务时候，输出结果会将每一个union all块的数据放在一个分区，导致使用sqoop导出数据时候报错；

（tool.exportTool:Encountered IOExpection  running export job: file not found Exception path is not a file ）

原因是导出文件路径为分区路径，不是具体的文件路径；

2、例如：

set hive.execution.engine=tez;

insert overwrite table test
(
	select a, b, c
	from table_a
)
union all
(
	select a, b, c
	from table_b
);

最后在hdfs上的输出结果会有两个分区（两个文件夹）：

/apps/hive/warehouse/test.db/test/1/00000_0

/apps/hive/warehouse/test.db/test/2/00000_1

导致使用sqoop导出时候，默认去导/apps/hive/warehouse/test.db/test/1文件内容，但它是一个分区路径，最后会报找不到文件的错误（file not found Exception path is not a file）

3、原因：
1）、Tez引擎执行情况下，每一段互不影响，它每段的结果都是在内存里（Union all是并行的，每一段都单独运行），不可能第一段计算好了，一直占着内存等其他的写完，所以先计算完的先写入hdfs；
2）、Mr情况下，基于磁盘计算，所以union all的每一段结果都是在磁盘中，最后统一写入到hdfs；

3、解决办法：
1）、使用MR引擎；
2）、使用union替代union all，原因上面已经说明，union all每一段是并行的，使用tez时候先计算完的先写入到hdfs，union需要等待全部计算完成去重以后才写入hdfs。

完结撒花。

记Hive tez引擎和union all一起使用的一次糟糕体验

大数据系统相关栏目本月热门文章