按机场(内部联接)过滤,并在UNIOn ALL之前进行聚合,以减少传递到最终聚合简化程序的数据集。具有UNIOn ALL的UNIOn
ALL子查询应该比UNIOn ALL之后具有更大数据集的Join并行且运行速度更快。
SELECt f.airport, SUM(cnt) AS Total_FlightsFROM ( SELECt a.airport, COUNT(*) as cnt FROM flights_stats f INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA' WHERe Cancelled = 0 AND Month IN (3,4) GROUP BY a.airport UNIOn ALL SELECt a.airport, COUNT(*) as cnt FROM flights_stats f INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA' WHERe Cancelled = 0 AND Month IN (3,4) GROUP BY a.airport ) f GROUP BY f.airportORDER BY Total_Flights DESCLIMIT 10;
调整mapjoin并启用并行执行:
set hive.exec.parallel=true;set hive.auto.convert.join=true; --this enables map-joinset hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
使用Tez和向量化,调整映射器和化简器的并行性:https
:
//stackoverflow.com/a/48487306/2700344



