1.上线该优化后任务性能表现
任务执行时长:
任务一:该任务为wps_android国内应用的天任务耗时截图,执行时长缩短3-4倍。该任务是天任务中最长的任务,也就是说以现在的任务流程部署情况,整体天任务执行时长缩短3倍以上。
任务二:该任务为国内应用的周任务耗时截图,执行时长缩短3-4倍。
任务三:该任务为数仓单个子任务中耗时最长的任务(30天活跃任务),执行时间缩短5-6倍。其中任务的执行时间包括执行hive sql查询的时间+数据写入RDS的时间。
任务四:该任务为数仓月任务耗时截图,执行时长整体缩短2-3倍。
hive SQL数:
数仓计算天任务SQL数从2W降低到8K左右
2.原理分析
grouping sets 语法是hive0.10版引进的高级函数。
我们知道,下面sql
SELECt month, day,COUNT(DISTINCT device_id) AS uv,grouping_id FROM dwd_test GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY grouping_id ;
等价于
SELECt month,NULL,COUNT(DISTINCT device_id) AS uv,1 AS grouping_id FROM dwd_test GROUP BY month UNIOn ALL SELECt NULL,day,COUNT(DISTINCT device_id) AS uv,2 AS grouping_id FROM dwd_test GROUP BY day UNIOn ALL SELECt month,day,COUNT(DISTINCT device_id) AS uv,3 AS grouping_id FROM dwd_test GROUP BY month,day
查询结果
那为什么我们优化SQL时不直接使用UNIOn ALL语法呢?
因为使用UNIOn ALL合并分组集的缺点是代码太长并且性能比较低下,因为每一个查询都会分别单独扫描一次源表。
而使用GROUPINGS SETS合并分组集可明显减少代码长度,并且能够优化扫描源表的次数,不必为每个分组集单独对表源进行扫描,从而提高性能。
EXPLAIN
SELECt SOURCE,
app_version,
channel,
sum(front_start_pv) AS front_start_pv,
count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv,
sum(all_start_pv) AS all_start_pv,
count(distinct(deviceId)) AS all_start_uv
FROM
(SELECt SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERe dt = '2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
GROUP BY SOURCE,
app_version,
channel
GROUPING
SETS ((SOURCE), (SOURCE,
app_version), (SOURCE,
channel), (SOURCE,
app_version,
channel))
explain SELECt SOURCE, app_version, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECt SOURCE, app_version, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_seek_edu.dwd_app_start_d WHERe dt='2020-04-21' AND SOURCE='dw' GROUP BY SOURCE, app_version, channel, province, model, deviceId) a GROUP BY SOURCE, app_version, channel, province, model
对于SQL分析执行计划。图一为合并后的sql,图二为合并前四条SQL中的一条。从图中可知,grouping sets语法优化后的sql还是一个map一个reduce完成所有指标的计算,并没有更多的计算过程,最后多一个reduce完成数据的整理。
3.使用中遇到的坑
一、数据上报null值与grouping sets合并各维度组合产生的null区分
SQL1 SELECt SOURCE, app_version, channel, sum(front_start_pv) AS front_start_pv, count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv, sum(all_start_pv) AS all_start_pv, count(distinct(deviceId)) AS all_start_uv FROM (SELECt SOURCE, nvl(app_version,'') AS app_version, nvl(channel,'') AS channel, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_seek_edu.dwd_app_start_d WHERe dt = '2020-04-21' AND SOURCE='dw' GROUP BY SOURCE,nvl(app_version,''),nvl(channel,''),deviceId ) a GROUP BY SOURCE,app_version,channel GROUPING SETS ((SOURCE), (SOURCE,app_version), (SOURCE,channel), (SOURCE,app_version,channel))
我们可以通过nvl(app_version,'') 函数对上报的null值填充,这样结果中所有的null值就都是组合产生的null值了。
二、不同维度组合join中关联条件不同容易引起数据的不一致
SQL2 SELECt SOURCE, app_version, channel, count(distinct(a.deviceId)) AS uv FROM (SELECt SOURCE, nvl(app_version,'') AS app_version, nvl(channel,'') AS channel, deviceId FROM dw_wps_android.dwd_app_start_d WHERe dt='2020-05-24' AND SOURCE='dw' AND front_start_pv>0 GROUP BY SOURCE, nvl(app_version,''), nvl(channel,''), deviceId) a JOIN (SELECt SOURCE, deviceId FROM dw_wps_android.dwd_app_start_d WHERe dt = '2020-06-23' AND SOURCE='dw' AND front_start_pv>0 GROUP BY SOURCE, deviceId) b ON a.source = b.source AND a.deviceId = b.deviceId GROUP BY a.source, app_version, channel GROUPING SETS ((a.source), (a.source, app_version), (a.source, channel), (a.source, app_version, channel))
SQL3 SELECt SOURCE, app_version, channel, count(distinct(a.deviceId)) AS uv FROM (SELECt SOURCE, nvl(app_version,'') AS app_version, nvl(channel,'') AS channel, deviceId FROM dw_wps_android.dwd_app_start_d WHERe dt='2020-05-24' AND front_start_pv>0 GROUP BY SOURCE, nvl(app_version,''), nvl(channel,''), deviceId) a JOIN (SELECt SOURCE, deviceId FROM dw_wps_android.dwd_app_start_d WHERe dt = '2020-06-23' AND front_start_pv>0 GROUP BY SOURCE, deviceId) b ON a.deviceId = b.deviceId GROUP BY a.source, app_version, channel GROUPING SETS ((), (app_version), (channel), (app_version, channel))
如果直接把这两个sql合成一个,会因为不同维度组合join时关联条件不同导致最终数据偏多。



