背景:为满足业务方分析需求,数仓基础统计中存在大量多维度的指标计算。原有计算中根据不同维度拆分多个sql串行提交hive集群进行查询,多个任务在map阶段存在重复读取源数据等问题,因此引入grouping sets语法合并业务逻辑为一条sql。
一次map一次reduce完成一个基础指标的统计,减少集群资源浪费。现以安卓国内版天活跃启动统计任务为例进行测试。
1.sql1 按province,model维度统计source为dw的前后台启动活跃数据。
sql1
SELECt SOURCE,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt SOURCE,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
province,
model,
deviceId) a
GROUP BY SOURCE,
province,
model
2.sql2 按channel,province,model维度统计source为dw的前后台启动活跃数据。
sql2
SELECt SOURCE,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt SOURCE,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
channel,
province,
model
sql3 按app_version,province,model维度统计source为dw的前后台启动活跃数据。
sql3 SELECt SOURCE, app_version, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECt SOURCE, app_version, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERe dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, app_version, province, model, deviceId) a GROUP BY SOURCE, app_version, province, model
4.sql4 按app_version,channel,province,model维度统计source为dw的前后台启动活跃数据。
sql4 SELECt SOURCE, app_version, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECt SOURCE, app_version, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERe dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, app_version, channel, province, model, deviceId) a GROUP BY SOURCE, app_version, channel, province, model
5.sql5 按province,model维度统计source为所有平台的前后台启动活跃数据
sql5
SELECt 'all' AS SOURCE,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt 'all' AS SOURCE,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
GROUP BY 'all',
province,
model,
deviceId) a
GROUP BY 'all',
province,
model
6.sql6 按channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql6
SELECt 'all' AS SOURCE,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt 'all' AS SOURCE,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
GROUP BY 'all',
channel,
province,
model,
deviceId) a
GROUP BY 'all',
channel,
province,
model
7.sql7 按app_version,province,model维度统计source为所有平台的前后台启动活跃数据
sql7
SELECt 'all' AS SOURCE,
app_version,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt 'all' AS SOURCE,
app_version,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
GROUP BY 'all',
app_version,
province,
model,
deviceId) a
GROUP BY 'all',
app_version,
province,
model
8.sql8 按app_version,channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql8
SELECt 'all' AS SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECt 'all' AS SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
GROUP BY 'all',
app_version,
channel,
province,
model,
deviceId) a
GROUP BY 'all',
app_version,
channel,
province,
model
9.sql9 合并source,app_version,channel,province,model维度组合统计前后台启动活跃数据
sql9
SELECt nvl(source,"all") as source,
app_version,
channel,
province,
model,
sum(front_start_pv) as front_start_pv,
sum(if(front_start_pv>0,1,0)) as front_start_uv,
sum(all_start_pv) as all_start_pv,
sum(1) as all_start_uv
FROM
(SELECt source,
nvl(app_version,"dw_null") as app_version,
nvl(channel,"dw_null") as channel,
nvl(province,"dw_null") as province,
nvl(model,"dw_null") as model,
deviceId,
sum(front_start_pv) as front_start_pv,
sum(all_start_pv) as all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERe dt='2020-04-04'
GROUP BY source,
nvl(app_version,"dw_null"),
nvl(channel,"dw_null"),
nvl(province,"dw_null"),
nvl(model,"dw_null"),
deviceId) a
GROUP BY source,
app_version,
channel,
province,
model
GROUPING SETS(
(source,province,model),
(source,channel,province,model),
(source,app_version,province,model),
(source,app_version,channel,province,model),
(province,model),
(channel,province,model),
(app_version,province,model),
(app_version,channel,province,model)
)HAVINg source='dw' OR source is null
| sql行号 | 数据行 | time(秒) |
|---|---|---|
| sql1 | 191618 | 54.289 |
| sql2 | 349692 | 186.34 |
| sql3 | 811334 | 48.341 |
| sql4 | 1059390 | 50.373 |
| sql5 | 197314 | 66.308 |
| sql6 | 405496 | 68.312 |
| sql7 | 1077795 | 152.358 |
| sql8 | 1376569 | 182.347 |
| 总和 | 5469208 | 808.668 |
| sql行号 | 数据行 | time(秒) |
|---|---|---|
| sql9 | 5469208 | 134.432 |
说明:sql都是都测试队列相同集群环境下测试的。都是在container复用的情况下进行测试,排除了第一个sql从0申请资源执行时间过长的因素。
结论:采用grouping set语法进行sql合并能有效减少hive任务执行时间,同时以上9个sql占用的集群资源基本相当,优化后会大量减少集群资源的消耗。之后会测试数仓批量任务执行优化前后的时间对比,进行更全面的对比。



