- 数据汇总层:进行轻度聚合,不跨主题域,提升公共指标的复用性。
(1)DWS层数据对于DWD层进行轻度汇总,比如sum或count操作。
(2)DWS层数据业务场景不会跨主题域,且表内指标都是相同统计范围,比如最近1天,或者累加至今,原则上一般不做维度冗余。
1.3、DWS层表结构(1)表结构:实体+度量
(2)命名规则:dws_业务表名__di
(3)生命周期:182层
2、DWS层数据处理 2.1、进行轻度聚合形成宽表模型- 根据不同维度组成多个宽表。
(1)最细粒度:维度1,维度2,维度3,维度4…
(2)宽表力度:维度1,维度2,维度3
2.2、DWS层数据应用(1)主要对外开放的数据,也是使用频率最高的部分,该层的核心在于计算和分许
(2)将上册数据整合成通用数据,提供指标汇总层
3、DWS实例(用户全量表) 绪论- 需求:统计小说用户全量表,日更新
create table ods.ods_ios_di( p_date string ,etl_time string ,server_ip string ,log_time string ,guid string ,search_words string ,age String ,city String ,phone string ,appid string ,msgid string ,httpcode string ,httpcontent string ,msg string ) partitioned by (`pt` string) row format delimited fields terminated by ',';3.1、创建Hive目标表
(1)编写建表脚本
create table dws.dws_novel_guid( guid string ) partitioned by (`pt` string) row format delimited fields terminated by ',';
(2)配置SQL组件创建表
- 数据源:hive::dws
- sql类型:非查询
(3)上线部署
- 上线后进入画布
(4)运行脚本
3.2、编写业务逻辑进行数据测试(1)编写业务逻辑脚本
insert overwrite table dws.dws_novel_guid partition (pt=${pt})
select if(t2.guid is not null,t2.guid,t1.guid) as guid
from (
select guid
from ods.ods_ios_di
where pt = ${pt}
group by guid
) t1
full join
(select guid
from dws_novel_guid
where pt = ${pt}-1
group by guid
)
on t1.guid = t2.guid
(2)配置SQL组件创建表
- 数据源:hive::dws
- sql类型:非查询
- 自定义参数: p t 、 {pt} 、 pt、{pt_one}
- 前置SQL
alter table dws.dws_novel_guid drop if exists partition(pt=${pt})
alter table dws.dws_novel_guid add if not exists partition (pt=${pt})
(4)部署上线
- 上线后进入画布
(5)运行测试
- 进入画布,运行数据链路的补充节点(前置任务已跑多日)
- 补数,跑一周数据后进行对数
- 串行执行:资源不足时使用
- 并行执行:资源充足时使用
(1)调度周期
- 每天凌晨6点调度前一天数据
(2)失败策略
- 失败策略:结束
(3)通知策略
- 通知策略:失败发
- 需求:统计每个人的小说阅读pv,是否新用户
create table dwd.dwd_ios_di( p_date string ,etl_time string ,server_ip string ,log_time string ,guid string ,search_words string ,age String ,city String ,phone string ,appid string ,msgid string ,httpcode string ,httpcontent string ,msg string ) partitioned by (`pt` string) row format delimited fields terminated by ',';4.1、创建Hive表
(1)创建Hive目标表
CREATE TABLE dws.dws_npvel_guid_pv_di( p_date string COMMENT '分区日期', platform string COMMENT '手机平台', version string COMMENT '手机版本', read_pv string COMMENT '小说pv', channel_id string COMMENT '小说渠道', top_channel string COMMENT '一级渠道名称', second_channel string COMMENT '二级渠道名称', thrid_channel string COMMENT '三级渠道名称', is_new string COMMENT '是否新用户', ) partitioned by (`pt` string) row format delimited fields terminated by ',';
(2)配置SQL组件创建表
- 数据源:hive::dws
- sql类型:非查询
(3)上线部署
- 上线后进入画布
(4)运行脚本
4.2、编写业务逻辑进行数据测试(1)编写业务逻辑
insert overwrite table dws.dws_npvel_guid_pv_di partition (pt=${pt})
select
${pt} as p_date
,t1.guid as guid
,if(split(t1.phone,'-')[0] in ('小米','华为'),'android','ios') as platform
,nvl(split(t1.phone,'-')[0],'未知') as version
,sum(guid_novel_pv) as guid_novel_pv
,channel_id as channel_id
,top_channel_name
,second_channel_name
,thrid_channel_name
,case when t2.guid is not null then 1 else 0 end as is_new
case when guid is not null then 1 else 0 end as is is_new
from
(
guid
,sum(guid_pv) as guid_novel_pv
,phone
,appid as channel_id
,split(appid,'_')[0] as top_channel
,split(appid,'_')[1] as second_channel
,split(appid,'_')[2] as thrid_channel
from dwd.dwd_ios_di
where pt = ${pt}
and appid in ('1_2_3_4','2_3_4_5','3_4_5_6','4_5_6_7')
group by
guid,phone,appid,split(appid,'_')[0],split(appid,'_')[1],split(appid,'_')[2]
) t1
full join
(
select guid
--小用户全量表
from dws.dwd_novel_guid
where pt=${pt_one}
and guid RLIKE '^[4-9A-Z]{24}$'
) t2
on t1.guid = t2.guid
left join
(
-- 渠道维表
select top_channel,top_channel_name,second_channel,second_channel_name, thrid_channel,thrid_channel_name
from dim.dim_novel_channel_di
where pt = ${pt}
) t3
on t1.top_channel = t3.top_channel and t1.second_channel = t3.second_channel and t1.thrid_channel = t3.thrid_channel
(2)配置SQL组件
- 数据源:hive::dws
- sql类型:非查询
- 自定义参数: p t 、 {pt} 、 pt、{pt_one}
- 前置SQL
alter table dws.dws_npvel_guid_pv_di drop if exists partition(pt=${pt})
alter table dws.dws_npvel_guid_pv_di add if not exists partition (pt=${pt})
(3)配置依赖组件
- 添加维表依赖,依赖于T-1的数据
- 添加全量表依赖,注意全量表依赖于T-2的数据
- 配置起始任务组件,方便运行调试任务流
(4)部署上线
- 上线后进入画布
(5)运行测试
- 进入画布,运行数据链路的起始节点
- 补数,跑一周数据后进行对数
- 串行执行:资源不足时使用
- 并行执行:资源充足时使用
(1)调度周期
- 每天凌晨6点调度前一天数据
(2)失败策略
- 失败策略:结束
(3)通知策略
- 通知策略:失败发



