- 概述
- 行转多列
- 数仓详细
- 数据路径
- 代码
- 1、数据准备
- 2、设置动态分区
- 3、第一天数据写入
- 数据查询
- 数据写入
- 4、第二天数据写入
- 数据查询
- 数据写入
- 补充
-
什么是事实表?
每行数据代表一个业务事件,通常有很多外键(地区、用户…)
业务事件可以是:下单、支付、退款、评价…
业务事件有数字度量,如:数量、金额、次数…
行数较多,列数较少
每天很多新增 -
事实表的分类
| 分类 | 说明 | 特点 | 场景 |
|---|---|---|---|
| 事务型事实表 | 以每个事务为单位 | 数据只追加不修改 | 一个订单支付 一笔订单退款 |
| 周期型快照事实表 | 保留固定时间间隔的数据 | 不会保留所有数据 | 点赞数 |
| 累积型快照事实表 | 跟踪业务事实的变化 | 数据可修改 | 订单状态 |
- 本文以订单状态表为例
1、按订单ID分组,聚合订单状态和时间,转为MAP
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=') m
FROM
ods_order_status
GROUP BY
order_id
打印结果
+--------+----------------------------------------------------------+ |order_id|m | +--------+----------------------------------------------------------+ |P2 |[end -> 2020-01-01 23:45:00, start -> 2020-01-01 22:45:00]| |P3 |[start -> 2020-01-01 23:30:00] | |P1 |[start -> 2020-01-01 08:00:00, end -> 2020-01-01 08:01:00]| +--------+----------------------------------------------------------+
2、按Key获取MAP值
WITH
t1 AS (
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=') m
FROM
ods_order_status
GROUP BY
order_id
)
SELECt
order_id,
m['start'] start_time,
m['end'] end_time
FROM
t1
打印结果
+--------+-------------------+-------------------+ |order_id|start_time |end_time | +--------+-------------------+-------------------+ |P2 |2020-01-01 22:45:00|2020-01-01 23:45:00| |P3 |2020-01-01 23:30:00|null | |P1 |2020-01-01 08:00:00|2020-01-01 08:01:00| +--------+-------------------+-------------------+数仓详细 数据路径
| 表名 | 表名 | 路径 | 策略 | 备注 |
|---|---|---|---|---|
| ods_order | 订单表 | sqoop > hdfs > ods | 增量变化同步 按create_time和operate_time | 此处省略 |
| ods_order_status | 订单状态流水表 | sqoop > hdfs > ods | 增量同步 按operate_time | |
| dwd_order | 订单表 | ods > dwd | 未结束订单写到9999-12-31分区 结束订单按结束日期写到日期分区 |
-- 建库:e-commerce
DROp DATAbase IF EXISTS ec CASCADE;
CREATE DATAbase ec LOCATION '/ec';
USE ec;
-- 建表:原始层,订单状态表
DROP TABLE IF EXISTS ec.ods_order_status;
CREATE TABLE ec.ods_order_status (
order_id STRING,
order_status STRING,
operation_time STRING)
PARTITIonED BY (ymd STRING)
LOCATION '/ec/ods_order_status';
-- 建表:明细层,订单(累积型快照事实)表
DROP TABLE IF EXISTS ec.dwd_order;
CREATE TABLE ec.dwd_order (
order_id STRING,
start_time STRING,
end_time STRING)
PARTITIonED BY (ymd STRING)
LOCATION '/ec/dwd_order';
-- 造数据,写到原始层
INSERT INTO TABLE ec.ods_order_status PARTITION(ymd='2020-01-01') VALUES
("P1","start","2020-01-01 08:00:00"),
("P1","end","2020-01-01 08:01:00"),
("P2","start","2020-01-01 22:45:00"),
("P3","start","2020-01-01 23:30:00");
INSERT INTO TABLE ec.ods_order_status PARTITION(ymd='2020-01-02') VALUES
("P3","end","2020-01-02 00:15:00"),
("P4","start","2020-01-02 06:30:00");
2、设置动态分区
-- 开启动态分区功能 SET hive.exec.dynamic.partition=true; -- 设置动态分区为非严格模式 SET hive.exec.dynamic.partition.mode=nonstrict;3、第一天数据写入 数据查询
WITH
t1 AS(
SELECT
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=') m
FROM ec.ods_order_status
WHERe ymd='2020-01-01'
GROUP BY order_id
)
SELECt
order_id,
m['start'] start_time,
m['end'] end_time,
CASE
WHEN m['end'] IS NOT NULL THEN '2020-01-01'
ELSE '9999-12-31'
END ymd
FROM t1;
数据写入查询结果
注意:语法要求WITH写在INSERT前面
WITH
t1 AS(
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=') m
FROM ec.ods_order_status
WHERe ymd='2020-01-01'
GROUP BY order_id
)
INSERT OVERWRITE TABLE ec.dwd_order PARTITION(ymd)
SELECt
order_id,
m['start'] start_time,
m['end'] end_time,
CASE
WHEN m['end'] IS NOT NULL THEN '2020-01-01'
ELSE '9999-12-31'
END ymd
FROM t1;
4、第二天数据写入 数据查询写入后结果
WITH
t1 AS(
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=')m
FROM ec.ods_order_status
WHERe ymd='2020-01-02'
GROUP BY order_id
),
new AS(
SELECt
order_id,
m['start'] start_time,
m['end'] end_time,
CASE
WHEN m['end'] IS NOT NULL THEN '2020-01-02'
ELSE '9999-12-31'
END ymd
FROM
t1
),
old AS (SELECt * FROM ec.dwd_order WHERe ymd='9999-12-31')
SELECt
NVL(new.order_id,old.order_id) order_id,
NVL(new.start_time,old.start_time) start_time,
NVL(new.end_time,old.end_time) end_time,
NVL(new.ymd,old.ymd) ymd
FROM new
FULL OUTER JOIN old
ON new.order_id=old.order_id;
数据写入查询结果
注意:语法要求WITH写在INSERT前面
WITH
t1 AS(
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=')m
FROM ec.ods_order_status
WHERe ymd='2020-01-02'
GROUP BY order_id
),
new AS(
SELECt
order_id,
m['start'] start_time,
m['end'] end_time,
CASE
WHEN m['end'] IS NOT NULL THEN '2020-01-02'
ELSE '9999-12-31'
END ymd
FROM
t1
),
old AS (SELECt * FROM ec.dwd_order WHERe ymd='9999-12-31')
INSERT OVERWRITE TABLE ec.dwd_order PARTITION(ymd)
SELECt
NVL(new.order_id,old.order_id) order_id,
NVL(new.start_time,old.start_time) start_time,
NVL(new.end_time,old.end_time) end_time,
NVL(new.ymd,old.ymd) ymd
FROM new
FULL OUTER JOIN old
ON new.order_id=old.order_id;
补充写入后结果
上面的订单状态比较简单,这个全一点,SQL的思路是一样的
| 状态 | 时间字段 | 说明 | 备注 |
|---|---|---|---|
| 待支付 | create_time | 创建时间 | |
| 已支付 | pay_time | 支付时间 | |
| 确认收货 | /confirm/i_time | 确认时间 | 到货后7天内,买家可主动确认收货或退款;7天后没有操作将会自动确认收货 |
| 已取消 | cancel_time | 取消时间 | 下单后支付前,主动取消订单 |
| 支付过期 | overdue_time | 过期时间 | 下单后1小时内没有支付 |
| 退款中 | refund_time | 退款申请时间 | |
| 退款完成 | refund_finish_time | 退款完成时间 | |
| 结束 | end_time | 结束时间 |
WITH
t1 AS(
SELECt
order_id,
STR_TO_MAP(CONCAT_WS(',',COLLECT_SET(CONCAT(order_status,'=',operation_time))),',','=') m
FROM ec.ods_order_status
WHERe ymd='2020-01-01'
GROUP BY order_id
)
SELECt
order_id,
m['已支付'] pay_time,
m['已取消'] cancel_time,
m['确认收货'] /confirm/i_time,
m['退款中'] refund_time,
m['退款完成'] refund_finish_time,
m['支付过期'] overdue_time,
m['结束'] end_time,
CASE
WHEN m['结束'] IS NOT NULL THEN '昨天'
ELSE '9999-12-31'
END ymd
FROM t1;
另外,订单状态表(ods_order_status)要和订单表(ods_order)连接,本文就不JOIN了



