1.HDFS读写过程
1.1HDFS读流程1.2 HDFS写流程 2.rows between
2.1 通过题目理解2.2 总结 3.连续登陆
1.HDFS读写过程 1.1HDFS读流程RPC:远程过程调用
HDFS client 调用FileSystem.open(filePath), 与NN进行【RPC】通信,返回该文件的block列表
返回【FSDataInputStream】对象
若block列表较长,则分多次返回,多次运行下面流程
HDFS client调用【FSDataInputStream】的read方法 与第一个块最近的DN进行读取,读取完成后,检查是否ok
如果ok,机会关闭与DN的通信 如果不ok,就从第二个节点去读取,以此类推 当block列表全部读取完成了,
HDFS client调用FSDataInpuitStream的close方法
关闭数据流
HDFS client 调用FileSystem.create(filePath)方法
与NN进行【RPC】通信
NN会检查这个文件夹是否存在?是否有权限创建这个文件
如果都可以,就创建一个文件
但是这个时候是没有数据的,也不关联block
NN会再根据 文件的大小,块大小,副本数等
计算要上传多少的块和对应的DN节点
最终这个信息返回给客户端【FSDataOutputStream】对象
HDFS client调用客户端【FSDataOutputStream】的write方法
根据NN返回的信息,将第一个块的第一个副本写到DN1
写完复制到DN2,再复制到DN3
当我们三个副本写完,DN3返回ack【确认字符】到DN2
DN2接收到ack,返回ack给DN1
DN1接收到ack,返回ack给到【FSDataOutputStream】
告诉它第一个块的三副本写完了
以此类推
当所有的块全部写完
HDFS client调用【FSDataOutputStream】的close方法
关闭数据流
然后调用FileSystem.complete方法,告诉NN文件写成功。
例.求近5天的登陆次数总和
SELECt x.*, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND 4 following ) FROM click AS x
结果为
- 求每个spu_id的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt ) FROM click
- 求每个spu_id,当前日期和之后所有日期的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND unbounded following ) FROM click
3. 求每个spu_id,当前日期和之前所有日期的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN unbounded preceding AND current ROW ) FROM click
4. 求每个spu_id,当前日期和之前2天的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN 2 preceding AND current ROW ) FROM click
- 求每个spu_id,当前日期和之后2天的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN current ROW AND 2 following ) FROM click
- 求每个spu_id,之前2天和之后2天的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN 2 preceding AND 2 following ) FROM click
但是在有些数据库内开窗函数不写rows between会报错
所以回到第一题研究默认情况下的rows between的写法
1.求每个spu_id的click_pv之和
SELECt spu_id, pt, click_pv, sum( click_pv ) over ( PARTITION BY spu_id ORDER BY pt rows BETWEEN unbounded preceding AND unbounded following ) FROM click2.2 总结
unbounded preceding 前所有行
n preceding 前面n行
current row 当前行
n following 后面n行
unbounded following 后面所有行
- 每个id的最大连续天数 找参考系
SELECt id, max( cnt ) FROM ( SELECt id, date_sub, count( 1 ) AS cnt FROM ( SELECt id, login_date, row_number() over ( PARTITION BY id ORDER BY login_date ), dayofyear( login_date )- row_number() over ( PARTITION BY id ORDER BY login_date ) AS date_sub FROM login ) as t1 GROUP BY id, date_sub ) AS x GROUP BY id;
但是这种解法仅限于当年有跨年的数据就不能实现了
所以使用一种通用的方法
SELECt id, max( cnt ) FROM ( SELECt id, date_sub, count( 1 ) AS cnt FROM ( SELECt id, login_date, row_number() over ( PARTITION BY id ORDER BY login_date ), DATE_SUB( login_date, INTERVAL row_number() over ( PARTITION BY id ORDER BY login_date ) day) AS date_sub FROM login ) as t1 GROUP BY id, date_sub ) as t2 GROUP BY id;
可以看到结果变为正确的
- 每个id最大连续登陆天数和对应的开始时间和结束时间
SELECt id, max_login_date, min_login_date, cnt FROM ( SELECt id, max_login_date, min_login_date, cnt, max( cnt ) over ( PARTITION BY cnt ORDER BY date_sub DESC ) AS max_cnt FROM ( SELECt id, date_sub, max( login_date ) AS max_login_date, min( login_date ) AS min_login_date, count( 1 ) AS cnt FROM ( SELECt id, login_date, row_number() over ( PARTITION BY id ORDER BY login_date ), date_sub( login_date, INTERVAL row_number() over ( PARTITION BY id ORDER BY login_date ) DAY ) AS date_sub FROM login ) as t1 GROUP BY id, date_sub )as t3) as t2 WHERe cnt = max_cnt;



