Hive的全局排序和局部排序

这里写目录标题

三、Hive的全局排序和局部排序
- 3.1. 使用一个Reducer
- 3.2. 使用多个Reducer
- - 3.2.1. order by
  - 3.2.2. sort by
  - 3.2.3. distribute by
  - 3.2.4. cluster by

三、Hive的全局排序和局部排序

分布式数据库如何实现 order by 和 limit 10000,100 这类操作的？

我的数据量确实很大（不能使用orderby来进行全局排序），但是需求就是要进行全局排序？

3.1. 使用一个Reducer

设置参数：

set mapreduce.job.reduces=1;

SQL实现：

select * from table order by age desc;

3.2. 使用多个Reducer

create database if not exists exercise_db;
use exercise_db;
drop table if exists exercise_student;
create table exercise_student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ',';
load data local inpath "/home/bigdata/students.txt" into table exercise_student;
select * from exercise_student;

需求：实现全局排序，求年龄最大的3个人，不能使用一个reduceTask 来做。

最终SQL：

第一步：查询最大的和最小的年龄

select max(age) as maxage, min(age) as minage from exercise_student;

结果：

+-------+------+
| maxage | minage |
+-------+-------+
| 23 	| 17 	|
+-------+-------+

假设数据是基本分布均匀的话：那么每个分区的范围都可以是固定长度
但是如果不是固定范围：找一些数学方式来搞定！ 身高(服从正太分布)

通用的技巧：采样! 通过采样能知道数据的分布规律！能确定界限

很简单：（如果你采样了1G的数据，一定要排序，想分成10个分区来做）
1、从0读到100M的时候，把第100M 位置的那条记录的 分桶字段拿出来
2、从100M-200M的区间范围
....
一定能确定每个区间的分桶字段的起始范围！

怎么做采样？

1、分桶之后采样！(采样100条，采样100M，也可以采样5%)
2、select * from table distribute by rand() - 0.5 limit 3;

第二步：查询一下去重的年龄个数有多少个

select distinct age from exercise_student order by age desc;

结果：

+------+
| age |
+------+
| 23 |
| 22 |
| 21 |
| 20 |
| 19 |
| 18 |
| 17 |
+------+

第三步：分桶

set mapreduce.job.reduces=3;
insert overwrite directory "hdfs://hadoop277ha/hive_student_out_order3" 
select id,name,sex,age,department 
from exercise_student 
distribute by (
    case
    when age > 20 then 0
    when age > 18 then 1
    else 2
    end)
sort by age desc;

结果：

hadoop fs -ls /hive_student_out_order3
hadoop fs -cat /hive_student_out_order3/000000_0
hadoop fs -cat /hive_student_out_order3/000001_0
hadoop fs -cat /hive_student_out_order3/000002_0

MR程序实现海量数据的分布式排序！
写SQL

细节：如果使用 order by 来做，最终就是一个 reduceTask 来做，所以当数据量特别大的时候，肯定行不通。

方案细节：必然选择多个 reduceTask + sort by 做局部排序

set mapreduce.job.reduces = 10;
select id,name,sex,age,department from student sort by age desc;

并不能实现全局排序，稍稍改变就可以了！因为默认的数据分区是：Hash散列。

必要条件：只要能保证，第一个分区的所有数据，小于第二个分区，第二个分区的所有数据小于第三个分区

… 最终的实现思路：把 Hash散列改成范围分区！ + 分区降序排序，大的数据放第一个分区
0-100, 100-200, 200-300,…

set mapreduce.job.reduces = 3;
insert overwrite directory "hdfs://hadoop277ha/hive_student_out_order3"
select
    id,name,sex,age,department from exercise_student distribute by (
        case
            when age > 20 then 0
            when age > 18 then 1
        else 2
        end)
sort by age desc;

有一个问题：你怎么就确定这三个分桶的界限是：20,18 呢？

有可能出现的问题：这三个桶中的数据分布不均匀！

1、先确定 max, min
2、然后通过抽样了解数据分布规律
    hive采样：
        1、全量数据取 5%
        2、全量数据取 100条
        3、全量数据取 100M

select * from student order by age;
select * from student sort by age;
select * from student distribute by age;
select * from student cluster by age;

3.2.1. order by

全局排序！最终使用一个reduceTask来完成排序，就算你设置了使用多个也没用

如果数据量很大，使用order by来排序是不明智的。

3.2.2. sort by

局部排序！

假设你使用多个reduceTask来运行的话，那么每个reduceTask输出的结果是有序的。但是所有的结果放在一起是没有顺序的

经典需求：

我的数据量确实很大（不能使用orderby来进行全局排序），但是需求就是要进行全局排序？
思路参考Hbase的设计！（范围分区 + 局部有序）

3.2.3. distribute by

分桶查询，有条件：

必须设置 reduceTask 的个数： set mapreduce.job.reduces=4
查询中，必须设置 distribute by 来设置分桶规则：hash散列

3.2.4. cluster by

如果 sort by 的字段和 distribute by 的字段一样，就可以简写:

cluster by age = distribute by age sort by age;

只有这种情况能简写，除此之外，都不可以！

SQL语句：

select * from nx_student cluster by id;

95004 张立 男 19 IS
95008 李娜 女 18 CS
95012 孙花 女 20 CS
95016 钱国 男 21 MA
95020 赵钱 男 21 IS
95001 李勇 男 20 CS
95005 刘刚 男 18 MA
95009 梦圆圆 女 18 MA
95013 冯伟 男 21 CS
95017 王风娟 女 18 IS
95021 周二 男 17 MA
95002 刘晨 女 19 IS
95006 孙庆 男 23 CS
95010 孔小涛 男 19 CS
95014 王小丽 女 19 CS
95018 王一 女 19 IS
95022 郑明 男 20 MA
95003 王敏 女 22 MA
95007 易思玲 女 19 MA
95011 包小柏 男 18 MA
95015 王君 男 18 MA
95019 邢小丽 女 19 IS

结论：

数据分成了4段，每一段的ID都是按照hash散列的规则得到的结果都是一样的
hash(id) % 4 = 余数

Hive的全局排序和局部排序

大数据系统相关栏目本月热门文章