hive的几种排序
准备数据一、order by二、sort by三、distribute by四、cluster by
hive的几种排序 准备数据1、创建一个sort.txt文件,输入内容
tom,chinese,100 tom,math,90 tom,english,95 tom,history,88 tom,chirst,98 jery,chinese,90 jery,math,92 jery,english,95 jery,history,90 jery,chirst,89 may,chinese,90 may,math,89 may,english,85 may,history,98 may,chirst,91
2、创建表
create table sort_test( name string, major string, score int ) row format delimited fields terminated by ",";
3、加载数据
load data local inpath '/opt/module/datas/sort.txt' into table sort_test;一、order by
全局排序,只有一个reduce,会把全局数据做个排序
select * from sort_test order by score desc;
tom chinese 100 tom chirst 98 may history 98 tom english 95 jery english 95 jery math 92 may chirst 91 tom math 90 jery chinese 90 jery history 90 may chinese 90 jery chirst 89 may math 89 tom history 88 may english 85二、sort by
对于大规模的数据集 order by 的效率非常低。在很多情况下,并不需要全局排 序,此时可以使用 sort by
注意,是对每一个得reduce得数据进行排序,如果只有一个reduce,那么效果和order by是没有差别得
select * from sort_test sort by score desc;
tom chinese 100 tom chirst 98 tom english 95 jery english 95 may chinese 90 jery chirst 89 may math 89 tom history 88 may english 85 may history 98 jery math 92 may chirst 91 tom math 90 jery chinese 90 jery history 90三、distribute by
在有些情况下,我们需要控制某个特定行应该到哪个 reducer,通常是为 了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by 类似 MR 中 partition (自定义分区),进行分区,结合 sort by 使用。
select * from sort_test distribute by name sort by score desc;
tom chinese 100 tom chirst 98 may history 98 tom english 95 jery english 95 jery math 92 may chirst 91 tom math 90 jery chinese 90 jery history 90 may chinese 90 jery chirst 89 may math 89 tom history 88 may english 85
好像没有什么区别,为啥呢,因为这个分区啊,得设置多个reduce才行
set mapreduce.job.reduces=3;
may history 98 may chirst 91 may chinese 90 may math 89 may english 85 tom chinese 100 tom chirst 98 tom english 95 jery english 95 jery math 92 tom math 90 jery chinese 90 jery history 90 jery chirst 89 tom history 88
写到文件中看一看
insert overwrite local directory '/opt/module/hive/datas/distribute-sort' select * from sort_test distribute by name sort by score desc;四、cluster by
select * from sort_test cluster by major;
当你得分区字段和排序字段是一样得,就可以使用cluster by,注意,cluster by只能有正序,不能有倒序



