Hive动态分区和分桶

1、 Hive动态分区和分桶
- 1、Hive动态分区
- - 1、hive的动态分区介绍
  - 2、hive的动态分区配置
  - 3、hive动态分区语法
2、Hive分桶
- 1、Hive分桶的介绍
- 2、Hive分桶的配置
- 3、Hive分桶的抽样查询

1、 Hive动态分区和分桶 1、Hive动态分区 1、hive的动态分区介绍

hive的静态分区需要用户在插入数据的时候必须手动指定hive的分区字段值，但是这样的话会导致用户的操作复杂度提高，而且在使用的时候会导致数据只能插入到某一个指定分区，无法让数据散列分布，因此更好的方式是当数据在进行插入的时候，根据数据的某一个字段或某几个字段值动态的将数据插入到不同的目录中，此时，引入动态分区。

2、hive的动态分区配置

--hive设置hive动态分区开启
	set hive.exec.dynamic.partition=true;
	默认：true
--hive的动态分区模式
	set hive.exec.dynamic.partition.mode=nostrict;
	默认：strict（至少有一个分区列是静态分区）
--每一个执行mr节点上，允许创建的动态分区的最大数量(100)
	set hive.exec.max.dynamic.partitions.pernode;
--所有执行mr节点上，允许创建的所有动态分区的最大数量(1000)	
	set hive.exec.max.dynamic.partitions;
--所有的mr job允许创建的文件的最大数量(100000)	
	set hive.exec.max.created.files;

3、hive动态分区语法

-- Hive extension (dynamic partition inserts):
	INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) 		select_statement FROM from_statement;
	
	INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) 			select_statement FROM from_statement;
	

-- load data	
-- 语法
LOAD DATA [LOCAL] INPATH hdfs_path INTO TABLE table_name 
	[PARTITION(date='2008-06-08', country='US')] -- 适用于静态分区表

-- 动态分区插入数据
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECt pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

--- 准备数据
cat > dynamic_data.txt <<-EOF
1,小明1,11,man,抽烟-喝酒-烫头,北京:王府井-上海:浦东
2,小明2,12,women,lol-book-movie,beijing:aaa-shanghai:jingan
3,小明3,13,man,抽烟-喝酒-烫头,北京:王府井-上海:浦东
4,小明4,13,women,lol-book-movie,beijing:aaa-shanghai:jingan
5,小明5,14,man,抽烟-喝酒-烫头,北京:王府井-上海:浦东
6,小明6,15,women,lol-book-movie,beijing:aaa-shanghai:jingan
7,小明7,16,man,抽烟-喝酒-烫头,北京:王府井-上海:浦东
EOF

hdfs dfs -mkdir /hive_dynamic
hdfs dfs -put dynamic_data.txt /hive_dynamic

-- 创建普通表
create table psn21
(
    id int,
    name string,
    age int,
    gender string,
    likes array,
    address map
)
row format delimited
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':';

-- 加载数据到普通表
load data inpath '/hive_dynamic/dynamic_data.txt' overwrite into table psn21;

-- 动态分区表的创建
create table psn22
(
id int,
name string,
likes array,
address map
)
partitioned by (age int,gender string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

-- 动态分区表加载数据，自动创建分区
set hive.exec.dynamic.partition.mode=nonstrict;
-- 查询的顺序要跟定义的顺序一致，分区的顺序也要一致
insert into table psn22 
	partition(age,gender) 
	select id,name,likes,address,age,gender from psn21;

-- 等价
from psn21 pp
insert overwrite table psn22 partition(age,gender)
select pp.id,pp.name,pp.likes,pp.address,pp.age,pp.gender;

-- 追加
from psn21 pp
insert into table psn22 partition(age,gender)
select pp.id,pp.name,pp.likes,pp.address,pp.age,pp.gender;

-- 清空表数据
truncate table psn22;

-- 删除分区的数据
alter table psn22 drop partition(gender='man');
alter table psn22 drop partition(age=13);

2、Hive分桶 1、Hive分桶的介绍

	Bucketed tables are fantastic in that they allow much more efficient sampling than do non-bucketed tables, and they may later allow for time saving operations such as mapside joins. However, the bucketing specified at table creation is not enforced when the table is written to, and so it is possible for the table's metadata to advertise properties which are not upheld by the table's actual layout. This should obviously be avoided. Here's how to do it right.

注意：

1、Hive分桶表是对列值取hash值的方式，将不同数据放到不同文件中存储

2、对于hive中每一个表、分区都可以进一步进行分桶

3、由列的hash值除以桶的个数来决定每条数据划分在哪个桶中

应用场景

1. 方便抽样
2. 提高join查询效率

2、Hive分桶的配置

--设置hive支持分桶
	set hive.enforce.bucketing=true;
	mr会根据bucket的个数自动分配reduce task的个数，用户也可以通过mapred.reduce.tasks设置任务个数，但是分桶是不推荐使用

3、Hive分桶的抽样查询

--案例
	select * from bucket_table tablesample(bucket 1 out of 4 on columns)
--TABLESAMPLE语法：
	TABLESAMPLE(BUCKET x OUT OF y)
		x：表示从哪个bucket开始抽取数据
		y：必须为该表总bucket数的倍数或因子

cat >> hive_data.txt <<-EOF
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
1,小明1,抽烟-喝酒-烫头,上海:静安
2,小明2,抽烟-喝酒-烫头,北京:王府井
EOF

hdfs dfs -put -f hive_data.txt /data/

select count(*) from psn21;
-- 发现map和reduce的个数都是1

-- 通过哈希值除以桶的余数来决定放在哪个桶
cat > bucket <<-EOF
1,tom,11
2,cat,22
3,dog,33
4,hive,44
5,hbase,55
6,mr,66
7,alice,77
8,scala,88
EOF

-- 上传到hdfs
hdfs dfs -put -f bucket /data/

-- 创建普通表
CREATE TABLE psn31( id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- 加载数据
load data inpath '/data/bucket' into table psn31;


-- 分桶
create table psn32( id int, name string, age int)
clustered by (age) into 4 buckets
row format delimited fields terminated by ',';

-- 加载数据
insert into table psn32 select id, name, age from psn31;

dfs -ls -R /user/hive_remote/warehouse/psn32;

-- 抽样
select id, name, age from psn32 tablesample(bucket 2 out of 4 on age);

bucket 2 out of 4
x=2
y=4
从第二个桶取数据,取到 桶数/y

Hive动态分区和分桶

大数据系统相关栏目本月热门文章