【Hive】Hive 分区表、桶的基本操作

文章目录

一、Hive 分区表

1、创建分区表2、插入数据3、查看分区4、修改表分区5、删除表分区二、Hive 桶

1、创建桶2、插入数据3、抽样桶表

环境准备

Hadoop 完全分布式（一主两从即可）MySQL环境、Hive环境

一、Hive 分区表 1、创建分区表

在 Hive 中创建一个分区表 goods，包含 goods_id 和 goods_status 两个字段，字符类型为 string ，分区为 cat_id ，字符类型为 string ，以 ‘t’ 为分隔符:

hive> create table if not exists goods(goods_id string,goods_status string)
    > partitioned by (cat_id string)
    > row format delimited fields terminated by 't';
OK
Time taken: 0.424 seconds

查看表 goods 的表结构：

hive> desc goods;
OK
goods_id            	string              	                    
goods_status        	string              	                    
cat_id              	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
cat_id              	string              	                    
Time taken: 0.226 seconds, Fetched: 8 row(s)

这样一个分区表就建好啦~

返回顶部

2、插入数据

向分区表中插入数据，将本地 /home/data/hive-data 下的 goods 中的数据插入到分区表中。

首先在 Hive 中创建一个非分区表 goods1 ，用于存储本地 goods 文件的数据：

hive> create table goods1(goods_id string,goods_status string,cat_id string)
    > row format delimited fields terminated by 't';
OK
Time taken: 0.084 seconds

将本地数据导入 goods1：

hive> load data local inpath '/home/data/hive-data/goods' into table goods1;
Loading data to table db.goods1
OK
Time taken: 1.316 seconds

再将表 goods1 中的数据导入到分区表 goods 中：

hive> insert into table goods partition(cat_id='52052') select goods_id,goods_status from goods1 where cat_id='52052';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220306171401_5343b66f-8685-487a-be77-537fd35ccb3a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646556082284_0003, Tracking URL = http://server:8088/proxy/application_1646556082284_0003/
Kill Command = /usr/local/src/hadoop/bin/hadoop job  -kill job_1646556082284_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 17:14:52,886 Stage-1 map = 0%,  reduce = 0%
2022-03-06 17:14:59,984 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.7 sec
MapReduce Total cumulative CPU time: 1 seconds 700 msec
Ended Job = job_1646556082284_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://192.168.64.183:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-06_17-14-01_837_747076726384884371-1/-ext-10000
Loading data to table db.goods partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.7 sec   HDFS Read: 213175 HDFS Write: 378 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 700 msec
OK
Time taken: 60.848 seconds

查看结果：

hive> select * from goods limit 10;
OK
1020405	6	52052
1020406	6	52052
1020408	6	52052
1020326	6	52052
1020407	6	52052
1020404	6	52052
1020327	6	52052
1020409	6	52052
1020325	6	52052
1024641	1	52052`在这里插入代码片`
Time taken: 0.136 seconds, Fetched: 10 row(s)

返回顶部

3、查看分区

查看 goods 表中的分区：

hive> show partitions goods;
OK
cat_id=52052
Time taken: 0.128 seconds, Fetched: 1 row(s)

返回顶部

4、修改表分区

修改分区表 goods 中的分区列为 cat_id=52051，并查看修改后的分区名：

hive> alter table goods partition(cat_id=52052 ) rename to partition(cat_id=52051);
OK
Time taken: 0.391 seconds
hive> show partitions goods;
OK
cat_id=52051
Time taken: 0.061 seconds, Fetched: 1 row(s)

返回顶部

5、删除表分区

删除 goods 表中的 cat_id 分区：

hive> alter table goods drop if exists partition (cat_id='52051');
Dropped the partition cat_id=52051
OK
Time taken: 0.456 seconds
hive> show partitions goods;
OK
Time taken: 0.059 seconds

返回顶部

二、Hive 桶 1、创建桶

设置环境变量，开启Hive分桶：

hive> set hive.enforce.bucketing=true;

创建一个名为 goods_t 的表，包含两个字段：goods_id 和 goods_status，字符类型首位 string，按 cat_id string 作为分区，按照 goods_status 列聚类和 goods_id 列排列，划分成两个桶：

hive> create table goods_t(goods_id string,goods_status string) 
    > partitioned by (cat_id string)
    > clustered by (goods_status)
    > sorted by (goods_id)
    > into 2 buckets;
OK
Time taken: 0.08 seconds

返回顶部

2、插入数据

向 goods_t 表中插入 goods1 表中的数据：

hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20220306173319_d8682df4-91ba-42cc-b3cd-76ced5fdf898
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1646556082284_0004, Tracking URL = http://server:8088/proxy/application_1646556082284_0004/
Kill Command = /usr/local/src/hadoop/bin/hadoop job  -kill job_1646556082284_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2022-03-06 17:33:45,839 Stage-1 map = 0%,  reduce = 0%
2022-03-06 17:33:51,130 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.31 sec
2022-03-06 17:33:58,420 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 3.31 sec
2022-03-06 17:34:12,914 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.1 sec
MapReduce Total cumulative CPU time: 5 seconds 100 msec
Ended Job = job_1646556082284_0004
Loading data to table db.goods_t partition (cat_id=52063)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 5.1 sec   HDFS Read: 220296 HDFS Write: 112278 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 100 msec
OK
Time taken: 54.668 seconds

返回顶部

3、抽样桶表

查看表数据：

hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status) limit 10;
OK
1000016	4	52063
1000018	4	52063
1000096	4	52063
1000218	4	52063
1000219	4	52063
1000223	4	52063
1000310	4	52063
1000368	4	52063
1000371	4	52063
1000380	4	52063
Time taken: 0.131 seconds, Fetched: 10 row(s)

返回顶部

【Hive】Hive 分区表、桶的基本操作

大数据系统相关栏目本月热门文章