文章目录
一、Hive 分区表
1、创建分区表2、插入数据3、查看分区4、修改表分区5、删除表分区 二、Hive 桶
1、创建桶2、插入数据3、抽样桶表
环境准备
Hadoop 完全分布式(一主两从即可)MySQL环境、Hive环境
一、Hive 分区表 1、创建分区表
在 Hive 中创建一个分区表 goods,包含 goods_id 和 goods_status 两个字段,字符类型为 string ,分区为 cat_id ,字符类型为 string ,以 ‘t’ 为分隔符:
hive> create table if not exists goods(goods_id string,goods_status string)
> partitioned by (cat_id string)
> row format delimited fields terminated by 't';
OK
Time taken: 0.424 seconds
查看表 goods 的表结构:
hive> desc goods; OK goods_id string goods_status string cat_id string # Partition Information # col_name data_type comment cat_id string Time taken: 0.226 seconds, Fetched: 8 row(s)
这样一个分区表就建好啦~
返回顶部
2、插入数据
向分区表中插入数据,将本地 /home/data/hive-data 下的 goods 中的数据插入到分区表中。
首先在 Hive 中创建一个非分区表 goods1 ,用于存储本地 goods 文件的数据:
hive> create table goods1(goods_id string,goods_status string,cat_id string)
> row format delimited fields terminated by 't';
OK
Time taken: 0.084 seconds
将本地数据导入 goods1:
hive> load data local inpath '/home/data/hive-data/goods' into table goods1; Loading data to table db.goods1 OK Time taken: 1.316 seconds
再将表 goods1 中的数据导入到分区表 goods 中:
hive> insert into table goods partition(cat_id='52052') select goods_id,goods_status from goods1 where cat_id='52052'; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20220306171401_5343b66f-8685-487a-be77-537fd35ccb3a Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1646556082284_0003, Tracking URL = http://server:8088/proxy/application_1646556082284_0003/ Kill Command = /usr/local/src/hadoop/bin/hadoop job -kill job_1646556082284_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2022-03-06 17:14:52,886 Stage-1 map = 0%, reduce = 0% 2022-03-06 17:14:59,984 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec MapReduce Total cumulative CPU time: 1 seconds 700 msec Ended Job = job_1646556082284_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://192.168.64.183:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-06_17-14-01_837_747076726384884371-1/-ext-10000 Loading data to table db.goods partition (cat_id=52052) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.7 sec HDFS Read: 213175 HDFS Write: 378 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 700 msec OK Time taken: 60.848 seconds
查看结果:
hive> select * from goods limit 10; OK 1020405 6 52052 1020406 6 52052 1020408 6 52052 1020326 6 52052 1020407 6 52052 1020404 6 52052 1020327 6 52052 1020409 6 52052 1020325 6 52052 1024641 1 52052`在这里插入代码片` Time taken: 0.136 seconds, Fetched: 10 row(s)
返回顶部
3、查看分区
查看 goods 表中的分区:
hive> show partitions goods; OK cat_id=52052 Time taken: 0.128 seconds, Fetched: 1 row(s)
返回顶部
4、修改表分区
修改分区表 goods 中的分区列为 cat_id=52051,并查看修改后的分区名:
hive> alter table goods partition(cat_id=52052 ) rename to partition(cat_id=52051); OK Time taken: 0.391 seconds hive> show partitions goods; OK cat_id=52051 Time taken: 0.061 seconds, Fetched: 1 row(s)
返回顶部
5、删除表分区
删除 goods 表中的 cat_id 分区:
hive> alter table goods drop if exists partition (cat_id='52051'); Dropped the partition cat_id=52051 OK Time taken: 0.456 seconds hive> show partitions goods; OK Time taken: 0.059 seconds
返回顶部
二、Hive 桶 1、创建桶
设置环境变量,开启Hive分桶:
hive> set hive.enforce.bucketing=true;
创建一个名为 goods_t 的表,包含两个字段:goods_id 和 goods_status,字符类型首位 string,按 cat_id string 作为分区,按照 goods_status 列聚类和 goods_id 列排列,划分成两个桶:
hive> create table goods_t(goods_id string,goods_status string)
> partitioned by (cat_id string)
> clustered by (goods_status)
> sorted by (goods_id)
> into 2 buckets;
OK
Time taken: 0.08 seconds
返回顶部
2、插入数据
向 goods_t 表中插入 goods1 表中的数据:
hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods1; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20220306173319_d8682df4-91ba-42cc-b3cd-76ced5fdf898 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 2 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapreduce.job.reduces= Starting Job = job_1646556082284_0004, Tracking URL = http://server:8088/proxy/application_1646556082284_0004/ Kill Command = /usr/local/src/hadoop/bin/hadoop job -kill job_1646556082284_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2 2022-03-06 17:33:45,839 Stage-1 map = 0%, reduce = 0% 2022-03-06 17:33:51,130 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec 2022-03-06 17:33:58,420 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 3.31 sec 2022-03-06 17:34:12,914 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.1 sec MapReduce Total cumulative CPU time: 5 seconds 100 msec Ended Job = job_1646556082284_0004 Loading data to table db.goods_t partition (cat_id=52063) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 5.1 sec HDFS Read: 220296 HDFS Write: 112278 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 100 msec OK Time taken: 54.668 seconds
返回顶部
3、抽样桶表
查看表数据:
hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status) limit 10; OK 1000016 4 52063 1000018 4 52063 1000096 4 52063 1000218 4 52063 1000219 4 52063 1000223 4 52063 1000310 4 52063 1000368 4 52063 1000371 4 52063 1000380 4 52063 Time taken: 0.131 seconds, Fetched: 10 row(s)
返回顶部



