Hive 分区表、分桶表

1. 分区

分区表对应对应于 HDFS 上的单独文件，该文件加下是该分区的所有数据，分区相当与多级目录，通过 where 条件可实现不同分区的数据查询操作。

1.1 分区表操作

创建单级分区表

create external table weblog00 (line string)
partitioned by (dt string);

 加载在数据到分区表

load data local inpath '/root/weblogs/access.log-20211101'
    into table weblog00 partition(dt='20211101');
load data local inpath '/root/weblogs/access.log-20211102'
    into table weblog00 partition(dt='20211102');

 查看分区表中的数据

select * from weblog00 where dt='20211101' limit 10;

 添加分区

-- 添加单个分区
alter table weblog00 add partition (dt='001');
-- 添加多个分区
alter table weblog00 add partition (dt='002', dt='003');

 删除分区

-- 删除分区
alter table weblog00 drop partition (dt='001');
-- 删除多个分区
alter table weblog00 drop partition (dt='002'), partition (dt='003');

 查看分区表中的分区

show partitions weblog00;

1.2 二级分区

创建二级分区表

create table weblog00 (line string)
partitioned by (month string, day string);

 加载数据到二级分区

load data local inpath '/root/weblogs/access.log-20211101' 
	into table weblog00 partition (month='11', day='01');
load data local inpath '/root/weblogs/access.log-20211102' 
	into table weblog00 partition (month='11', day='02');

 上传数据到分区后并修复表

dfs -mkdir -p /user/hive/warehouse/hive_test01.db/weblog01/month=11/day=01;
dfs -put /root/weblogs/access.log-20211101 /user/hive/warehouse/hive_test01.db/weblog01/month=11/day=01;

msck repair table weblog00;

 上传数据到分区后添加分区

dfs -mkdir -p /user/hive/warehouse/hive_test01.db/weblog00/month=11/day=05;
dfs -put /root/weblogs/access.log-20211101 /user/hive/warehouse/hive_test01.db/weblog00/month=11/day=05;

alter table weblog00 add partition (month='11', day='05');

 创建文件加后加载数据到分区

dfs -mkdir -p /user/hive/warehouse/hive_test01.db/weblog00/month=11/day=06;

load data local inpath '/root/weblogs/access.log-20211101' 
	into table weblog00 partition (month='11', day='06');

1.3 多级分区表

创建多级分区表

create external table weblog01 (line string)
partitioned by (ym string, d string, M string, S string);

 通过 flume 向表所在分区目录上传数据

使用两个节点，一个node1节点读取本地文件加数据，并向node2节点发送数据，节点2接收数据后伤上传至HDFS上的指定目录！

flume 相关博文可跳转：Flume 拓扑结构

flume 启动命令：
flume-ng agent -c  -f  -n  -Dflume.root.logger=INFO,console

 刷新数据库

msck repair table weblog01;

2. 分桶

对表中的某列数据进行划分，可进一步组织为桶，桶相当于更进一步的数据范围划分。Hive 采用对列值进行 hash，得到该列中每条数据存入的桶。

个人认为，分桶就是 MR 中设置 reducer 数量一样，通过计算分桶列的每条数据的哈希值与桶数据取模得到桶的下标，然后将这些数据放入该桶中。

2.1 分桶表操作

数据准备

 创建分桶表

CREATE TABLE `weblog_extracted_bkt`
(
    `ip`   string,
    `dt`   string,
    `code` string,
    `type` string,
    `url`  string,
    `up`   bigint,
    `down` bigint
)
    clustered by (code)
        into 11 buckets;

 向分桶表中导入数据

insert into weblog_extracted_bkt select * from weblog_extracted;

 对桶抽样

语法结构：tablesample(bucket x out of y) （x必须小于等于y）

select * from weblog_extracted_bkt 
	tablesample ( bucket 3 out of 4 on code) limit 10;

3. 写在最后

分区表的几种放数据方式：

加载数据到分区表
上传数据到分区后并修复表
上传数据到分区后添加分区
创建文件加后加载数据到分区

❤️ END ❤️

Hive 分区表、分桶表

大数据系统相关栏目本月热门文章