一.数据库与数据库表 1.1创建数据库创建数据库

create database if not exists dababase_name;
use  dababase_name;

说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的

hive.metastore.warehouse.dir
/user/hive/warehouse

创建数据库并指定hdfs存储位置

create database dababase_name location 'path_name';

修改数据库

可以使用alter database 命令来修改数据库的一些属性。但是数据库的元数据信息是不可更改的，包括数据库的名称以及数据库所在的位置

alter  database  dababase_name  set  dbproperties('createtime'='20210611');

查看数据库详细信息

查看数据库基本信息

desc  database dababase_name;

查看数据库更多详细信息

desc database extended  dababase_name;

删除数据库

删除一个空数据库，如果数据库下面有数据表，那么就会报错

drop  database  dababase_name;

强制删除数据库，包含数据库下面的表一起删除

drop  database  myhive  dababase_name;

1.2创建数据库表管理表（内部表）创建表并指定字段之间的分隔

create  table if not exists stu2(id int ,name string) row format delimited fields terminated by 't' stored as textfile location '/user/stu2';

根据查询结果创建表(复制表结构并复制表里面的数据)

create table stu3 as select * from stu2;

根据已经存在的表结构创建表（仅复制表结构）

create table stu4 like stu2;

查询表的类型

desc formatted  stu2;

外部表：

create external table techer (t_id string,t_name string) row format delimited fields terminated by 't';

从本地加载数据的时候，本地数据没有动；从HDFS加载数据的时候，把数据移动到了hive表的location位置，源文件移动位置；如果没有指定location位置，那么会在hive表的默认位置；

内部表与外部表的创建：

external关键字决定了是内部表还是外部表；

内部表：删表的时候，同时删除hdfs的数据；

外部表：删表的时候，不会删除hdfs上面的数据；因为是指定其他的hdfs路径的数据加载到表当中来

创建分区表：

一般没有独立的表模型，只有内部分区表或者外部分区表。核心思想：分而治之，数据量越少，运行速度越快。所以可以按照一定的规则，创建一些文件夹，再根据指定的文件夹找到指定的数据。

创建分区表语法

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by 't';

加载数据到分区表中

load data local inpath '/opt/module/hivedatas/score.csv' into table score partition (month='201806');

创建分桶表

即指定一个字段进行分桶，其实就是以一个字段作为key2，应用MapReduce的分区规则（HashPartitioner）通过多个reducer输出多个文件

开启hive的桶表功能

set hive.enforce.bucketing=true;

设置reduce的个数

set mapreduce.job.reduces=3;

创建通表

create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by 't';

桶表的数据加载，由于通标的数据加载通过hdfs dfs -put文件或者通过load data均不好使，只能通过insert overwrite

修改表表重命名

alter  table  old_table_name  rename  to  new_table_name;

增加/修改列信息

（1）查询表结构

desc score5;

（2）添加列

alter table score5 add columns (mycol string, mysco string);

（3）查询表结构

desc score5;

（4）更新列

alter table score5 change column mysco mysconew int;

（5）查询表结构

desc score5;

删除表

drop table score5;

hive表中加载数据

通过load方式加载数据

load data local inpath '/opt/module/hivedatas/score.csv' overwrite into table score partition(month='201806');

取消overwrite则不覆盖源文件

通过查询方式加载数据

create table score4 like score;

insert overwrite  table score4 partition(month = '201806') select s_id,c_id,s_score from score;

关键字overwrite 必须要有

多插入模式

给score表加载数据

load data local inpath '/opt/module/hivedatas/score.csv' overwrite into table score partition(month='201806');

创建第一部分表：

create table score_first( s_id string,c_id  string) partitioned by (month string) row format delimited fields terminated by 't' ;

创建第二部分表：

create table score_second(c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by 't';

分别给第一部分与第二部分表加载数据

from score insert overwrite table score_first partition(month='201806') select s_id,c_id insert overwrite table score_second partition(month = '201806')  select c_id,s_score;

查询语句中创建表并加载数据（as select）

create table score5 as select * from score;

hive表中的数据导出 insert导出

将查询的结果导出到本地

insert overwrite local directory '/opt/module/exporthive' select * from score;

将查询的结果格式化导出到本地

insert overwrite local directory '/opt/module/exporthive' row format delimited fields terminated by 't' collection items terminated by '#' select * from student;

将查询的结果导出到HDFS上(没有local)

insert overwrite directory '/opt/module/exporthive' row format delimited fields terminated by 't' collection items terminated by '#' select * from score;

Hadoop命令导出到本地

dfs -get /opt/module/exporthive/000000_0 /opt/module/exporthive/local.txt;

export导出到HDFS上

export table score to '/export/exporthive/score';

清空表数据

只能清空管理表，也就是内部表

truncate table score6;

Hive函数内置函数

1）查看系统自带的函数

hive> show functions;

2）显示自带的函数的用法

hive> desc function upper;

3）详细显示自带的函数的用法

hive> desc function extended upper;

自定义函数：

根据用户自定义函数类别分为以下三种：

（1）UDF（User-Defined-Function）

一进一出

（2）UDAF（User-Defined Aggregation Function）

聚集函数，多进一出

类似于：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一进多出

如lateral view explore()

编程步骤：

第一步：创建工程，导入jar包

第二步：创建Java类，继承UDF

第三步：定义一个方法，方法名evaluate；必须有返回值，而且还有一个参数，表示接收输入的数据

第四步：定义udf的逻辑

第五步：打成jar包放到hive的lib目录下

第六步：hive的客户端，add jar ，添加jar包

第七步：设置临时函数与自定义udf的关联

create temporary function tolowercase as 'cn.itcast.udf.ItcastUDF';

第八步：使用udf

hive的数据压缩
推荐使用snappy，因为压缩速度和解压速度明显优于其他压缩格式

开启Map输出阶段压缩

开启map输出阶段压缩可以减少job中map和Reduce task间数据传输量。具体配置如下：

1）开启hive中间传输数据压缩功能

hive (default)>set hive.exec.compress.intermediate=true;

2）开启mapreduce中map输出压缩功能

hive (default)>set mapreduce.map.output.compress=true;

3）设置mapreduce中map输出数据的压缩方式

hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

4）执行查询语句

 select count(1) from score;

开启Reduce输出阶段压缩

当Hive将输出写入到表中时，输出内容同样可以进行压缩。属性hive.exec.compress.output控制着这个功能。用户可能需要保持默认设置文件中的默认值false，这样默认的输出就是非压缩的纯文本文件了。用户可以通过在查询语句或执行脚本中设置这个值为true，来开启输出结果压缩功能。

1）开启hive最终输出数据压缩功能

hive (default)>set hive.exec.compress.output=true;

2）开启mapreduce最终输出数据压缩

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3）设置mapreduce最终数据输出压缩方式

hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4）设置mapreduce最终数据输出压缩为块压缩

hive (default)>set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5）测试一下输出结果是否是压缩文件

insert overwrite local directory '/opt/module/snappy' select * from score distribute by s_id sort by s_id desc;

Hive的三种参数设置

第一种：hive-site.xml,对所有的Hive客户端都有效

第二种：命令行的参数，bin/hive -hiveconf 参数名=参数值，对进入的这次会话有效

第三种：参数声明，对当前的SQL语句生效，set 参数名-参数值

参数声明 > 命令行参数 > 配置文件参数（hive）

Hive的数据存储格式：

两类四种：

行式存储：text，sequenceFile

列式存储：parquet ，orc

存储文件的压缩比总结：

ORC > Parquet > textFile

存储文件的查询速度总结：

ORC > TextFile > Parquet

存储方式和压缩总结：

在实际的项目开发当中，hive表的数据存储格式一般选择：orc或parquet。压缩方式一般选择snappy。

Hive操作命令

通过load方式加载数据
load data local inpath '/opt/module/hivedatas/score.csv' overwrite into table score partition(month='201806');

取消overwrite则不覆盖源文件

hive的数据压缩
推荐使用snappy，因为压缩速度和解压速度明显优于其他压缩格式

大数据系统相关栏目本月热门文章

Hive操作命令

通过load方式加载数据 load data local inpath '/opt/module/hivedatas/score.csv' overwrite into table score partition(month='201806'); 取消overwrite则不覆盖源文件

hive的数据压缩 推荐使用snappy，因为压缩速度和解压速度明显优于其他压缩格式

大数据系统相关栏目本月热门文章

通过load方式加载数据
load data local inpath '/opt/module/hivedatas/score.csv' overwrite into table score partition(month='201806');

取消overwrite则不覆盖源文件

hive的数据压缩
推荐使用snappy，因为压缩速度和解压速度明显优于其他压缩格式