Hadoop-Hive使用笔记_大数据系统

Hadoop-Hive使用笔记

Hadoop hadf存储命令：

Hadoop fs -ls /dir

本地数据文件插入hive表中（注：文件要是utf-8格式的）
1.追加到表中

hive> LOAD DATA LOCAL INPATH '/home/edgeuser/pake/20210602/mm.txt'  
INTO table S11.ld_cust_m
partition(end_dt = '20210227')  --插入到指定分区

2.覆盖到表中

hive> LOAD DATA LOCAL INPATH '/home/edgeuser/pake/20210602/mm.txt' overwrite INTO table S11.ld_cust_m;

-- 查询cdh内存空间使用情况
df -h
-- 查询日志大于100M的文件
find ./ -type f -size +100M -print0 | xargs -0 du -h

Hive基础命令

认证命令
kinit -kt /opt/Talend-7.1.1/keytabs/hive.keytab ~~hive/name01.cdh.bank.cn@CDHUAT.BANK.CN~~
查看是否认证成功：klist

**reduce设置**
 
设置reduce个数：set mapreduce.job.reduce=3;
查看设置reduce个数：set mapreduce.job.reduce;
hive -hiveconf hive.root.logger=INFO,console
hive -hiveconf hive.root.logger=INFO,console
set mapred.reduce.tasks=100;

设置加载分区的个数开发环境默认100 分区超过一百会报错两个要同时启用

set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;

查询表所占用的内存大小

Show table stats tablename;
Show column stats tablename;
Compute stats tablename; 这个命令执行后再执行show table stats tablename 将会显示分区所有数据的条数
show  table stats s0.ld_cust_m
show  column stats s0.ld_cust_m
compute  stats s0.ld_cust_m

查看hive查询的执行计划

   Explain select 语句 （这个命令查询出的结果只是预测结果）

hive表建表时所使用的压缩格式与区别

   Create table tablename
(a bigint
,b bigint
 )
   Partition (end_dt string)   -- 定义分区 
   Stored as sequencefile;    -- 指定压缩格式
   Set hive.exec.dynamic.partition = true; -- 开启动态分区
   Set hive.exec.dynamic.partition.mod = nonstrict;

Create table tablename
(a bigint
,b bigint
 )
   Partition (end_dt string)   -- 定义分区 
   Stored as parquet;    -- 指定压缩格式
   Set hive.exec.dynamic.partition = true; -- 开启动态分区
   Set hive.exec.dynamic.partition.mod = nonstrict;

4.coalesce 与nvl 用法作用相当

1.	返回coalesce操作数列表中第一个非null，如果多为null，则返回null
2.	在coalesce表达式中必须至少有一个操作数不为null
3.	Nvl、nvl2先执行所有操作数再判断，coalesce先判断，只有在必要的时才执行操作
select coalesce(null,'11',) from s02.card_hs  limit 1 ;
select nvl(null,'11') from s02.card_hs  limit 1 ;
select nvl2(null,'11','1') from s02.card_hs  limit 1;

Regexp_replace 替换将逗号替换掉

select regexp_replace(regexp_replace(hx,'^,|,$',''),',{1,4}',',') from card_hs_20201126

5.分组拼接成一行

select zhanghao,concat_ws(',',collect_set(cast(concat(hangyuanmingcheng,bili) as string )))  from  ifrs.ceshi group by zhanghao ;fen’zu

6.查询时间sh

select current_date  from bd_signdetail
select current_timestamp  from bd_signdetail
select from_unixtime(unix_timestamp())

格式化日期：
Date_format(date,格式)
Select date_format('2021-08-24 12:00:00','yyyy-mm-dd')
日期加减
Date_add
Select Date_add('2021-03-01 12:00:00',-1)
select datediff('2019-10-01','2019-10-01')
add_months(DATE_ADD(CONCAT(SUBSTRING(CURRENT_DATE() ,1,8),'01'),-1)  ,-1)
	```

分析函数：
```sql
--- 分组后没有并列排序
Row_number()   

--- 分组后可以并并列排序
RANK()OVER(PARTITION BY R ORDER BY DATA_DT DESC ) RAN

修改表结构

--修改表结构类型
Alter table table_name change column 列名 at 数据类型；

Alter table table_name change column 原字段名称 现字段名称 数据类型

–修改表结构（添加列）

Alter table table_name add columns(字段名 字段类型)
注：使用alter table table_name  add column  新增列时原来表数据会错位需要将分区数据清空重新插入

5.hive表数据导出到文件中
方法一：

Hive -e “
Set hive .exec.compress.output=false
Insert overwrite local directory ‘home/data/’
Row format delimited fields terminated by ‘,’
Stored as textfile 
Select * from table_name;”
 

select reflect('java.net.URLDecoder','decode','%cb%d1%b9%b7',"GBK") from

Spark 基础命令

— 设置支持分区

hive.exec.dynamic.partition            true
hive.exec.dynamic.partition.mode       nostruct

--- 设置查询引擎为 spark
hive.execution.engine                 spark
Set hive.execution.engine=spark

Spark 引擎设置

Hive 报return code2 return code3
可能是因为数据倾斜造成的
也有可能是资源不够使用（无解只能等待）
修改设置项

set hive.auto.convert.join = false;
set hive.ignore.mapjoin.hint = false;
set hive.exec.parallel = true;

-- 下面设置只有涉及到分组统计的时候加
Set hive.groupby.skewindata=true;

Spark 调整内存大小

"hive.exec.dynamic.partition"      "true"
"hive.exec.dynamic.partition.mode" "nonstrict"
"hive.execution.engine"           "spark"
"spark.executor.memory"  "18589934592"
"spark.driver.memory"      "18589934592"
"17179869184"

Hadoop-Hive使用笔记

大数据系统相关栏目本月热门文章