栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

00001

00001

1.0 Hadoop配置
  • 解压tar包
  • 配置环境变量
  • 配置文件

  • hadoop-env.sh
export JAVA_HOME=/usr/java/jdk.1.8.0_212
  • core-site.xml

	fs.default.name
	hdfs://master:9000

  • hdfs-site.xml

	dfs.replication
	2


	dfs.datenode.date.dir
	file:/usr/hadoop/hadoop-2.7.7/hdfs/date


	dfs.namenode.name.dir
	file:/usr/hadoop/hadoop-2.7.7/hdfs/name

  • yarn-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_212
  • yarn-site.xml

	yarn.resourcemanager.admin.address
	master:18141


	yarn.nodemanager.aux-services
	mapreduce_shuffle

mapred-site.xml


	mapreduce.framework.name
	yarn

  • 设置节点文件
    vim master
    vim slaves
  • 格式化
    hadoop namenode -format
  • 启动集群
    start-all.sh start
2.0 配置Spark 2.1 配置Scala
  • 解压文件
  • 配置环境变量
2.2 配置Spark
  • 解压文件
  • 配置环境变量
  • 配置文件
    cp /usr/spark/spark/conf
    vim spark.env.sh
export SPARK_MASTER_IP=master
export SCALA_HOME=/usr/scala/scala
export JAVA_HOME=/usr/java/jdk1.8.0_212
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

mv slaves 追加slave1,save2

  • 开启集群 sbin/start-all.sh
3.0 配置Flume
  • 解压
  • 删除lib文件下的guava-11.0.2.jar
4.0 配置Kafka
  • 解压
  • 修改配置文件cd /usr/kafka/config
  • 修改 server.properties
#id
broker.id=0

delete.topic.enable=true

#kafka 运行日志存放的路径
log.dirs=/usr/kafka/kafka/logs

#配置连接 Zookeeper 集群地址
`zookeeper.connect=master:2181,slave1:2181,slave2:2181`

1.编辑kafka配置文件
vim /opt/kafka/kafka_2.11-2.0.0/config/server.properties

在文件中直接添加以下配置

listeners=PLAINTEXT://0.0.0.0:9092 advertised.host.name=master advertised.listeners=PLAINTEXT://192.168.3.89:9092

2.通过jps获取已启动kafka的进程,然后将其进程杀死
kill -9 进程号

3.再启动kafka 命令如下
/opt/kafka/kafka_2.11-2.0.0/bin/kafka-server-start.sh -daemon /opt/kafka/kafka_2.11-2.0.0/config/server.properties

5.0 Mysql 5.1判断是否安装mysql-community-server

rpm -aq|grep mysql-community-server

5.2 如果没有安装的话开始安装
  • 安装wget:

    yum install wget

  • 下载yum源安装包

    wget http://repo.mysql.com/mysql57-community-release-el7.rpm

  • 如何查看CentOs版本

    cat /etc/centos-release

  • 安装下载好的rpm

    rpm -ivh mysql57-community-release-el7.rpm

  • 安装Msyql

    yum -y install mysql-community-server

  • 启动Mysql服务

    systemctl start mysqld

  • 初始化操作

    grep 'temporary password' /var/log/mysqld.log

  • 更改密码

    mysql_secure_installation

    然后出现Enter password for user root:,此时输入刚刚获得的初始密码并回车。

    出现是否需要确认的一律输入:y

    当出现New password:的时候则代表要输入你更改的新密码了,输入你要更改的新密码回车并再次输入确认

  • 设置开机自动运行MySQL服务

    systemctl enable mysqld

  • 打开防火墙3306端口

    当不打开此端口则我们是无法远程网络访问MySQL数据库的,一般要打开它,以便我们可以远程连接访问。

    firewall-cmd --zone=public --add-service=mysql --permanent

    iuiJagCEv8)H

5.3 判断是否开启mysql服务

systemctl status mysqld

5.4 mysqld.log日志下生成临时密码

grep "temporary password" /var/log/mysqld.log

5.6 设置Mysql本地root用户密码为123456 5.7 登录mysql

mysql -uroot -p

2、设置密码强度为低级

mysql> set global validate_password_policy=0;

3、设置密码长度为6

mysql> set global validate_password_length=6;

4、修改本地密码

mysql> alter user'root'@'localhost'identified by '123456';

5、设置满足任意主机节点root的远程访问权限(否则后续hive无法连接mysql)

mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;

6、刷新权限

mysql> flush privileges;

1.0 将hive安装包解压到指定路径/usr/hive
mkdir -p /usr/hive
cd /usr/package
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /usr/hive
2.0 设置hive系统环境变量
vi /etc/profile

#加入以下内容
export HIVE_HOME=/usr/hive/apache-hive-2.1.1-bin
export PATH=$PATH:$HIVE_HOME/bin

#使文件生效
source /etc/profile
3.0 设置hive运行环境
cd /usr/hive/apache-hive-2.1.1-bin/conf
mv hive-env.sh.template hive-env.sh
3.1 修改 hive-env.sh 中hadoop安装路径、hive配置文件存放路径为conf、hive运行资源库路径为lib
vi hive-env.sh

$textcolor{SeaGreen}{修改为以下内容:} $

export HIVE_CONF_DIR=/usr/hive/apache-hive-2.1.1-bin/conf
export HIVE_AUX_JARS_PATH=/usr/hive/apache-hive-2.1.1-bin/lib
3.2 新建文件 hive-site.xml
vi hive-site.xml

slave1上执行,添加以下内容


  

    hive.metastore.warehouse.dir
    /user/hive_remote/warehouse

    

    javax.jdo.option.ConnectionURL
    jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false
	JDBC connect string for a JDBC metastore

    

    javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver

    

    javax.jdo.option.ConnectionUserName
    root

    

    javax.jdo.option.ConnectionPassword
    123456
 

    hive.metastore.schema.verification
    false
 

    datanucleus.schema.autoCreateALL
    true
 

master上执行,添加以下内容




    hive.metastore.warehouse.dir
    /user/hive_remote/warehouse

    


    hive.metastore.local
    false

 


    hive.metastore.uris
    thrift://slave1:9083


3.3 将jar包复制至$HIVE_HOME的conf目录下

因为服务器端需要和mysql进行通信,所以服务端需要 Mysql 的 lib 安装包到 Hive_Home/lib目录下,仅在slave1节点上操作。

cd /usr/package
cp mysql-connector-java-5.1.47-bin.jar /usr/hive/apache-hive-2.1.1-bin/lib
4.0 解决jline的版本冲突

将hive安装目录下的lib目录下的jline复制至hadoop目录下,具体操作如下:

cd /usr/hive/apache-hive-2.1.1-bin/lib
cp jline-2.12.jar /usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib
5.0 启动hive验证

每次启动 hive 前要先启动 hadoop 集群。 在master操作

cd /usr/hadoop/hadoop-2.7.3
start-dsf.sh
start-yarn.sh

slave1上执行,初始化数据库

schematool -dbType mysql -initSchema

slave1上执行

cd /usr/hive/apache-hive-2.1.1-bin
bin/hive --service metastore

master上执行

cd /usr/hive/apache-hive-2.1.1-bin
bin/hive

hive启动成功后,输入show databases,出现如下界面则为成功

报错:
org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist

修改 yarn-site.xml文件,重启即可,不需要格式化hadoop


    yarn.nodemanager.aux-services
    mapreduce_shuffle


    yarn.nodemanager.aux-services.mapreduce.shuffle.class
    org.apache.hadoop.mapred.ShuffleHandler

6.0 Hive常用命令

hive 常用命令:

  • 插入数据

    load data local infile '/root/customer.txt'
    into table CUSTOMER 
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY 'n';
    
  • LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 
    [PARTITION (partcol1=val1, partcol2=val2 ...)
    
  1. 创建分区表

    create [external] table employees(id int ,name string,age int)
    comment 'employees home adress'
    partitioned by (province string,city string);
    
  2. 向分区表插入数据

    intsert into table employees partition(province='henan',city='zhumadian')
    values(1,'zhagnsan',23),(2,'zhagnsan',23);
    
  3. 查看分区表有哪些分区

    $hive>show partitions tablename
    //查看具体分区的细节信息
    $hive>show partitions tablename partition(province='henan')
    
  4. 动态分区

    //动态分区需要修改默认参数,开启非严格模式
    set hive.exec.dynamic.partition.mode=nonstrict
     
    //动态分区只是在插入数据的时候,根据select语句的最后两个字段进行动态分区
    1.创建分区表
    create [external] table employees2(id int ,name string,age int)
    comment 'employees home adress'
    partitioned by (province string,city string);
    2.插入数据,根据province,city的值自动创建分区,字段名必须一致
    insert overwrite table employees2(province,city) 
    select id,name,age,province,city from employees;
    
  5. 导出表数据

    //LOCAL关键字 控制是输出到本地还是HDFS
    INSERT OVERWRITE [LOCAL] DIRECTORY '/home/hadoop/output' 
    [ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY 't']
    select * from employees2; 
    
  6. 查询结果显示字段名称(header)

    set hive.cli.print.header=true
    
6.1 hive 中的over函数

窗口函数

分析函数 over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)
  • Row_Number,Rank,Dense_Rank 这三个窗口函数的使用场景非常多

  • row_number():

    从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。

  • rank():

    生成数据项在分组中的排名,排名相等会在名次中留下空位。

  • dense_rank():

    生成数据项在分组中的排名,排名相等会在名次中不会留下空位。

1,1,20201210,1607558400 2,2,20201210,1607558402 3,3,20201210,1607558403 4,4,20201210,1607558406 5,1,20201210,1607558500 6,2,20201210,1607558510 7,3,20201210,1607558520 8,5,20201210,1607558525 9,6,20201210,1607558529 10,4,20201210,1607558532 11,2,20201210,1607558535 12,5,20201210,1607558540 13,1,20201210,1607558545 14,6,20201210,1607558550 15,2,20201210,1607558560 16,1,20201210,1607558570 17,1,20201211,1607644805 18,2,20201211,1607644806 19,3,20201211,1607644809 20,4,20201211,1607644812 21,1,20201211,1607644815 22,2,20201211,1607644820 23,3,20201211,1607644828 24,5,20201211,1607644832 25,6,20201211,1607644843 26,4,20201211,1607644849 27,2,20201211,1607644856 28,5,20201211,1607644860 29,1,20201211,1607644863 30,6,20201211,1607644878 31,2,20201211,1607644885 32,1,20201211,1607644899

模拟数据

  • 创建表并导入数据

    CREATE EXTERNAL TABLE IF NOT EXISTS 'douyin_maidian_log'
    (
    	id string,
        user_id string,
        day string,
        time_stamp int
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    load data local inpath 'filepath' into table douyin_maidian_log
    
  • 查询用户明细以及当前表中数据总量

    select * ,count(user_id) over() as total from douyin_maidian_log
    

  • 查询用户明细以及当前每天的用户的埋点总数

    select *,count(user_id) over(partition by day) from douyin_maidian_log
    

  • 查询用户明细以及当前每天的用户个数

    先查询每天的用户个数:
    	select day,count(distinct user_id) from douyin_maidian_log group by day as t2
    之后合并查询
    	select t1.*,t2.total from douyin_maidian_log t1,
       (
        select day,count(distinct user_id) as total from douyin_maidian_log group by day
        )as t2  
        where t1.day=t2.day
    
  • 查询每个用户每次在线时的在线时长

    select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log
    --------------------------------------------------------------------------------------
    select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0;
    --------------------------------------------------------------------------------------
    select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11,
    
    

    row_number() 方法:为每一条分组记录返回一个数字

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2aFd9S57-1640088998835)(C:Users30645AppDataRoamingTyporatypora-user-imagesimage-20211218174231585.png)]

    select t12.user_id,t11.time_stamp-t12.time_stamp,row_number() over(partition by t12.user_id) from 
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from
     
    ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11
    
    where t12.user_id=t11.user_id and t12.rn=t11.rn; 
    
    select t12.user_id,t11.time_stamp-t12.time_stamp,row_number() over(partition by t12.user_id) from
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11
    
    where t12.user_id=t11.user_id and t12.rn=t11.rn; 
    
  • 查询每个用户的总在线时长

    select tb.user_id,max(tb.total) from
    (
        select t12.user_id user_id,t11.time_stamp-t12.time_stamp total,row_number() over(partition by t12.user_id) rn from 
    (
        select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
    (
        select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn 
        from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn 
     from
    (
        select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn 
        from douyin_maidian_log) t1 where t1.rn%2=0) t11
     where t12.user_id=t11.user_id and t12.rn=t11.rn) tb group by tb.user_id;
    
select tb.user_id,max(tb.total)
from(
	select t12.user_id user_id,t11.time_stamp-t12.time_stamp total,row_number_()
    over(partition by t12.userid) rn
    from(
        select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn
        from(
            select user_id,time_stamp,row_number() over(partition by user_id order by 				time_stamp) as rn from douyin_maidian_log
           	
6.1.2 hive窗口函数案例2
  • 测试数据

    20191020,11111,85
    20191020,22222,83
    20191020,33333,86
    20191021,11111,87
    20191021,22222,65
    20191021,33333,98
    20191022,11111,67
    20191022,22222,34
    20191022,33333,88
    20191023,11111,99
    20191023,22222,33
    
  • 创建测试表并导入数据

    CREATE EXTERNAL TABLE IF NOT EXISTS user_score
    (day string,
    userid string,
    score int)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    load data local inpath '/home/test/user_score.log' into table user_score;
    
    
  • 查询每一天所有score大于80分的用户总数

    select t.day,max(t.total) from
    (select  day,userid,score,count(userid) over(partition by day rows between unbounded preceding and current row) total 
    from  user_score
    where score>80) t group by t.day;
    --------------------------------------------------------------------------------------
    select day,count(*) from user_score where score > 80 group by day order by day ;
    

    between unbounded preceding and current row可加可不加

  • 查询每个用户的当前日期分数大于80的天数

     select *,count(userid) over(partition  by userid order by day) as total
    from user_score where score>80 order by day,userid;
    
6.1.3 hive窗口函数案例3
  • 测试数据

    jack,2017-01-01,10
    tony,2017-01-02,15
    jack,2017-02-03,23
    tony,2017-01-04,29
    jack,2017-01-05,46
    jack,2017-04-06,42
    tony,2017-01-07,50
    jack,2017-01-08,55
    mart,2017-04-08,62
    mart,2017-04-09,68
    neil,2017-05-10,12
    mart,2017-04-11,75
    neil,2017-06-12,80
    mart,2017-04-13,94
    
  • 创建测试表并导入数据

    create table business
    (
    name string, 
    orderdate string,
    cost int  
    )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    #加载数据
    load data local inpath "/home/test/business.log" into table business;
    
  • 查询在2017年4月份购买过的顾客及总人数

    select *,count(name) over() as total from business where substr(orderdate,1,7)='2017-04';
    

    使用语法: substr(string A, int start),substring(string A, int start) 两者用法一样,两个参数

    ​ 返回值: string

    ​ 说明:返回字符串A从start位置到结尾的字符串

  • 查询每个顾客的购买明细及每个月购买总额

    select *,sum(cost) over(partition by name,substr(orderdate,1,7)) total_amount from business;
    
  • 查询顾客上次的购买时间,如果是第一次购买上次购买时间为null

    select name,orderdate,cost,lag(orderdate,1) over(partition by name order by orderdate) last_date from business;
    
6.1.4 hive窗口函数案例4
  • 准备测试数据

    孙悟空,语文,87
    孙悟空,数学,95
    孙悟空,英语,68
    大海,语文,94
    大海,数学,56
    大海,英语,84
    宋宋,语文,64
    宋宋,数学,86
    宋宋,英语,84
    婷婷,语文,65
    婷婷,数学,85
    婷婷,英语,78
    
  • 创建表并加入数据

    create table score
    (
    name string,
    subject string, 
    score int
    ) row format delimited fields terminated by ",";
    
    #加载数据
    load data local inpath '/usr/data/test4.log' into table score;
    
    
  • 查询每门学科学生成绩排名

    select *,
    row_number() over(partition by subject order by score desc),--不并列排名
    rank() over(partition by subject order by score desc),--并列空位排名
    dense_rank() over(partition by subject order by score desc)--并列不空位
    from score;
    
    只统计数学
    select t1.name,t1.subject,t1.score,t1.ranking from(select name,subject,score,row_number() over(partition by subject order by score desc)as ranking from score) t1 where t1.subject="数学";
    
  • 查询每门学科成绩排名top n的学生

    select * from 
    (select *,row_number() over(partition by subject order by score desc) rmp from score
    ) t where t.rmp<=3;
    
6.4 hive 数据去重

一、数据全部重复

namescore
Computer1600
Phone12
Phone12

操作步骤:

  1. 复制表结构

    CREATE TABLE  LIKE ;
    
  2. 插入去重后的数据

    insert overwrite table  
    select distinct * from ;
    

二、部分数据重复

namescoretype
Computer16002
Phone121
Phone151

操作步骤:

  1. 复制表结构数据

    CREATE TABLE  LIKE ;
    
  2. 插入去重后的数据

    insert overwrite table (
    select t.name,t.score,t.type
    from(
    	select
        	name,score,type,row_number() over(distribute by name sort by score) as rn
        from 
      ) t where t.rn = 1
    );
    
  3. 总结

    insert overwrite table  (
    select <字段>
    from (
    select <字段>, row_number() over(distribute by <有重复的字段> sort by <重复字段的排列根据字段>) as rn
    from 
    ) t where t.rn6=1
    );
    
6.5 hive 缺失字段填充
   |select t1.name,t1.rk from(
   |select *,NVL(comment,-1) rk from region )t1
6.6 统计用户在线时长
  • 统计数据

    'aa','2020-04-28 09:30:00','login'
    'aa','2020-04-28 10:15:00','logout'
    'aa','2020-04-28 13:30:00','login'
    'aa','2020-04-28 14:15:00','logout'
    'bb','2020-04-28 06:30:00','login'
    'bb','2020-04-28 09:15:00','logout'
    'bb','2020-04-28 12:30:00','login'
    'cc','2020-04-28 11:25:00','logout'
    'cc','2020-04-28 16:30:00','login'
    'cc','2020-04-28 20:10:00','logout'
    'dd','2020-04-28 05:25:00','logout'
    'dd','2020-04-28 10:30:00','login'
    'dd','2020-04-28 11:10:00','logout'
    'dd','2020-04-28 21:15:00','login'
    
  • 创建数据表

    CREATE TABLE user_test(
    name string,
    login_ts string,
    logout_ts string
    )row format delimited fields terminated ',';
    
    #加载数据
    load data local inpath '/usr/data/test7.txt' into table use_test;
    
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/673774.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号