数据量越来越大,数据分析的实时性越来越强,数据结果的应用越来越广泛,大数据技术应运而生大数据:大数据是收集、整理、处理大容量数据集,并从中获得结果的技术总称 大数据处理框架
处理框架:实际负责处理数据操作的一系列组件常见框架
批处理框架:用于批量处理大数据集的处理框架,可对整个数据集进行操作。如Apache Hadoop流处理框架:用于对随时进入系统的数据进行实时计算,是一种“无数据边界”的操作方式。如Apache Storm,Apache Samza混合处理框架:一些大数据处理框架可同时处理批处理和流处理工作负载。如:Apache Spark,Apache Flink hadoop介绍 简介
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。主要解决海量数据的存储与分析计算问题。广义上来说,Hadoop通常指一个更加广阔到概念——Hadoop生态圈Hadoop是一个可靠,可扩展的分布式计算的开源软件,可以从单个服务器扩展到数千台计算机。集群中每台计算机都提供本地计算和存储Hadoop把硬件故障认为常态,通过软件把控故障,在软件水平实现高可用Hadoop是一个大数据处理框架,允许使用简单的编程模型跨计算机集群分布式处理大型数据集 项目
核心项目
Hadoop HDFS:分布式文件系统,可提供对应用程序数据的高吞吐量访问Hadoop YARN:作业调度和集群资源管理的框架Hadoop MapReduce:基于YARN的系统,用于并行处理大型数据集Hadoop Common:支持其他Hadoop模块的常用实用程序Hadoop Ozone: Hadoop集群所提供的对象存储
其他项目
Ambari:基于Web的工具,用于配置,管理和监控Apache Hadoop集群,包括对Hadoop HDFS,HadoopMapReduce,Hive,HCatalog,Hbase,ZooKeeper,Oozie,Pig和Sqoop的支持。Ambari还提供了一个用于查看集群运行状况的仪表板,例如热图,以及可视化查看MapReduce,Pig和Hive应用程序的功能,以及以用户友好的方式诊断其性能特征的功能Avro:数据序列化系统Hbase:可扩展的分布式数据库,支持大型表的结构化数据存储。Mahout:可扩展的机器学习和数据挖掘库Spark:用于Hadoop数据的快速通用计算引擎。Spark提供了一种简单而富有表现力的编程模型,支持广泛的应用程序,包括ETL,机器学习,流处理和图形计算。ZooKeeper:用于分布式应用程序的高性能协调服务 HDFS 介绍
HDFS:Hadoop Distributed File System,Hadoop分布式文件系统HDFS它是一个高度容错性的系统,可以部署在廉价机上,提供高吞吐量的数据访问,适合具有超大数据集(海量数据分析,机器学习等)的应用程序特点:
支持大数据文件:适合TB级别数据文件分块存储支持一次写入,多次读取,顺序读取支持廉价硬件支持硬件故障 读写流程 相关名词
Block:最基本的存储单位。将文件进行分块处理,通常是128M/块
Hadoop集群:一主多从架构
NameNode:主节点
用于保存整个文件系统的目录信息、文件信息及分块信息
功能:
接收用户的操作请求维护文件系统的目录结构管理文件和Block之间的映射管理管理 block 和 DataNode 之间的映射
DataNode:从节点
分布在廉价的计算机上,用于存储Block块文件
具体储存方式:文件被分成块存储到 DataNode 的磁盘上,每个Block(块)可以设置多副本
写流程客户端向namenode发起请求客户端向datenode发起建立连接请求客户端向datenode存储数据
读流程 MapReduce 介绍MapReduce:是一套从海量源数据提取、分析元素,最后返回结果集的方法MapReduce框架的核心步骤主要分两部分:Map 和ReduceMap把大数据分成小数据,进行分析计算,将结果通过洗牌的方式给Reduce。Reduce对Map的结果进行汇总。举例说明:
求和:1+5+7+3+4+9+3+5+6=?
工作流程
当向MapReduce 框架提交一个计算作业时,它会首先把计算作业拆分成若干个Map 任务,然后分配到不同的节点(DataNode)上去执行,每一个Map 任务处理输入数据中的一部分,当Map 任务完成后,它会生成一些中间文件,这些中间文件将会作为Reduce 任务的输入数据。Reduce 任务的主要目标就是把前面若干个Map 的输出汇总到一起并输出。
YARN
YARN: Yet An other Resouce Negotiator,另一种资源协调者
主要功能:任务调度和集群资源管理,让hadoop平台性能及扩展性得到更好发挥
YARN是Hadoop 2.0新增加的一个子项目,弥补了Hadoop 1.0(MRv1)扩展性差、可靠性资源利用率低以及无法支持其他计算框架等不足,Hadoop的下一代计算框架MRv2将资源管理功能抽象成一个通用系统YARN。也就是说,YARN由MapReduce拆分而来
优点:降低集群资源空闲,提高集群资源利用率。维护成本低。数据共享。 避免了集群之间移动数据。
主从架构
ResourceManager:资源管理(主)
负责对各个NodeManager上的资源进行统一管理和任务调度
NodeManager:节点管理(从)
在各个计算节点运行,用于接收RM中ApplicationsManager的计算任务、启动/停止任务、和RM中Scheduler汇报并协商资源、监控并汇报本节点的情况
Hadoop部署 常见部署分类 单机单机(本地模式)是Hadoop的默认部署模式当配置文件为空时,Hadoop完全运行在本地不需要与其他节点交互,单机(本地模式)就不使用HDFS,也不加载任何Hadoop的守护进程该模式主要用于开发调试MapReduce程序的应用逻辑 伪分布式
Hadoop守护进程运行在本地机器上,模拟一个小规模的的集群该模式在单机模式之上增加了代码调试功能,允许你检查内存使用情况,HDFS输入/输出,以及其他的守护进程交互 完全分布式
介绍
单机和伪分布式部署仅用于测试环境,生产环境需要完全分布式部署完全分部式是真正利用多台Linux主机来进行部署Hadoop,对Linux机器集群进行规划,使得Hadoop各个模块分别部署在不同的多台机器上由于NameNode和ResourceManager一主多从的架构模式,需要对其做高可用
NameNode HA故障切换
一个NameService下面有两个NameNode,分别处于Active和Standby状态。通过Zookeeper进行协调选举,确保只有一个活跃的NameNode。一旦主(Active)宕掉,standby会切换成Active
ZKFailoverController作为一个ZK(ZooKeeper)集群的客户端,用来监控NN的状态信息。每个运行NN的节点必须要运行一个ZKFC(ZKFailoverControlle)
ZKFC功能
Health monitoring健康检查:zkfc定期对本地的NN发起health-check的命令,如果NN正确返回,那么这个NN被认为是OK的。否则被认为是失效节点ZooKeeper session management:当本地NN是健康的时候,zkfc将会在zk中持有一个session。如果本地NN又正好是active的,那么zkfc还有持有一个"ephemeral"的节点作为锁,一旦本地NN失效了,那么这个节点将会被自动删除ZooKeeper-based election主备选举:如果本地NN是健康的,并且zkfc发现没有其他的NN持有那个独占锁。那么他将试图去获取该锁,一旦成功,那么它就需要执行Failover,然后成为active的NN节点。
NameNode HA数据共享
Namenode主要维护两个文件,一个是fsimage,一个是editlogfsimage保存了最新的元数据检查点,包含了整个HDFS文件系统的所有目录和文件的信息。对于文件来说包括了数据块描述信息、修改时间、访问时间等;对于目录来说包括修改时间、访问权限控制信息(目录所属用户,所在组)等。editlog主要是在NameNode已经启动情况下对HDFS进行的各种更新操作进行记录,HDFS客户端执行所有的写操作都会被记录到editlog中,editlog保存在JournalNode节点StandBy从JournalNode节点里读取editlog中数据进行同步 单机部署
软件包获取
官网:https://hadoop.apache.org/
java环境准备
[root@server5 ~]# java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
更换jdk(提前下载好)
[root@server5 ~]# tar xf jdk-8u191-linux-x64.tar.gz
[root@server5 ~]# mv jdk1.8.0_191/ /usr/local/jdk
导入环境变量(注意顺序)
[root@server5 ~]# vim /etc/profile
[root@server5 ~]# tail -2 /etc/profile
export JAVA_HOME=/usr/local/jdk
export PATH=${JAVA_HOME}/bin:$PATH
[root@server5 ~]# source /etc/profile
[root@server5 ~]# java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
Hadoop安装
[root@server5 ~]# rz [root@server5 ~]# tar xf hadoop-2.8.5.tar.gz [root@server5 ~]# mv hadoop-2.8.5 /opt/ 导入环境变量 [root@server5 ~]# echo 'PATH=$PATH:/opt/hadoop-2.8.5/bin' >> /etc/profile [root@server5 ~]# source /etc/profile
词频统计,验证可用性
准备文件 [root@server5 ~]# mkdir /tmp/input [root@server5 ~]# vim /tmp/input/test.txt [root@server5 ~]# cat /tmp/input/test.txt zhangsan lisi zhangsan 192.168.139.10 lisi 192.168.139.20 zhangsan 192.168.139.10 jack 192.168.139.30 词频统计 [root@server5 ~]# hadoop jar /opt/hadoop-2.8.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /tmp/input /tmp/output Shuffle Errors BAD_ID=0 ConNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 查看结果 [root@server5 ~]# ls /tmp/output/ part-r-00000 _SUCCESS # _SUCCESS仅说明运行成功;part-r-00000保存输出结果文件 [root@server5 ~]# cat /tmp/output/part-r-00000 192.168.139.10 2 192.168.139.20 1 192.168.139.30 1 jack 1 lisi 2 zhangsan 3伪分布式部署
在单机部署的基础上实现伪分布式部署,具体步骤见单机部署修改java jdk环境
[root@server5 ~]# cd /opt/hadoop-2.8.5/etc/ [root@server5 etc]# cd hadoop/ [root@server5 hadoop]# ls capacity-scheduler.xml hadoop-policy.xml kms-log4j.properties ssl-client.xml.example configuration.xsl hdfs-site.xml kms-site.xml ssl-server.xml.example container-executor.cfg httpfs-env.sh log4j.properties yarn-env.cmd core-site.xml httpfs-log4j.properties mapred-env.cmd yarn-env.sh hadoop-env.cmd httpfs-signature.secret mapred-env.sh yarn-site.xml hadoop-env.sh httpfs-site.xml mapred-queues.xml.template hadoop-metrics2.properties kms-acls.xml mapred-site.xml.template hadoop-metrics.properties kms-env.sh slaves 修改java jdk环境 [root@server5 hadoop]# vim hadoop-env.sh [root@server5 hadoop]# grep -Ev '^#|^$' hadoop-env.sh |head -1 export JAVA_HOME=/usr/local/jdk [root@server5 hadoop]# vim mapred-env.sh [root@server5 hadoop]# grep -Ev '^#|^$' mapred-env.sh |head -1 export JAVA_HOME=/usr/local/jdk [root@server5 hadoop]# vim yarn-env.sh [root@server5 hadoop]# grep -Ev '#|^$' yarn-env.sh |grep 'export JAVA_HOM'E export JAVA_HOME=/usr/local/jdk
配置fs和NameNode数据的存放位置
[root@server5 hadoop]# echo '192.168.139.50 hd1' >> /etc/hosts [root@server5 hadoop]# vim core-site.xml [root@server5 hadoop]# tail -12 core-site.xmlfs.defaultFS hdfs://hd1:8020 hadoop.tmp.dir /opt/data/tmp
配置副本数量
[root@server5 hadoop]# vim hdfs-site.xml [root@server5 hadoop]# tail -7 hdfs-site.xmldfs.replication 1
格式化hdfs
[root@server5 hadoop]# hdfs namenode -format /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hd1/192.168.139.50 ************************************************************/ [root@server5 hadoop]# ls /opt/data/tmp/ dfs [root@server5 hadoop]# ls /opt/data/tmp/dfs/name/current/ fsimage_0000000000000000000 fsimage_0000000000000000000.md5 seen_txid VERSION
启动角色
将/opt/hadoop-2.8.5/sbin导入环境变量 [root@server5 ~]# vim /etc/profile [root@server5 ~]# tail -1 /etc/profile PATH=$PATH:/opt/hadoop-2.8.5/sbin [root@server5 ~]# . /etc/profile 启动角色 [root@server5 ~]# hadoop-daemon.sh start namenode starting namenode, logging to /opt/hadoop-2.8.5/logs/hadoop-root-namenode-server5.out [root@server5 ~]# hadoop-daemon.sh start datanode starting datanode, logging to /opt/hadoop-2.8.5/logs/hadoop-root-datanode-server5.out 查看java进程,验证启动 [root@server5 ~]# jps 5072 DataNode 5157 Jps 4941 NameNode
hdfs文件测试
[root@server5 ~]# hdfs dfs --help --help: Unknown command Usage: hadoop fs [generic options] [-appendToFile... ] [-cat [-ignoreCrc] ...] [-checksum ...] [-chgrp [-R] GROUP PATH...] [-chmod [-R] PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal [-f] [-p] [-l] [-d] ... ] [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] ... ] [-count [-q] [-h] [-v] [-t [ ]] [-u] [-x] ...] [-cp [-f] [-p | -p[topax]] [-d] ... ] [-createSnapshot [ ]] [-deleteSnapshot ] [-df [-h] [ ...]] [-du [-s] [-h] [-x] ...] [-expunge] [-find ... ...] [-get [-f] [-p] [-ignoreCrc] [-crc] ... ] [-getfacl [-R] ] [-getfattr [-R] {-n name | -d} [-e en] ] [-getmerge [-nl] [-skip-empty-file] ] [-help [cmd ...]] [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [ ...]] [-mkdir [-p] ...] [-moveFromLocal ... ] [-moveToLocal ] [-mv ... ] [-put [-f] [-p] [-l] [-d] ... ] [-renameSnapshot ] [-rm [-f] [-r|-R] [-skipTrash] [-safely] ...] [-rmdir [--ignore-fail-on-non-empty] ...] [-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set ]] [-setfattr {-n name [-v value] | -x name} ] [-setrep [-R] [-w] ...] [-stat [format] ...] [-tail [-f] ] [-test -[defsz] ] [-text [-ignoreCrc] ...] [-touchz ...] [-truncate [-w] ...] [-usage [cmd ...]] 创建目录 [root@server5 ~]# hdfs dfs -mkdir /test 查看 [root@server5 ~]# hdfs dfs -ls / Found 1 items drwxr-xr-x - root supergroup 0 2021-12-19 16:30 /test 上传文件 [root@server5 ~]# echo 'this is a test file' > 1.txt [root@server5 ~]# hdfs dfs -put 1.txt /test [root@server5 ~]# hdfs dfs -ls /test Found 1 items -rw-r--r-- 1 root supergroup 20 2021-12-19 16:32 /test/1.txt 读取文件内容 [root@server5 ~]# hdfs dfs -cat /test/1.txt this is a test file 下载文件 [root@server5 ~]# rm -rf 1.txt [root@server5 ~]# hdfs dfs -get /test/1.txt [root@server5 ~]# cat 1.txt this is a test file
配置yarn
[root@server5 ~]# cd /opt/hadoop-2.8.5/etc/hadoop/ [root@server5 hadoop]# cp mapred-site.xml.template mapred-site.xml [root@server5 hadoop]# vim mapred-site.xml [root@server5 hadoop]# tail -7 mapred-site.xml[root@server5 hadoop]# vim yarn-site.xml [root@server5 hadoop]# tail -12 yarn-site.xml mapreduce.framework.name yarn yarn.resourcemanager.hostname hd1 yarn.nodemanager.aux-services mapreduce_shuffle
启动yarn
[root@server5 hadoop]# yarn-daemon.sh start resourcemanager starting resourcemanager, logging to /opt/hadoop-2.8.5/logs/yarn-root-resourcemanager-server5.out [root@server5 hadoop]# yarn-daemon.sh start nodemanager starting nodemanager, logging to /opt/hadoop-2.8.5/logs/yarn-root-nodemanager-server5.out [root@server5 hadoop]# jps 5072 DataNode 8565 ResourceManager 8392 NodeManager 4941 NameNode 8781 Jps
词频统计,验证
[root@server5 hadoop]# hadoop jar /opt/hadoop-2.8.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /test/2.txt /output/00 21/12/19 20:42:49 INFO mapreduce.Job: Counters: 13 Job Counters Failed map tasks=4 Killed reduce tasks=1 Launched map tasks=4 Other local map tasks=3 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=13 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=13 Total time spent by all reduce tasks (ms)=0 Total vcore-milliseconds taken by all map tasks=13 Total vcore-milliseconds taken by all reduce tasks=0 Total megabyte-milliseconds taken by all map tasks=13312 Total megabyte-milliseconds taken by all reduce tasks=0 # 报错信息 Container launch failed for container_1639916924136_0003_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
报错:Container launch failed for container_1639916924136_0003_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
原因:yarn-site.xml配置文件编写不正确
解决:如下
重新修改yarn-site.xml
[root@server5 hadoop]# vim yarn-site.xml [root@server5 hadoop]# tail -12 yarn-site.xmlyarn.resourcemanager.hostname hd1 yarn.nodemanager.aux-services mapreduce_shuffle 重启hadoop
yarn-daemon.sh stop resourcemanager yarn-daemon.sh stop nodemanager hadoop-daemon.sh stop namenode hadoop-daemon.sh stop datanode hadoop-daemon.sh start namenode hadoop-daemon.sh start datanode yarn-daemon.sh start resourcemanager yarn-daemon.sh start nodemanager
删除执行失败产生的空文件 [root@server5 hadoop]# hdfs dfs -rmdir /output/00 [root@server5 hadoop]# hdfs dfs -rmdir /output 再次执行 [root@server5 hadoop]# hadoop jar /opt/hadoop-2.8.5/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /test/2.txt /output/00 Shuffle Errors BAD_ID=0 ConNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 查看结果 [root@server5 hadoop]# hdfs dfs -ls /output/00 Found 2 items -rw-r--r-- 1 root supergroup 0 2021-12-20 14:18 /output/00/_SUCCESS -rw-r--r-- 1 root supergroup 47 2021-12-20 14:18 /output/00/part-r-00000 [root@server5 hadoop]# hdfs dfs -cat /output/00/part-r-00000 168 1 186 2 192.168.139.10 1 lisi 2 zhangsan 1
Windows上做域名解析:C:WindowsSystem32driversetchosts浏览器访问:192.168.139.50:8088
开启历史服务
[root@server5 hadoop]# mr-jobhistory-daemon.sh start historyserver starting historyserver, logging to /opt/hadoop-2.8.5/logs/mapred-root-historyserver-server5.out [root@server5 hadoop]# jps 8177 Jps 6665 NodeManager 8137 JobHistoryServer 6363 ResourceManager 6157 NameNode 6255 DataNode
web查看历史服务hd1:19888
完全分布式部署| 主机名 | IP地址 | 角色 | 备注 |
|---|---|---|---|
| hd1 | 192.168.139.10 | NameNode | 需要运行ZKFC |
| hd2 | 192.168.139.20 | NameNode | 需要运行ZKFC |
| hd3 | 192.168.139.30 | ResourceManager | |
| hd4 | 192.168.139.40 | DataNode | NodeManager | JournalNode | 安装ZooKeeper |
| hd5 | 192.168.139.50 | DataNode | NodeManager | JournalNode | 安装ZooKeeper |
| hd6 | 192.168.139.60 | DataNode | NodeManager | JournalNode | 安装ZooKeeper |
修改主机名 hostnamectl set-hostname hd1 su 配置静态IP vim /etc/sysconfig/network-scripts/ifcfg-ens33 #修改UUID和IPADDR即可 ----------------------------------------------------- TYPE="Ethernet" PROXY_METHOD="none" BROWSER_onLY="no" BOOTPROTO="static" DEFROUTE="yes" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" NAME="ens33" DEVICE="ens33" onBOOT="yes" IPADDR=192.168.139.10 GATEWAY=192.168.139.2 NETMASK=255.255.255.0 DNS1=114.114.114.114 ----------------------------------------------------- systemctl restart network 域名解析 cat >> /etc/hosts <JDK部署 yum install -y lrzsz rz tar xf jdk-8u191-linux-x64.tar.gz mv jdk1.8.0_191/ /usr/local/jdk cat >> /etc/profile << EOF export JAVA_HOME=/usr/local/jdk export PATH=${JAVA_HOME}/bin:$PATH EOF source /etc/profile java -versionZooKeeper部署(节点4 5 6)节点4安装zookeeper [root@hd4 ~]# rz [root@hd4 ~]# tar xf zookeeper-3.4.14.tar.gz [root@hd4 ~]# mv zookeeper-3.4.14 /usr/local/zookeeper [root@hd4 ~]# cd /usr/local/zookeeper/conf/ 修改配置文件zoo.cfg(默认命名,更改,可能会启动失败) 每台服务器myid不同,需要分别修改,这里设置: server.1对应的myid为1 server.2对应的myid为2 server.3对应的myid为3 2888端口:follower连接到leader机器的端口 3888端口:leader选举端口 [root@hd4 conf]# cp zoo_sample.cfg zoo.cfg [root@hd4 conf]# vim zoo_sample [root@hd4 conf]# cat zoo_sample |grep -v "#" tickTime=2000 initLimit=10 syncLimit=5 dataDir=/opt/data clientPort=2181 server.1=hd4:2888:3888 server.2=hd5:2888:3888 server.3=hd6:2888:3888 拷贝ZooKeeper配置到节点5和节点6 [root@hd4 conf]# scp -r /usr/local/zookeeper/ hd5:/usr/local/ [root@hd4 conf]# scp -r /usr/local/zookeeper/ hd6:/usr/local/ 创建dataDir和myid文件 mkdir /opt/data [root@hd4 ~]# echo 1 > /opt/data/myid [root@hd5 ~]# echo 2 > /opt/data/myid [root@hd6 ~]# echo 3 > /opt/data/myid 将/usr/local/zookeeper/bin导入环境变量 echo 'PATH=$PATH:/usr/local/zookeeper/bin' >>/etc/profile source /etc/profile 启动ZooKeeper zkServer.sh start 查看状态(需要三个节点全部启动,才能查看) [root@hd4 conf]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: follower [root@hd5 ~]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: leader [root@hd6 ~]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: followerHadoop部署修改相关配置文件的JAVA_HOME
[root@hd1 ~]# cd /opt/hadoop/etc/hadoop/ [root@hd1 hadoop]# vim hadoop-env.sh 25 export JAVA_HOME=/usr/local/jdk [root@hd1 hadoop]# vim mapred-env.sh 16 export JAVA_HOME=/usr/local/jdk [root@hd1 hadoop]# vim yarn-env.sh 23 export JAVA_HOME=/usr/local/jdk编写相关内容
[root@hd1 hadoop]# vim core-site.xml[root@hd1 hadoop]# vim hdfs-site.xml fs.defaultFS hdfs://ns1 hadoop.tmp.dir /opt/data/tmp ha.zookeeper.quorum hd4:2181,hd5:2181,hd6:2181 [root@hd1 hadoop]# vim slaves [root@hd1 hadoop]# cat slaves hd4 hd5 hd6 [root@hd1 hadoop]# cp mapred-site.xml.template mapred-site.xml [root@hd1 hadoop]# vim mapred-site.xml dfs.nameservices ns1 dfs.ha.namenodes.ns1 nn1,nn2 dfs.namenode.rpc-address.ns1.nn1 hd1:9000 dfs.namenode.http-address.ns1.nn1 hd1:50070 dfs.namenode.rpc-address.ns1.nn2 hd2:9000 dfs.namenode.http-address.ns1.nn2 hd2:50070 dfs.namenode.shared.edits.dir qjournal://hd4:8485;hd5:8485;hd6:8485/ns1 dfs.journalnode.edits.dir /opt/data/journal dfs.ha.automatic-failover.enabled true dfs.client.failover.proxy.provider.ns1 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files /root/.ssh/id_rsa [root@hd1 hadoop]# vim yarn-site.xml mapreduce.framework.name yarn yarn.resourcemanager.hostname hd3 yarn.nodemanager.aux-services mapreduce_shuffle 将Hadoop的软件包与配置拷贝到其他节点
[root@hd1 ~]# for i in hd{2..6};do scp -r /opt/hadoop/ $i:/opt/;done将/opt/hadoop/bin和/opt/hadoop/sbin导入环境变量
echo 'PATH=$PATH:/opt/hadoop/bin' >> /etc/profile echo 'PATH=$PATH:/opt/hadoop/sbin' >> /etc/profile source /etc/profile启动集群
启动4,5,6节点的ZooKeeper [root@hd4 conf]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: follower [root@hd5 ~]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: leader [root@hd6 ~]# zkServer.sh status ZooKeeper JMX enabled by default Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg Mode: follower 启动journalnode [root@hd1 ~]# hadoop-daemons.sh start journalnode hd4: starting journalnode, logging to /opt/hadoop/logs/hadoop-root-journalnode-hd4.out hd5: starting journalnode, logging to /opt/hadoop/logs/hadoop-root-journalnode-hd5.out hd6: starting journalnode, logging to /opt/hadoop/logs/hadoop-root-journalnode-hd6.out [root@hd4 ~]# ls /opt/data/journal [root@hd4 ~]# jps 2083 QuorumPeerMain 5444 Jps 5383 JournalNode 格式化NameNode [root@hd1 ~]# hdfs namenode -format [root@hd1 ~]# scp -r /opt/data/ hd2:/opt/ 格式化zkfc [root@hd1 ~]# hdfs zkfc -formatZK 启动hdfs [root@hd1 ~]# start-dfs.sh Starting namenodes on [hd1 hd2] hd1: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-hd1.out hd4: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-hd4.out hd6: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-hd6.out hd5: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-hd5.out Starting journal nodes [hd4 hd5 hd6] hd5: journalnode running as process 50021. Stop it first. hd6: journalnode running as process 5477. Stop it first. hd4: journalnode running as process 5383. Stop it first. Starting ZK Failover Controllers on NN hosts [hd1 hd2] hd1: starting zkfc, logging to /opt/hadoop/logs/hadoop-root-zkfc-hd1.out hd2: starting zkfc, logging to /opt/hadoop/logs/hadoop-root-zkfc-hd2.out 启动yarn [root@hd3 ~]# start-yarn.sh 查看 [root@hd1 ~]# jps 5374 NameNode 5678 DFSZKFailoverController 5774 Jps [root@hd2 ~]# jps 5638 NameNode 5737 DFSZKFailoverController 5850 Jps [root@hd3 ~]# jps 5587 Jps 5309 ResourceManager [root@hd4 ~]# jps 2083 QuorumPeerMain 5619 NodeManager 5749 Jps 5383 JournalNode 5479 DataNode [root@hd5 ~]# jps 50162 DataNode 50339 NodeManager 47301 QuorumPeerMain 50021 JournalNode 50476 Jps [root@hd6 ~]# jps 5552 DataNode 5682 NodeManager 5477 JournalNode 5799 Jps 5179 QuorumPeerMain验证集群正常使用:词频统计
[root@hd1 ~]# vim 1.txt [root@hd1 ~]# hdfs dfs -mkdir /input [root@hd1 ~]# hdfs dfs -put 1.txt /input [root@hd1 ~]# hdfs dfs -ls /input Found 1 items -rw-r--r-- 3 root supergroup 219 2021-12-21 23:02 /input/1.txt [root@hd1 ~]# yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /input /output/00 Shuffle Errors BAD_ID=0 ConNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 [root@hd1 ~]# hdfs dfs -ls /output/00 Found 2 items -rw-r--r-- 3 root supergroup 0 2021-12-21 23:05 /output/00/_SUCCESS -rw-r--r-- 3 root supergroup 237 2021-12-21 23:05 /output/00/part-r-00000 [root@hd1 ~]# hdfs dfs -cat /output/00/part-r-00000 "ens33" 2 "no" 1 "stable-privacy" 1 "yes" 3 114.114.114.114 1 192.168.139.10 1 192.168.139.2 1 255.255.255.0 1 CONF 1 DEVICE 1 DNS1 1 GATEWAY 1 IPADDR 1 IPV6_ADDR_GEN_MODE 1 IPV6_DEFROUTE 1 IPV6_FAILURE_FATAL 1 NAME 1 NETMASK 1 onBOOT 1Windows访问
域名解析C:WindowsSystem32driversetchosts 192.168.139.10 hd1 192.168.139.20 hd2 192.168.139.30 hd3 192.168.139.40 hd4 192.168.139.50 hd5 192.168.139.60 hd6Ambari自动部署Hadoop Ambari介绍查看NameNode状态
hd1:50070
hd2:50070
查看yarn状态
hd3:8088
Apache Ambari项目旨在通过开发用于配置,管理和监控Apache Hadoop集群的软件来简化Hadoop管理。Ambari提供了一个由RESTful API支持的直观,易用的Hadoop管理Web UI
具体功能
提供Hadoop集群
Ambari提供了跨任意数量的主机安装Hadoop服务的分步向导Ambari处理群集的Hadoop服务配置 管理hadoop集群监控Hadoop集群
Ambari提供了一个仪表板,用于监控Hadoop集群的运行状况和状态Ambari利用Ambari指标系统进行指标收集Ambari利用Ambari alert framework进行系统警报,并在需要您注意时通知您(例如,节点出现故障,剩余磁盘空间不足等) 使用Ambari REST API轻松将Hadoop配置,管理和监控功能集成到自己的应用程序中
Ambari本身也是一个分布式架构软件,主要由两部分组成:Ambari Server和Ambari Agent
用户通过Ambari Server通知Ambari Agent安装对应的软件,Agent会定时发送各个机器每个软件模块的状态给Server,最终这些状态信息会呈现给Ambari的GUI,方便用户了解到集群中各组件状态,做出相应的维护策略
Ambari Agent实际用于部署hadoop集群,Ambari Server用于管理,监控hadoop集群
部署相关文档:https://docs.cloudera.com/HDPdocuments/Ambari-2.6.1.5/bk_ambari-installation/content/determine_product_interop.html
环境准备
主机名 IP地址 角色 备注 hd1 192.168.139.10 agent 需要运行ZKFC hd2 192.168.139.20 agent 需要运行ZKFC hd3 192.168.139.30 agent hd4 192.168.139.40 agent 安装ZooKeeper hd5 192.168.139.50 agent 安装ZooKeeper hd6 192.168.139.60 agent 安装ZooKeeper ambari_server 192.168.139.70 server 修改主机名 hostnamectl set-hostname ambari_server su 配置静态IP vim /etc/sysconfig/network-scripts/ifcfg-ens33 #修改UUID和IPADDR即可 ----------------------------------------------------- TYPE="Ethernet" PROXY_METHOD="none" BROWSER_onLY="no" BOOTPROTO="static" DEFROUTE="yes" IPV4_FAILURE_FATAL="no" IPV6INIT="yes" IPV6_AUTOCONF="yes" IPV6_DEFROUTE="yes" IPV6_FAILURE_FATAL="no" IPV6_ADDR_GEN_MODE="stable-privacy" NAME="ens33" DEVICE="ens33" onBOOT="yes" IPADDR=192.168.139.70 GATEWAY=192.168.139.2 NETMASK=255.255.255.0 DNS1=114.114.114.114 ----------------------------------------------------- systemctl restart network 域名解析 cat >> /etc/hosts <JDK部署 [root@ambari_server ~]# yum install -y lrzsz [root@ambari_server ~]# rz [root@ambari_server ~]# tar xf jdk-8u191-linux-x64.tar.gz [root@ambari_server ~]# mv jdk1.8.0_191/ /usr/local/jdk [root@ambari_server ~]# for i in hd{1..6}; do scp -r /usr/local/jdk $i:/usr/local/;done [root@ambari_server ~]# cat >> /etc/profile << EOF export JAVA_HOME=/usr/local/jdk export PATH=${JAVA_HOME}/bin:$PATH EOF [root@ambari_server ~]# source /etc/profile [root@ambari_server ~]# java -version [root@ambari_server ~]# for i in hd{1..6}; do scp -r /etc/profile $i:/etc/profile;done 在6个agent节点实现 source /etc/profile java -version数据库部署(ambari_server节点)安装mariadb [root@ambari_server ~]# yum install -y mariadb mariadb-server.x86_64 [root@ambari_server ~]# systemctl start mariadb.service [root@ambari_server ~]# systemctl enable mariadb.service [root@ambari_server ~]# mysqladmin -uroot password 123456 创建数据库并授权 [root@ambari_server ~]# mysql -uroot -p123456 MariaDB [(none)]> create database ambari character set utf8; MariaDB [(none)]> use ambari MariaDB [ambari]> grant all on ambari.* to 'ambari'@'ambari_server' identified by '123456'; MariaDB [ambari]> grant all on ambari.* to 'ambari'@'%' identified by '123456'; # '%'代表除本机外的所有主机 MariaDB [(none)]> flush privileges; 验证授权 [root@ambari_server ~]# mysql -h ambari_server -uambari -p123456 ERROR 1045 (28000): Access denied for user 'ambari'@'ambari_server' (using password: YES)报错:ERROR 1045 (28000): Access denied for user ‘ambari’@‘ambari_server’ (using password: YES)
解决:
登录mysql数据库 [root@ambari_server ~]# mysql -uroot -p123456 MariaDB [(none)]> use mysql MariaDB [mysql]> select host,user,password from user; +----------------+--------+-------------------------------------------+ | host | user | password | +----------------+--------+-------------------------------------------+ | localhost | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | | ambari_server | root | | | 127.0.0.1 | root | | | ::1 | root | | | localhost | | | | ambari_server | | | | ambari_server | ambari | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | | % | ambari | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | +----------------+--------+-------------------------------------------+ 删除user表中user列为空格的行: MariaDB [mysql]> delete from user where user=' '; MariaDB [mysql]> select host,user,password from user; +----------------+--------+-------------------------------------------+ | host | user | password | +----------------+--------+-------------------------------------------+ | localhost | root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | | ambari_server | root | | | 127.0.0.1 | root | | | ::1 | root | | | ambari_server | ambari | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | | % | ambari | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | +----------------+--------+-------------------------------------------+ 刷新授权表 MariaDB [(none)]> flush privileges; MariaDB [(none)]> exit 成功登录 [root@ambari_server ~]# mysql -h ambari_server -uambari -p123456 MariaDB [(none)]>java&mysql的连接工具 [root@ambari_server ~]# yum install -y mysql-connector-java.noarch本地yum源部署开启httpd服务 [root@ambari_server ~]# yum install -y httpd 传入ambari软件包 [root@ambari_server ~]# cd /var/www/html/ [root@ambari_server ~]# rz [root@ambari_server ~]# ls ambari.zip HDP.zip HDP-UTIL.zip [root@ambari_server html]# unzip ambari.zip [root@ambari_server html]# unzip HDP.zip [root@ambari_server html]# unzip HDP-UTIL.zip 自建yum源 [root@localhost ~]# vim /etc/yum.repos.d/ambari.repo [root@ambari_server yum.repos.d]# vim ambari.repo [root@ambari_server yum.repos.d]# cat ambari.repo [ambari-2.6.1.5] name=ambari Version - ambari-2.6.1.5 baseurl=http://ambari_server/ambari gpgcheck=0 enabled=1 priority=1 [root@ambari_server yum.repos.d]# vim hdp.repo [root@ambari_server yum.repos.d]# cat hdp.repo #VERSION_NUMBER=2.6.1.0-129 [HDP-2.6.1.0] name=HDP Version - HDP-2.6.1.0 baseurl=http://ambari_server/HDP gpgcheck=1 gpgkey=http://ambari_servr/HDP/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 # 这里还缺少了 HDP-UTIL的源,暂时找不到资源 开启httpd服务 [root@ambari_server ~]# systemctl start httpd [root@ambari_server ~]# systemctl enable httpd 验证 [root@ambari_server ~]# yum clean all [root@ambari_server ~]# yum makecache [root@ambari_server ~]# yum repolist安装Ambari[root@ambari_server ~]# cd /var/www/html [root@ambari_server html]# rz # 上传ambari-server,ambari-agent包 [root@ambari_server html]# yum install -y /var/www/html/ambari-server-2.5.1.0-159.x86_64.rpm [root@ambari_server html]# ls ambari-agent-2.5.1.0-159.x86_64.rpm ambari-server-2.5.1.0-159.x86_64.rpm [root@ambari_server html]# for i in hd{1..6} do scp -r ambari-agent-2.5.1.0-159.x86_64.rpm $i:/root done 所有agent节点安装ambari-agent yum install -y /root/ambari-agent-2.5.1.0-159.x86_64.rpm初始化Ambari Serverambari框架数据 [root@ambari_server ~]# ls /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql ambari框架数据导入数据库 [root@ambari_server ~]# mysql -h ambari_server -uambari -p123456 MariaDB [(none)]> use ambari MariaDB [ambari]> source /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql MariaDB [ambari]> show tables; +-------------------------------+ | Tables_in_ambari | +-------------------------------+ | ClusterHostMapping | | QRTZ_BLOB_TRIGGERS | | QRTZ_CALENDARS | | QRTZ_CRON_TRIGGERS | ... | widget | | widget_layout | | widget_layout_user_widget | +-------------------------------+ 105 rows in set (0.00 sec) 初始化 [root@ambari_server ~]# ambari-server setup Using python /usr/bin/python Setup ambari-server Checking SELinux... SELinux status is 'disabled' Customize user account for ambari-server daemon [y/n] (n)? y Enter user account for ambari-server daemon (root):root Adjusting ambari-server permissions and ownership... Checking firewall status... Checking JDK... [1] Oracle JDK 1.8 + Java Cryptography Extension (JCE) Policy Files 8 [2] Oracle JDK 1.7 + Java Cryptography Extension (JCE) Policy Files 7 [3] Custom JDK ============================================================================== Enter choice (1): 3 选择自定义JDK WARNING: JDK must be installed on all hosts and JAVA_HOME must be valid on all hosts. WARNING: JCE Policy files are required for configuring Kerberos security. If you plan to use Kerberos,please make sure JCE Unlimited Strength Jurisdiction Policy Files are valid on all hosts. Path to JAVA_HOME: /usr/local/jdk Validating JDK on Ambari Server...done. Completing setup... Configuring database... Enter advanced database configuration [y/n] (n)? y 配置数据库 Configuring database... ============================================================================== Choose one of the following options: [1] - PostgreSQL (Embedded) [2] - Oracle [3] - MySQL / MariaDB [4] - PostgreSQL [5] - Microsoft SQL Server (Tech Preview) [6] - SQL Anywhere [7] - BDB ============================================================================== Enter choice (1): 3 选择mysql Hostname (localhost): ambari_server Invalid hostname. Hostname (localhost): ambari-server.a.com 主机名需要符合规范 Port (3306): 3306 端口 Database name (ambari): ambari 数据库 Username (ambari): ambari 用户 Enter Database Password (bigdata): 123456 Re-enter password: 123456 Configuring ambari database... Configuring remote database connection properties... WARNING: Before starting Ambari Server, you must run the following DDL against the database to create the schema: /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql Proceed with configuring remote database connection properties [y/n] (y)? y 配置远程连接 Extracting system views... ambari-admin-2.5.1.0.159.jar ........... Adjusting ambari-server permissions and ownership... Ambari Server 'setup' completed successfully. 启动Ambari-server [root@ambari_server ~]# ambari-server start Using python /usr/bin/python Starting ambari-server Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Ambari database consistency check started... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start........................ Server started listening on 8080 DB configs consistency check: no errors and warnings were found. Ambari Server 'start' completed successfully. 开机自启 [root@ambari_server ~]# chkconfig --list |grep ambari ambari-server 0:关 1:关 2:开 3:开 4:开 5:开 6:关浏览器访问192.168.139.70:8080用户名和密码都是admin



