Hadoop生产调优手册（三）

6、 HDFS—故障排除 6.1、 NameNode故障处理

6.1.1、需求：

NameNode进程挂了并且存储的数据也丢失了，如何恢复NameNode

6.1.2、故障模拟

（1）kill -9 NameNode进程

[atguigu@hadoop102 current]$ kill -9 19886

（2）删除NameNode存储的数据（/opt/module/hadoop-3.1.3/data/tmp/dfs/name）

[atguigu@hadoop102 hadoop-3.1.3]$ rm -rf /opt/module/hadoop-3.1.3/data/dfs/name/*

6.1.3、问题解决

（1）拷贝SecondaryNameNode中数据到原NameNode存储数据目录

[atguigu@hadoop102 dfs]$ scp -r atguigu@hadoop104:/opt/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/

（2）重新启动NameNode

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs --daemon start namenode

（3）向集群上传一个文件，即可恢复正常，可能会存在少量数据的丢失可能性，正常情况下，一般是使用namenode的高可用，而不是SecondaryNameNode。

6.2、集群安全模式&磁盘修复 6.2.1、安全模式：文件系统只接受读数据请求，而不接受删除、修改等变更请求 6.2.2、进入安全模式场景

NameNode在加载镜像文件和编辑日志期间处于安全模式；NameNode再接收DataNode注册时，处于安全模式

6.2.3、退出安全模式条件

dfs.namenode.safemode.min.datanodes:最小可用datanode数量，默认0
dfs.namenode.safemode.threshold-pct:副本数达到最小要求的block占系统总block数的百分比，默认0.999f。（只允许丢一个块）
dfs.namenode.safemode.extension:稳定时间，默认值30000毫秒，即30秒

6.2.4、基本语法

集群处于安全模式，不能执行重要操作（写操作）。集群启动完成后，自动退出安全模式。

（1）bin/hdfs dfsadmin -safemode get	（功能描述：查看安全模式状态）
（2）bin/hdfs dfsadmin -safemode enter （功能描述：进入安全模式状态）
（3）bin/hdfs dfsadmin -safemode leave	（功能描述：离开安全模式状态）
（4）bin/hdfs dfsadmin -safemode wait	（功能描述：等待安全模式状态）

6.2.5、案例1：启动集群进入安全模式

	（1）重新启动集群
	```xml
	[atguigu@hadoop102 subdir0]$ myhadoop.sh stop
	[atguigu@hadoop102 subdir0]$ myhadoop.sh start
	```
（2）集群启动后，立即来到集群上删除数据，提示集群处于安全模式

6.2.6、案例2：磁盘修复

需求：数据块损坏，进入安全模式，如何处理
（1）分别进入hadoop102、hadoop103、hadoop104的/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-1015489500-192.168.10.102-

1611909480872/current/finalized/subdir0/subdir0目录，统一删除某2个块信息
[atguigu@hadoop102 subdir0]$ pwd
/opt/module/hadoop-3.1.3/data/dfs/data/current/BP-1015489500-192.168.10.102-1611909480872/current/finalized/subdir0/subdir0

[atguigu@hadoop102 subdir0]$ rm -rf blk_1073741847 blk_1073741847_1023.meta
[atguigu@hadoop102 subdir0]$ rm -rf blk_1073741865 blk_1073741865_1042.meta

说明：hadoop103/hadoop104重复执行以上命令
（2）重新启动集群

[atguigu@hadoop102 subdir0]$ myhadoop.sh stop
[atguigu@hadoop102 subdir0]$ myhadoop.sh start

（3）观察http://hadoop102:9870/dfshealth.html#tab-overview

说明：安全模式已经打开，块的数量没有达到要求。
（4）离开安全模式

[atguigu@hadoop102 subdir0]$ hdfs dfsadmin -safemode get
Safe mode is ON
[atguigu@hadoop102 subdir0]$ hdfs dfsadmin -safemode leave
Safe mode is OFF

（5）观察http://hadoop102:9870/dfshealth.html#tab-overview

（6）将元数据删除

（7）观察http://hadoop102:9870/dfshealth.html#tab-overview，集群已经正常

6.2.7、案例3：

需求：模拟等待安全模式

（1）查看当前模式

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -safemode get
Safe mode is OFF

（2）先进入安全模式

[atguigu@hadoop102 hadoop-3.1.3]$ bin/hdfs dfsadmin -safemode enter

（3）创建并执行下面的脚本
在/opt/module/hadoop-3.1.3路径上，编辑一个脚本safemode.sh

[atguigu@hadoop102 hadoop-3.1.3]$ vim safemode.sh

#!/bin/bash
hdfs dfsadmin -safemode wait
hdfs dfs -put /opt/module/hadoop-3.1.3/README.txt /

[atguigu@hadoop102 hadoop-3.1.3]$ chmod 777 safemode.sh
[atguigu@hadoop102 hadoop-3.1.3]$ ./safemode.sh

（4）再打开一个窗口，执行

[atguigu@hadoop102 hadoop-3.1.3]$ bin/hdfs dfsadmin -safemode leave

（5）再观察上一个窗口

Safe mode is OFF

（6）HDFS集群上已经有上传的数据了

6.3、慢磁盘监控

“慢磁盘”指的时写入数据非常慢的一类磁盘。其实慢性磁盘并不少见，当机器运行时间长了，上面跑的任务多了，磁盘的读写性能自然会退化，严重时就会出现写入数据延时的问题。

6.3.1、如何发现慢磁盘？

正常在HDFS上创建一个目录，只需要不到1s的时间。如果你发现创建目录超过1分钟及以上，而且这个现象并不是每次都有。只是偶尔慢了一下，就很有可能存在慢磁盘。可以采用如下方法找出是哪块磁盘慢：

6.3.1.1、通过心跳未联系时间。

一般出现慢磁盘现象，会影响到DataNode与NameNode之间的心跳。正常情况心跳时间间隔是3s。超过3s说明有异常。

6.3.1.2、fio命令，测试磁盘的读写性能

（1）顺序读测试

[atguigu@hadoop102 ~]# sudo yum install -y fio
[atguigu@hadoop102 ~]# sudo fio -filename=/home/atguigu/test.log -direct=1 -iodepth 1 -thread -rw=read -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_r
Run status group 0 (all jobs):
   READ: bw=360MiB/s (378MB/s), 360MiB/s-360MiB/s (378MB/s-378MB/s), io=20.0GiB (21.5GB), run=56885-56885msec

结果显示，磁盘的总体顺序读速度为360MiB/s。
（2）顺序写测试

[atguigu@hadoop102 ~]# sudo fio -filename=/home/atguigu/test.log -direct=1 -iodepth 1 -thread -rw=write -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_w
Run status group 0 (all jobs):
  WRITE: bw=341MiB/s (357MB/s), 341MiB/s-341MiB/s (357MB/s-357MB/s), io=19.0GiB (21.4GB), run=60001-60001msec

结果显示，磁盘的总体顺序写速度为341MiB/s。
（3）随机写测试

[atguigu@hadoop102 ~]# sudo fio -filename=/home/atguigu/test.log -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_randw
Run status group 0 (all jobs):
  WRITE: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), io=18.1GiB (19.4GB), run=60001-60001msec

结果显示，磁盘的总体随机写速度为309MiB/s。
（4）混合随机读写：

[atguigu@hadoop102 ~]# sudo fio -filename=/home/atguigu/test.log -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=70 -ioengine=psync -bs=16k -size=2G -numjobs=10 -runtime=60 -group_reporting -name=test_r_w -ioscheduler=noop
Run status group 0 (all jobs):
   READ: bw=220MiB/s (231MB/s), 220MiB/s-220MiB/s (231MB/s-231MB/s), io=12.9GiB (13.9GB), run=60001-60001msec
  WRITE: bw=94.6MiB/s (99.2MB/s), 94.6MiB/s-94.6MiB/s (99.2MB/s-99.2MB/s), io=5674MiB (5950MB), run=60001-60001msec

结果显示，磁盘的总体混合随机读写，读速度为220MiB/s，写速度94.6MiB/s。

6.4、小文件归档 6.4.1、HDFS存储小文件弊端

每个文件均按块存储，每个块的元数据存储在NameNode的内存中，因此HDFS存储小文件会非常低效。因为大量的小文件会耗尽NameNode中的大部分内存。
但注意，存储小文件所需要的磁盘容量和数据块的大小无关。例如，一个1MB的文件设置为128MB的块存储，实际使用的是1MB的磁盘空间，而不是128MB。

6.4.2、HDFS存储小文件弊端解决存储小文件办法之一

HDFS存档文件或HAR文件，是一个更高效的文件存档工具，它将文件存入HDFS块，在减少NameNode内存使用的同时，允许对文件进行透明的访问。具体说来，HDFS存档文件对内还是一个一个独立文件，对NameNode而言却是一个整体，减少了NameNode的内存。

6.4.3、案例实操

（1）需要启动YARN进程

[atguigu@hadoop102 hadoop-3.1.3]$ start-yarn.sh

（2）归档文件
把/input目录里面的所有文件归档成一个叫input.har的归档文件，并把归档后文件存储到/output路径下。

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop archive -archiveName input.har -p  /input   /output

（3）查看归档

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /output/input.har
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls har:///output/input.har

（4）解归档文件

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp har:///output/input.har/*    /

7、 HDFS—集群迁移 7.1 、Apache和Apache集群间数据拷贝

1）scp实现两个远程主机之间的文件复制

scp -r hello.txt root@hadoop103:/user/atguigu/hello.txt		// 推 push
	scp -r root@hadoop103:/user/atguigu/hello.txt  hello.txt		// 拉 pull
	scp -r root@hadoop103:/user/atguigu/hello.txt root@hadoop104:/user/atguigu   //是通过本地主机中转实现两个远程主机的文件复制；如果在两个远程主机之间ssh没有配置的情况下可以使用该方式。

2）采用distcp命令实现两个Hadoop集群之间的递归数据复制，两个namenode服务器进行数据拷贝

[atguigu@hadoop102 hadoop-3.1.3]$  bin/hadoop distcp 
hdfs://hadoop102:8020/user/atguigu/hello.txt 
hdfs://hadoop105:8020/user/atguigu/hello.txt

8、 MapReduce生产经验 8.1、 MapReduce跑的慢的原因

MapReduce程序效率的瓶颈在于两点：
1）计算机性能
CPU、内存、磁盘、网络
2）I/O操作优化
（1）数据倾斜
（2）Map运行时间太长，导致Reduce等待过久
（3）小文件过多

8.2、MapReduce常用调优参数 8.2.1、map阶段优化

8.2.2、reducer阶段优化

8.3、 MapReduce数据倾斜问题 8.3.1、数据倾斜现象

数据频率倾斜——某一个区域的数据量要远远大于其他区域。
数据大小倾斜——部分记录的大小远远大于平均值。

8.3.2、减少数据倾斜的方法

（1）首先检查是否空值过多造成的数据倾斜
生产环境，可以直接过滤掉空值；如果想保留空值，就自定义分区，将空值加随机数打散。最后再二次聚合。
（2）能在map阶段提前处理，最好先在Map阶段处理。如：Combiner、MapJoin
（3）设置多个reduce个数

9、Hadoop-Yarn生产经验

具体参考之前的文档
yarn的调优方式

10、Hadoop综合调优 10.1、 Hadoop小文件优化方法 10.1.1、 Hadoop小文件弊端

HDFS上每个文件都要在NameNode上创建对应的元数据，这个元数据的大小约为150byte，这样当小文件比较多的时候，就会产生很多的元数据文件，一方面会大量占用NameNode的内存空间，另一方面就是元数据文件过多，使得寻址索引速度变慢。
小文件过多，在进行MR计算时，会生成过多切片，需要启动过多的MapTask。每个MapTask处理的数据量小，导致MapTask的处理时间比启动时间还小，白白消耗资源。

10.1.2、 Hadoop小文件解决方案

1）在数据采集的时候，就将小文件或小批数据合成大文件再上传HDFS（数据源头）
2）Hadoop Archive（存储方向）
是一个高效的将小文件放入HDFS块中的文件存档工具，能够将多个小文件打包成一个HAR文件，从而达到减少NameNode的内存使用
3）CombineTextInputFormat（计算方向）
CombineTextInputFormat用于将多个小文件在切片过程中生成一个单独的切片或者少量的切片。
4）开启uber模式，实现JVM重用（计算方向）
默认情况下，每个Task任务都需要启动一个JVM来运行，如果Task任务计算的数据量很小，我们可以让同一个Job的多个Task运行在一个JVM中，不必为每个Task都开启一个JVM。
（1）未开启uber模式，在/input路径上上传多个小文件并执行wordcount程序

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2

（2）观察控制台

2021-02-14 16:13:50,607 INFO mapreduce.Job: Job job_1613281510851_0002 running in uber mode : false

（3）观察http://hadoop103:8088/cluster

（4）开启uber模式，在mapred-site.xml中添加如下配置


  	mapreduce.job.ubertask.enable
  	true


 

  	mapreduce.job.ubertask.maxmaps
  	9



  	mapreduce.job.ubertask.maxreduces
  	1



  	mapreduce.job.ubertask.maxbytes

（5）分发配置

[atguigu@hadoop102 hadoop]$ xsync mapred-site.xml

（6）再次执行wordcount程序

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output2

（7）观察控制台

2021-02-14 16:28:36,198 INFO mapreduce.Job: Job job_1613281510851_0003 running in uber mode : true

（8）观察http://hadoop103:8088/cluster

10.2、测试MapReduce计算性能

使用Sort程序评测MapReduce
注：一个虚拟机不超过150G磁盘尽量不要执行这段代码

（1）使用RandomWriter来产生随机数，每个节点运行10个Map任务，每个Map产生大约1G大小的二进制随机数

[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar randomwriter random-data

（2）执行Sort程序

[atguigu@hadoop102 mapreduce]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar sort random-data sorted-data

（3）验证数据是否真正排好序了

[atguigu@hadoop102 mapreduce]$ 
hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.3-tests.jar testmapredsort -sortInput random-data -sortOutput sorted-data

10.3、企业开发场景案例 10.3.1、需求

（1）需求：从1G数据中，统计每个单词出现次数。服务器3台，每台配置4G内存，4核CPU，4线程。
（2）需求分析：
1G / 128m = 8个MapTask；1个ReduceTask；1个mrAppMaster
平均每个节点运行10个 / 3台 ≈ 3个任务（4 3 3）

10.3.2 、HDFS参数调优

（1）修改：hadoop-env.sh

export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS -Xmx1024m"

export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"

（2）修改hdfs-site.xml


    dfs.namenode.handler.count
    21

（3）修改core-site.xml


    fs.trash.interval
    60

（4）分发配置
[atguigu@hadoop102 hadoop]$ xsync hadoop-env.sh hdfs-site.xml core-site.xml
10.3.3 MapReduce参数调优
（1）修改mapred-site.xml


  mapreduce.task.io.sort.mb
  100




  mapreduce.map.sort.spill.percent
  0.80




  mapreduce.task.io.sort.factor
  10




  mapreduce.map.memory.mb
  -1
  The amount of memory to request from the scheduler for each    map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
  




  mapreduce.map.cpu.vcores
  1




  mapreduce.map.maxattempts
  4




  mapreduce.reduce.shuffle.parallelcopies
  5




  mapreduce.reduce.shuffle.input.buffer.percent
  0.70




  mapreduce.reduce.shuffle.merge.percent
  0.66




  mapreduce.reduce.memory.mb
  -1
  The amount of memory to request from the scheduler for each    reduce task. If this is not specified or is non-positive, it is inferred
    from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio.
    If java-opts are also not specified, we set it to 1024.
  




  mapreduce.reduce.cpu.vcores
  2




  mapreduce.reduce.maxattempts
  4




  mapreduce.job.reduce.slowstart.completedmaps
  0.05




  mapreduce.task.timeout
  600000

（2）分发配置

[atguigu@hadoop102 hadoop]$ xsync mapred-site.xml

10.3.4、 Yarn参数调优

（1）修改yarn-site.xml配置参数如下：

The class to use as the resource scheduler.
yarn.resourcemanager.scheduler.class
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

Number of threads to handle scheduler interface.
yarn.resourcemanager.scheduler.client.thread-count
8

Enable auto-detection of node capabilities such as
memory and CPU.

yarn.nodemanager.resource.detect-hardware-capabilities
false

Flag to determine if logical processors(such as
hyperthreads) should be counted as cores. only applicable on Linux
when yarn.nodemanager.resource.cpu-vcores is set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true.

yarn.nodemanager.resource.count-logical-processors-as-cores
false

Multiplier to determine how to convert phyiscal cores to
vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
is set to -1(which implies auto-calculate vcores) and
yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.

yarn.nodemanager.resource.pcores-vcores-multiplier
1.0

Amount of physical memory, in MB, that can be allocated
for containers. If set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically calculated(in case of Windows and Linux).
In other cases, the default is 8192MB.

yarn.nodemanager.resource.memory-mb
4096

Number of vcores that can be allocated
for containers. This is used by the RM scheduler when allocating
resources for containers. This is not used to limit the number of
CPUs used by YARN containers. If it is set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically determined from the hardware in case of Windows and Linux.
In other cases, number of vcores is 8 by default.
yarn.nodemanager.resource.cpu-vcores
4

The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have less memory than this value will be shut down by the resource manager.

yarn.scheduler.minimum-allocation-mb
1024

The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.

yarn.scheduler.maximum-allocation-mb
2048

The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.

yarn.scheduler.minimum-allocation-vcores
1

The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an
InvalidResourceRequestException.
yarn.scheduler.maximum-allocation-vcores
2

Whether virtual memory limits will be enforced for
containers.
yarn.nodemanager.vmem-check-enabled
false

Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.

yarn.nodemanager.vmem-pmem-ratio
2.1

（2）分发配置

[atguigu@hadoop102 hadoop]$ xsync yarn-site.xml

10.3.5、执行程序

（1）重启集群

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/stop-yarn.sh
[atguigu@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

（2）执行WordCount程序

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /input /output

（3）观察Yarn任务执行页面

http://hadoop103:8088/cluster/apps

Hadoop生产调优手册（三）

大数据系统相关栏目本月热门文章