hadoop定位hdfs文件块异常和修复org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlock
一、问题重启hadoop集群之后,执行任务时发生异常
异常信息
Error: java.io.IOException: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlock{BP-1982579562-192.168.xxx.32-1629880080614:blk_1083851475_10110700; getBlockSize()=29733; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.114.33:50010,DS-c7e1e9b5-cea8-43cb-87a4-f429602b0e03,DISK], DatanodeInfoWithStorage[192.168.114.35:50010,DS-79ec8e0d-bb51-4779-aee8-53d8a98809d6,DISK], DatanodeInfoWithStorage[192.168.114.32:50010,DS-cf7e207c-0e1d-4b65-87f7-608450271039,DISK]]} of /log_collection/ods/ods_xxxx_log/dt=2021-11-24/log.1637744364144.lzo at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:420) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) Caused by: org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block length for LocatedBlock{BP-1982579562-192.168.114.32-1629880080614:blk_1083851475_10110700; getBlockSize()=29733; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.114.33:50010,DS-c7e1e9b5-cea8-43cb-87a4-f429602b0e03,DISK], DatanodeInfoWithStorage[192.168.114.35:50010,DS-79ec8e0d-bb51-4779-aee8-53d8a98809d6,DISK], DatanodeInfoWithStorage[192.168.114.32:50010,DS-cf7e207c-0e1d-4b65-87f7-608450271039,DISK]]} of /log_collection/ods/ods_xxxx_log/dt=2021-11-24/log.1637744364144.lzo at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:440) at org.apache.hadoop.hdfs.DFSInputStream.getLastBlockLength(DFSInputStream.java:349) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:330) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:230) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:196) at org.apache.hadoop.hdfs.DFSClient.openInternal(DFSClient.java:1048) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1011) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:321) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:317) at org.apache.hadoop.fs.FileSystemlinkResolver.resolve(FileSystemlinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:329) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899) at
二、解决
使用 hdfs fsck /path –openforwrite 检查一下
”Cannot obtain block length for LocatedBlock”结合字面意思讲应该是当前有文件处于写入状态尚未关闭,无法与对应的datanode通信来成功标识其block长度.
[hadoop@node1 talents_bash]$ hdfs fsck /log_collection/ods/ods_xxxx_log/dt=2021-11-24/ -openforwrite Connecting to namenode via http://node3.bigdata.59wanmei.com:50070/fsck?ugi=hadoop&openforwrite=1&path=%2Flog_collection%2Fods%2Fods_xxxx_log%2Fdt%3D2021-11-24 FSCK started by hadoop (auth:SIMPLE) from /192.168.114.31 for path /log_collection/ods/ods_xxxx_log/dt=2021-11-24 at Thu Nov 25 11:34:10 CST 2021 /log_collection/ods/ods_xxxx_log/dt=2021-11-24/log.1637740751701.lzo 5214 bytes, replicated: replication=3, 1 block(s), OPENFORWRITE: . Status: HEALTHY Number of data-nodes: 7 Number of racks: 1 Total dirs: 1 Total symlinks: 0 Replicated Blocks: Total size: 150068174 B Total files: 112 Total blocks (validated): 112 (avg. block size 1339894 B) Minimally replicated blocks: 111 (99.10714 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 2.9732144 Missing blocks: 0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Erasure Coded Block Groups: Total size: 0 B Total files: 0 Total block groups (validated): 0 Minimally erasure-coded block groups: 0 Over-erasure-coded block groups: 0 Under-erasure-coded block groups: 0 Unsatisfactory placement block groups: 0 Average block group size: 0.0 Missing block groups: 0 Corrupt block groups: 0 Missing internal blocks: 0 FSCK ended at Thu Nov 25 11:34:10 CST 2021 in 5 milliseconds The filesystem under path '/log_collection/ods/ods_xxxx_log/dt=2021-11-24' is HEALTHY
推断:HDFS文件租约未释放
可以参考这篇文章来了解HDFS租约机制 http://www.cnblogs.com/cssdongl/p/6699919.html
了解过HDFS租约后我们知道,客户端在每次读写HDFS文件的时候获取租约对文件进行读写,文件读取完毕了,然后再释放此租约.文件状态就是关闭的了。
但是结合当前场景由于先关闭的hadoop集群,后关闭的Flume sink hdfs,那么hadoop集群都关了,Flume还在对hdfs文件写入,那么租约最后释放了吗?答案是肯定没释放.
恢复租约
对于这些状态损坏的文件来讲,rm掉的话是很暴力的做法,万一上游对应日期的数据已经没有rention呢?所以,既然没有释放租约,那么恢复租约close掉文件就是了,如下命令
hdfs debug recoverLease -path -retries
请将修改成你需要恢复的租约状态不一致的hdfs文件的具体路径,如果要恢复的很多,可以写个自动化脚本来找出需要恢复的所有文件然后统一恢复租约.
ok,执行完命令后再次cat对应hdfs文件已无异常,顺利显示内容,问题解决.
[hadoop@node1 talents_bash]$ hdfs debug recoverLease -path /log_collection/ods/ods_xxxx_log/dt=2021-11-24/log.1637740751701.lzo -retries 10 recoverLease SUCCEEDED on /log_collection/ods/ods_xxxx_log/dt=2021-11-24/log.1637740751701.lzo
参考解决方案:
https://www.cnblogs.com/cssdongl/p/6700512.html



