栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

服务器异常断开导致kudu无法启动问题(Data length checksum does not match: Incorrect checksum in file ... : Checksum )

服务器异常断开导致kudu无法启动问题(Data length checksum does not match: Incorrect checksum in file ... : Checksum )

某日一台服务器异常断开,无法登陆,后续正常后,agent服务自动重启,服务器上的各种角色也在启动恢复,但是kudu无法恢复。手动重启后失败,查看报错原因当时kudu正在进行数据写入操作,由于服务器异常断开,导致kudu数据文件异常,报错如下:

++ date
+ timestamp='Wed Oct 13 10:57:02 CST 2021'
+ echo 'Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13'
+ echo 'Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13'
Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13
+ '[' false == true ']'
+ KUDU_ARGS=
+ '[' false == true ']'
+ '[' tserver = master ']'
+ '[' tserver = tserver ']'
+ KUDU_ARGS=' --tserver_master_addrs=hadoop11,hadoop12,hadoop13'
+ exec /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/kudu/sbin/kudu-tserver --tserver_master_addrs=hadoop11,hadoop12,hadoop13 --flagfile=/run/cloudera-scm-agent/process/18986-kudu-KUDU_TSERVER/gflagfile
F1013 10:57:03.360226 62788 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Corruption: Failed to load FS layout: Could not process records in container /mnt/sdf/kudu/tserver/data/e68f1a45d9f144d9a5242a2067cb8d37: Data length checksum does not match: Incorrect checksum in file /mnt/sdf/kudu/tserver/data/e68f1a45d9f144d9a5242a2067cb8d37.metadata at offset 2697092: Checksum does not match. Expected: 0. Actual: 1214729159
*** Check failure stack trace: ***
Wrote minidump to /var/log/kudu/minidumps/kudu-tserver/498615d9-40ee-493b-14fc78f5-777a0a19.dmp
*** Aborted at 1634093823 (unix time) try "date -d @1634093823" if you are using GNU date ***
PC: @     0x7f7760f4e1d7 __GI_raise
*** SIGABRT (@0x3cf0000f544) received by PID 62788 (TID 0x7f776350b9c0) from PID 62788; stack trace: ***
    @     0x7f7762ed5370 (unknown)
    @     0x7f7760f4e1d7 __GI_raise
    @     0x7f7760f4f8c8 __GI_abort
    @          0x1b49fe9 (unknown)
    @           0x8e27ad google::LogMessage::Fail()
    @           0x8e4703 google::LogMessage::SendToLog()
    @           0x8e2309 google::LogMessage::Flush()
    @           0x8e508f google::LogMessageFatal::~LogMessageFatal()
    @           0x883d86 (unknown)
    @     0x7f7760f3ab35 __libc_start_main
    @           0x883755 (unknown)

度娘说找出报错数据文件操作时间,然后把那个时间节点的文件最后一行都删除掉,重启后依次这样处理有问题的文件;实际操作发现这种操作太多了,花费时间太久。

sudo ls -l --full-time /mnt/sdh/kudu/tserver/data/29676e68406a4421b04d368798607062.metadata | awk {'print $7'}| cut -c 1-8
	 
for i in `sudo ls -l /mnt/sdh/kudu/tserver/data/ --full-time |grep "2021-10-13 10:37:19" | grep ".metadata" | awk {'print $9'}`; do  sudo sed -i '$d' /mnt/sdh/kudu/tserver/data/$i; done 
	 

最终解决:由于只涉及一台tablet Server的数据文件,文件都是备份三份的,所以直接停止该服务器的kudu服务,并删除这个服务器的kudu角色,废弃之前的kudu数据文件,重新添加tablet server。

根据该服务器的kudu配置,将文件路径和wal路径记录下来后续要进行重命名备份

Kudu Tablet Server WAL Directory

Kudu Tablet Server Data Directories

CM界面然后停止异常服务器kudu的Tablet Server ,删除该异常服务器的kudu角色。

登录异常服务器的后台,将配置的文件路径和wal路径进行重命名进行备份(不重命名会导致后续重新添加这台服务器的tablet server时报错)。

通过后台ksck发现仍然在显示连接。

sudo -u kudu kudu cluster ksck hadoopap11  | head -n 10

Connected to the Master
WARNING: Unable to connect to Tablet Server a8c0534fc01d4c3bae02faec3d3fddd4 (hadoop 32:7050): Network error: could not send Ping RPC to server: Client connection negotiation failed: client connection to 172.18.8.52:7050: connect: Connection refused (error 111)
WARNING: Fetched info from 18 Tablet Servers, 1 weren't reachable

最后发现需要重启kudu的master,否则连接信息会一直存在(我当时直接滚动重启所有kudu)。

再次ksck命令发正常了,再重新添加这台服务器kudu的Tablet Server服务。

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/336334.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号