2021-10-18 18:23:58,158 INFO jdbc.Utils: Supplied authorities: 192.168.xx.xx:10000 2021-10-18 18:23:58,158 INFO jdbc.Utils: Resolved authority: 192.168.xx.xx:10000 2021-10-18 18:24:24,234 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:33917 in memory (size: 4.8 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,253 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:42263 in memory (size: 4.8 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,264 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:33917 in memory (size: 7.3 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,268 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:42263 in memory (size: 7.3 KiB, free: 366.3 MiB)
最开始的时候,对问题的定位原因是数据问题,可能是因为数据倾斜导致的进程假死,所以将当前任务使用以下命令yarn application -kill ${application_id}杀死后,重新选择了一张空表进行测试,发现还是出现同样问题,所以初步怀疑的数据倾斜导致的进行假死原因不成立,又重新进行排查。
这里有一篇自己看到的关于数据倾斜解决的问题:spark性能优化,这是自己在排查问题过程中看到的文章。
然后继续进行排查,这时候日志突然报出了如下问题:
2021-10-18 18:44:35,265 ERROR yarn.ApplicationMaster: User class threw exception: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. metaException(message:java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1219044: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxx' on table 'hbase:meta' at null at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148) at org.apache.hadoop.hbase.client.HbaseAdmin.executeCallable(HbaseAdmin.java:3078) at org.apache.hadoop.hbase.client.HbaseAdmin.executeCallable(HbaseAdmin.java:3070) at org.apache.hadoop.hbase.client.HbaseAdmin.tableExists(HbaseAdmin.java:445) at org.apache.hadoop.hive.hbase.HbasemetaHook.preCreateTable(HbasemetaHook.java:135) at org.apache.hadoop.hive.metastore.HivemetaStoreClient.createTable(HivemetaStoreClient.java:832) at org.apache.hadoop.hive.metastore.HivemetaStoreClient.createTable(HivemetaStoreClient.java:822) at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingmetaStoreClient.invoke(RetryingmetaStoreClient.java:212) at com.sun.proxy.$Proxy34.createTable(Unknown Source) at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HivemetaStoreClient$SynchronizedHandler.invoke(HivemetaStoreClient.java:2773) at com.sun.proxy.$Proxy34.createTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:921) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4954) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:428) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:224) at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:316) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxxx' on table 'hbase:meta' at null at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:395) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:369) at org.apache.hadoop.hbase.metaTableAccessor.getTableState(metaTableAccessor.java:1108) at org.apache.hadoop.hbase.metaTableAccessor.tableExists(metaTableAccessor.java:427) at org.apache.hadoop.hbase.client.HbaseAdmin$6.rpcCall(HbaseAdmin.java:448) at org.apache.hadoop.hbase.client.HbaseAdmin$6.rpcCall(HbaseAdmin.java:445) at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:106) ... 40 more Caused by: java.io.IOException: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:222) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:104)
这时候发现hbase服务异常停止,在对hbase服务进行重启时,突然意外的发现,hbase服务无法启动了,此时重启hbase,Hmaster服务一直启动失败,查看HMaster的启动日志,发现问题如下:
2021-10-18 20:17:12,663 WARN [main-SendThread(192.168.17.23:2181)] zookeeper.ClientCnxn: Session 0x0 for server 192.168.17.23/192.168.17.23:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntonativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2021-10-18 20:17:26,533 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed
2021-10-18 20:17:26,533 ERROR [main] regionserver.HRegionServer: Failed construction RegionServer
org.apache.hadoop.hbase.ZooKeeperConnectionException: master:160000x0, quorum=192.168.17.22:2181,192.168.17.23:2181,192.168.17.24:2181, baseZNode=/hbase Unexpected KeeperException creating base node
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createbaseZNodes(ZKWatcher.java:192)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.(ZKWatcher.java:167)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.(ZKWatcher.java:119)
at org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:652)
at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:494)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2937)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:236)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2955)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:547)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:526)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1197)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1175)
at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createbaseZNodes(ZKWatcher.java:183)
... 14 more
问题发现了,是hbase无法连接到zookeeper,导致的hbase启动失败,但是通过ps -ef|grep zookeeper发现zookeeper服务正在运行,所以应该不存在因为zookeeper服务停止导致的hbase服务失败,然后注意到了java.io.IOException: Connection reset by peer,这个问题是socket连接数过多导致的问题,所以通过查看zookeeper的连接数
netstat -anl|grep 2181|grep -i 'zk_ip'|grep ESTABLISHED|wc -l
发现zookeeper的连接数是240,但是zookeeper的默认连接数是60,所以据诶定修改zookppeer的最大允许连接数,
在zookeeper的zoo.cfg文件中,添加如下参数:
# the maximum number of client connections. # increase this if you need to handle more clients maxClientCnxns=2000
然后重启zookeeper集群,然后在zookeeper的shell中执行rmr /hbase,再重启hbase,发现服务正常启动。



