线上环境收到错误告警邮件,错误信息 keeperErrorCode = ConnectionLoss
解决问题:
1、分析错误信息得出,zk连接出了问题,先重启项目(项目重启后就恢复了正常,项目突然异常,切记第一时间重启,别先分析问题)
2、 查看具体错误信息,通过 keeperErrorCode = ConnectionLoss 分析不出原因,查看上下日志发现 zk客户端在不停的 创建socket(opening socket connection ...),在通讯建立成功后,read操作失败(Unable to read additional data from server ... ),错误显示可能是 zk服务端 关闭了通道
2021-10-31 00:00:17,959 INFO [main-SendThread(10.128.16.39:2181)] [org.apache.zookeeper.ClientCnxn] - Opening socket connection to server 10.128.16.39/10.128.16.39:2181. Will not attempt to authenticate using SASL (unknown error) 2021-10-31 00:00:17,959 INFO [main-SendThread(10.128.16.39:2181)] [org.apache.zookeeper.ClientCnxn] - Socket connection established to 10.128.16.39/10.128.16.39:2181, initiating session 2021-10-31 00:00:17,963 INFO [main-SendThread(10.128.16.39:2181)] [org.apache.zookeeper.ClientCnxn] - Session establishment complete on server 10.128.16.39/10.128.16.39:2181, sessionid = 0x27bdf1c3c985743, negotiated timeout = 40000 2021-10-31 00:00:17,971 INFO [main-EventThread] [org.apache.curator.framework.state.ConnectionStateManager] - State change: REConNECTED 2021-10-31 00:00:17,971 WARN [main-EventThread] [org.apache.curator.framework.state.ConnectionStateManager] - ConnectionStateManager queue full - dropping events to make room 2021-10-31 00:00:17,979 INFO [main-SendThread(10.128.16.39:2181)] [org.apache.zookeeper.ClientCnxn] - Unable to read additional data from server sessionid 0x27bdf1c3c985743, likely server has closed socket, closing socket connection and attempting reconnect 2021-10-31 00:00:18,079 INFO [main-EventThread] [org.apache.curator.framework.state.ConnectionStateManager] - State change: LOST 2021-10-31 00:00:18,079 WARN [main-EventThread] [org.apache.curator.framework.state.ConnectionStateManager] - ConnectionStateManager queue full - dropping events to make room 2021-10-31 00:00:18,079 ERROR [main-EventThread] [org.apache.curator.framework.imps.CuratorframeworkImpl] - Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.framework.imps.CuratorframeworkImpl.checkBackgroundRetry(CuratorframeworkImpl.java:728) ~[curator-framework-2.10.0.jar:?] at org.apache.curator.framework.imps.CuratorframeworkImpl.processBackgroundOperation(CuratorframeworkImpl.java:516) ~[curator-framework-2.10.0.jar:?] at org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:50) ~[curator-framework-2.10.0.jar:?] at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:609) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
3、 根据客户端第一次报错信息时间,查看zk服务端日志,发现这个时候zk正在选举(选举时不对外提供服务),选举持续时间1分钟,紧接着,就出现了 Exception causing close of session 0x27bdf1c3c985743 due to java.io.IOException: Len error 1054010 ,zk服务端read长度是有限制的(jute.maxbuffer ,默认值 1M)
2021-10-30 23:59:57,575 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.128.0.22:4099 2021-10-30 23:59:57,577 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x27bdf1c3c985743 at /10.128.0.22:4099 2021-10-30 23:59:57,578 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x27bdf1c3c985743 2021-10-30 23:59:57,578 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x27bdf1c3c985743 with negotiated timeout 40000 for client /10.128.0.22:4099 2021-10-30 23:59:57,595 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x27bdf1c3c985743 due to java.io.IOException: Len error 1054010 2021-10-30 23:59:57,595 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.128.0.22:4099 which had sessionid 0x27bdf1c3c985743
4、 解决办法2种:
1:调大 jute.maxbuffer 2:找到发送超过长度的命令代码,进行整改



