k8s中etcd报错etcd组件不健康

1、上集群发现页面有报错etcd组件不健康，但是节点显示没有任何问题

后台查看etcd发现etcd列表没有

+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.0.3:2379 | 78b2261e379e9a4c | 3.3.15 | 19 MB | true | 16537 | 14306144 |
| https://192.168.0.2:2379 | d5a8f8671df6bb3b | 3.3.15 | 18 MB | false | 16537 | 14306144 |

2、检查这个etcd不健康

Sangfor:PaaS/private-master-01-a7aeca ~ x docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health
{"level":"warn","ts":"2022-02-11T07:55:41.631Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-9aae1d5c-2c6b-4474-87e4-35cca2d5cba7/192.168.0.4:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.4:2379: connect: connection refused""}
https://192.168.0.3:2379 is healthy: successfully committed proposal: took = 13.843404ms
https://192.168.0.2:2379 is healthy: successfully committed proposal: took = 13.728998ms
https://192.168.0.4:2379 is unhealthy: failed to commit proposal: context deadline exceeded

3、检查对应etcd主机后台的etcd日志有报错

2022-01-14 21:11:44.560318 W | etcdserver: failed to reach the peerURL(https://192.168.0.2:2380) of member d5a8f8671df6bb3b (Get https://192.168.0.2:2380/version: dial tcp 192.168.0.2:2380: connect: no route to host)
2022-01-14 21:11:44.560368 W | etcdserver: cannot get the version of member d5a8f8671df6bb3b (Get https://192.168.0.2:2380/version: dial tcp 192.168.0.2:2380: connect: no route to host)
2022-01-14 21:11:47.267996 W | rafthttp: health check for peer d5a8f8671df6bb3b could not connect: dial tcp 192.168.0.2:2380: connect: no route to host (prober "ROUND_TRIPPER_SNAPSHOT")
2022-01-14 21:11:47.296968 W | rafthttp: health check for peer d5a8f8671df6bb3b could not connect: dial tcp 192.168.0.2:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2022-01-14 21:11:50.573183 W | etcdserver: failed to reach the peerURL(https://192.168.0.2:2380) of member d5a8f8671df6bb3b (Get https://192.168.0.2:2380/version: dial tcp 192.168.0.2:2380: connect: no route to host)
4、看到有帖子说是分析是因为etcd1的配置文件/etc/systemd/system/etcd.service 启动脚本中的ETCD_INITIAL_CLUSTER_STATE是new，而在配置中ETCD_INITIAL_CLUSTER写入了etcd2/3的IP:PORT，这时etcd1尝试去连接etcd2、etcd3，但是etcd2、3的etcd服务此时还未启动，因此需要先启动etcd2和3的etcd服务，再去启动etcd1。

5、所以考虑到这个问题尝试重启etcd，结果成功，集群正常

Sangfor:PaaS/private-master-03-e672f6 ~ o docker restart etcd
etcd
Sangfor:PaaS/private-master-03-e672f6 ~ o docker ps -a | grep etcd
5e5c47f5b434 10.113.67.53/multi-arch/library/sangforpaas/coreos-etcd:v3.3.15-sangfor1 "/usr/local/bin/etcd…" 4 weeks ago Up 3 seconds etcd
6、再次检查etcd健康，没有问题了

Sangfor:PaaS/private-master-01-a7aeca ~ o docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health
https://192.168.0.2:2379 is healthy: successfully committed proposal: took = 17.85983ms
https://192.168.0.3:2379 is healthy: successfully committed proposal: took = 17.603519ms
https://192.168.0.4:2379 is healthy: successfully committed proposal: took = 19.782918ms
Sangfor:PaaS/private-master-01-a7aeca ~ o docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.0.4:2379 | 2bfd771eb734cc67 | 3.3.15 | 18 MB | false | 16563 | 14309175 |
| https://192.168.0.3:2379 | 78b2261e379e9a4c | 3.3.15 | 19 MB | false | 16563 | 14309175 |
| https://192.168.0.2:2379 | d5a8f8671df6bb3b | 3.3.15 | 18 MB | true | 16563 | 14309175 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

本文章参考如下帖子

kubernetes 二进制安装遇到 etcd 不能启动报错处理【附源码】_安享落幕_51CTO博客

k8s中etcd报错etcd组件不健康

大数据系统相关栏目本月热门文章