Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd v3.5.0 with cherry-pick(pull/13348) still resurrects ancient members, it restarts failed with "member count is unequal"... #13697

Closed
jiapeish opened this issue Feb 15, 2022 · 6 comments
Labels

Comments

@jiapeish
Copy link

jiapeish commented Feb 15, 2022

What happened?

1 etcd node(etcd-node-0) restarts about every 3 seconds, with log "closing etcd server" and "member count is unequal".

We deployed 3 etcd pod in kubernetes. this Monday we performed an rolling update of the etcd pods(only the App above etcd updated, etcd version is both v3.5.0 with fix #13348), but after this update, 1 etcd pod start failed.

Log is like this:

query etcd-node-0:
vm-0:/home/paas # ETCDCTL_API=3 ./etcdctl --endpoints=176.228.0.62:2379 member list -w table
Error: dial tcp 176.228.0.62:2379: getsockopt: connection refused

query etcd-node-1:
vm-0:/home/paas # ETCDCTL_API=3 ./etcdctl --endpoints=176.228.1.7:2379 member list -w table
+------------------+-----------+-------------+--------------------------+-------------------------+
|        ID        |  STATUS   |    NAME     |        PEER ADDRS        |      CLIENT ADDRS       |
+------------------+-----------+-------------+--------------------------+-------------------------+
| 9127ba3ed067c444 |   started | 176.228.1.1 |  http://176.228.1.1:2380 | http://176.228.1.1:2379 |
| 960522629f2e76c3 | unstarted |             | http://176.228.0.62:2380 |                         |
| a3191ee449725555 |   started | 176.228.1.7 |  http://176.228.1.7:2380 | http://176.228.1.7:2379 |
+------------------+-----------+-------------+--------------------------+-------------------------+

query etcd-node-2:
vm-0:/home/paas # ETCDCTL_API=3 ./etcdctl --endpoints=176.228.1.1:2379 member list -w table
+------------------+-----------+--------------+--------------------------+-------------------------+
|        ID        |  STATUS   |     NAME     |        PEER ADDRS        |      CLIENT ADDRS       |
+------------------+-----------+--------------+--------------------------+-------------------------+
| 9127ba3ed067c444 |   started |  176.228.1.1 |  http://176.228.1.1:2380 | http://176.228.1.1:2379 |
| 960522629f2e76c3 | unstarted |              | http://176.228.0.62:2380 |                         |
| a3191ee449725555 |   started |  176.228.1.7 |  http://176.228.1.7:2380 | http://176.228.1.7:2379 |
| ccd3ecb34f88ff13 |   started | 176.228.0.62 | http://176.228.0.62:2380 |                         |
| ed254075ccc97692 |   started |  176.228.1.7 |  http://176.228.1.7:2380 |                         |
+------------------+-----------+--------------+--------------------------+-------------------------+
[2022-02-14 03:16:53.61431936 +0000 GMT m=+1.322807682] [etcdStartup] restart etcd failed: error validating peerURLs {ClusterID:29db5d86d64823d4 Members:[&{ID:9127ba3ed067c444 RaftAttributes:{PeerURLs:[http://176.228.1.1:2380] IsLearner:false} Attributes:{Name:176.228.1.1 ClientURLs:[http://176.228.1.1:2379]}} &{ID:a3191ee449725555 RaftAttributes:{PeerURLs:[http://176.228.1.7:2380] IsLearner:false} Attributes:{Name:176.228.1.7 ClientURLs:[http://176.228.1.7:2379]}} &{ID:ccd3ecb34f88ff13 RaftAttributes:{PeerURLs:[http://176.228.0.62:2380] IsLearner:false} Attributes:{Name:176.228.0.62 ClientURLs:[]}} &{ID:ed254075ccc97692 RaftAttributes:{PeerURLs:[http://176.228.1.7:2380] IsLearner:false} Attributes:{Name:176.228.1.7 ClientURLs:[]}} &{ID:f21dd0ca326247f3 RaftAttributes:{PeerURLs:[http://176.228.0.62:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}] RemovedMemberIDs:[]}: 
member count is unequal: ems(5) lms(3)
{"level":"info","ts":"2022-02-14T02:55:16.077Z","caller":"embed/etcd.go:369","msg":"closing etcd server","name":"176.228.0.62","data-dir":"/opt/ETCDdata/","advertise-peer-urls":["http://176.228.0.62:2380"],"advertise-client-urls":["http://176.228.0.62:2379"]}

{"level":"info","ts":"2022-02-14T02:55:16.077Z","caller":"embed/etcd.go:371","msg":"closed etcd server","name":"176.228.0.62","data-dir":"/opt/ETCDdata/","advertise-peer-urls":["http://176.228.0.62:2380"],"advertise-client-urls":["http://176.228.0.62:2379"]}

What did you expect to happen?

We expect the 3 etcd pod updated successfully.

How can we reproduce it (as minimally and precisely as possible)?

We just performed a rolling update for the 3 etcd pod. It's the Kubernetes's manner to operate them.

Anything else we need to know?

We've cherry-picked fix into our etcd cluster.

Etcd version (please run commands below)

$ etcd --version
v3.5.0

$ etcdctl version
etcdctl version: 3.3.6
API version: 3.3

Etcd configuration (command line flags or environment variables)

节点名称

name: 'etcd0'

data-dir: "/opt/ETCDdata/"

listen-peer-urls: "http://0.0.0.0:2380"

listen-client-urls: "http://0.0.0.0:2379"

initial-advertise-peer-urls: "http://0.0.0.0:2380"

initial-cluster: "etcd0=http://0.0.0.0:2380"

initial-cluster-state: "new"

initial-cluster-token: "etcd-cluster"

advertise-client-urls: "http://0.0.0.0:2379"

auto-compaction-mode: "revision"

auto-compaction-retention: "1000"

strict-reconfig-check: false

pre-vote: true

enable-grpc-gateway: true

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ vm-0:/home/paas # ETCDCTL_API=3 ./etcdctl --endpoints=176.228.1.7:2379 member list -w table
+------------------+-----------+-------------+--------------------------+-------------------------+
|        ID        |  STATUS   |    NAME     |        PEER ADDRS        |      CLIENT ADDRS       |
+------------------+-----------+-------------+--------------------------+-------------------------+
| 9127ba3ed067c444 |   started | 176.228.1.1 |  http://176.228.1.1:2380 | http://176.228.1.1:2379 |
| 960522629f2e76c3 | unstarted |             | http://176.228.0.62:2380 |                         |
| a3191ee449725555 |   started | 176.228.1.7 |  http://176.228.1.7:2380 | http://176.228.1.7:2379 |
+------------------+-----------+-------------+--------------------------+-------------------------+

query the other node is like this:
vm-0:/home/paas # ETCDCTL_API=3 ./etcdctl --endpoints=176.228.1.1:2379 member list -w table
+------------------+-----------+--------------+--------------------------+-------------------------+
|        ID        |  STATUS   |     NAME     |        PEER ADDRS        |      CLIENT ADDRS       |
+------------------+-----------+--------------+--------------------------+-------------------------+
| 9127ba3ed067c444 |   started |  176.228.1.1 |  http://176.228.1.1:2380 | http://176.228.1.1:2379 |
| 960522629f2e76c3 | unstarted |              | http://176.228.0.62:2380 |                         |
| a3191ee449725555 |   started |  176.228.1.7 |  http://176.228.1.7:2380 | http://176.228.1.7:2379 |
| ccd3ecb34f88ff13 |   started | 176.228.0.62 | http://176.228.0.62:2380 |                         |
| ed254075ccc97692 |   started |  176.228.1.7 |  http://176.228.1.7:2380 |                         |
+------------------+-----------+--------------+--------------------------+-------------------------+


$ etcdctl --endpoints=<member list> endpoint status -w table
Failed to get the status of endpoint 176.228.0.62:2379 (context deadline exceeded)
+------------------+------------------+---------+---------+-----------+-----------+------------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------+------------------+---------+---------+-----------+-----------+------------+
| 176.228.1.7:2379 | a3191ee449725555 |   3.5.0 |   23 MB |      true |         2 |      62619 |
| 176.228.1.1:2379 | 9127ba3ed067c444 |   3.5.0 |   23 MB |     false |         2 |      62619 |
+------------------+------------------+---------+---------+-----------+-----------+------------+

Relevant log output

No response

@jiapeish
Copy link
Author

jiapeish commented Feb 15, 2022

The failing node's complete log:

{"level":"warn","ts":"2022-02-10T17:37:34.168Z","logger":"raft","caller":"etcdserver/zap_raft.go:85","msg":"ccd3ecb34f88ff13 stepped down to follower since quorum is not active"}
{"level":"info","ts":"2022-02-10T17:37:34.168Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 became follower at term 5"}
{"level":"info","ts":"2022-02-10T17:37:34.168Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: ccd3ecb34f88ff13 lost leader ccd3ecb34f88ff13 at term 5"}
{"level":"warn","ts":"2022-02-10T17:37:34.168Z","caller":"etcdserver/server.go:1032","msg":"update leader as:","lead-id":"0000000000000000"}
{"level":"info","ts":"2022-02-10T17:37:34.278Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 no leader at term 5; dropping index reading msg"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 no leader at term 5; dropping index reading msg"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 5, index: 5195, vote: ccd3ecb34f88ff13] cast MsgPreVote for ed254075ccc97692 [logterm: 5, index: 5195] at term 5"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [term: 5] received a MsgVote message with higher term from ed254075ccc97692 [term: 6]"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 became follower at term 6"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 5, index: 5195, vote: 0] cast MsgVote for ed254075ccc97692 [logterm: 5, index: 5195] at term 6"}
{"level":"info","ts":"2022-02-10T17:37:34.341Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: ccd3ecb34f88ff13 elected leader ed254075ccc97692 at term 6"}
{"level":"warn","ts":"2022-02-10T17:37:34.341Z","caller":"etcdserver/server.go:1032","msg":"update leader as:","lead-id":"ed254075ccc97692"}
{"level":"warn","ts":"2022-02-14T01:42:11.286Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"ccd3ecb34f88ff13","remote-peer-id":"ed254075ccc97692","error":"unexpected EOF"}
{"level":"warn","ts":"2022-02-14T01:42:11.286Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"ed254075ccc97692","error":"failed to read ed254075ccc97692 on stream Message (unexpected EOF)"}
{"level":"warn","ts":"2022-02-14T01:42:11.286Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"ccd3ecb34f88ff13","remote-peer-id":"ed254075ccc97692","error":"unexpected EOF"}
{"level":"warn","ts":"2022-02-14T01:42:11.289Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"9127ba3ed067c444","rtt":"403.783µs","error":"dial tcp 176.228.1.1:2380: connect: connection refused"}
{"level":"warn","ts":"2022-02-14T01:42:11.289Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9127ba3ed067c444","rtt":"700.936µs","error":"dial tcp 176.228.1.1:2380: connect: connection refused"}
{"level":"warn","ts":"2022-02-14T01:42:12.227Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":18380174024408422142,"retry-timeout":"500ms"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 is starting a new election at term 6"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 became pre-candidate at term 6"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 received MsgPreVoteResp from ccd3ecb34f88ff13 at term 6"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 6, index: 63539] sent MsgPreVote request to 9127ba3ed067c444 at term 6"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 6, index: 63539] sent MsgPreVote request to ed254075ccc97692 at term 6"}
{"level":"info","ts":"2022-02-14T01:42:12.654Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: ccd3ecb34f88ff13 lost leader ed254075ccc97692 at term 6"}
{"level":"warn","ts":"2022-02-14T01:42:12.654Z","caller":"etcdserver/server.go:1032","msg":"update leader as:","lead-id":"0000000000000000"}
{"level":"warn","ts":"2022-02-14T01:42:12.655Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"ccd3ecb34f88ff13","remote-peer-id":"ed254075ccc97692"}
{"level":"warn","ts":"2022-02-14T01:42:12.727Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":18380174024408422142,"retry-timeout":"500ms"}
{"level":"info","ts":"2022-02-14T01:42:14.053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 is starting a new election at term 6"}
{"level":"info","ts":"2022-02-14T01:42:14.053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 became pre-candidate at term 6"}
{"level":"info","ts":"2022-02-14T01:42:14.054Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 received MsgPreVoteResp from ccd3ecb34f88ff13 at term 6"}
{"level":"info","ts":"2022-02-14T01:42:14.054Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 6, index: 63539] sent MsgPreVote request to 9127ba3ed067c444 at term 6"}
{"level":"info","ts":"2022-02-14T01:42:14.054Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"ccd3ecb34f88ff13 [logterm: 6, index: 63539] sent MsgPreVote request to ed254075ccc97692 at term 6"}
{"level":"warn","ts":"2022-02-14T01:42:14.229Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":18380174024408422142,"retry-timeout":"500ms"}

@serathius
Copy link
Member

serathius commented Feb 15, 2022

Please reproduce the problem on official etcd v3.5.2 binaries.

@jiapeish
Copy link
Author

As an etcd user, I think it's an unselfish act to report a bug.
First, this bug maybe not reproducible. It's a cop-out excuse for refusing to debug/fix a serious problem until the user has reproducible steps.
Second, we've adequate logging what's happened. If that is not enough, what about adding more debug information to etcd for maintenance?

@jiapeish
Copy link
Author

jiapeish commented Feb 16, 2022

I've compared the difference between official v3.5.2 and our enhanced/cherry-picked v3.5.0, which have the cherry-pick of such below commits:

  • fix runlock bug
  • lease checkpoint feature
  • set the backend again after recovering v3 backend from snapshot
  • implement a retry logic for auth old revision in the client
  • some other client fix or tests case added
  • server: Ensure that adding and removing members handle storev2 and backend out of sync
  • version: 3.5.0 - base line

So it seems that there is not any big difference in v3.5.2 related to this issue...

@serathius
Copy link
Member

cc @ptabor

@jiapeish
Copy link
Author

jiapeish commented Mar 4, 2022

Close it since it occured only once and there is no further response.

@jiapeish jiapeish closed this as completed Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants