[BUG] Restore failed! #563

lixiaoyuner · 2022-12-19T15:32:01Z

Describe the bug:

I have a k8s cluster, there are only two masters(vm1&vm2) in the cluster, it means there are two etcd nodes, I run below command on one of the masters to upload backup files

etcdbrctl snapshot --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --storage-provider="ABS" --store-container="etcd-backup-test" --schedule "*/1 * * * *" --delta-snapshot-period=10s --max-backups=10 --garbage-collection-policy='LimitBased'

I destroy the k8s cluster and try to use below command to restore the two master

# on vm1
etcdbrctl restore --store-container="etcd-backup-test" --storage-provider="ABS" --data-dir="/var/lib/etcd" --initial-advertise-peer-urls="https://192.168.122.10:2380" --initial-cluster="vm1=https://192.168.122.10:2380,vm2=https://192.168.122.50:2380" --name="vm1"
# on vm2
etcdbrctl restore --store-container="etcd-backup-test" --storage-provider="ABS" --data-dir="/var/lib/etcd" --initial-advertise-peer-urls="https://192.168.122.50:2380" --initial-cluster="vm1=https://192.168.122.10:2380,vm2=https://192.168.122.50:2380" --name="vm2"

I thought this would initialize the data directory, but it tries to boot etcd server like below

{"level":"info","ts":"2022-12-19T14:58:10.014Z","caller":"raft/raft.go:811","msg":"aa2f014e32cf986f [logterm: 1, index: 2] sent MsgVote request to 75478cf6d34328ee at term 42"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:1472","msg":"skipped leadership transfer; local server is not leader","local-member-id":"aa2f014e32cf986f","current-leader-member-id":"0"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2066","msg":"failed to publish local member to cluster through raft","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","request-path":"/0/members/aa2f014e32cf986f/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2066","msg":"failed to publish local member to cluster through raft","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","request-path":"/0/members/aa2f014e32cf986f/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2052","msg":"stopped publish because server is stopped","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"75478cf6d34328ee"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"75478cf6d34328ee"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.134Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/root/etcd.default","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}

How can I correctly initialize data directory only?

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-backup-restore version/commit ID:
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

shreyas-s-rao · 2022-12-20T03:43:26Z

@lixiaoyuner I don't understand how you're running an etcd cluster with only two members. Also, you cannot restore the same backup for two different members of the same etcd cluster - this will not work due to mismatch in the metadata of the etcd members. You will need to restore one member, then start etcd for this first member (which will be the etcd cluster leader) and wait for it to become healthy, and then simply start the second member as a learner, while ensuring that the clustering configuration is provided correctly. This will allow the second member to join as a non-voting follower and sync up its data with the leader. Once this is successful, you can promote the learner to voting member.

etcd-backup-restore was built as a backup-restore tool primarily for single-node etcd clusters. To make it work for multi-node etcd clusters, we have extensively worked on gardener/etcd-druid#107 - etcd-druid is an operator or an external manager that is necessary to orchestrate the lifecycle of each etcd member, because it is not possible for an etcd member to manage the life of a cluster from within.

shreyas-s-rao · 2022-12-20T04:10:27Z

Closing this issue for now. Feel free to re-open it if you feel there's a bug in the way etcd-backup-restore works.
/close

ishan16696 · 2022-12-20T04:48:33Z

The problem with cluster of 2 members is if 1 member get down then you ends up losing quorum. Then learner approach will also not work in this case as to add a learner you must have a leader in the cluster but you already lose quorum.
I recommend to use some odd no. of members in etcd cluster say 3. In etcd-druid, we enforce people to use odd no. of members in etcd-cluster.

lixiaoyuner added the kind/bug Bug label Dec 19, 2022

gardener-robot closed this as completed Dec 20, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Restore failed! #563

[BUG] Restore failed! #563

lixiaoyuner commented Dec 19, 2022

shreyas-s-rao commented Dec 20, 2022

shreyas-s-rao commented Dec 20, 2022

ishan16696 commented Dec 20, 2022

[BUG] Restore failed! #563

[BUG] Restore failed! #563

Comments

lixiaoyuner commented Dec 19, 2022

shreyas-s-rao commented Dec 20, 2022

shreyas-s-rao commented Dec 20, 2022

ishan16696 commented Dec 20, 2022