Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restore failed! #563

Closed
lixiaoyuner opened this issue Dec 19, 2022 · 3 comments
Closed

[BUG] Restore failed! #563

lixiaoyuner opened this issue Dec 19, 2022 · 3 comments
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@lixiaoyuner
Copy link

Describe the bug:

I have a k8s cluster, there are only two masters(vm1&vm2) in the cluster, it means there are two etcd nodes, I run below command on one of the masters to upload backup files

etcdbrctl snapshot --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --storage-provider="ABS" --store-container="etcd-backup-test" --schedule "*/1 * * * *" --delta-snapshot-period=10s --max-backups=10 --garbage-collection-policy='LimitBased'

I destroy the k8s cluster and try to use below command to restore the two master

# on vm1
etcdbrctl restore --store-container="etcd-backup-test" --storage-provider="ABS" --data-dir="/var/lib/etcd" --initial-advertise-peer-urls="https://192.168.122.10:2380" --initial-cluster="vm1=https://192.168.122.10:2380,vm2=https://192.168.122.50:2380" --name="vm1"
# on vm2
etcdbrctl restore --store-container="etcd-backup-test" --storage-provider="ABS" --data-dir="/var/lib/etcd" --initial-advertise-peer-urls="https://192.168.122.50:2380" --initial-cluster="vm1=https://192.168.122.10:2380,vm2=https://192.168.122.50:2380" --name="vm2"

I thought this would initialize the data directory, but it tries to boot etcd server like below

{"level":"info","ts":"2022-12-19T14:58:10.014Z","caller":"raft/raft.go:811","msg":"aa2f014e32cf986f [logterm: 1, index: 2] sent MsgVote request to 75478cf6d34328ee at term 42"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:1472","msg":"skipped leadership transfer; local server is not leader","local-member-id":"aa2f014e32cf986f","current-leader-member-id":"0"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2066","msg":"failed to publish local member to cluster through raft","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","request-path":"/0/members/aa2f014e32cf986f/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2066","msg":"failed to publish local member to cluster through raft","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","request-path":"/0/members/aa2f014e32cf986f/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"etcdserver/server.go:2052","msg":"stopped publish because server is stopped","local-member-id":"aa2f014e32cf986f","local-member-attributes":"{Name:default ClientURLs:[http://localhost:0]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"75478cf6d34328ee"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"75478cf6d34328ee"}
{"level":"warn","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"aa2f014e32cf986f","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.130Z","caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"75478cf6d34328ee"}
{"level":"info","ts":"2022-12-19T14:58:11.134Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/root/etcd.default","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}

How can I correctly initialize data directory only?

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-backup-restore version/commit ID:
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

@lixiaoyuner lixiaoyuner added the kind/bug Bug label Dec 19, 2022
@shreyas-s-rao
Copy link
Collaborator

@lixiaoyuner I don't understand how you're running an etcd cluster with only two members. Also, you cannot restore the same backup for two different members of the same etcd cluster - this will not work due to mismatch in the metadata of the etcd members. You will need to restore one member, then start etcd for this first member (which will be the etcd cluster leader) and wait for it to become healthy, and then simply start the second member as a learner, while ensuring that the clustering configuration is provided correctly. This will allow the second member to join as a non-voting follower and sync up its data with the leader. Once this is successful, you can promote the learner to voting member.

etcd-backup-restore was built as a backup-restore tool primarily for single-node etcd clusters. To make it work for multi-node etcd clusters, we have extensively worked on gardener/etcd-druid#107 - etcd-druid is an operator or an external manager that is necessary to orchestrate the lifecycle of each etcd member, because it is not possible for an etcd member to manage the life of a cluster from within.

@shreyas-s-rao
Copy link
Collaborator

Closing this issue for now. Feel free to re-open it if you feel there's a bug in the way etcd-backup-restore works.
/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 20, 2022
@ishan16696
Copy link
Member

The problem with cluster of 2 members is if 1 member get down then you ends up losing quorum. Then learner approach will also not work in this case as to add a learner you must have a leader in the cluster but you already lose quorum.
I recommend to use some odd no. of members in etcd cluster say 3. In etcd-druid, we enforce people to use odd no. of members in etcd-cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

4 participants