-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resient server cluster in a environment where containers get new IPs #1306
Comments
We're pretty much in the same scenario. We also run Consul in docker containers (on AWS EC2) and basically want to have a healthy system in all cases. So when one instance goes down (out of three) I just want to provision a new one automatically without any manual interaction. For our other components (especially Cassandra) we've got a good setup, but with Consul we also run into the issues you mentioned (and are mentioned in the posts you referenced). I'm going to automate the process of editting the peers.json files and cycle the affected Consul instances. The most annoying thing is that we also use consul DNS for service discovery, so might run into a couple of failed lookups if the DNS TTL happens to occur with a restart. As an addition, I'm exploring how force-leave affects the |
+1 to track. Running into this right now. Re: Automating the editing of peers.son, I've got a setup where when a node goes down it likely knows the address of the peer it was replacing. (network attached storage) The script I have sets up a new container but also issues a "force-leave" on the prior node IP. My understanding is that this impacts the upper layer but not the raft layer (peers.json). @josdirksen - how are you using force-leave? I'm trying to use it as mentioned in the prior statement, yet I don't see peers.json updated on the remaining/healthy nodes. And thus, run into leader election issues. |
I run the following every five minutes to check whether there are any dead nodes. I do this on each node of the consul cluster, since (if I remember correctly) force-leave states don't automatically propagate between cluster members.
Not that exciting, but seems to work in our scenario, and keeps the peers.json setup correctly, without having to manually change this. But I'll do a double-check based on your comment. |
Just a note, See below (sanitised slightly) - showing that RequestVote calls are still attempted after the
In addition,
So while I think the A big nice to have would be the ability to remove a Raft peer programmatically (RPC/API calls, CLI command, same as most stuff etc), in a way that it modifies the |
Looks like my request above (re managing |
@slackpad are you able to provide any feedback on this one? |
Related #1562 (comment) (3-4 comments there) |
@sean- @slackpad - how would you feel about augmenting RemoveFailedNode to support cleaning up the Raft entry in addition to cleaning up the Serf side, for I wonder if the correct way to do this is to use the Serf layer to broadcast a Raft removal which is actioned on each node locally? Or potentially hook into the lanNodeFailed/wanNodeFailed hooks as a Plan B? Not 100% sure if these are called on |
I have managed to improve the resiliency of my cluster so far with the following:
Instead of just a SIGTERM, I now perform a |
I've been reading through #993 #454 and other posts, and would echo the comment made by @pikeas here:
#993 (comment)
We have had some failures in our test environments, where we've been able to recover the cluster.
We have some specific constraints, which might be unique to us today, but I think are potentially common to a number of implementations going forwards:
We've currently got the following design, and would like some feedback on how valid it is, and how we can improve it:
Our main practical concerns are:
The text was updated successfully, but these errors were encountered: