Add ability to leave a cluster. #432

JimtotheB · 2017-04-18T03:58:12Z

Using this cookbook on Opsworks, with Chef 12. Having the ability to leave a cluster during the "shutdown" lifecycle event would be helpful.

michaelklishin · 2017-04-18T08:23:20Z

Mind linking to the docs so that we at least can evaluate how unsafe it can potentially be?

majormoses · 2017-04-18T23:14:41Z

https://www.rabbitmq.com/clustering.html I am thinking if we want to do anything it should be using reset as opposed to remove to avoid accidental removal of anything but self:

rabbitmqctl stop_app
rabbitmqctl reset

I am just starting to play with rabbitmq after a few years of not using it so I am not 100% sure on the risk but we are creating a shutdown script with that to ensure that when aws nodes are terminated (say by ASG) they are properly removed from cluster.

majormoses · 2017-04-18T23:19:42Z

I can see the value of removing nodes after they have been terminated but this should be implemented with extreme caution as RMQ clustering favors consistency over availability unlike clustered solutions such as elasticsearch or consul.

michaelklishin · 2017-04-18T23:49:58Z

It should be done with caution because machines are generally terrible at knowing when something is gone for good vs. temporarily down. A human operator knows better. RabbitMQ is far from being the only tool where node removal must be pretty explicit.

If a node was terminated and that's known for sure, there is no risk to doing anything but stop_app and reset is not what you are looking for. That will in no way notify the rest of the cluster and resetting the node itself is pointless: it's already been decommissioned, so you might as well destroy its disk entirely.

michaelklishin · 2017-04-18T23:51:22Z

What you are looking for is rabbitmqctl forget_cluster_node and there are only two questions:

When exactly should it be invoked
What node should invoke it (as a disconnected or terminated node cannot do that) and against what current cluster member

majormoses · 2017-04-19T00:48:01Z

Apologies for any ignorance with RMQ as I am picking it up after not using it for several years. We are in the beginning stages of our RMQ deployment so I would certainly love to hear others thoughts on the matter.

@michaelklishin to be clear I mean a normal shutdown/restart via a shutdown script; we are removing and rejoining on startup as it could be gone for some amount of time (or never come back at all) and will not be trusted when it comes back up as it will have outdated information. Is this not safe? I would imagine that monitoring and alerting is your friend to prevent you from dropping below a threshold. If you have all your machines shutdown/restart at once I suppose this could be a DR nightmare (though at that point you probably have much larger problems to address) if you do not have any sort of playback on messages.

For nodes that are abruptly terminated this is a different story and need to use the forget_cluster_node (sorry I referred to it as remove earlier since I didn't consult documentation for that). I am thinking that if we build this at all it should be a library only function and should not be included in any community cookbook recipes. This as you pointed out something that could be very tricky and dangerous to use and therefore when and how to use it should fall to the wrapper cookbook. I would not recommend doing this at all without something like a lifecycle hook, using a cloudwatch event rule, etc that will definitively tell us that the machine will never come back. This feels like it needs a solution outside of chef and is more along the lines of https://github.com/eheydrick/aws-cleaner to solve.

JimtotheB · 2017-04-19T01:29:42Z

@michaelklishin These are the Opsworks lifecycle hooks, where you can specify Chef recipes to run. http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-events.html

We have 30 or so instances across 6 different Opsworks stacks, and the only time I ever shell into any of them it to manually remove a server from the cluster. We don't hand manage any of our instances, if there is a problem we just crumple it up and bring up a new one.

Is it even possible for a cluster member to say "Hey I don't want to be a part of this cluster anymore?"

majormoses · 2017-04-19T01:40:07Z

@JimtotheB yes there are 2 ways:
remove self from cluster:

rabbitmqctl stop_app
rabbitmqctl reset

from another node in the cluster:

rabbitmqctl forget_cluster_node <nodename>

JimtotheB · 2017-04-19T02:04:43Z

@majormoses I was aware of your second example, as thats what I have been doing manually on our RMQ instances. Barring the addition this issue is asking for, Im just going to write a shell recipe that calls your first example directly during the shutdown hook on Opsworks.

With the way we do our deployments and load balancing, there is zero chance of a stopped instance ever coming back again, but I cant speak for anyone elses use case here.

majormoses · 2017-04-19T16:06:20Z

I think we should provide both scenarios as options as library helper methods and leave the implementation of how and when up to the end user, @michaelklishin thoughts? If this sounds good I can put a PR together for this.

This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 sceneries: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.

This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenario: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.

michaelklishin · 2017-04-19T19:45:19Z

Having library functions would be a good start.

This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.

closes #432

majormoses · 2017-04-28T20:53:30Z

@JimtotheB let me know if you have any issues

JimtotheB · 2017-04-28T23:23:38Z

@majormoses Thanks for the alert, I'm just about to come back around to this in the next couple of days. Ill let you know if I run into any issues.

This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.

michaelklishin mentioned this issue Apr 19, 2017

closes #432 #433

Merged

jjasghar closed this as completed in 4556fa9 Apr 28, 2017

jjasghar pushed a commit that referenced this issue Apr 28, 2017

Merge pull request #433 from majormoses/feature/rmq-cluster-leave

a160903

closes #432

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to leave a cluster. #432

Add ability to leave a cluster. #432

JimtotheB commented Apr 18, 2017 •

edited

Loading

michaelklishin commented Apr 18, 2017

majormoses commented Apr 18, 2017 •

edited

Loading

majormoses commented Apr 18, 2017

michaelklishin commented Apr 18, 2017

michaelklishin commented Apr 18, 2017

majormoses commented Apr 19, 2017

JimtotheB commented Apr 19, 2017

majormoses commented Apr 19, 2017 •

edited

Loading

JimtotheB commented Apr 19, 2017

majormoses commented Apr 19, 2017

michaelklishin commented Apr 19, 2017 •

edited

Loading

majormoses commented Apr 28, 2017

JimtotheB commented Apr 28, 2017

Add ability to leave a cluster. #432

Add ability to leave a cluster. #432

Comments

JimtotheB commented Apr 18, 2017 • edited Loading

michaelklishin commented Apr 18, 2017

majormoses commented Apr 18, 2017 • edited Loading

majormoses commented Apr 18, 2017

michaelklishin commented Apr 18, 2017

michaelklishin commented Apr 18, 2017

majormoses commented Apr 19, 2017

JimtotheB commented Apr 19, 2017

majormoses commented Apr 19, 2017 • edited Loading

JimtotheB commented Apr 19, 2017

majormoses commented Apr 19, 2017

michaelklishin commented Apr 19, 2017 • edited Loading

majormoses commented Apr 28, 2017

JimtotheB commented Apr 28, 2017

JimtotheB commented Apr 18, 2017 •

edited

Loading

majormoses commented Apr 18, 2017 •

edited

Loading

majormoses commented Apr 19, 2017 •

edited

Loading

michaelklishin commented Apr 19, 2017 •

edited

Loading