Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to leave a cluster. #432

Closed
JimtotheB opened this issue Apr 18, 2017 · 13 comments
Closed

Add ability to leave a cluster. #432

JimtotheB opened this issue Apr 18, 2017 · 13 comments

Comments

@JimtotheB
Copy link

JimtotheB commented Apr 18, 2017

Using this cookbook on Opsworks, with Chef 12. Having the ability to leave a cluster during the "shutdown" lifecycle event would be helpful.

@michaelklishin
Copy link
Member

Mind linking to the docs so that we at least can evaluate how unsafe it can potentially be?

@majormoses
Copy link
Contributor

majormoses commented Apr 18, 2017

https://www.rabbitmq.com/clustering.html I am thinking if we want to do anything it should be using reset as opposed to remove to avoid accidental removal of anything but self:

rabbitmqctl stop_app
rabbitmqctl reset

I am just starting to play with rabbitmq after a few years of not using it so I am not 100% sure on the risk but we are creating a shutdown script with that to ensure that when aws nodes are terminated (say by ASG) they are properly removed from cluster.

@majormoses
Copy link
Contributor

I can see the value of removing nodes after they have been terminated but this should be implemented with extreme caution as RMQ clustering favors consistency over availability unlike clustered solutions such as elasticsearch or consul.

@michaelklishin
Copy link
Member

It should be done with caution because machines are generally terrible at knowing when something is gone for good vs. temporarily down. A human operator knows better. RabbitMQ is far from being the only tool where node removal must be pretty explicit.

If a node was terminated and that's known for sure, there is no risk to doing anything but stop_app and reset is not what you are looking for. That will in no way notify the rest of the cluster and resetting the node itself is pointless: it's already been decommissioned, so you might as well destroy its disk entirely.

@michaelklishin
Copy link
Member

What you are looking for is rabbitmqctl forget_cluster_node and there are only two questions:

  • When exactly should it be invoked
  • What node should invoke it (as a disconnected or terminated node cannot do that) and against what current cluster member

@majormoses
Copy link
Contributor

Apologies for any ignorance with RMQ as I am picking it up after not using it for several years. We are in the beginning stages of our RMQ deployment so I would certainly love to hear others thoughts on the matter.

@michaelklishin to be clear I mean a normal shutdown/restart via a shutdown script; we are removing and rejoining on startup as it could be gone for some amount of time (or never come back at all) and will not be trusted when it comes back up as it will have outdated information. Is this not safe? I would imagine that monitoring and alerting is your friend to prevent you from dropping below a threshold. If you have all your machines shutdown/restart at once I suppose this could be a DR nightmare (though at that point you probably have much larger problems to address) if you do not have any sort of playback on messages.

For nodes that are abruptly terminated this is a different story and need to use the forget_cluster_node (sorry I referred to it as remove earlier since I didn't consult documentation for that). I am thinking that if we build this at all it should be a library only function and should not be included in any community cookbook recipes. This as you pointed out something that could be very tricky and dangerous to use and therefore when and how to use it should fall to the wrapper cookbook. I would not recommend doing this at all without something like a lifecycle hook, using a cloudwatch event rule, etc that will definitively tell us that the machine will never come back. This feels like it needs a solution outside of chef and is more along the lines of https://github.com/eheydrick/aws-cleaner to solve.

@JimtotheB
Copy link
Author

@michaelklishin These are the Opsworks lifecycle hooks, where you can specify Chef recipes to run. http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-events.html

We have 30 or so instances across 6 different Opsworks stacks, and the only time I ever shell into any of them it to manually remove a server from the cluster. We don't hand manage any of our instances, if there is a problem we just crumple it up and bring up a new one.

Is it even possible for a cluster member to say "Hey I don't want to be a part of this cluster anymore?"

@majormoses
Copy link
Contributor

majormoses commented Apr 19, 2017

@JimtotheB yes there are 2 ways:
remove self from cluster:

rabbitmqctl stop_app
rabbitmqctl reset

from another node in the cluster:

rabbitmqctl forget_cluster_node <nodename>

@JimtotheB
Copy link
Author

@majormoses I was aware of your second example, as thats what I have been doing manually on our RMQ instances. Barring the addition this issue is asking for, Im just going to write a shell recipe that calls your first example directly during the shutdown hook on Opsworks.

With the way we do our deployments and load balancing, there is zero chance of a stopped instance ever coming back again, but I cant speak for anyone elses use case here.

@majormoses
Copy link
Contributor

I think we should provide both scenarios as options as library helper methods and leave the implementation of how and when up to the end user, @michaelklishin thoughts? If this sounds good I can put a PR together for this.

majormoses added a commit to majormoses/chef-cookbook that referenced this issue Apr 19, 2017
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 sceneries:
- removing self from cluster (helpful for normal decommission)
- removing any node from cluster (helpful for abruptly terminated machines)

These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
majormoses added a commit to majormoses/chef-cookbook that referenced this issue Apr 19, 2017
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenario:
- removing self from cluster (helpful for normal decommission)
- removing any node from cluster (helpful for abruptly terminated machines)

These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
@michaelklishin
Copy link
Member

michaelklishin commented Apr 19, 2017

Having library functions would be a good start.

majormoses added a commit to majormoses/chef-cookbook that referenced this issue Apr 19, 2017
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios:
- removing self from cluster (helpful for normal decommission)
- removing any node from cluster (helpful for abruptly terminated machines)

These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
majormoses added a commit to majormoses/chef-cookbook that referenced this issue Apr 28, 2017
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios:
- removing self from cluster (helpful for normal decommission)
- removing any node from cluster (helpful for abruptly terminated machines)

These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
@majormoses
Copy link
Contributor

@JimtotheB let me know if you have any issues

@JimtotheB
Copy link
Author

@majormoses Thanks for the alert, I'm just about to come back around to this in the next couple of days. Ill let you know if I run into any issues.

jjasghar pushed a commit that referenced this issue Jun 16, 2017
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios:
- removing self from cluster (helpful for normal decommission)
- removing any node from cluster (helpful for abruptly terminated machines)

These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants