-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to leave a cluster. #432
Comments
Mind linking to the docs so that we at least can evaluate how unsafe it can potentially be? |
https://www.rabbitmq.com/clustering.html I am thinking if we want to do anything it should be using reset as opposed to remove to avoid accidental removal of anything but self:
I am just starting to play with rabbitmq after a few years of not using it so I am not 100% sure on the risk but we are creating a shutdown script with that to ensure that when aws nodes are terminated (say by ASG) they are properly removed from cluster. |
I can see the value of removing nodes after they have been terminated but this should be implemented with extreme caution as RMQ clustering favors consistency over availability unlike clustered solutions such as elasticsearch or consul. |
It should be done with caution because machines are generally terrible at knowing when something is gone for good vs. temporarily down. A human operator knows better. RabbitMQ is far from being the only tool where node removal must be pretty explicit. If a node was terminated and that's known for sure, there is no risk to doing anything but |
What you are looking for is
|
Apologies for any ignorance with RMQ as I am picking it up after not using it for several years. We are in the beginning stages of our RMQ deployment so I would certainly love to hear others thoughts on the matter. @michaelklishin to be clear I mean a normal shutdown/restart via a shutdown script; we are removing and rejoining on startup as it could be gone for some amount of time (or never come back at all) and will not be trusted when it comes back up as it will have outdated information. Is this not safe? I would imagine that monitoring and alerting is your friend to prevent you from dropping below a threshold. If you have all your machines shutdown/restart at once I suppose this could be a DR nightmare (though at that point you probably have much larger problems to address) if you do not have any sort of playback on messages. For nodes that are abruptly terminated this is a different story and need to use the forget_cluster_node (sorry I referred to it as remove earlier since I didn't consult documentation for that). I am thinking that if we build this at all it should be a library only function and should not be included in any community cookbook recipes. This as you pointed out something that could be very tricky and dangerous to use and therefore when and how to use it should fall to the wrapper cookbook. I would not recommend doing this at all without something like a lifecycle hook, using a cloudwatch event rule, etc that will definitively tell us that the machine will never come back. This feels like it needs a solution outside of chef and is more along the lines of https://github.com/eheydrick/aws-cleaner to solve. |
@michaelklishin These are the Opsworks lifecycle hooks, where you can specify Chef recipes to run. http://docs.aws.amazon.com/opsworks/latest/userguide/workingcookbook-events.html We have 30 or so instances across 6 different Opsworks stacks, and the only time I ever shell into any of them it to manually remove a server from the cluster. We don't hand manage any of our instances, if there is a problem we just crumple it up and bring up a new one. Is it even possible for a cluster member to say "Hey I don't want to be a part of this cluster anymore?" |
@JimtotheB yes there are 2 ways:
from another node in the cluster:
|
@majormoses I was aware of your second example, as thats what I have been doing manually on our RMQ instances. Barring the addition this issue is asking for, Im just going to write a shell recipe that calls your first example directly during the shutdown hook on Opsworks. With the way we do our deployments and load balancing, there is zero chance of a stopped instance ever coming back again, but I cant speak for anyone elses use case here. |
I think we should provide both scenarios as options as library helper methods and leave the implementation of how and when up to the end user, @michaelklishin thoughts? If this sounds good I can put a PR together for this. |
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 sceneries: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenario: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
Having library functions would be a good start. |
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
@JimtotheB let me know if you have any issues |
@majormoses Thanks for the alert, I'm just about to come back around to this in the next couple of days. Ill let you know if I run into any issues. |
This adds some helper functions to allow a wrapper cookbook to implement removing a node from a cluster in 2 scenarios: - removing self from cluster (helpful for normal decommission) - removing any node from cluster (helpful for abruptly terminated machines) These both should not be actually consumed in this cookbook and simply be provided for wrappers to use as they see fit. This should be implemented only after careful thought, design, testing, etc.
Using this cookbook on Opsworks, with Chef 12. Having the ability to leave a cluster during the "shutdown" lifecycle event would be helpful.
The text was updated successfully, but these errors were encountered: