-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remediate RabbitMQ reset failures #449
Conversation
We were getting intermittent failures after the erlang cookie was changed. They looked like this: ``` Mixlib::ShellOut::ShellCommandFailed ------------------------------------ Expected process to exit with [0], but received '69' ---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ---- STDOUT: Stopping rabbit application on node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113' ... STDERR: Error: unable to connect to node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113': nodedown DIAGNOSTICS =========== attempted to contact: ['3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113'] 3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113: * connected to epmd (port 4369) on ip-10-72-81-113 * epmd reports: node '3f49a593-39c1-4954-9c38-f3e763cb4ee3' not running at all no other nodes on ip-10-72-81-113 * suggestion: start the node current node details: - node name: 'rabbitmq-cli-19@ip-10-72-81-113' - home dir: /var/lib/rabbitmq - cookie hash: WYSyTQI4sAl0fW/1IdQOyQ== ---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ---- Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned ``` The Erlang VM was up but it had not had time to bring up the RabbitMQ app. This patch adds some retries to the command to give the time needed. Given that we only sometimes saw this error, a minute of retries should be more than enough. I did not add any tests because I am not sure how to test an intermittent failure.
This will help us with the current problem of race condition |
@jjasghar fine with you to merge? |
@jkugler thank you! |
Looks like FoodCritic doesn't like some of the providers...but I didn't change those. Will that prevent a merge? |
I'll release this cookbook tomorrow :) |
Will this be version 5.1.1? |
I'll have to verify the changes, but i'm pretty sure it'll be |
OK, sounds good. |
Thanks for the new release! |
still having reset failure when recreating all nodes
|
^[[0m STDERR: Error: {no_running_cluster_nodes,"You cannot leave a cluster if no online nodes are present."}^M |
Looks like another timing issue, but may not be related to this fix. In my understanding, RabbitMQ shouldn't care that you tell it to leave a node-less cluster. |
We were getting intermittent failures after the erlang cookie was changed.
They looked like this:
The Erlang VM was up but it had not had time to bring up the RabbitMQ app.
This patch adds some retries to the command to give the time needed.
Given that we only sometimes saw this error, a minute of retries
should be more than enough.
I did not add any tests because I am not sure how to test an intermittent failure.