Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remediate RabbitMQ reset failures #449

Merged
merged 1 commit into from
Jun 15, 2017

Conversation

jkugler
Copy link
Contributor

@jkugler jkugler commented Jun 15, 2017

We were getting intermittent failures after the erlang cookie was changed.
They looked like this:

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '69'
---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
STDOUT: Stopping rabbit application on node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113' ...
STDERR: Error: unable to connect to node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113': nodedown

DIAGNOSTICS
===========

attempted to contact: ['3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113']

3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113:
  * connected to epmd (port 4369) on ip-10-72-81-113
  * epmd reports: node '3f49a593-39c1-4954-9c38-f3e763cb4ee3' not running at all
                  no other nodes on ip-10-72-81-113
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-19@ip-10-72-81-113'
- home dir: /var/lib/rabbitmq
- cookie hash: WYSyTQI4sAl0fW/1IdQOyQ==
---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned

The Erlang VM was up but it had not had time to bring up the RabbitMQ app.
This patch adds some retries to the command to give the time needed.
Given that we only sometimes saw this error, a minute of retries
should be more than enough.

I did not add any tests because I am not sure how to test an intermittent failure.

We were getting intermittent failures after the erlang cookie was changed.
They looked like this:

```
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '69'
---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
STDOUT: Stopping rabbit application on node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113' ...
STDERR: Error: unable to connect to node '3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113': nodedown

DIAGNOSTICS
===========

attempted to contact: ['3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113']

3f49a593-39c1-4954-9c38-f3e763cb4ee3@ip-10-72-81-113:
  * connected to epmd (port 4369) on ip-10-72-81-113
  * epmd reports: node '3f49a593-39c1-4954-9c38-f3e763cb4ee3' not running at all
                  no other nodes on ip-10-72-81-113
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-19@ip-10-72-81-113'
- home dir: /var/lib/rabbitmq
- cookie hash: WYSyTQI4sAl0fW/1IdQOyQ==
---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----
Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned
```

The Erlang VM was up but it had not had time to bring up the RabbitMQ app.
This patch adds some retries to the command to give the time needed.
Given that we only sometimes saw this error, a minute of retries
should be more than enough.

I did not add any tests because I am not sure how to test an intermittent failure.
@amulyas
Copy link
Contributor

amulyas commented Jun 15, 2017

This will help us with the current problem of race condition

@michaelklishin
Copy link
Member

@jjasghar fine with you to merge?

@michaelklishin
Copy link
Member

@jkugler thank you!

@jkugler
Copy link
Contributor Author

jkugler commented Jun 15, 2017

Looks like FoodCritic doesn't like some of the providers...but I didn't change those. Will that prevent a merge?

https://travis-ci.org/rabbitmq/chef-cookbook/jobs/243461797

@jjasghar jjasghar merged commit 959f2c5 into rabbitmq:master Jun 15, 2017
@jjasghar
Copy link
Contributor

I'll release this cookbook tomorrow :)

@jkugler jkugler deleted the fix_node_reset_failure branch June 15, 2017 23:24
@jkugler
Copy link
Contributor Author

jkugler commented Jun 15, 2017

Will this be version 5.1.1?

@jjasghar
Copy link
Contributor

I'll have to verify the changes, but i'm pretty sure it'll be 5.2.0.

@jkugler
Copy link
Contributor Author

jkugler commented Jun 15, 2017

OK, sounds good.

@jkugler
Copy link
Contributor Author

jkugler commented Jun 16, 2017

Thanks for the new release!

@amulyas
Copy link
Contributor

amulyas commented Jul 26, 2017

still having reset failure when recreating all nodes

                                                                                           ================================================================================^[[0m^M
                                                                                           ^[[31mError executing action `run` on resource 'execute[reset-node]'^[[0m^M
                                                                                           ================================================================================^[[0m^M
                                                                                           ^M
                                                                                       ^[[0m    Mixlib::ShellOut::ShellCommandFailed^[[0m^M
                                                                                           ------------------------------------^[[0m^M
                                                                                           Expected process to exit with [0], but received '70'^M
                                                                                       ^[[0m    ---- Begin output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----^M
                                                                                       ^[[0m    STDOUT: Stopping rabbit application on node '347ad9c0-05d5-4d46-a97c-77c110753d7f@ip-10-72-80-163'^M
                                                                                       ^[[0m    Resetting node '347ad9c0-05d5-4d46-a97c-77c110753d7f@ip-10-72-80-163'^M
                                                                                       ^[[0m    STDERR: Error: {no_running_cluster_nodes,"You cannot leave a cluster if no online nodes are present."}^M
                                                                                       ^[[0m    ---- End output of rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app ----^M
                                                                                       ^[[0m    Ran rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl start_app returned 70^[[0m^M
                                                                                           ^M
                                                                                       ^[[0m    Cookbook Trace:^[[0m^M
                                                                                           ---------------^[[0m^M
                                                                                           /var/chef/cache/cookbooks/compat_resource/files/lib/chef_compat/monkeypatches/chef/runner.rb:78:in `run_action'^M

@amulyas
Copy link
Contributor

amulyas commented Jul 26, 2017

^[[0m STDERR: Error: {no_running_cluster_nodes,"You cannot leave a cluster if no online nodes are present."}^M

@jkugler
Copy link
Contributor Author

jkugler commented Jul 26, 2017

Looks like another timing issue, but may not be related to this fix. In my understanding, RabbitMQ shouldn't care that you tell it to leave a node-less cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants