Nodes try to rejoin cluster when first listed node is down #347

ccrebolder · 2016-02-22T18:23:44Z

Regarding the line https://github.com/jjasghar/rabbitmq/blob/master/providers/cluster.rb#L202, it looks as if the elsif statement is checking whether var_node_name_to_join is part of cluster_status, but I think it should be checking var_node_name. var_node_name_to_join is just set to the first node name in the array passed into the lwrp.

I discovered that when powering down or stopping the first listed node in node['rabbitmq']['clustering']['cluster_nodes'], the other nodes worked fine until chef-client ran. They would then attempt to rejoin the cluster because the first node was no longer listed in the "running_nodes" output of rabbitmqctl cluster_status, and to rejoin they would try to connect again to the first node, which would fail as it was turned off. This would result in the whole cluster coming down.

Let me know if you'd like more info, or a PR for this.

The text was updated successfully, but these errors were encountered:

The call to `joined_cluster?` was passing in the `to_join` node name instead of the current node name. This resulted in the nodes trying to rejoin whenever the `to_join` node was offline. Resolves rabbitmq#347

Rarian · 2016-03-07T20:51:14Z

+1

opsline-radek · 2016-04-10T19:02:24Z

This fix breaks new cluster builds. When a new node comes up with cluster, it's alone part of the cluster itself. The check will always return true, it will never allow a node to join another one.

When I run cluster status on a new node I get this:

# rabbitmqctl cluster_status
Cluster status of node 'rabbit@production-rabbitmq-6' ...
[{nodes,[{disc,['rabbit@production-rabbitmq-6']}]},
 {running_nodes,['rabbit@production-rabbitmq-6']},
 {cluster_name,<<"mycluster">>},
 {partitions,[]},
 {alarms,[{'rabbit@production-rabbitmq-6',[]}]}]

Let's say, the cluster_nodes list contains production-rabbitmq-5 and production-rabbitmq-6. With the new code, node 6 will check whether itself is part of the cluster and according to output above, it is. It will never join 5. It should check whether node 5 is part of the cluster, then join it. The original code was correct.

The fix is simple - if the node is down and removed from chef server, update the cluster_nodes attributes to remove it - or better - make it dynamic based on a search in a wrapper cookbook.

The call to `joined_cluster?` was passing in the `to_join` node name instead of the current node name. This resulted in the nodes trying to rejoin whenever the `to_join` node was offline. Resolves #347

ccrebolder mentioned this issue Feb 23, 2016

Fix check for whether node has joined cluster #348

Merged

jjasghar closed this as completed in #348 Mar 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes try to rejoin cluster when first listed node is down #347

Nodes try to rejoin cluster when first listed node is down #347

ccrebolder commented Feb 22, 2016

Rarian commented Mar 7, 2016

opsline-radek commented Apr 10, 2016

Nodes try to rejoin cluster when first listed node is down #347

Nodes try to rejoin cluster when first listed node is down #347

Comments

ccrebolder commented Feb 22, 2016

Rarian commented Mar 7, 2016

opsline-radek commented Apr 10, 2016