Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bunny doesn't recover from network failure with Timeout::Error #536

Closed
michal-maroody opened this issue Dec 31, 2017 · 3 comments
Closed

Comments

@michal-maroody
Copy link
Contributor

We are working with Rabbit cluster of 3 nodes with mirror of one on AWS.
All nodes are connected to ELB.
The ELB health check is configured with interval of 30 seconds and unhealthy threshold of 3, and the health check set to the HTTP aliveness-test (meaning, that it takes the ELB 90 seconds to decide that a node is dead)

The client is sneakers (v2.6.0) worker with bunny (v2.7.3) connection.
We connect to the rabbit cluster through the DNS.
We configured the worker to use automatically_recovery=true
and unlimited reconnect attempts

When we have network issues we get this exception:
"Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error)"

The bunny tries to reconnect automatically and then gets the same exception again , but the second time no one is catching it and the reader loop stops and we lose the worker.

Steps to reproduce (we tested with 3 nodes, but it might work with one, too):

  1. Start rabbit cluster with 3 nodes with auto sync and one mirror
    with below policies:
    ha-mode: exactly
    ha-params: 2
    ha-sync-mode: automatic
    queue-master-locator: min-masters

  2. Connect all nodes to ELB with the health check:
    interval: 30 seconds
    unhealthy threshold: 3

  3. Create a sneakers worker (v2.6.0) with bunny (v2.7.3) and start it. This will create a queue on
    the rabbit.

  4. Stream data to the queue.

  5. Check what is the master node the queue is on (defined in Rabbit admin -> Queues tab -> our
    queue's node).
    Simulate an unclean shutdown of the queue's master node. We reproduced this by connecting
    to this node via ssh and disconnecting it from the network by running the command "ifdown
    eth0".

  6. Check the sneakers worker's log and see that it doesn't receive messages anymore.
    And we see this exception "Got an exception when receiving data: IO timeout when reading 7
    bytes (Timeout::Error)".

    Why did we get it?
    The default network_recovery_interval is 5 seconds whereas the ELB waits for 1.5 minutes until
    it removes this node from service (healthy check definition).

  7. Check the rabbit admin and see that the queue we were working on already moved to the slave
    Node and still the sneakers couldn't reconnect.

How to workaround it?
After reviewing the bunny code, we saw in the class of Session in the method recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745) that it is not catching the Timeout::Error and this exception is raised up until reaches the reader loop at https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/reader_loop.rb#L42. Since this is already in a rescue clause, it becomes an unhandled exception and we presume it breaks out of the thread uncleanly

We did manage to fix this issue by adding Timeout::Error to the rescue inside recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745).

However, we didn't solve it like this for now since we don't know what are the side affects of it.
So currently our workaround is just to increase the read_timeout configuration in sneakers, and in this case the bunny hangs until the ELB replaces the node.

@michaelklishin
Copy link
Member

michaelklishin commented Jan 1, 2018

For this particular scenario It makes sense. Catching timeout errors in most scenarios is a bad idea that hides a lot of problems in the system as a whole that should not be hidden. Since the code in #537 is only concerned with TCP connection + AMQP 0-9-1 connection negotiation, I'm willing to merge it and let the time tell us if there are any side effects we couldn't foresee.

Thanks.

@michaelklishin
Copy link
Member

@michal-maroody thank you. Feel free to give master a try and report any side effects you may come across.

@michal-maroody
Copy link
Contributor Author

@michaelklishin Thanks! I will update on any issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants