Bunny doesn't recover from network failure with Timeout::Error #536

michal-maroody · 2017-12-31T11:00:11Z

We are working with Rabbit cluster of 3 nodes with mirror of one on AWS.
All nodes are connected to ELB.
The ELB health check is configured with interval of 30 seconds and unhealthy threshold of 3, and the health check set to the HTTP aliveness-test (meaning, that it takes the ELB 90 seconds to decide that a node is dead)

The client is sneakers (v2.6.0) worker with bunny (v2.7.3) connection.
We connect to the rabbit cluster through the DNS.
We configured the worker to use automatically_recovery=true
and unlimited reconnect attempts

When we have network issues we get this exception:
"Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error)"

The bunny tries to reconnect automatically and then gets the same exception again , but the second time no one is catching it and the reader loop stops and we lose the worker.

Steps to reproduce (we tested with 3 nodes, but it might work with one, too):

Start rabbit cluster with 3 nodes with auto sync and one mirror
with below policies:
ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
queue-master-locator: min-masters
Connect all nodes to ELB with the health check:
interval: 30 seconds
unhealthy threshold: 3
Create a sneakers worker (v2.6.0) with bunny (v2.7.3) and start it. This will create a queue on
the rabbit.
Stream data to the queue.
Check what is the master node the queue is on (defined in Rabbit admin -> Queues tab -> our
queue's node).
Simulate an unclean shutdown of the queue's master node. We reproduced this by connecting
to this node via ssh and disconnecting it from the network by running the command "ifdown
eth0".
Check the sneakers worker's log and see that it doesn't receive messages anymore.
And we see this exception "Got an exception when receiving data: IO timeout when reading 7
bytes (Timeout::Error)".

Why did we get it?
The default network_recovery_interval is 5 seconds whereas the ELB waits for 1.5 minutes until
it removes this node from service (healthy check definition).
Check the rabbit admin and see that the queue we were working on already moved to the slave
Node and still the sneakers couldn't reconnect.

How to workaround it?
After reviewing the bunny code, we saw in the class of Session in the method recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745) that it is not catching the Timeout::Error and this exception is raised up until reaches the reader loop at https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/reader_loop.rb#L42. Since this is already in a rescue clause, it becomes an unhandled exception and we presume it breaks out of the thread uncleanly

We did manage to fix this issue by adding Timeout::Error to the rescue inside recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745).

However, we didn't solve it like this for now since we don't know what are the side affects of it.
So currently our workaround is just to increase the read_timeout configuration in sneakers, and in this case the bunny hangs until the ELB replaces the node.

michaelklishin · 2018-01-01T15:08:02Z

For this particular scenario It makes sense. Catching timeout errors in most scenarios is a bad idea that hides a lot of problems in the system as a whole that should not be hidden. Since the code in #537 is only concerned with TCP connection + AMQP 0-9-1 connection negotiation, I'm willing to merge it and let the time tell us if there are any side effects we couldn't foresee.

Thanks.

michaelklishin · 2018-01-01T15:33:54Z

@michal-maroody thank you. Feel free to give master a try and report any side effects you may come across.

michal-maroody · 2018-01-02T10:08:33Z

@michaelklishin Thanks! I will update on any issues

michal-maroody mentioned this issue Dec 31, 2017

catch timeout error #537

Merged

michaelklishin closed this as completed in #537 Jan 1, 2018

FX-HAO mentioned this issue Jan 13, 2020

cannot recover from disconnections #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bunny doesn't recover from network failure with Timeout::Error #536

Bunny doesn't recover from network failure with Timeout::Error #536

michal-maroody commented Dec 31, 2017

michaelklishin commented Jan 1, 2018 •

edited

Loading

michaelklishin commented Jan 1, 2018

michal-maroody commented Jan 2, 2018

Bunny doesn't recover from network failure with Timeout::Error #536

Bunny doesn't recover from network failure with Timeout::Error #536

Comments

michal-maroody commented Dec 31, 2017

michaelklishin commented Jan 1, 2018 • edited Loading

michaelklishin commented Jan 1, 2018

michal-maroody commented Jan 2, 2018

michaelklishin commented Jan 1, 2018 •

edited

Loading