You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are working with Rabbit cluster of 3 nodes with mirror of one on AWS.
All nodes are connected to ELB.
The ELB health check is configured with interval of 30 seconds and unhealthy threshold of 3, and the health check set to the HTTP aliveness-test (meaning, that it takes the ELB 90 seconds to decide that a node is dead)
The client is sneakers (v2.6.0) worker with bunny (v2.7.3) connection.
We connect to the rabbit cluster through the DNS.
We configured the worker to use automatically_recovery=true
and unlimited reconnect attempts
When we have network issues we get this exception:
"Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error)"
The bunny tries to reconnect automatically and then gets the same exception again , but the second time no one is catching it and the reader loop stops and we lose the worker.
Steps to reproduce (we tested with 3 nodes, but it might work with one, too):
Start rabbit cluster with 3 nodes with auto sync and one mirror
with below policies:
ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
queue-master-locator: min-masters
Connect all nodes to ELB with the health check:
interval: 30 seconds
unhealthy threshold: 3
Create a sneakers worker (v2.6.0) with bunny (v2.7.3) and start it. This will create a queue on
the rabbit.
Stream data to the queue.
Check what is the master node the queue is on (defined in Rabbit admin -> Queues tab -> our
queue's node).
Simulate an unclean shutdown of the queue's master node. We reproduced this by connecting
to this node via ssh and disconnecting it from the network by running the command "ifdown
eth0".
Check the sneakers worker's log and see that it doesn't receive messages anymore.
And we see this exception "Got an exception when receiving data: IO timeout when reading 7
bytes (Timeout::Error)".
Why did we get it?
The default network_recovery_interval is 5 seconds whereas the ELB waits for 1.5 minutes until
it removes this node from service (healthy check definition).
Check the rabbit admin and see that the queue we were working on already moved to the slave
Node and still the sneakers couldn't reconnect.
However, we didn't solve it like this for now since we don't know what are the side affects of it.
So currently our workaround is just to increase the read_timeout configuration in sneakers, and in this case the bunny hangs until the ELB replaces the node.
The text was updated successfully, but these errors were encountered:
For this particular scenario It makes sense. Catching timeout errors in most scenarios is a bad idea that hides a lot of problems in the system as a whole that should not be hidden. Since the code in #537 is only concerned with TCP connection + AMQP 0-9-1 connection negotiation, I'm willing to merge it and let the time tell us if there are any side effects we couldn't foresee.
We are working with Rabbit cluster of 3 nodes with mirror of one on AWS.
All nodes are connected to ELB.
The ELB health check is configured with interval of 30 seconds and unhealthy threshold of 3, and the health check set to the HTTP aliveness-test (meaning, that it takes the ELB 90 seconds to decide that a node is dead)
The client is sneakers (v2.6.0) worker with bunny (v2.7.3) connection.
We connect to the rabbit cluster through the DNS.
We configured the worker to use automatically_recovery=true
and unlimited reconnect attempts
When we have network issues we get this exception:
"Got an exception when receiving data: IO timeout when reading 7 bytes (Timeout::Error)"
The bunny tries to reconnect automatically and then gets the same exception again , but the second time no one is catching it and the reader loop stops and we lose the worker.
Steps to reproduce (we tested with 3 nodes, but it might work with one, too):
Start rabbit cluster with 3 nodes with auto sync and one mirror
with below policies:
ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
queue-master-locator: min-masters
Connect all nodes to ELB with the health check:
interval: 30 seconds
unhealthy threshold: 3
Create a sneakers worker (v2.6.0) with bunny (v2.7.3) and start it. This will create a queue on
the rabbit.
Stream data to the queue.
Check what is the master node the queue is on (defined in Rabbit admin -> Queues tab -> our
queue's node).
Simulate an unclean shutdown of the queue's master node. We reproduced this by connecting
to this node via ssh and disconnecting it from the network by running the command "ifdown
eth0".
Check the sneakers worker's log and see that it doesn't receive messages anymore.
And we see this exception "Got an exception when receiving data: IO timeout when reading 7
bytes (Timeout::Error)".
Why did we get it?
The default network_recovery_interval is 5 seconds whereas the ELB waits for 1.5 minutes until
it removes this node from service (healthy check definition).
Check the rabbit admin and see that the queue we were working on already moved to the slave
Node and still the sneakers couldn't reconnect.
How to workaround it?
After reviewing the bunny code, we saw in the class of Session in the method recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745) that it is not catching the Timeout::Error and this exception is raised up until reaches the reader loop at https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/reader_loop.rb#L42. Since this is already in a rescue clause, it becomes an unhandled exception and we presume it breaks out of the thread uncleanly
We did manage to fix this issue by adding Timeout::Error to the rescue inside recover_from_network_failure (https://github.com/ruby-amqp/bunny/blob/master/lib/bunny/session.rb#L745).
However, we didn't solve it like this for now since we don't know what are the side affects of it.
So currently our workaround is just to increase the read_timeout configuration in sneakers, and in this case the bunny hangs until the ELB replaces the node.
The text was updated successfully, but these errors were encountered: