-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry policy backoff with jitter ignored when connection timeout increased from 30s #928
Comments
Hi @hansdals I'm working on this issue. I'll see what kind of repro I can get by increasing the OPTION_CONNECTION_TIMEOUT when using IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER for our retry policy, as you've done. |
One point which may be worth clarifying is how the
Here is an example: The limit would be 0, 1, then 2, 4, 8, 16, 32, 64, ... give or take a few seconds due to the jitter. Let's say you set Initially, this will look like 120 seconds between each retry. This is because on the 1st retry, the minimum time between retries is 0, and 120 seconds is more than 0 so it will retry. On the 2nd retry, the minimum time between retries is 1, and 120 seconds is more than 1 so it will retry. On the 3rd retry, the minimum time between retries is 2, and 120 seconds is more than 2 so it will retry. On the 4th retry... you get the idea. This will happen until the minimum time between retries has been incremented to be greater than 120 seconds. At this point you will start seeing more of that exponential backoff with jitter come into effect. On the 9th retry the minimum time will be approximately 256 seconds, and 120 seconds is less than 256, so after the connection is timed out, the MQTT transport on the client will wait until 256 seconds has been reached, at which point the retry will be attempted. |
@hansdals it's been a while since you've last responded to this thread so we will close it. If you have more questions or need more help please re open this issue. |
@hansdals, @YoDaMa, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey |
What you are saying is that if I had run my last test that i posted above wasn't run long enough to see the effect ? Thanks for help @YoDaMa . No need to re-open this issue. The fix in release 2019-03-18 seemed to have done it. What happened before was that when network was lost, the mqtt client segfaulted when retrying. Then procd re-spawned the MQTT client. This caused 30k to reconnect aggressively rendering backoff not working. We would see large dropouts in connections to IoT Hub dip every 1-2 weeks, and when it didn't show dropout devices sometimes MQTT clients would be not responding to direct methods. We have never had a more stable MQTT client than now. |
@hansdals glad to hear it! If that log snippet is from the beginning of a run, then yes. Although admittedly, the slightly under a minute consistent delay between retries seems a little peculiar. We'd have to look deeper into the application code written to diagnose why that is. |
Development Machine, OS, Compiler (and Other Relevant Toolchain Info)
Ubuntu 16.04
Cross Compiled on Ubuntu 16.04
toolchain-arm_xscale_gcc-4.8-linaro_uClibc-0.9.33.2_eabi
target_arm_xscale_uClibc-0.9.33.2_eabi linux ver 3.4 (OpenWRT based SDK)
SDK Version (Please Give Commit SHA if Manually Compiling)
Release 2019-03-18
Protocol
MQTT
Describe the Bug
We do not get re-connect backoff with jitter when we increase OPTION_CONNECTION_TIMEOUT when the iothub is overloaded.
We have many mqtt clients with SDK 2018-09-11 now trying to reconnect with the iothub that seems to be overloaded by our about 30k+ mqtt clients trying to reconnect. Running the client on a device I see "InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK". Related issue: #889 . I have compiled that (2019-03-18) and installed it on a single device, and the segfault on re-connect is fixed! Thanks!
With the segfault fixed we can now successfully apply the exponential backoff with jitter retry policy, and it seems to work fine with default OPTION_CONNECTION_TIMEOUT (30s):
Options and retry policy set by me in devicemethod_simplesample_run(void) function from ./Serializer:
Notice that time between CONNECT is exponential with some jitter. Great!
Since our iothub is still overloaded, and it is hard to recreate this error, i'm waiting with reinstalling this on all the 30k mqtt clients to explore this overloaded condition some more. I increase the option OPTION_CONNECTION_TIMEOUT from default 30s to 90s. Now the client can connect after some attempts, but no exponential backoff seen in the timestamps and the errors seen from the ConnectionStatusCallback are different (unauthorized, no network, device disabled). This seems incorrect and a bug.
added to the code to increase connection-timeout
The text was updated successfully, but these errors were encountered: