Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry policy backoff with jitter ignored when connection timeout increased from 30s #928

Closed
hansdals opened this issue Mar 29, 2019 · 6 comments
Assignees
Labels

Comments

@hansdals
Copy link

hansdals commented Mar 29, 2019

Development Machine, OS, Compiler (and Other Relevant Toolchain Info)

Ubuntu 16.04
Cross Compiled on Ubuntu 16.04
toolchain-arm_xscale_gcc-4.8-linaro_uClibc-0.9.33.2_eabi
target_arm_xscale_uClibc-0.9.33.2_eabi linux ver 3.4 (OpenWRT based SDK)

SDK Version (Please Give Commit SHA if Manually Compiling)

Release 2019-03-18

Protocol

MQTT

Describe the Bug

We do not get re-connect backoff with jitter when we increase OPTION_CONNECTION_TIMEOUT when the iothub is overloaded.

We have many mqtt clients with SDK 2018-09-11 now trying to reconnect with the iothub that seems to be overloaded by our about 30k+ mqtt clients trying to reconnect. Running the client on a device I see "InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK". Related issue: #889 . I have compiled that (2019-03-18) and installed it on a single device, and the segfault on re-connect is fixed! Thanks!
With the segfault fixed we can now successfully apply the exponential backoff with jitter retry policy, and it seems to work fine with default OPTION_CONNECTION_TIMEOUT (30s):

Options and retry policy set by me in devicemethod_simplesample_run(void) function from ./Serializer:

int lifetime = 1 * 60 * 60 * 24;
IoTHubClient_LL_SetOption(iotHubClientHandle, OPTION_SAS_TOKEN_LIFETIME, &lifetime);
bool trace = true;
IoTHubClient_LL_SetOption(iotHubClientHandle, OPTION_LOG_TRACE, &trace);
size_t retrytimeoutlimit = 2 * 60 * 60;
IoTHubClient_LL_SetRetryPolicy(iotHubClientHandle, IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER, &retrytimeoutlimit);
size_t messagetimeout = 90;
IoTHubClient_LL_SetOption(iotHubClientHandle, OPTION_MESSAGE_TIMEOUT, &messagetimeout);

IoTHubClient accepted the message for delivery
IoTHubClient_LL_SetMessageCallback...successful.
-> 15:05:29 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Mar 29 15:06:00 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
Error: Time:Fri Mar 29 15:06:31 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
-> 15:06:31 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Mar 29 15:07:02 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
Error: Time:Fri Mar 29 15:07:33 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
-> 15:07:33 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Mar 29 15:08:04 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
Error: Time:Fri Mar 29 15:08:35 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
-> 15:08:37 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Mar 29 15:09:07 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
Error: Time:Fri Mar 29 15:10:12 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
-> 15:11:55 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Mar 29 15:12:25 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
Error: Time:Fri Mar 29 15:16:53 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2358 mqtt_client timed out waiting for CONNACK
-> 15:24:56 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/XXSerialnoXX/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0

Notice that time between CONNECT is exponential with some jitter. Great!

Since our iothub is still overloaded, and it is hard to recreate this error, i'm waiting with reinstalling this on all the 30k mqtt clients to explore this overloaded condition some more. I increase the option OPTION_CONNECTION_TIMEOUT from default 30s to 90s. Now the client can connect after some attempts, but no exponential backoff seen in the timestamps and the errors seen from the ConnectionStatusCallback are different (unauthorized, no network, device disabled). This seems incorrect and a bug.

added to the code to increase connection-timeout

size_t connectiontimeout = 120;
IoTHubClient_LL_SetOption(iotHubClientHandle, OPTION_CONNECTION_TIMEOUT, &connectiontimeout);

IoTHubClient accepted the message for delivery
IoTHubClient_LL_SetMessageCallback...successful.
-> 14:20:04 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:20:58 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:20:58 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:20:58 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:20:58 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:20:58 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:20:58 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:20:59 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:21:53 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:21:53 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:21:53 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:21:53 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:21:53 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:21:53 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:21:53 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:22:47 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:22:47 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:22:47 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:22:47 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:22:47 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:22:47 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:22:48 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
^[[A^[[A<- 14:23:42 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:23:42 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:23:42 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:23:42 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:23:42 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:23:42 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:23:42 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:24:37 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:24:37 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:24:37 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:24:37 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:24:37 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:24:37 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:24:37 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:25:31 CONNACK | SESSION_PRESENT: false | RETURN_CODE: 0x5
2019-03-29 14:25:31 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:25:31 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED
Error: Time:Fri Mar 29 14:25:31 2019 File:./build_dir/target-arm_xscale_uClibc-0.9.33.2_eabi/AzureIoTHubSDK-2019-03-18/iothub_client/src/iothubtransport_mqtt_common.c Func:mqtt_operation_complete_callback Line:1728 Connection Not Accepted: 0x5: Not Authorized
2019-03-29 14:25:31 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED
2019-03-29 14:25:31 ________mqtt_client ERR: IOTHUB_CLIENT_CONNECTION_NO_NETWORK
-> 14:25:32 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: iothuburl.net/_______/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.3.0%20(native%3b%20Linux%3b%20armv7l) | PWD: XXXX | CLEAN: 0
<- 14:26:26 CONNACK | SESSION_PRESENT: true | RETURN_CODE: 0x0
2019-03-29 14:26:26 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_AUTHENTICATED
2019-03-29 14:26:26 ________mqtt_client INFO: IOTHUB_CLIENT_CONNECTION_OK, ID _______
-> 14:26:26 SUBSCRIBE | PACKET_ID: 2 | TOPIC_NAME: devices/_______/messages/devicebound/# | QOS: 1 | TOPIC_NAME: $iothub/methods/POST/# | QOS: 0
-> 14:26:26 PUBLISH | IS_DUP: false | RETAIN: 0 | QOS: DELIVER_AT_LEAST_ONCE | TOPIC_NAME: devices/_______/messages/events/ | PACKET_ID: 3 | PAYLOAD_LEN: 31
<- 14:26:26 SUBACK | PACKET_ID: 2 | RETURN_CODE: 1 | RETURN_CODE: 0
<- 14:26:26 PUBACK | PACKET_ID: 3
Message Id: 0 Received.
Result Call Back Called! Result is: IOTHUB_CLIENT_CONFIRMATION_OK 

@hansdals hansdals changed the title Retry policy backoff with jitter ignored when connection timeout increased to >30s default Retry policy backoff with jitter ignored when connection timeout increased from 30s Mar 29, 2019
@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 11, 2019

Hi @hansdals I'm working on this issue. I'll see what kind of repro I can get by increasing the OPTION_CONNECTION_TIMEOUT when using IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER for our retry policy, as you've done.

@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 11, 2019

One point which may be worth clarifying is how the OPTION_CONNECTION_TIMEOUT and the IoTHubClient_LL_SetRetryPolicy interact with each other.

OPTION_CONNECTION_TIMEOUT is defaulted to 30 seconds. What does this mean? If a connection is broken, then the client will wait 30 seconds for a CONNACK in MQTT before doing anything. After that 30 seconds, it will check with the retry policy. If the retry policy says "yes, the minimum time between the last retry and the current retry has been met", then the client will retry a connection.

Here is an example:
When a user sets the retry policy to for instance exponential backoff with jitter, there will be limits set in the SDK specifying the minimum time between consecutive retries.

The limit would be 0, 1, then 2, 4, 8, 16, 32, 64, ... give or take a few seconds due to the jitter.

Let's say you set OPTION_CONNECTION_TIMEOUT to 120 seconds. Well, then the connection will be attempted, and after 120 seconds the transport layer of the client will give up. It will then consult with the retry policy, and if the minimum time between retries has been met, it will retry connecting to the IoT Hub.

Initially, this will look like 120 seconds between each retry. This is because on the 1st retry, the minimum time between retries is 0, and 120 seconds is more than 0 so it will retry. On the 2nd retry, the minimum time between retries is 1, and 120 seconds is more than 1 so it will retry. On the 3rd retry, the minimum time between retries is 2, and 120 seconds is more than 2 so it will retry. On the 4th retry... you get the idea. This will happen until the minimum time between retries has been incremented to be greater than 120 seconds. At this point you will start seeing more of that exponential backoff with jitter come into effect. On the 9th retry the minimum time will be approximately 256 seconds, and 120 seconds is less than 256, so after the connection is timed out, the MQTT transport on the client will wait until 256 seconds has been reached, at which point the retry will be attempted.

@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 17, 2019

@hansdals it's been a while since you've last responded to this thread so we will close it. If you have more questions or need more help please re open this issue.

@YoDaMa YoDaMa closed this as completed Apr 17, 2019
@az-iot-builder-01
Copy link
Collaborator

@hansdals, @YoDaMa, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

@hansdals
Copy link
Author

What you are saying is that if I had run my last test that i posted above wasn't run long enough to see the effect ?

Thanks for help @YoDaMa . No need to re-open this issue. The fix in release 2019-03-18 seemed to have done it. What happened before was that when network was lost, the mqtt client segfaulted when retrying. Then procd re-spawned the MQTT client. This caused 30k to reconnect aggressively rendering backoff not working. We would see large dropouts in connections to IoT Hub dip every 1-2 weeks, and when it didn't show dropout devices sometimes MQTT clients would be not responding to direct methods. We have never had a more stable MQTT client than now.

@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 24, 2019

@hansdals glad to hear it!

If that log snippet is from the beginning of a run, then yes. Although admittedly, the slightly under a minute consistent delay between retries seems a little peculiar. We'd have to look deeper into the application code written to diagnose why that is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants