Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reconnect after many published data queued (mqtt_client timed out waiting for CONNACK) #889

Closed
qiying opened this issue Mar 4, 2019 · 12 comments

Comments

@qiying
Copy link

qiying commented Mar 4, 2019

  • OS and version used:
    linux ver2.6 (Embedded System)
  • SDK version used:
    1.2.14

Description of the issue:

We are testing an exception:
Connect to iothub -> publish data -> disconnect network (unplug the network route) -> publish data -> connect network back -> publish data.

When there was only a few of data, every thing was fine: connection_status_callback(IOTHUB_CLIENT_CONNECTION_NO_NETWORK) would be shown when the network was disconnect. When the network was reconnected, azure sdk connected to iothub again successfully.

But when there were a lots of data (Thousands of MQTT packets) published at the period of disconnection, no connection_status_callback was occurred. Nevertheless, after we plunged the network route back, azure sdk could not connect to iothub no longer.

Code sample exhibiting the issue:

main code which only was called once:

info->device_handle = IoTHubDeviceClient_CreateFromConnectionString(info->cfg.deviceConnectionString, protocol);
(void)IoTHubDeviceClient_SetConnectionStatusCallback(info->device_handle, azure_connection_status_callback, NULL);

status callback function:

static void azure_connection_status_callback(IOTHUB_CLIENT_CONNECTION_STATUS result, IOTHUB_CLIENT_CONNECTION_STATUS_REASON reason, void *user_context)
{
    azure_info *info;
    (void)reason;
    (void)user_context;
    printf("%s(%d).result=%s, reason=%s\r\n", __FUNCTION__, __LINE__, ENUM_TO_STRING(IOTHUB_CLIENT_CONNECTION_STATUS, result), ENUM_TO_STRING(IOTHUB_CLIENT_CONNECTION_STATUS_REASON, reason));
}

Publish cyclic data:

        message_handle = IoTHubMessage_CreateFromString(jp_handle->str);
        if (!message_handle) printf("error: message_handle == NULL\n");
        (void)IoTHubMessage_SetContentTypeSystemProperty(message_handle, "application%2fjson");
        (void)IoTHubMessage_SetContentEncodingSystemProperty(message_handle, "utf-8");
        ret = IoTHubDeviceClient_SendEventAsync(info->device_handle, message_handle, azure_msg_send_confirm, NULL);
        IoTHubMessage_Destroy(message_handle);

Console log of the issue:

If there was only a few data published during the period of disconnection, every thing was fine:

Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:lookup_address_and_initiate_socket_c
onnection Line:277 Failure: getaddrinfo failure -3.                             
Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:socketio_open Line:760 lookup_addres
s_and_connect_socket failed                                                     
Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:on_underlying_io_open_complete Line:760 
Invalid tlsio_state. Expected state is TLSIO_STATE_OPENING_UNDERLYING_IO.       
Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:tlsio_openssl_open Line:1258 Failed open
ing the underlying I/O.                                                         
Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:mqtt_client_connect Line:1001 Error: io_open failed
                                                                                
Error: Time:Wed Feb 27 18:40:03 2019 File:
/azure-iot-sdk-c/
iothub_client/src/iothubtransport_mqtt_common.c Func:SendMqttConnectMsg Line:215
0 failure connecting to address xxx.azure-devices.net.                 
**azure_connection_status_callback(214).result=IOTHUB_CLIENT_CONNECTION_UNA
UTHENTICATED, reason=IOTHUB_CLIENT_CONNECTION_NO_NETWORK**   

If there was a lots of data published during the period of disconnection, no connection_status_callback occurred, and azure sdk could not reconnect to iothub :

Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/
c-utility/adapters/socketio_berkeley.c Func:lookup_address_and_initiate_socket_c
onnection Line:277 Failure: getaddrinfo failure -3.                             
Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:socketio_open Line:760 lookup_addres
s_and_connect_socket failed                                                     
Info: Closing tlsio from a state other than TLSIO_STATE_EXT_OPEN or TLSIO_STATE_EXT_ERROR                                                                       
Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:on_underlying_io_open_complete Line:760 
Invalid tlsio_state. Expected state is TLSIO_STATE_OPENING_UNDERLYING_IO.       
Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:tlsio_openssl_open Line:1258 Failed open
ing the underlying I/O.                                                         
Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:mqtt_client_connect Line:1001 Error: io_open failed
                                                                                
Error: Time:Wed Feb 27 18:50:21 2019 File:
/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:SendMqttConnectMsg Line:215
0 failure connecting to address xxx.azure-devices.net.           

And then, try to reconnect...but show the message below continuously:

on/shareApps/protocol/module/cloud/azure/stack/sdk/azure-iot-sdk-c/
iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2
229 mqtt_client timed out waiting for CONNACK                                   
Error: Time:Wed Feb 27 18:55:11 2019 File:
/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2
229 mqtt_client timed out waiting for CONNACK                                   
Error: Time:Wed Feb 27 18:55:11 2019 File:
/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2
229 mqtt_client timed out waiting for CONNACK                                   
Error: Time:Wed Feb 27 18:55:11 2019 File:
/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2
229 mqtt_client timed out waiting for CONNACK    
@qiying qiying changed the title Fail to reconnect after many published data queued Fail to reconnect after many published data queued (mqtt_client timed out waiting for CONNACK) Mar 4, 2019
@qiying
Copy link
Author

qiying commented Mar 5, 2019

After I added #842, "mqtt_client timed out waiting for CONNACK" has fixed.
But I still cannot get azure_connection_status_callback when network is disconnect.

@qiying
Copy link
Author

qiying commented Mar 5, 2019

Now I realized that "max_retry_time_in_secs=0" means retry forever, and would not call connectionStatusCallback(IOTHUB_CLIENT_CONNECTION_RETRY_EXPIRED).

Hope there is a mechanism: When the max retry count reached, a connectionStatusCallback(IOTHUB_CLIENT_CONNECTION_RETRY_EXPIRED) occur, but connection still retry forever. Or provide a API for user to retry the connection which retry_action has been RETRY_ACTION_STOP_RETRYING.

@ewertons
Copy link
Contributor

ewertons commented Mar 7, 2019

Hi @qiying ,
you can set the maximum retry time.
See here:

IoTHubClient_SetRetryPolicy( IOTHUB_CLIENT_HANDLE iotHubClientHandle, IOTHUB_CLIENT_RETRY_POLICY retryPolicy, size_t retryTimeoutLimitInSeconds);

You can define the maximum retry timeout using the last argument above.

Does that address your concern?

@qiying
Copy link
Author

qiying commented Mar 7, 2019

Hi, now I know where is the problem.
At the beginning, the most serious problem is: when I recover the network, azure SDK cannot connect successfully by printing

"InitializeConnection Line:2229 mqtt_client timed out waiting for CONNACK       

now this problem has fixed from #842

After this problem solved, I still have another problem: Normally, disconnection status callback(no network) occurs if the messages is published seldom. But, if I publish messages too frequently (10times/1sec) by IoTHubDeviceClient_SendEventAsync(), when the network lost, there is not disconnection status callback occurs. I have to get the disconnection status callback.
Therefore, I consider that maybe I can use retryTimeoutLimitInSeconds to trigger disconnection status callback(retry expire). But it not works, because when disconnection status callback(retry expire) occur, retry stop too. I want connection retry forever. I have to set retryTimeoutLimitInSeconds as 0.
The way I solved my problem is only publish message after I get the confirm callback from previous message. Then I can get disconnection status callback(no network) which I want.

I think when messages are published too frequently, azure SDK does not have time to close socket completely, so disconnection status callback(no network) would not occur. This may be a problem azure SDK should fix.

I use MQTT only.
Thanks!

@qiying
Copy link
Author

qiying commented Mar 8, 2019

Hi, here is some logs.

If messages is published seldom, I can get disconnection status callback:

Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:lookup_address_and_initiate_socket_connection Line:277 Failure: getaddrinfo failure -3.
Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:socketio_open Line:760 lookup_address_and_connect_socket failed
onOpenComplete(388).mqtt_client=0x9ee3b08, open_result=1
onOpenComplete(391).socketConnected=0
mqtt_error_callback(1859).conn err cb
IoTHubClientCore_LL_ConnectionStatusCallBack(591).
iothub_ll_connection_status_callback(421).
Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:on_underlying_io_open_complete Line:760 Invalid tlsio_state. Expected state is TLSIO_STATE_OPENING_UNDERLYING_IO.
Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:tlsio_openssl_open Line:1258 Failed opening the underlying I/O.
Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/umqtt/src/mqtt_client.c Func:mqtt_client_connect Line:1005 Error: io_open failed
Error: Time:Wed Mar  6 11:43:44 2019 File:/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:SendMqttConnectMsg Line:2268 failure connecting to address april-iothub.azure-devices.net.
dispatch_user_callbacks(667).CALLBACK_TYPE_CONNECTION_STATUS
dispatch_user_callbacks(670).connection_status_callback
connection status callback: ret=IOTHUB_CLIENT_CONNECTION_UNAUTHENTICATED, reason=IOTHUB_CLIENT_CONNECTION_NO_NETWORK. 

If messages is published frequently, I cannot get disconnection status callback:

Error: Time:Wed Mar  6 11:39:17 2019 File:/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:lookup_address_and_initiate_socket_connection Line:277 Failure: getaddrinfo failure -3.
Error: Time:Wed Mar  6 11:39:17 2019 File:/azure-iot-sdk-c/c-utility/adapters/socketio_berkeley.c Func:socketio_open Line:760 lookup_address_and_connect_socket failed
onOpenComplete(388).mqtt_client=0x9ef5b08, open_result=1
onOpenComplete(391).socketConnected=1
Info: Closing tlsio from a state other than TLSIO_STATE_EXT_OPEN or TLSIO_STATE_EXT_ERROR
Error: Time:Wed Mar  6 11:39:17 2019 File:/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:on_underlying_io_open_complete Line:760 Invalid tlsio_state. Expected state is TLSIO_STATE_OPENING_UNDERLYING_IO.
Error: Time:Wed Mar  6 11:39:17 2019 File:/azure-iot-sdk-c/c-utility/adapters/tlsio_openssl.c Func:tlsio_openssl_open Line:1258 Failed opening the underlying I/O.

something may wrong with socketConnected

@ewertons
Copy link
Contributor

Hi @qiying ,

we are glad that fixing #842 helped you.

Unfortunately it is not clear why you are not seeing the connection status callback.
We are investigating it, and will post an update as soon as we get it.

@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 22, 2019

I will pick up investigating this.

@YoDaMa YoDaMa assigned YoDaMa and unassigned jebrando Apr 22, 2019
@YoDaMa
Copy link
Contributor

YoDaMa commented Apr 23, 2019

An update: I have tried reproducing this using the LL layer to see if it is an issue with the DoWork loop being blocked when a message is sent every DoWork. To simulate a similar scenario I have a sample using the LL layer on Windows 10 using MQTT doing the following:

  1. Send 5 Messages, and then run do work until all 5 messages have been published and receive PUBACK.
  2. Disconnect the internet cable.
  3. Send 1 Message per DoWork call, for 995 messages.
  4. After a while, plug back in the cable and let the messages that haven't timed out send.

Doing this, I received error messages that showed the connection status callback being invoked:

rror: Time:Mon Apr 22 16:41:46 2019 File:..\c-utility\adapters\tlsio_schannel.c Func:_tlsio_schannel_open Line:1238 xio_open failed
Error: Time:Mon Apr 22 16:41:46 2019 File:..\umqtt\src\mqtt_client.c Func:_mqtt_client_connect Line:1005 Error: io_open failed
Error: Time:Mon Apr 22 16:41:46 2019 File:..\iothub_client\src\iothubtransport_mqtt_common.c Func:_SendMqttConnectMsg Line:2263 failure connecting to address yosephhub.azure-devices.net.
Error: Time:Mon Apr 22 16:41:50 2019 File:..\c-utility\adapters\socketio_win32.c Func:_lookup_address_and_initiate_socket_connection Line:299 Failure: getaddrinfo failure 11001.
Error: Time:Mon Apr 22 16:41:50 2019 File:..\c-utility\adapters\socketio_win32.c Func:_socketio_open Line:390 lookup_address_and_connect_socket failed
The device client has been disconnected (PRINTED BY CONNECTION_STATUS_CALLBACK)

So from this, the connection_status_callback is being invoked in the LL (non-threaded) layer. If there is an issue it is involving the threaded layer (the non _LL_ layer).

@YoDaMa
Copy link
Contributor

YoDaMa commented May 9, 2019

Hi @qiying apologies for getting to this slowly. I've repro'd it using the convenience sample using MQTT and still in the convenience sample I am seeing that the connection status callback is still being invoked.
This is doing the exact same process as I did using the LL layer except I am using the non LL layer, so the events are multithreaded.

Could you verify on your end that the connection status callback. It could be specific to the TLS stack on your embedded device.

Here's the gist with the logs I'm getting: https://gist.github.com/YoDaMa/022546ac983f6eb4814bb7cf4482be2e#file-multithreaded-log

@YoDaMa
Copy link
Contributor

YoDaMa commented May 17, 2019

@qiying since I was unable to repro this I am going to close this github issue.

@YoDaMa YoDaMa closed this as completed May 17, 2019
@az-iot-builder-01
Copy link
Collaborator

@qiying, @ewertons, @YoDaMa, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

@rajaggrawal
Copy link
Contributor

rajaggrawal commented Dec 18, 2019

Hi,
i am trying to connect with AzureIoTHub , initially i dont have internet when system comes up and azure connection gets fail.
But it never connect even after internet connection comes up.

seeing the below logs.

2019-10-14T09:52:20.517980+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: Failure: getaddrinfo failure -3.
2019-10-14T09:52:20.518115+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: lookup_address_and_connect_socket failed
2019-10-14T09:52:20.518209+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: Http connection failed to connect
2019-10-14T09:52:20.518280+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: Invalid tlsio_state. Expected state is TLSIO_STATE_OPENING_UNDERLYING_IO.
2019-10-14T09:52:20.518387+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: Failed opening the underlying I/O.
2019-10-14T09:52:20.518603+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: opening xio failed
2019-10-14T09:52:20.518691+00:00 ClickShare-1862300213 iotagent[733]: [ERROR] AZURE_PLATFORM: ERROR: failed to open http client

could you please help here..

retry is set with below option
IoTHubDeviceClient_SetRetryPolicy(iotHubHandle, IOTHUB_CLIENT_RETRY_INTERVAL,0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants