Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random timeouts waiting for CONNACK cause process crashes in iothubtransport_mqtt_common module #842

Closed
RivaStyle opened this issue Feb 1, 2019 · 14 comments
Assignees
Labels

Comments

@RivaStyle
Copy link

  • OS and version used: Ubuntu 16.04

  • SDK version used: 2019-01-31 (v1.2.13)

Description of the issue:

When connecting to IoT Hub to push data, it randomly happens (after many successful attempts) that the MQTT connection times out at initialization, waiting for CONNACK.

This triggers DisconnectFromClient where the case is:

  • transport_data->mqttClientStatus = MQTT_CLIENT_STATUS_CONNECTING
  • transport_data->isDestroyCalled = false

Despite that isDestroyCalled flag, a xio_destroy is called anyway on transport_data->xioTransport, causing a crash.

Could it be an issue similar to #446?

Code sample exhibiting the issue:

iothub_client_sample_mqtt_esp8266, put in a state machine and set to push a single message in a user-selectable time interval (between 5 minutes and 1 hour).

Modifications made:

  • Added Connection Status callback: it basically quits the main do-while loop after CONN_RETRY_NUM (set to 4) unsuccessful connection attempts.
static void ConnStatCallback(IOTHUB_CLIENT_CONNECTION_STATUS result, IOTHUB_CLIENT_CONNECTION_STATUS_REASON reason, void* userContextCallback)
{
    int* counter = (int*)userContextCallback;

    if (result == IOTHUB_CLIENT_CONNECTION_AUTHENTICATED)
    {
        IOTHUB_LOGDEBUG("Attempt %d of %d: MQTT connection established", *counter, CONN_RETRY_NUM);
    }
    else
    {
        (*counter)++;

        switch (reason)
        {
            case IOTHUB_CLIENT_CONNECTION_EXPIRED_SAS_TOKEN:
            {
                IOTHUB_LOGDEBUG("Attempt %d of %d: Id rejected because of expired SAS Token", *counter, CONN_RETRY_NUM);
            }
            break;

            case IOTHUB_CLIENT_CONNECTION_DEVICE_DISABLED:
            {
                IOTHUB_LOGDEBUG("Attempt %d of %d: Device disabled by IoT Hub", *counter, CONN_RETRY_NUM);
            }
            break;

            case IOTHUB_CLIENT_CONNECTION_BAD_CREDENTIAL:
            {
                // Bad login credentials, no need to retry again and again - exit immediately
                *counter = CONN_RETRY_NUM;
                IOTHUB_LOGDEBUG("Attempt %d of %d: Invalid Device Id or Shared Access Key", *counter, CONN_RETRY_NUM);
            }
            break;

            case IOTHUB_CLIENT_CONNECTION_RETRY_EXPIRED:
            {
                // Retry timeout set by policy has expired, exit immediately
                *counter = CONN_RETRY_NUM;
                IOTHUB_LOGDEBUG("Attempt %d of %d: Retry Policy timeout expired", *counter, CONN_RETRY_NUM);
            }
            break;

            case IOTHUB_CLIENT_CONNECTION_NO_NETWORK:
            {
                // VMU-C is not connected to a network, exit immediately
                *counter = CONN_RETRY_NUM;
                IOTHUB_LOGDEBUG("Attempt %d of %d: Network error", *counter, CONN_RETRY_NUM);
            }
            break;

            case IOTHUB_CLIENT_CONNECTION_COMMUNICATION_ERROR:
            default:
            {
                IOTHUB_LOGDEBUG("Attempt %d of %d: Unauthorized MQTT connection", *counter, CONN_RETRY_NUM);
            }
            break;
        }
    }

    IOTHUB_LOGDEBUG("ConnStat callback counter = %d, reason = %d", *counter, reason);
    if (*counter >= CONN_RETRY_NUM)
    {
        IOTHUB_LOGDEBUG("Maximum number of attempts reached, exiting IoT Hub Client");
        g_retVal = STATUSCODE_SRV_UNAV;
        g_continueRunning = false;
    }
}
  • Modified Send Confirmation callback, so as to quit the loop after one message:
static void SendConfirmationCallback(IOTHUB_CLIENT_CONFIRMATION_RESULT result, void* userContextCallback)
{
    EVENT_INSTANCE* eventInstance = (EVENT_INSTANCE*)userContextCallback;
    IOTHUB_LOGDEBUG("Confirmation[%d] received for message tracking id = %zu with result = %s",
                    callbackCounter, eventInstance->messageTrackingId,
                    ENUM_TO_STRING(IOTHUB_CLIENT_CONFIRMATION_RESULT, result));

    callbackCounter++;
    IoTHubMessage_Destroy(eventInstance->messageHandle);

    // Quit after receiving last callback
    IOTHUB_LOGDEBUG("callbackCounter = %d", (int)callbackCounter);
    if (callbackCounter >= MESSAGE_COUNT)
    {
        g_continueRunning = false;
    }
}

Console log of the issue:

2019-02-01 10:55:50  IOTHUB_DEBUG: 250 iothub_clientMQTTRun() IoTHubClient_LL_SetConnectionStatusCallback...successful.
2019-02-01 10:55:50  IOTHUB_DEBUG: 262 iothub_clientMQTTRun() IoTHubClient_LL_SetMessageCallback...successful.
2019-02-01 10:55:50  IOTHUB_DEBUG: 297 iothub_clientMQTTRun() IoTHubClient_LL_SendEventAsync accepted message [0] for transmission to IoT Hub.
-> 10:55:51 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: smxiothub.azure-devices.net/GavazziTest-device_VMU-C/?api-version=2017-11-08-preview&DeviceClientType=iothubclient%2f1.2.13%20(native%3b%20Linux%3b%20armv5tejl) | PWD: XXXX | CLEAN: 0
Error: Time:Fri Feb  1 10:56:21 2019 File:/mnt/ssd/Progetti/uwp_workspace_photon/UWP3/userapps/public/libs/azure-iot-sdk-c/iothub_client/src/iothubtransport_mqtt_common.c Func:InitializeConnection Line:2344 mqtt_client timed out waiting for CONNACK
* crash *
sigChildHnd(): sig=17 killed taskpid=17745 
sigChildHnd(): removed taskpid=17745 from list
@jebrando
Copy link
Contributor

jebrando commented Feb 4, 2019

@RivaStyle How are you building the SDK on the ESP32, I'm looking at recreated, and I'd like to get as close to your configuration as possible.

Thanks.

@RivaStyle
Copy link
Author

These are my CMake flags:

	@cd azure-iot-sdk-c; \
	mkdir -p cmake
	cd azure-iot-sdk-c/cmake; \
	cmake \
	-Duse_mqtt=ON  \
	-Duse_amqp=OFF \
	-Duse_http=ON \
	-Dskip_samples=ON \
	-Dbuild_service_client=OFF \
	-Dbuild_serializer=OFF \
	-Dbuild_as_dynamic=ON \
	-Ddont_use_uploadtoblob=ON \
	-DCMAKE_INSTALL_PREFIX="$(COMPILE_DIR)/usr/local/azure-iot" \
	-DCMAKE_SYSTEM_NAME=Linux \
	-DCMAKE_SYSTEM_VERSION=1 \
	-DCMAKE_TOOLCHAIN_FILE=../build_all/linux/toolchain-uwp.cmake \
	-DCMAKE_C_STANDARD=99 \
	-DCMAKE_CXX_STANDARD=11 \
	-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
	-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
	-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=ONLY \
	-DCMAKE_VERBOSE_MAKEFILE=OFF \
	-DOPENSSL_ROOT_DIR=$(COMPILE_DIR)/usr/local/openssl \
	-DOPENSSL_LIBRARIES=$(COMPILE_DIR)/usr/local/openssl/lib \
	-DCURL_INCLUDE_DIR=$(COMPILE_DIR)/usr/local/curl/include \
	-DCURL_LIBRARY=$(COMPILE_DIR)/usr/local/curl/lib \
	-DcompileOption_C="\
	$(USER_FLAGS) \
	-L$(INSTALL_DIR)/lib -I$(COMPILE_DIR)/usr/local/libuuid/include \
	-I$(COMPILE_DIR)/usr/local/curl/include \
	-lcares" \
	..

where $(USER_FLAGS) is:

-O2 -marm -mabi=aapcs-linux -mno-thumb-interwork \
-march=armv5te -mtune=arm926ej-s \
-msoft-float -Uarm -fomit-frame-pointer

"build_all/linux/toolchain-uwp.cmake" is:

INCLUDE(CMakeForceCompiler)

SET(CMAKE_SYSTEM_NAME Linux)     # this one is important
SET(CMAKE_SYSTEM_VERSION 1)     # this one not so much

# this is the location of the amd64 toolchain targeting the Raspberry Pi
#SET(CMAKE_C_COMPILER $ENV{RPI_ROOT}/../bin/arm-linux-gnueabihf-gcc)

# this is the file system root of the target
#SET(CMAKE_FIND_ROOT_PATH $(COMPILE_DIR) $(INSTALL_DIR) $(COMPILE_DIR)/usr/local/libuuid/include)

# search for programs in the build host directories
SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)

# for libraries and headers in the target directories
SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

As a side note, I didn't have to specify the C_STANDARD and CXX_STANDARD till 1.2.12 (my update steps have been 1.2.10->1.2.12->1.2.13 in a matter of a couple weeks). In any case the issue was present as well using 1.2.10.

@RivaStyle
Copy link
Author

(I forgot to add an information, that "build_serializer" is a custom option I added to not build the serializer module as it wasn't needed)

@jebrando
Copy link
Contributor

@RivaStyle (As a side note I really like that build_serializer flag, I will try to get the team to incorporate it). I guess there are two issues here, the wait for CONNACK and the segfault. For the CONNACK, did you lose network connectivity around this time? It's unusual for the IoTHub to not send back a CONNACK unless the device loses it's network.

For the Segfault, do you now what communication stack you're using (tlsio_openssl, tlsio_mbedtls, tlsio_wolfssl, ...)? I believe that there is some clue in the issue.

Thanks,
Jelani

@RivaStyle
Copy link
Author

No, when the CONNACK timeout happens I don't lose network connectivity, but it could be some kind of overload as other heavy processes are running alongside it. I'll try to disable it and have a further check.

Regarding the communication stack, it's tlsio_openssl.

Thanks,
Luca

@jitin17
Copy link
Contributor

jitin17 commented Feb 18, 2019

@RivaStyle @jebrando I could reproduce this issue. I suspect the problem is with InitializeConnection function in iothubtransport_mqtt_common.c. InitializeConnection does not handle all retry actions appropriately.

I am adding some pain points in the InitializeConnection function, which I think causes the crash. (Note: This is compared against master branch, latest commit 1239a17)

Line 2331 : View Code

I think result = 0 is incorrect, and this is causing a crash when I add IoTHubDeviceClient_LL_SetRetryPolicy(iotHubClientHandle, IOTHUB_CLIENT_RETRY_INTERVAL, 20) in the application logic.

RETRY_ACTION_RETRY_LATER retry action is not being handled

When retry action is RETRY_ACTION_RETRY_LATER, the return value of InitializeConnection is 0, which also results in crash. The return value when retry action is RETRY_ACTION_RETRY_LATER should be something other than 0.

@jebrando Can you confirm whether this is true, and let me know if you would like me to raise a PR to resolve this issue. Thanks.

@RivaStyle
Copy link
Author

@jitin17 if you wanna check if that can intercept my case, the retry policy I'm using is IoTHubClient_LL_SetRetryPolicy(iotHubClientHandle, IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER, 10) .

@jitin17
Copy link
Contributor

jitin17 commented Feb 18, 2019

@RivaStyle This is the default retry policy. And this fails because RETRY_ACTION_RETRY_LATER is not being handled.
This case can be handled by setting result as non-zero in InitializeConnection function of iothubtransport_mqtt_common.c file when retry action is RETRY_ACTION_RETRY_LATER. Let me know, if this works for you.

mahavirj pushed a commit to espressif/esp-azure that referenced this issue Feb 20, 2019
…val without the need of changing sleep time

Also, added support for IOTHUB_CLIENT_RETRY_INTERVAL connection retry policy, which will take care
of internet connection loss and WiFi connection loss.

By default, retry policy is IOTHUB_CLIENT_RETRY_EXPONENTIAL_BACKOFF_WITH_JITTER,
but it causes a bug due to this known issue Azure/azure-iot-sdk-c#842.
As soon as this bug is resolved, we will make retry policy setting configurable.
@RivaStyle
Copy link
Author

@jitin17 @jebrando a non-zero value in case of RETRY_ACTION_RETRY_LATER actually prevented crashes on library side... I'm still seeing some crashes, but now it's more on my side, about how I call my push client to exit.

@jitin17
Copy link
Contributor

jitin17 commented Feb 21, 2019

@RivaStyle I have raised a PR #875 to fix this issue.

@ewertons
Copy link
Contributor

ewertons commented Mar 7, 2019

Hi @RivaStyle , PR #875 has been merged.
Could you verify it fixes your issue please?

@RivaStyle
Copy link
Author

I confirm that the library, as it was cloned from Git last Friday, definitely fixes the issue.

@ewertons
Copy link
Contributor

Alright, thanks for confirming.
We will close this issue, but please feel free to reopen it if you would like to follow up.
Thanks,
Azure IoT SDK Team

@az-iot-builder-01
Copy link
Collaborator

@jebrando, @RivaStyle, @ewertons, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants