Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalid calls to vSocketClose() when system bogged down and multiple TCP ports are closed all at once #570

Closed
phelter opened this issue Nov 8, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@phelter
Copy link
Contributor

phelter commented Nov 8, 2022

Describe the bug

When the system is bogged down (cpu oversubscribed) and there are multiple TCP sockets open, and all TCP sockets are closed all at once, there are cases where extra calls to vSocketCLose() are performed on sockets that have already have been closed with bad socket data.

After much review of the environment and the message queue to the FreeRTOS-Plus-TCP IP task, narrowed down a bug in the lifetime management of the sockets in the socket layer.

Due to the use of: vSocketCloseNextTime, a socket may be both closed:

  • by the TCP layer calling vSocketCloseNextTime( pxSocket ), storing the pxSocket socket in the xSocketToClose global variable and then closed later by another call to vSocketCloseNextTime(NULL) which is called first thing in the prvIPTask(). While the second call:
  • User defined FreeRTOS_closesocket() - is performed by the user since the user still has a local store of the opened FreeRTOS_Socket_t.

Believe at some point in the tcp code a vTCPStateChange( pxSocket, eCLOSE_WAIT) is issued without intervention by the user. The user also requests a FreeRTOS_closesocket() to terminate the socket, but by then, the socket has already performed the close.

Because vSocketClose() deletes it's resources, when a closed socket is closed again, a DataAbort interrupt is executed due to a Bad Address, and attempting to delete a list item twice.

Target

  • Development board: [e.g. Zynq]
  • Instruction Set Architecture: [ARM CortexA9]
  • IDE and version: VSCode + CMake
  • Toolchain and version: arm-none-eabi-gcc 10.3.1

Host

  • Host OS: Ubuntu 18.04
  • Version: [e.g. Mojave 10.14.6]

To Reproduce

  • Unable to provide an example at this moment, but if you look at the code you can see this scenario is in fact possible.

Expected behavior
The Socket layer maintains a well known lifetime definition of all sockets created by the user, and ensures if at any point the socket is closed for internal reasons, it does not perform the same action again.

Screenshots
None

Wireshark logs
None

Additional context
Debug - additional messages from Close Execution:

vSocketClose: Caller name: 1315BCS
vSocketClose: pxSocket: 0x0036FDD0
vSocketClose: pxSocket.usLocalPort: 12121
Lost: Socket 12121 now has 0 / 5 children
FreeRTOS_closesocket[12121 to C0A80064ip:49305]: buffers 64 socks 4
______________________prvProcessIPEventsAndTimers: vSocketClose about to get called______________________
vSocketClose: Caller name: 12A184S
vSocketClose: pxSocket: 0x00308870
vSocketClose: pxSocket.usLocalPort: 12345
Lost: Socket 12345 now has 0 / 5 children
FreeRTOS_closesocket[12345 to C0A80064ip:49304]: buffers 64 socks 3
______________________prvProcessIPEventsAndTimers: vSocketClose about to get called______________________
vSocketClose: Caller name: 12A184S
vSocketClose: pxSocket: 0x0036FDD0
vSocketClose: pxSocket.usLocalPort: 12602
FreeRTOS_closesocket: xSocket: 0x0036FDD0
FreeRTOS_closesocket: xSocket.usLocalPort: 12602

Note that the vSocketClose is called twice on pxSocket 0x0036FDD0 - due to another socket's close being executed.

Where a dump of the callers is:

0x0012cfe0 vSocketClose(): FreeRTOS_Sockets.c, line 1550	
0x001315bc vSocketCloseNextTime(): FreeRTOS_TCP_IP.c, line 126	
0x0013d7c0 vCheckNetworkTimers(): FreeRTOS_IP_Timers.c, line 243	
0x00129ffc prvProcessIPEventsAndTimers(): FreeRTOS_IP.c, line 290	
0x00129fe8 prvIPTask(): FreeRTOS_IP.c, line 271	

So the first close of the first socket is due to the vSocketCloseNextTime() and the second one is due to user request. On the second close, a DataAbort occurs inside the uxListRemove function.

image

@phelter phelter added the bug Something isn't working label Nov 8, 2022
phelter added a commit to phelter/FreeRTOS-Plus-TCP that referenced this issue Nov 9, 2022
@htibosch
Copy link
Contributor

Hello @phelter, thank you for this report.

There is something I don't understand: when vSocketCloseNextTime() is called, the socket it still an orphan, it has not been passed to the application yet. Therefore it is obligatory and safe for the IP-task to call FreeRTOS_socketclose() (in a delayed manner).

This is where vSocketCloseNextTime() is called the first time:

if( ( eTCPState == eCLOSED ) ||
    ( eTCPState == eCLOSE_WAIT ) )
{
    if( ( pxSocket->u.xTCP.bits.bPassQueued != pdFALSE_UNSIGNED ) ||
        ( pxSocket->u.xTCP.bits.bPassAccept != pdFALSE_UNSIGNED ) )
    {
        if( pxSocket->u.xTCP.bits.bReuseSocket == pdFALSE_UNSIGNED )
        {
            vSocketCloseNextTime( pxSocket );
        }
    }
}

So the TCP status must be eCLOSED or eCLOSE_WAIT, it has not been passed to the application, and it is not a reuseable socket.

The time between the following statements is very small :

    vSocketCloseNextTime( pxSocket );
    ...
    vSocketCloseNextTime( NULL );

    /* Now the IP-task will block again */
    xQueueReceive( xNetworkEventQueue, ...);

I would be curious to see/understand your TCP-related code, especially the API calls, and the call to FreeRTOS_closesocket().

Why does your code call FreeRTOS_closesocket(), what has happened just before that?

In the PR you show a sequence of events in a log. Do you have the same type of log without applying the patch?

And just curiosity: what priorities have you assigned to the IP-task and to the task that calls TCP API's?

@phelter
Copy link
Contributor Author

phelter commented Nov 10, 2022

@htibosch The use case is:

There are two server sockets that are created - each in their own task defined by prvConnectionListeningTask and with different ports and different task priorities:

Server A: - is at priority 14 - (configMAX_PRIORITIES - 2)
Server B: - is at priority 1 - (tskIDLE_PRIORITY + 1)

The first thing that gets executed within each of the tasks is vCreateTCPServerSocket

void prvConnectionListeningTask( void *pvParameters )
{
    struct freertos_sockaddr xClient;
    Socket_t                 xListeningSocket = FREERTOS_INVALID_SOCKET;
    Socket_t                 xConnectedSocket = FREERTOS_INVALID_SOCKET;
    socklen_t xSize = sizeof( xClient );
    serverTaskInfo_t        *pxServerTaskInfo = NULL;

    pxServerTaskInfo = ( serverTaskInfo_t* ) pvParameters;

    xListeningSocket = vCreateTCPServerSocket(pxServerTaskInfo->portNum);
    FreeRTOS_listen( xListeningSocket, pxServerTaskInfo->connectionBacklog );

    for( ;; )
    {
        xConnectedSocket = FreeRTOS_accept( xListeningSocket, &xClient, &xSize );
        configASSERT( xConnectedSocket != FREERTOS_INVALID_SOCKET ); // NOLINT performance-no-int-to-ptr

        xTaskCreate( pxServerTaskInfo->serverHandlerTask,
                     pxServerTaskInfo->taskName,
                     pxServerTaskInfo->stackSize,
                     ( void * ) xConnectedSocket,
                     pxServerTaskInfo->taskPriority,
                     NULL );
    }
}

static Socket_t vCreateTCPServerSocket( uint16_t pPortNum )
{
    struct freertos_sockaddr xBindAddress;
    Socket_t                 xListeningSocket = FREERTOS_INVALID_SOCKET;
    static const TickType_t xReceiveTimeOut = portMAX_DELAY;

#if( ipconfigUSE_TCP_WIN == 1 )
    WinProperties_t xWinProps;

    xWinProps.lTxBufSize = ipconfigTCP_TX_BUFFER_LENGTH;
    xWinProps.lTxWinSize = ipconfigTCP_WIN_SEG_COUNT;
    xWinProps.lRxBufSize = ipconfigTCP_RX_BUFFER_LENGTH;
    xWinProps.lRxWinSize = ipconfigTCP_WIN_SEG_COUNT;
#endif /* ipconfigUSE_TCP_WIN */

    xListeningSocket = FreeRTOS_socket( FREERTOS_AF_INET, FREERTOS_SOCK_STREAM, FREERTOS_IPPROTO_TCP );
    configASSERT( xListeningSocket != FREERTOS_INVALID_SOCKET );

    FreeRTOS_setsockopt( xListeningSocket, 0, FREERTOS_SO_RCVTIMEO, &xReceiveTimeOut, sizeof( xReceiveTimeOut ) );

#if( ipconfigUSE_TCP_WIN == 1 )
    FreeRTOS_setsockopt( xListeningSocket, 0, FREERTOS_SO_WIN_PROPERTIES, ( void * ) &xWinProps, sizeof( xWinProps ) );
#endif /* ipconfigUSE_TCP_WIN */

    xBindAddress.sin_port = FreeRTOS_htons( pPortNum );
    FreeRTOS_bind( xListeningSocket, &xBindAddress, sizeof( xBindAddress ) );
    return xListeningSocket;
}

As you can see the listening socket spawns an independent task for each accepted connection at the same priority as the listening server socket.

The function that is passed as: pxServerTaskInfo->serverHandlerTask is:

void vServerConnectionInstance( void *pvParameters )
{
    Socket_t                xSocket = FREERTOS_INVALID_SOCKET;
    static char cRxedData[ TCP_RX_DATA_BUFFER_SIZE ];
    CR02BaseType_t          lBytesReceived  = 0;
    static const TickType_t xReceiveTimeOut = pdMS_TO_TICKS( CONFIG_SERVER_TCP_TIMEOUT_MS );
    static const TickType_t xSendTimeOut =    pdMS_TO_TICKS( CONFIG_SERVER_TCP_TIMEOUT_MS );

    /* The socket has already been created and connected before
    being passed into this RTOS task using the RTOS task's parameter. */
    xSocket = ( Socket_t ) pvParameters;

    FreeRTOS_setsockopt( xSocket, 0, FREERTOS_SO_RCVTIMEO, &xReceiveTimeOut, sizeof( xReceiveTimeOut ) );
    FreeRTOS_setsockopt( xSocket, 0, FREERTOS_SO_SNDTIMEO, &xSendTimeOut, sizeof( xSendTimeOut ) );
    for( ;; )
    {
        lBytesReceived = FreeRTOS_recv( xSocket, &cRxedData, TCP_RX_DATA_BUFFER_SIZE, 0 );

        if( lBytesReceived > 0 )
        {
            prvProcessRXConfigData( xSocket, cRxedData, (size_t)lBytesReceived );
        }
        else if( lBytesReceived == 0 )
        {
        }
        else
        {
            FreeRTOS_shutdown( xSocket, FREERTOS_SHUT_RDWR );
            break;
        }
    }

    while( FreeRTOS_recv( xSocket, &cRxedData, TCP_RX_DATA_BUFFER_SIZE, 0 ) >= 0 )
    {
        vTaskDelay( pdMS_TO_TICKS( 250 ) );
    }

    FreeRTOS_closesocket( xSocket );

    vTaskDelete( NULL );
}

This is the only place the FreeRTOS_closesocket() is used in the entire code base, and it is only used on the connection sockets. There is no FreeRTOS_closesocket() performed on the listening sockets as they are assumed to be enabled forever (until reset). And based on the logs, the bug only occurs when the LAST connection socket is being closed.

The failure happens then because these are not REUSE_SOCKET's and some internal state of a socket in the TCP layer is being used to decide the socket layer lifetime. The changes provided ensure that the socket layer maintains it's own knowledge of the lifetime of a socket and acts accordingly (regardless of the TCP layer's state definition.

For logs - the same set of log output occurs when the patch is NOT applied but when the vSocketCLoseNextTime(NULL) is performed, a DataAbort ISR handler occurs because the contents of the FreeRTOS_Socket_t are all overwritten with some 0xa5a5a5a5 data instead of NULL's, which means it's a double-free scenario.

I believe you have provided the reason for this bug in your previous comment:

The time between the following statements is very small :

    vSocketCloseNextTime( pxSocket );
    ...
    vSocketCloseNextTime( NULL );

    /* Now the IP-task will block again */
    xQueueReceive( xNetworkEventQueue, ...);

If this is an assumption that must hold true, then there is a bug in the code here. The xSocketToClose is a global which is affected by any or all socket closures, so if another socket needs to use the xSocketToClose then the previous socket in that location will be closed outside of the context of that socket. One must ensure those actions and anything in between are atomic in nature ( eg protected by disabling ISR's), otherwise a task can be interrupted between them and then cause a socket to close despite it not being in a closable state - it's just less likely to happen when they are close together (including right after eachother).

There might be a simpler approach to fixing this by checking xSocketToClose inside the vSocketClose() function and if it is the same task - clear out the xSocketToClose because we've already dealt with it, but I wanted to separate out the socket layer from the TCP layer and also confirm lifetime requirements of a socket for other calls in the API.

@htibosch
Copy link
Contributor

htibosch commented Nov 11, 2022

Thank you for the code and your comments.

Your assumption about a possible race condition does not seem correct to me.

As long as xSocketToClose is set, it is impossible to call vSocketClose() from another task:

  xQueueReceive()                    // returns 'eNetworkRxEvent'
  vSocketCloseNextTime( pxSocket );  // xSocketToClose = pxSocket

  // A competing task calls 'FreeRTOS_closesocket()', it must wait

  vSocketCloseNextTime( NULL );      // pxSocket is closed, xSocketToListen = NULL

  xQueueReceive();                   // returns 'eSocketCloseEvent'
  vSocketClose();                    // Because a 'eSocketCloseEvent' was received

Between two calls to xQueueReceive(), only one message will be handled, no task can come in-between.

The buffer parameter to FreeRTOS_recv() was preceded by an ampersand.:

     static char cRxedData[ TCP_RX_DATA_BUFFER_SIZE ];
-    lBytesReceived = FreeRTOS_recv( xSocket, &cRxedData, TCP_RX_DATA_BUFFER_SIZE, 0 );
+    lBytesReceived = FreeRTOS_recv( xSocket, cRxedData, TCP_RX_DATA_BUFFER_SIZE, 0 );

which may lead to data corruption.

I don't know if you it is easy to reproduce the problem, but then I would rewrite vSocketClose() to just print that it is being called and show the socket pointer. Also, I would be curious about these flags at the time vSocketClose() is called:

  bPassAccept : 1,   /**< when true, this socket may be returned in a call to accept() */
  bPassQueued : 1,   /**< when true, this socket is an orphan until it gets connected
                      * Why an orphan? Because it may not be returned in a accept() call until it
                      * gets the state eESTABLISHED */
  bReuseSocket : 1,  /**< When a listening socket gets a connection, do not create a new instance but keep on using it */

I had some more minor remarks about the code and attach them here: demo_code.zip

One remark shutdown() does not have to be called by a TCP server, unless it wants to take the initiative to disconnect from a TCP client.

And also: an invalid socket is defined as a socket that is either NULL or FREERTOS_INVALID_SOCKET. The function xSocketValid() tests for both values.

Could you post the FreeRTOS_IPConfig.h that you used for your project?

@phelter
Copy link
Contributor Author

phelter commented Nov 12, 2022

Hello @htibosch ,

While I agree with you about the use of the &cRxedData vs cRxedData is misleading if you prefer implied conversion of arrays to a pointer to their array type, it is not an issue because the array is defined above as an array of char of fixed size and the data is stored in the first location of the array.

Please see: ISO/IEC 98/99 section 6.3.2.1 - paragraph 2.
And the explanation here
This means the &cRxedData translates to a pointer to head of the type char[TCP_RX_DATA_BUFFER_SIZE], which also happens to be a pointer to the first element of the array. The difference between the two:
Currently implemented: &cRxedData -> pointer to a char[TCP_RX_DATA_BUFFER_SIZE] - which is the first char of that buffer.
What you proposed cRxedData -> pointer to a char through implicit conversion - which is also the first char of that buffer.

So it is not this.

All I know is when the code provided in PR #571 - which is explicitly checking the lifetime of a socket is put in place, this issue no longer happens.

As mentioned the sockets returned to the user from FreeRTOS_accept() API are sockets that have been created and provided to the user. It is also acceptable to close these clientSockets at any time using FreeRTOS_closesocket() - even after the far end may have closed the socket. I believe that the far end closing the socket before the client does is also a possible cause for the error.

It is very difficult to prove that the state of a TCP socket is correct or will only terminate a socket NOT provided back to the user. However, the fact that the logs provided in the original description show that the same socket is closed multiple times - (once by user, and once internally) should be sufficient proof that there is a lifetime issue in this code and that the separation of the act of closing and the act of destroying a socket are two independent actions unless the close is requested by the user.

I have provided the necessary info and fix for this issue which has been proven on our machine to work and operate. If you require further proof the issue exists, I suggest creating a unit test or proof that a socket returned by FreeRTOS_accept() is never closed internally unless requested by the user via FreeRTOS_closesocket(). My belief is the existence of the vSocketCloseNextTime() shows this proof false.

@phelter
Copy link
Contributor Author

phelter commented Nov 12, 2022

@htibosch

One remark shutdown() does not have to be called by a TCP server, unless it wants to take the initiative to disconnect from a TCP client.

Yes I am aware. I did not use shutdown() in any of the examples provided associated with this particular bug and have used the Berkley API for over 20 years now.

And also: an invalid socket is defined as a socket that is either NULL or FREERTOS_INVALID_SOCKET. The function xSocketValid() tests for both values.

Yes I am aware. And I fixed several places in the PR #571 - where the socket received by the IPTask event queue was not checking for both NULL OR FREERTOS_INVALID_SOCKET.

Could you post the FreeRTOS_IPConfig.h that you used for your project?
See Attached: FreeRTOSIPConfig.zip

@htibosch
Copy link
Contributor

htibosch commented Nov 12, 2022

Hi @phelter, mind you that I am ( we are ) very grateful that you came up with this issue, and that you spent so much time on it.

About &cRxedData vs cRxedData : I responded too quickly. Your code is of course correct.

And about my remark about shutdown(): don't take it personal. One of my tasks (on the FreeRTOS forum) is to give informations to developers. And sometimes I tell things that people already know.

Before we make a reparation or a patch that protects against unwanted behaviour, I would like to understand why a socket can "have two owners": the IP-task and the application.
I would like to see a patch that prevents this, in stead of protecting vSocketClose() against "bad behaviour".

About Priorities:

The FreeRTOS+TCP library was written with the silent assumption that the IP-task has a higher priority than any of its users. In fact, we have always recommended this scheme:

Higher : Network Interface (niEMAC_HANDLER_TASK_PRIORITY)
Normal : IP-task (ipconfigIP_TASK_PRIORITY)
Lower : any task that makes use of +TCP

In FreeRTOS_accept(), you can see that the user task is altering the contents of a socket:

vTaskSuspendAll();
{
    if( pxSocket->u.xTCP.bits.bReuseSocket == pdFALSE_UNSIGNED )
    {
        pxClientSocket = pxSocket->u.xTCP.pxPeerSocket;
    }
    else
    {
        pxClientSocket = pxSocket;
    }

    if( pxClientSocket != NULL )
    {
        pxSocket->u.xTCP.pxPeerSocket = NULL;

        /* Is it still not taken ? */
        if( pxClientSocket->u.xTCP.bits.bPassAccept != pdFALSE_UNSIGNED )
        {
            pxClientSocket->u.xTCP.bits.bPassAccept = pdFALSE;
        }
        else
        {
            pxClientSocket = NULL;
        }
    }
}
( void ) xTaskResumeAll();

This is safe to do because the scheduler is temporarily suspended by calling vTaskSuspendAll().

When the user task has a higher priority than the IP-task, we can:

  1. Add a new API for accept() to guarantee a strict sequential order of events ( good idea )
  2. Let the IP-task suspend the scheduler at certain places ( bad idea )
  3. Make it mandatory to assign a higher priority to the IP-task ( doubt ).

The cause

I think that the following happens:

For an unknown (but likely legitimate) reason, the IP-task puts the socket in status eCLOSED or eCLOSE_WAIT, and orders the deletion of it by calling vSocketCloseNextTime().
The (listening) parent socket still has a reference to the child socket in xTCP.pxPeerSocket.

The user application calls FreeRTOS_accept() and finds a new client socket in xTCP.pxPeerSocket. It quickly discovers that the socket is already closed, and so it calls FreeRTOS_closesocket(), which leads to an exception.

In a solution, I think that xTCP.pxPeerSocket of the parent should be cleared as soon as the IP-task wants to delete the socket.

A TCP connection can be closed because of the following reasons:

eCLOSED:

  • A TCP/RST packet was received.

eCLOSE_WAIT:

  • Malloc failed.
  • Closing because non-activity ( HANG protection ).
  • FIN stage ready ( graceful closure ).
  • Unexpected TCP flags received.
  • Transmission times out after 3 attempts.

Question 1, do you have any idea why the sockets were put into eCLOSED or eCLOSE_WAIT state?

Question 2, could you describe the steps that must be taken to reproduce the problem?

Question 3, you wrote "When the system is bogged down (cpu oversubscribed)"
What are the consequences for the 2 servers when:
IP-task : - is at priority 14 - (configMAX_PRIORITIES - 2) (ipconfigIP_TASK_PRIORITY)
Server A: - is at priority 14 - (configMAX_PRIORITIES - 2)
Server B: - is at priority 1 - (tskIDLE_PRIORITY + 1)
Did they come to a stop during the over-subscription?

Thanks

@phelter
Copy link
Contributor Author

phelter commented Nov 12, 2022

Hello @htibosch, thank you for your response.

Regarding:

And about my remark about shutdown(): don't take it personal. One of my tasks (on the FreeRTOS forum) is to give informations to developers. And sometimes I tell things that people already know.

I did not take it personal; however, it would be appreciated that bugs of this nature, are treated guilty until proven innocent rather than innocent until proven guilty by the maintainers of the code base. It is a very serious issue and with proof provided that using the API provided in a valid and acceptable way leads to a catastrophic result of memory use after being freed. I have spent a significant number of hours identifying, explaining, resolving and testing the issue and the hope was that the remainder of the work involved in supporting this would be taken on by the maintainers of the FreeRTOS-Plus-TCP repo.

Before we make a reparation or a patch that protects against unwanted behaviour, I would like to understand why a socket can "have two owners": the IP-task and the application.

A socket does not have 2 owners (construction and destruction in C++ terms), however it does have 2 actors that can close a socket - the user app of the server - in this case is also the owner of the socket on the server side, and the client which does so via a close request from the far end and then the IPTask acts as a proxy for the client on the server side to complete this action. The IPTask is allowed to close the socket, but it is not allowed to destroy the socket as this is the job of the 1 owner.

Question 1, do you have any idea why the sockets were put into eCLOSED or eCLOSE_WAIT state?

Without confirmation - my current theory is:

A socket is returned from FreeRTOS_accept()

  • the pxSocket is Valid, not a ReuseSocket, eTCPState is Listening, and the peerSocket is not Null.
  • this results in
    pxClientSocket = pxSocket->u.xTCP.pxPeerSocket;
    ...
    pxSocket->u.xTCP.pxPeerSocket = NULL;
    ...
    pxClientSocket->u.xTCP.bits.bPassAccept = pdFALSE;
    ( void ) xSendEventStructToIPTask( &xAskEvent, portMAX_DELAY );
    return pxClientSocket;

At this point the pxClientSocket is returned to the user on the server side with:
pxClientSocket->u.xTCP.bits.bPassAccept = pdFALSE_UNSIGNED
pxClientSocket->u.xTCP.bits.bPassQueued = ??

Due to the way the u.xTCP.bits.bPassAccept and bPassQueued are being used and changed, their state may not correctly reflect the actual state of the TCP socket.

These bits are consumed in the following functions in the associated ways:
vTCPStateChanged - (Read/Modify/Write) on pre-existing sockets that have been returned to a user
prvTCPSocketCopy - (Write) on new sockets that have not been returned to a user (OK).
prvHandleListen - (Write) on pre-existing sockets
FreeRTOS_accept - (Read/Modify/Write) on socket that has not been returned to a user (OK).
prvTCPStatusAgeCheck - (Read) - pre-existing socket
xTCPCheckNewClient - (Read) - pre-existing socket
vSocketSelect - (Read)- pre-existing socket
prvTCPSetSocketCount - (Read) - pre-existing socket.

vTCPStateChanged is called in both the calls associated with the API - eg: FreeRTOS_listen() and in private functions which I believe are in the IPTask - but not confirming all of this. This means that multiple Tasks have the ability to read/modify/write this data without the proper mutex locking scheme that is used to ensure state is consumed in an atomic and protected manner.

The code is assuming these to be atomic and a common state - i.e. the use of both together in an if statement - we have the correct knowledge of PassQueued and PassAccept - as a combined state.

eg: the case where:

            if( ( pxSocket->u.xTCP.bits.bPassQueued != pdFALSE_UNSIGNED ) ||
                ( pxSocket->u.xTCP.bits.bPassAccept != pdFALSE_UNSIGNED ) )

This translates into two actions: confirming bPassQueued != false and bPassAccept != false - in between them they can be interrupted by another thread that changes both to be false or true, which means you do not have the combined state. A similar case can be made for checking the value and then changing it, or writing both values together.

The solution to this is to either:

  • use a semaphore-mutex lock on all read or read/modify/write or write actions of these states of the Socket_t, this would require a mutex per Socket_t
  • disable interrupts as you have done in the one spot of the FreeRTOS_accept() around all access actions of these state bits using vTaskSuspendAll() and vTaskResumeAll()

Question 2, could you describe the steps that must be taken to reproduce the problem?

Unfortunately no, I am working remotely helping out a team that is performing the debugging on the device itself. I believe I have provided the necessary info to reproduce, but I cannot share all of the code to reproduce the issue.

Question 3, you wrote "When the system is bogged down (cpu oversubscribed)"
What are the consequences for the 2 servers when:
IP-task : - is at priority 14 - (configMAX_PRIORITIES - 2) (ipconfigIP_TASK_PRIORITY)
Server A: - is at priority 14 - (configMAX_PRIORITIES - 2)
Server B: - is at priority 1 - (tskIDLE_PRIORITY + 1)
Did they come to a stop during the over-subscription?

No I don't believe so - they are all being serviced, just not necessarily very often due to other tasks in the system.

@shubnil
Copy link
Member

shubnil commented Feb 13, 2023

Hello @phelter, thank you for this report.
Apologies for the long wait.
We recently encountered a few scenarios where multiple socket free are observed.
The scenario is something like this:
If the server task calling FreeRTOS_accept runs at a priority lower than the IP task’s priority, we observe a double free of socket leading to memory corruption when a RST is received quickly after the connection is established. Here is the sequence of steps leading to double free -

  1. When a connection is received, this line sets the event flag so that server task can return from FreeRTOS_accept call
  2. Since the server task’s priority is lower than the IP task, it does not run immediately.
  3. The IP task continues and sets u.xTCP.bits.bPassAccept as true
  4. The client sends RST and this line determines that the socket is orphaned because u.xTCP.bits.bPassAccept is true and calls vSocketCloseNextTime -
  5. The next run of IP task calls vSocketCloseNextTime with NULL and deletes the socket
  6. When IP task is no longer ready to run, the server task returns from the FreeRTOS_accept call and eventually calls FreeRTOS_closesocket leading to double free of the socket.

Please suggest if your observation is something like this.

@shubnil
Copy link
Member

shubnil commented Feb 13, 2023

We have this change #705 for the above observation. Please check if this helps solving your scenario as well.

@shubnil shubnil self-assigned this Feb 13, 2023
@shubnil
Copy link
Member

shubnil commented Feb 21, 2023

Closing this issue as a probable fix is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants