Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shutdown_socket() function for follow up of #505 #506

Merged
merged 38 commits into from
Mar 9, 2025
Merged

Conversation

Jakio815
Copy link
Collaborator

@Jakio815 Jakio815 commented Dec 20, 2024

This PR follow ups #505.
Please merge with lf-lang/lingua-franca#2474.

New Integrated Function: shutdown_socket()

int shutdown_socket(int* socket, bool read_before_closing);

Arguments

  • socket: A pointer to the socket descriptor that needs to be shut down and closed.
  • read_before_closing: A boolean indicating whether the socket should read any remaining incoming data before closing:
    • true: Initiates a graceful shutdown by sending a FIN packet (SHUT_WR) and waits for EOF (0-length message) from the peer.
    • false: Immediately shuts down both reading and writing directions (SHUT_RDWR).

Return Values

  • 0: Indicates successful shutdown and closure of the socket.
  • -1: Indicates a failure during shutdown or closure, with errno set to describe the specific error.

Function Description
This function gracefully shuts down and closes a socket.

  • When read_before_closing is false:
    • The function calls shutdown(SHUT_RDWR) to immediately stop both sending and receiving data. Then calls close().
  • When read_before_closing is true:
    • The function calls shutdown(SHUT_WR) to signal the end of writing but allows reading to continue.
    • It waits for the peer to send an EOF (indicated by read() returning 0), and discards all received bytes.

This function holds a new mutex named shutdown_mutex, which ensures preventing deadlocks.

Refactoring on close_inbound_socket() and close_outbound_socket() from federate.c

close_inbound_socket()

The original code looked not updated. There were no use case when the argument flag was not 1. I removed the flag argument, and all unused code.

close_outbound_socket()

The original flags were only set by this one line code.
int flag = _lf_normal_termination ? 1 : -1;
So it depends on _lf_normal_termination. However, this function can directly use _lf_normal_termination, so I removed the flag, and directly used _lf_normal_termination.

Minor implementation details for macOS.

When the federate sends the MSG_TYPE_RESIGN signal, the RTI calls shutdown(SHUT_WR). Then, in the federate side, the thread from listen_to_rti_TCP()'s read_from_socket()'s read() returns with a EOF. Then, it calls shutdown_socket().

However, the shutdown() fails, with an errno Socket is not connected on only MacOS. even though the connection is half open. However, since the connection is half p[em, the federate should still send a FIN_ACK packet to the RTI, because the RTI is blocked on the read().

Thus, the federate does not return on failure of shutdown(), and calls the close(), to ensure the FIN_ACK packet is sent.

To summarize, even when shutdown() fails, the code will try close(), to ensure the socket is closed.

Minor addition of logic in federate.c's handle_tagged_message()

In federate.c's handle_tagged_message(), when the received message already missed the tag, it closes the inbound socket, however does not return -1 for failure of reading the message. This is logic is added.

Some mutex locks removed.

socket_mutex was only used as federate's inbound socket mutex. Especially, it was only used when closing inbound sockets. The new shutdown_socket() function holds its own mutex, so removed the duplication of mutex locks.

@Jakio815 Jakio815 marked this pull request as ready for review December 21, 2024 01:18
@Jakio815 Jakio815 changed the title Draft: Add shutdown_socket() function. Add shutdown_socket() function. Dec 21, 2024
@Jakio815 Jakio815 changed the title Add shutdown_socket() function. Add shutdown_socket() function for follow up of #505 Dec 21, 2024
Copy link
Member

@hokeun hokeun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactoring. Just left minor suggestions.

@Jakio815
Copy link
Collaborator Author

Jakio815 commented Dec 22, 2024

The macOS test is not passing for some unknown reason.

The CI test log seems to not work starting from Running: src/federated/DistributedCountPhysical.lf (28%), however if you run the federated tests on the local machine, it actually doesn't work starting from Dataflow.lf.

The reason for not passing is that the program blocks on the final shutdown phase.

First, this is how the LF code is intended to work.

  1. The federate sends a RESIGN signal.
  2. The RTI calls shutdown(socket, SHUT_WR), and then immediately calls read()
    • The shutdown() sends a FIN packet to the federate.
  3. The federate was blocked on read() to listen messages from the RTI.
    • The federate receives the FIN packet, and finishes reading the data in the socket buffer, and then returns EOF.
  4. The federate will also send a FIN packet and the RTI's read() will return 0.

This normally works on Linux, thus passing all tests. However, this fails on Mac.
Where it fails is exactly in step 3. The federate received a FIN packet, and the read() should return EOF, however it does not, and gets blocked forever.

I also checked that the FIN packet was properly sent, and the ACK from the federate to the RTI was also sent.

// netstat -ant | grep 15045
tcp4       0      0  127.0.0.1.15045        127.0.0.1.64842        FIN_WAIT_2
tcp4       0      0  127.0.0.1.64842        127.0.0.1.15045        CLOSE_WAIT

The RTI(15045) is in FIN_WAIT_2 state, which means that it sent the FIN, and received the ACK.
The federate(64842) is in CLOSE_WAIT which means that the remote side's(RTI) connection is closed, and waiting to connect its own connection.

This error can be reproduced by the lingua-franca's repo on branch refactor-only-comm-type and reactor-c branch on shutdown, and it should be done on a MacOS machine.

@hokeun @edwardalee Could anyone help investigating this problem?

@Jakio815 Jakio815 changed the base branch from refactor-only-comm-type to main January 8, 2025 04:45
@Jakio815 Jakio815 requested a review from hokeun January 24, 2025 18:43
Copy link
Member

@hokeun hokeun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and clean!

@Jakio815
Copy link
Collaborator Author

@edwardalee Would you please review this?

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've requested what I think is just a documentation change, but also a sanity check to make sure the right mutex is held by all callers of this new function.

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@edwardalee edwardalee added this pull request to the merge queue Mar 9, 2025
Merged via the queue into main with commit 0923caa Mar 9, 2025
30 checks passed
@Jakio815
Copy link
Collaborator Author

@edwardalee Please merge with lf-lang/lingua-franca#2474.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants