Various bugfixes and cleanups in the support for federated programs #323

edwardalee · 2023-12-21T00:10:17Z

This replaces PR #319 and #317 with a cleaner commit history. There is a companion PR in lingua-franca.

This PR removes absent messages except where they are absolutely needed, namely in zero-delay cycles.

This PR also improves the port search done by the RTI if it cannot bind to the specified or default port. This previously did not work and resulted in federates taking a long time to fail as they searched over possible ports.

This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.

The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g. foo | bar, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.

The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.

It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.

This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.

This PR also fixes some problems that could occur when safe-to-process violations occur and, in case problem resurface in the future, causes the program to error out rather than to deadlock.

Prevent exiting via SIGPIPE on socket write failure and instead handle the error. Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former. Remove unnecessary absent messages using EIMT and EIMT_strict. Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause. Also, calloc instead of malloc federate info so that pointers are reliably NULL. Also, added lf_print_error_system_failure utility to print system call error information.

Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former. Remove unnecessary absent messages using EIMT and EIMT_strict. Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause. Replace last_time field on a trigger with last_tag. Also, implement and use lf_print_error_system_failure to report system call errors.

lhstrh

Looks great! Thanks for the clean up...

…en received

* Federate sends NEVER_TAG in RESIGN to indicate error and RTI returns an error code. * RTI reads the tag on the RESIGN message always, not just if tracing is enabled. * Avoid attempting to close stdin, which results from faulty initialization of array. * More systematic socket shutdown process that ensures acknowledgements are received.

petervdonovan · 2024-01-11T01:17:03Z

Is clock synchronization broken in this branch? It looks like here it is assumed that the return value of write_to_socket is the number of bytes written, which seems to be no longer true due to refactoring in net_util.c.

edwardalee · 2024-01-11T15:32:16Z

Is clock synchronization broken in this branch? It looks like here it is assumed that the return value of write_to_socket is the number of bytes written, which seems to be no longer true due to refactoring in net_util.c.

My bad. I've pushed a fix. Thanks for spotting this. Not sure how I missed it.

petervdonovan

I still need to review federate.c, but so far I agree that this PR is bringing many improvements.

core/federated/RTI/main.c

core/federated/RTI/rti_common.c

core/trace.c

core/utils/util.c

core/reactor_common.c

Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>

petervdonovan

OK, finally signing off on the review of this PR. Thanks for waiting, I wanted to make sure I understood what changed.

Overall the changes look like a clear improvement.

core/federated/federate.c

edwardalee added 4 commits December 20, 2023 15:18

Use lf_print_error_system_failure

e41eb63

Point to lingua-franca/federated-cleanup

fb47b2b

edwardalee added enhancement Enhancement of existing feature bugfix federated labels Dec 21, 2023

edwardalee requested a review from lhstrh December 21, 2023 00:10

lhstrh changed the title ~~Federated cleanup~~ Various bugfixes and clean ups in the support for federated programs Dec 21, 2023

lhstrh approved these changes Dec 21, 2023

View reviewed changes

This was referenced Dec 21, 2023

Port search removal #319

Closed

Remove absent messages #317

Closed

lhstrh changed the title ~~Various bugfixes and clean ups in the support for federated programs~~ Various bugfixes and cleanups in the support for federated programs Dec 21, 2023

lhstrh mentioned this pull request Dec 21, 2023

Various bugfixes and cleanups in the support for federated programs lf-lang/lingua-franca#2140

Merged

edwardalee added 16 commits December 20, 2023 16:44

Fix possible segfault on tracing termination

c1c755d

Remove one more deadlock risk

548e278

Fix compile error and bogus comparison

444ebee

Prevent sending redundant reply to stop request

ab7605e

Fixed compile error

b9b17af

Treat the stop request from the RTI as if a local stop request had be…

d451273

…en received

Adjust port binding retries to realistic times

5bad8b9

RTI sends RESIGN on abnormal termination

4875564

Free environment only after all logging and debug statements

59ab5d2

Major refactoring of network functions

24dab5a

Send all messages to stdout, not stderr

6a1e313

Allow scheduling at current time before execution starts

a31a5d4

Better handling of startup

b849176

Made execution_started an environment flag

535cacb

Prevent spurious error at start

e17ee9a

Fixed use of write_to_socket

a961d9c

petervdonovan and others added 3 commits January 12, 2024 19:06

Fix deadlock caused by STP violation

9a797bb

Merge branch 'main' into federated-cleanup

08309a9

Removed outdated comments

5e1d9bb

petervdonovan mentioned this pull request Jan 14, 2024

Add linter and ensure that code is warning-free #268

Open

petervdonovan reviewed Jan 14, 2024

View reviewed changes

edwardalee and others added 13 commits January 14, 2024 09:57

Update core/federated/RTI/main.c

f6e090d

Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>

Update core/federated/RTI/rti_common.c

f6e685e

Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>

Absorb delay functionality into lf_tag_add()

edfde09

Clarify comments for eimt_strict()

161f00a

Print error on failure to write trace file

fd05ada

Comment only

f4ab3d8

Update core/federated/RTI/rti_remote.c

e1783f1

Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>

Comment only

4b7c940

Comment only

5ff00a2

Move freeing of local RTI to termination function

193bd66

Don't exit immediately on federate failure

7f84a33

Clean up error handling in receive_and_check_fed_id_message

753d79c

Do not overwrite NET with message tag unless less

e44d284

petervdonovan approved these changes Jan 18, 2024

View reviewed changes

Comments only

c37968d

edwardalee mentioned this pull request Jan 20, 2024

Bugfix to avoid deadlock on STP violation #325

Closed

edwardalee added 3 commits January 20, 2024 14:54

Comments only

30601a8

Trace before write and after read

ea398c7

Do not acquire mutex during abnormal termination

6e4af8e

edwardalee merged commit 36d3249 into main Jan 22, 2024
28 checks passed

edwardalee deleted the federated-cleanup branch January 22, 2024 01:39

erlingrj mentioned this pull request Apr 12, 2024

read_from_socket should always retry if it returns with errno=EAGAIN #408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various bugfixes and cleanups in the support for federated programs #323

Various bugfixes and cleanups in the support for federated programs #323

edwardalee commented Dec 21, 2023 •

edited

Loading

lhstrh left a comment

petervdonovan commented Jan 11, 2024

edwardalee commented Jan 11, 2024

petervdonovan left a comment

petervdonovan left a comment •

edited

Loading

Various bugfixes and cleanups in the support for federated programs #323

Various bugfixes and cleanups in the support for federated programs #323

Conversation

edwardalee commented Dec 21, 2023 • edited Loading

lhstrh left a comment

Choose a reason for hiding this comment

petervdonovan commented Jan 11, 2024

edwardalee commented Jan 11, 2024

petervdonovan left a comment

Choose a reason for hiding this comment

petervdonovan left a comment • edited Loading

Choose a reason for hiding this comment

edwardalee commented Dec 21, 2023 •

edited

Loading

petervdonovan left a comment •

edited

Loading