-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various bugfixes and cleanups in the support for federated programs #323
Conversation
Prevent exiting via SIGPIPE on socket write failure and instead handle the error. Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former. Remove unnecessary absent messages using EIMT and EIMT_strict. Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause. Also, calloc instead of malloc federate info so that pointers are reliably NULL. Also, added lf_print_error_system_failure utility to print system call error information.
Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former. Remove unnecessary absent messages using EIMT and EIMT_strict. Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause. Replace last_time field on a trigger with last_tag. Also, implement and use lf_print_error_system_failure to report system call errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the clean up...
* Federate sends NEVER_TAG in RESIGN to indicate error and RTI returns an error code. * RTI reads the tag on the RESIGN message always, not just if tracing is enabled. * Avoid attempting to close stdin, which results from faulty initialization of array. * More systematic socket shutdown process that ensures acknowledgements are received.
Is clock synchronization broken in this branch? It looks like here it is assumed that the return value of |
My bad. I've pushed a fix. Thanks for spotting this. Not sure how I missed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to review federate.c
, but so far I agree that this PR is bringing many improvements.
Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>
Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>
Co-authored-by: Peter Donovan <33707478+petervdonovan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, finally signing off on the review of this PR. Thanks for waiting, I wanted to make sure I understood what changed.
Overall the changes look like a clear improvement.
This replaces PR #319 and #317 with a cleaner commit history. There is a companion PR in lingua-franca.
This PR removes absent messages except where they are absolutely needed, namely in zero-delay cycles.
This PR also improves the port search done by the RTI if it cannot bind to the specified or default port. This previously did not work and resulted in federates taking a long time to fail as they searched over possible ports.
This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.
The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g.
foo | bar
, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.
It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.
This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.
This PR also fixes some problems that could occur when safe-to-process violations occur and, in case problem resurface in the future, causes the program to error out rather than to deadlock.