Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of TAN messages and new capability to record in-transit messages in the RTI #61

Merged
merged 39 commits into from
Jun 11, 2022

Conversation

Soroosh129
Copy link
Contributor

@Soroosh129 Soroosh129 commented Apr 4, 2022

This PR removes TAN messages entirely.

Instead, a federate with a physical action (that is connected to a network output) is going to periodically create a dummy event (with the period controlled by coordination-options: {advance-message-interval: 10 msec}) which forces the federate to advance its tag and allow downstream federates to make progress.

After fixing this bug, another bug was exposed in the RTI, in which the RTI could potentially lose track of a federate's actual earliest next event (see this comment for more detail). This caused the RTI to grant incorrect tag advance grant (TAG) messages. This bug was fixed by adding a queue to the RTI that keeps a record of all currently in-transit messages.

@Soroosh129 Soroosh129 requested a review from edwardalee April 4, 2022 17:06
@Soroosh129 Soroosh129 changed the title Hotfix: Fixed an issue wher TAN was being sent incorrectly Hotfix: Fixed an issue where TAN messages were being sent incorrectly Apr 4, 2022
Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Soroosh129
Copy link
Contributor Author

Soroosh129 commented Apr 4, 2022

Actually, I realized that this fix is a bit of a red herring.

The added second check for _lf_bounded_NET(&tag) is checking a modified version of tag, where tag.time = get_physical_time() at some point in the near past, against the current physical time. So it always returns false.

This is effectively disabling TAN, and that "fixed" the issue. However, TAN messages are important to ensure that progress is made.

Background of the problem: TAN messages appear to cause incorrect Tag Advance Grants to be sent by the RTI, causing STP violations in centralized coordination. I'm not sure why. It's very clear they are the issue because commenting out the content of handle_time_advance_notice in the RTI causes the STP violations to go away.

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cosmetic changes look good to me, but the one substantive change is probably not right. Before the change, a TAN would result in calling send_downstream_advance_grants_if_safe for all downstream federates, and after the change only for immediately downstream federates. But at worst, it should be harmless to call it for all downstream federates because send_downstream_advance_grants_if_safe calls send_advance_grant_if_safe, which checks for each federate whether it is actually safe to send a TAG. Since the docs for that latter function say clearly it should be called on all downstream federates, I suspect there was a reason for that. I suggest reverting this change and merging in the cosmetic changes.

core/federated/RTI/rti.c Outdated Show resolved Hide resolved
Soroosh129 and others added 2 commits April 5, 2022 13:02
Race condition is where a NET message from a federate from a previous cycle crosses a message to the federate being forwarded by the RTI, which causes the RTI's view of the NET of the federate to be incorrect.
@edwardalee
Copy link
Contributor

The RTI, as defined in this branch, causes the following tests in federated to lock up and time out: LoopDistributedDouble.lf, PingPongDistributed.lf, LoopDistributedCentralized.lf. Looks like TANs are not being sent when they should be.

@Soroosh129 Soroosh129 changed the title Hotfix: Fixed an issue where TAN messages were being sent incorrectly Remove TAN messages Jun 3, 2022
@Soroosh129 Soroosh129 requested review from edwardalee and lhstrh June 3, 2022 22:33
@Soroosh129
Copy link
Contributor Author

@lhstrh @edwardalee While the number of changed lines appears to be large (+1,134 −697), it is mostly inflated by replacement of tags with spaces. I appreciate your indulgence for this long overdue stylistic change :)

Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for one potential error raised in the code. Also, the PR is misnamed because it does much more than remove TAN messages. Perhaps "Remove TAN messages and record in-transit messages in the RTI"? I did a double take on in_transit_message_record_q_t not being a pointer, but then I realized it is a pair of pointers, so this seems reasonable to me.

core/federated/RTI/message_record/message_record.c Outdated Show resolved Hide resolved
@Soroosh129 Soroosh129 changed the title Remove TAN messages Remove TAN messages and record in-transit messages in the RTI Jun 4, 2022
in_transit_message_record_t* head_of_in_transit_messages = (in_transit_message_record_t*)pqueue_peek(queue->main_queue);
while (head_of_in_transit_messages != NULL) { // Queue is not empty
// The message record queue is ordered according to the `time` field, so we need to check
// all records with the minimum `time` and find those that have the smallest tag.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand, the reason why this procedure (and the one above it) are complicated is that pqueue priorities have to be 64 bits, which makes it hard to sort items by tag instead of time. This implementation looks like it might be complicated, and it looks like it might have suboptimal time complexity in programs that frequently schedule events a microstep in the future.

I have already suggested that attempts to cram priorities into a single word might not be serving us very well; maybe now is a good time to reconsider that?

Copy link
Contributor Author

@Soroosh129 Soroosh129 Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been part of a long-running and interesting conversation. Would you like to create an issue or a discussion for this?

core/federated/RTI/rti.h Outdated Show resolved Hide resolved
@Soroosh129 Soroosh129 merged commit c1eb156 into main Jun 11, 2022
@Soroosh129 Soroosh129 deleted the hotfix-TAN branch June 11, 2022 05:26
@lhstrh lhstrh changed the title Remove TAN messages and record in-transit messages in the RTI Removal of TAN messages and new capability to record in-transit messages in the RTI Jul 20, 2022
@lhstrh lhstrh added the enhancement Enhancement of existing feature label Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement of existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants