-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of TAN messages and new capability to record in-transit messages in the RTI #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Actually, I realized that this fix is a bit of a red herring. The added second check for This is effectively disabling TAN, and that "fixed" the issue. However, TAN messages are important to ensure that progress is made. Background of the problem: TAN messages appear to cause incorrect Tag Advance Grants to be sent by the RTI, causing STP violations in centralized coordination. I'm not sure why. It's very clear they are the issue because commenting out the content of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cosmetic changes look good to me, but the one substantive change is probably not right. Before the change, a TAN would result in calling send_downstream_advance_grants_if_safe
for all downstream federates, and after the change only for immediately downstream federates. But at worst, it should be harmless to call it for all downstream federates because send_downstream_advance_grants_if_safe
calls send_advance_grant_if_safe
, which checks for each federate whether it is actually safe to send a TAG. Since the docs for that latter function say clearly it should be called on all downstream federates, I suspect there was a reason for that. I suggest reverting this change and merging in the cosmetic changes.
Race condition is where a NET message from a federate from a previous cycle crosses a message to the federate being forwarded by the RTI, which causes the RTI's view of the NET of the federate to be incorrect.
The RTI, as defined in this branch, causes the following tests in federated to lock up and time out: LoopDistributedDouble.lf, PingPongDistributed.lf, LoopDistributedCentralized.lf. Looks like TANs are not being sent when they should be. |
The new logic does two things:\n 1- The RTI double checks that if we are replacing the NET of a federate with a larger value, it has finished the previously (already) promised NET, and,\n 2- the RTI now attempts to send TAG and PTAGs if it is updating the next event of a federate upon forwarding a message
@lhstrh @edwardalee While the number of changed lines appears to be large (+1,134 −697), it is mostly inflated by replacement of tags with spaces. I appreciate your indulgence for this long overdue stylistic change :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for one potential error raised in the code. Also, the PR is misnamed because it does much more than remove TAN messages. Perhaps "Remove TAN messages and record in-transit messages in the RTI"? I did a double take on in_transit_message_record_q_t not being a pointer, but then I realized it is a pair of pointers, so this seems reasonable to me.
in_transit_message_record_t* head_of_in_transit_messages = (in_transit_message_record_t*)pqueue_peek(queue->main_queue); | ||
while (head_of_in_transit_messages != NULL) { // Queue is not empty | ||
// The message record queue is ordered according to the `time` field, so we need to check | ||
// all records with the minimum `time` and find those that have the smallest tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand, the reason why this procedure (and the one above it) are complicated is that pqueue priorities have to be 64 bits, which makes it hard to sort items by tag instead of time. This implementation looks like it might be complicated, and it looks like it might have suboptimal time complexity in programs that frequently schedule events a microstep in the future.
I have already suggested that attempts to cram priorities into a single word might not be serving us very well; maybe now is a good time to reconsider that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been part of a long-running and interesting conversation. Would you like to create an issue or a discussion for this?
This PR removes TAN messages entirely.
Instead, a federate with a physical action (that is connected to a network output) is going to periodically create a dummy event (with the period controlled by
coordination-options: {advance-message-interval: 10 msec}
) which forces the federate to advance its tag and allow downstream federates to make progress.After fixing this bug, another bug was exposed in the RTI, in which the RTI could potentially lose track of a federate's actual earliest next event (see this comment for more detail). This caused the RTI to grant incorrect tag advance grant (TAG) messages. This bug was fixed by adding a queue to the RTI that keeps a record of all currently in-transit messages.