-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
communicator: fix max_local_peers value in disjoint function #12223
Conversation
I don't see why we need to compute However, I think we need to keep the renaming part of the PR, to make it clear that we are using nonblocking reductions during the communicator creation. |
@bosilca
|
ok, I misread the code (confused the send and recv buffers in the allreduce). However, my comment stands: we don't need an additional member of the structure we can use |
I see. @jiaxiyan Can you try I somehow remember some of the |
@bosilca Now I remember - we cannot use |
Btw, if the communicator is an intercomm, then what's the point of updating the |
We had a discussion offline. Got clarification from community members which I agree with -
I see... for inter-communicator there can be 2 groups sharing the same nodes, so the local peers count might be 1 but there can be 2 processes from 2 groups sharing the same node. In this case local peers do not make sense. But I think we still need this allreduce call as a dual-purpose barrier. From code history I can see it is used to signal that the communicator is functional and can be used for messaging. |
@bosilca Could you please give the PR a 2nd review? Thanks! |
I confirm that
This explanation makes no sense, because it links |
A communicator does not require any barrier-like collective to be valid, it only needs a unique context id. This Now that we only do the |
@bosilca Can you review this PR again? Thanks!! |
The allreduce_fn is non-blocking. Rename it to iallreduce_fn to make it clear. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
…local_peers value local_peers is passed in the non-blocking function iallreduce_fn as a stack variable. Change it to be part of the context struct so the correct value is passed. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hppritcha The changes made in 23df181 is reverted in #12246 |
the change in #12246 is not the problem, its the checking for whether the communicator is an intra or inter and if the later not doing the reduction that is causing the problem. The problem we are seeing only goes away if we remove all of the changes done in 23df181 to the |
Why exactly is that a problem ? |
That does not means removing the allreduce is the problem. |
We had this conversation earlier that the allreduce had a sync effect which I still don't understand why it's necessary. But I'm ok with removing the conditional check to temporarily mitigate. @hppritcha Do you have a theory for this behavior? |
I remember many, many years ago when I worked on the communicator code that we had to add a sync step in (which I think became the communicator activation routine later), to avoid that we receive messages on some process for a communicator ID that was not yet known on that node. Basically, because the last collective operation on confirming the CID finished earlier on some processes than others, it could happen that a message is received on a process for a CID that it wasn't aware of yet. Not sure whether it is the same issue here. |
@edgargabriel that issue was fixed by adding a pending queue of unmatchable messages into the PML ( What I don't understand yet, is that if the communicator is not correctly setup the any collective should trigger the deadlock, be that the barrier we are talking about adding or the reduce we removed. Something more subtle is going on, I noticed the execution path for the inter-communicator communicator creation (which btw has been made extremely complicated, undocumented and costly) going into a code that is marked as intra-communicator. I could not figure out yet the reason behind that, but as far as I can say all messages are correctly sent, and extracted from the network, but not correctly matched. As if they were put in a queue for the wrong process (which could be explained by the intra vs. inter communicator creation path). |
local_peers is passed in the non-blocking function allreduce_fn as a stack variable.
Change it to be part of the context struct so the correct value is passed.