Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

communicator: fix max_local_peers value in disjoint function #12223

Merged
merged 2 commits into from
Jan 18, 2024

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Jan 9, 2024

local_peers is passed in the non-blocking function allreduce_fn as a stack variable.
Change it to be part of the context struct so the correct value is passed.

@jiaxiyan jiaxiyan marked this pull request as ready for review January 9, 2024 20:58
@bosilca
Copy link
Member

bosilca commented Jan 9, 2024

I don't see why we need to compute local_peers? As it is today it is useless, and never used.

However, I think we need to keep the renaming part of the PR, to make it clear that we are using nonblocking reductions during the communicator creation.

@wenduwan
Copy link
Contributor

wenduwan commented Jan 9, 2024

@bosilca local_peers is actually used to determine communicator disjointness. What we found was this part:

int ompi_comm_activate_nb(...) {
    ...
    local_peers = context->max_local_peers;
    ret = context->allreduce_fn (&local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                 &subreq);
    ...
}

local_peers is a stack variable scoped to ompi_comm_activate_nb function, and we are wrongly using it as sendbuf in non-blocking allreduce. This PR fixes that by moving the value to the heap with the new context->local_peers variable.

@bosilca
Copy link
Member

bosilca commented Jan 9, 2024

ok, I misread the code (confused the send and recv buffers in the allreduce). However, my comment stands: we don't need an additional member of the structure we can use MPI_INPLACE.

@wenduwan
Copy link
Contributor

wenduwan commented Jan 9, 2024

I see.

@jiaxiyan Can you try iallreduce_fn(MPI_IN_PLACE, &context->max_local_peers, ...)?

I somehow remember some of the iallreduce_fn impl in this file did not support MPI_IN_PLACE...

@wenduwan
Copy link
Contributor

wenduwan commented Jan 9, 2024

@bosilca Now I remember - we cannot use MPI_IN_PLACE because the communicator can be an intercommunicator, so we have to use a separate send buffer.

@bosilca
Copy link
Member

bosilca commented Jan 10, 2024

MPI_IN_PLACE is valid for inter-communicators as well. If we have issues with some of the backend implementations of the allreduce, then we should document it, otherwise we should use all the capabilities provided by the collective framework.

Btw, if the communicator is an intercomm, then what's the point of updating the max_local_peers (as it will not have the same meaning as in the case of intracoms).

@wenduwan
Copy link
Contributor

We had a discussion offline. Got clarification from community members which I agree with - MPI_IN_PLACE is only applicable for intra-communicators. So I think we need context->local_peers as sendbuf to cover both intra and inter-communicator cases.

if the communicator is an intercomm, then what's the point of updating the max_local_peers (as it will not have the same meaning as in the case of intracoms)

I see... for inter-communicator there can be 2 groups sharing the same nodes, so the local peers count might be 1 but there can be 2 processes from 2 groups sharing the same node. In this case local peers do not make sense.
@jiaxiyan in this case I think we should make another change - the disjoint flag should only be set if OMPI_COMM_IS_INTRA(comm) is true.

But I think we still need this allreduce call as a dual-purpose barrier. From code history I can see it is used to signal that the communicator is functional and can be used for messaging.

@wenduwan
Copy link
Contributor

@bosilca Could you please give the PR a 2nd review? Thanks!

@wenduwan wenduwan requested a review from bosilca January 11, 2024 18:45
@bosilca
Copy link
Member

bosilca commented Jan 12, 2024

I confirm that MPI_IN_PLACE is only valid for intra-communicators. This is clearly spelled in the MPI standard in 6.2.3

Note that the “in place” option for intra-communicators does not apply to inter-communicators since in the inter-communicator case there is no communication from an MPI process to itself.

This explanation makes no sense, because it links MPI_IN_PLACE to communications to self, which for a collective communication has no meaning. But, it's there.

@bosilca
Copy link
Member

bosilca commented Jan 12, 2024

A communicator does not require any barrier-like collective to be valid, it only needs a unique context id. This cid is established well-before we do the last couple of allreduce.

Now that we only do the max_peers reduction for intracommunicators, we can remove the local_peers and use the MPI_IN_PLACE maybe ?

@wenduwan
Copy link
Contributor

@bosilca Thanks! So we really only need to allreduce the max_local_peers for intra-communicators only, and skip the allreduce altogether otherwise.

@jiaxiyan could you please try that? There used to be the allreduce at this place for reasons that I don't understand. We should test the change again.

@jiaxiyan
Copy link
Contributor Author

@bosilca Can you review this PR again? Thanks!!

The allreduce_fn is non-blocking. Rename it to iallreduce_fn to make it clear.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
…local_peers value

local_peers is passed in the non-blocking function iallreduce_fn as a stack variable.
Change it to be part of the context struct so the correct value is passed.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca Thank you for the review and explaining the code path.

@jiaxiyan Great job for catching the bug!

@wenduwan wenduwan merged commit c3b6852 into open-mpi:main Jan 18, 2024
9 of 10 checks passed
@hppritcha
Copy link
Member

@jiaxiyan @wenduwan what problem was this patch addressing? Its causing a regression - see #12367 .

@jiaxiyan
Copy link
Contributor Author

jiaxiyan commented Mar 8, 2024

@hppritcha The changes made in 23df181 is reverted in #12246

@hppritcha
Copy link
Member

the change in #12246 is not the problem, its the checking for whether the communicator is an intra or inter and if the later not doing the reduction that is causing the problem. The problem we are seeing only goes away if we remove all of the changes done in 23df181 to the ompi_comm_activate_nb function.

@bosilca
Copy link
Member

bosilca commented Mar 8, 2024

Why exactly is that a problem ?

@hppritcha
Copy link
Member

if you revert the commit that adds a conditional check for whether the comm is intercomm or not, @dalcinl 's issue #12367 goes away - at least for me.

@bosilca
Copy link
Member

bosilca commented Mar 8, 2024

That does not means removing the allreduce is the problem.

@wenduwan
Copy link
Contributor

if you revert the commit that adds a conditional check for whether the comm is intercomm or not, @dalcinl 's issue #12367 goes away - at least for me.

We had this conversation earlier that the allreduce had a sync effect which I still don't understand why it's necessary. But I'm ok with removing the conditional check to temporarily mitigate.

@hppritcha Do you have a theory for this behavior?

@edgargabriel
Copy link
Member

I remember many, many years ago when I worked on the communicator code that we had to add a sync step in (which I think became the communicator activation routine later), to avoid that we receive messages on some process for a communicator ID that was not yet known on that node. Basically, because the last collective operation on confirming the CID finished earlier on some processes than others, it could happen that a message is received on a process for a CID that it wasn't aware of yet. Not sure whether it is the same issue here.

@bosilca
Copy link
Member

bosilca commented Mar 11, 2024

@edgargabriel that issue was fixed by adding a pending queue of unmatchable messages into the PML (mca_pml_ob1.non_existing_communicator_pending). I am able to replicate this issue but if I check that list it is empty, so that does not seem to be the issue.

What I don't understand yet, is that if the communicator is not correctly setup the any collective should trigger the deadlock, be that the barrier we are talking about adding or the reduce we removed. Something more subtle is going on, I noticed the execution path for the inter-communicator communicator creation (which btw has been made extremely complicated, undocumented and costly) going into a code that is marked as intra-communicator.

I could not figure out yet the reason behind that, but as far as I can say all messages are correctly sent, and extracted from the network, but not correctly matched. As if they were put in a queue for the wrong process (which could be explained by the intra vs. inter communicator creation path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants