communicator: fix max_local_peers value in disjoint function #12223

jiaxiyan · 2024-01-09T18:47:14Z

local_peers is passed in the non-blocking function allreduce_fn as a stack variable.
Change it to be part of the context struct so the correct value is passed.

ompi/communicator/comm_cid.c

bosilca · 2024-01-09T21:10:05Z

I don't see why we need to compute local_peers? As it is today it is useless, and never used.

However, I think we need to keep the renaming part of the PR, to make it clear that we are using nonblocking reductions during the communicator creation.

wenduwan · 2024-01-09T21:15:57Z

@bosilca local_peers is actually used to determine communicator disjointness. What we found was this part:

int ompi_comm_activate_nb(...) {
    ...
    local_peers = context->max_local_peers;
    ret = context->allreduce_fn (&local_peers, &context->max_local_peers, 1, MPI_MAX, context,
                                 &subreq);
    ...
}

local_peers is a stack variable scoped to ompi_comm_activate_nb function, and we are wrongly using it as sendbuf in non-blocking allreduce. This PR fixes that by moving the value to the heap with the new context->local_peers variable.

bosilca · 2024-01-09T21:21:51Z

ok, I misread the code (confused the send and recv buffers in the allreduce). However, my comment stands: we don't need an additional member of the structure we can use MPI_INPLACE.

wenduwan · 2024-01-09T21:30:53Z

I see.

@jiaxiyan Can you try iallreduce_fn(MPI_IN_PLACE, &context->max_local_peers, ...)?

I somehow remember some of the iallreduce_fn impl in this file did not support MPI_IN_PLACE...

wenduwan · 2024-01-09T21:57:27Z

@bosilca Now I remember - we cannot use MPI_IN_PLACE because the communicator can be an intercommunicator, so we have to use a separate send buffer.

bosilca · 2024-01-10T10:42:21Z

MPI_IN_PLACE is valid for inter-communicators as well. If we have issues with some of the backend implementations of the allreduce, then we should document it, otherwise we should use all the capabilities provided by the collective framework.

Btw, if the communicator is an intercomm, then what's the point of updating the max_local_peers (as it will not have the same meaning as in the case of intracoms).

wenduwan · 2024-01-10T20:32:05Z

We had a discussion offline. Got clarification from community members which I agree with - MPI_IN_PLACE is only applicable for intra-communicators. So I think we need context->local_peers as sendbuf to cover both intra and inter-communicator cases.

if the communicator is an intercomm, then what's the point of updating the max_local_peers (as it will not have the same meaning as in the case of intracoms)

I see... for inter-communicator there can be 2 groups sharing the same nodes, so the local peers count might be 1 but there can be 2 processes from 2 groups sharing the same node. In this case local peers do not make sense.
@jiaxiyan in this case I think we should make another change - the disjoint flag should only be set if OMPI_COMM_IS_INTRA(comm) is true.

But I think we still need this allreduce call as a dual-purpose barrier. From code history I can see it is used to signal that the communicator is functional and can be used for messaging.

ompi/communicator/comm_cid.c

wenduwan · 2024-01-10T21:06:21Z

@bosilca Could you please give the PR a 2nd review? Thanks!

bosilca · 2024-01-12T17:48:06Z

I confirm that MPI_IN_PLACE is only valid for intra-communicators. This is clearly spelled in the MPI standard in 6.2.3

Note that the “in place” option for intra-communicators does not apply to inter-communicators since in the inter-communicator case there is no communication from an MPI process to itself.

This explanation makes no sense, because it links MPI_IN_PLACE to communications to self, which for a collective communication has no meaning. But, it's there.

bosilca · 2024-01-12T18:01:17Z

A communicator does not require any barrier-like collective to be valid, it only needs a unique context id. This cid is established well-before we do the last couple of allreduce.

Now that we only do the max_peers reduction for intracommunicators, we can remove the local_peers and use the MPI_IN_PLACE maybe ?

wenduwan · 2024-01-12T18:25:03Z

@bosilca Thanks! So we really only need to allreduce the max_local_peers for intra-communicators only, and skip the allreduce altogether otherwise.

@jiaxiyan could you please try that? There used to be the allreduce at this place for reasons that I don't understand. We should test the change again.

jiaxiyan · 2024-01-17T21:12:38Z

@bosilca Can you review this PR again? Thanks!!

The allreduce_fn is non-blocking. Rename it to iallreduce_fn to make it clear. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

…local_peers value local_peers is passed in the non-blocking function iallreduce_fn as a stack variable. Change it to be part of the context struct so the correct value is passed. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

wenduwan

@bosilca Thank you for the review and explaining the code path.

@jiaxiyan Great job for catching the bug!

hppritcha · 2024-03-08T17:16:12Z

@jiaxiyan @wenduwan what problem was this patch addressing? Its causing a regression - see #12367 .

jiaxiyan · 2024-03-08T17:28:27Z

@hppritcha The changes made in 23df181 is reverted in #12246

hppritcha · 2024-03-08T17:49:26Z

the change in #12246 is not the problem, its the checking for whether the communicator is an intra or inter and if the later not doing the reduction that is causing the problem. The problem we are seeing only goes away if we remove all of the changes done in 23df181 to the ompi_comm_activate_nb function.

bosilca · 2024-03-08T19:56:33Z

Why exactly is that a problem ?

hppritcha · 2024-03-08T20:01:00Z

if you revert the commit that adds a conditional check for whether the comm is intercomm or not, @dalcinl 's issue #12367 goes away - at least for me.

bosilca · 2024-03-08T20:12:02Z

That does not means removing the allreduce is the problem.

wenduwan · 2024-03-11T14:39:44Z

if you revert the commit that adds a conditional check for whether the comm is intercomm or not, @dalcinl 's issue #12367 goes away - at least for me.

We had this conversation earlier that the allreduce had a sync effect which I still don't understand why it's necessary. But I'm ok with removing the conditional check to temporarily mitigate.

@hppritcha Do you have a theory for this behavior?

edgargabriel · 2024-03-11T14:45:41Z

I remember many, many years ago when I worked on the communicator code that we had to add a sync step in (which I think became the communicator activation routine later), to avoid that we receive messages on some process for a communicator ID that was not yet known on that node. Basically, because the last collective operation on confirming the CID finished earlier on some processes than others, it could happen that a message is received on a process for a CID that it wasn't aware of yet. Not sure whether it is the same issue here.

bosilca · 2024-03-11T16:17:10Z

@edgargabriel that issue was fixed by adding a pending queue of unmatchable messages into the PML (mca_pml_ob1.non_existing_communicator_pending). I am able to replicate this issue but if I check that list it is empty, so that does not seem to be the issue.

What I don't understand yet, is that if the communicator is not correctly setup the any collective should trigger the deadlock, be that the barrier we are talking about adding or the reduce we removed. Something more subtle is going on, I noticed the execution path for the inter-communicator communicator creation (which btw has been made extremely complicated, undocumented and costly) going into a code that is marked as intra-communicator.

I could not figure out yet the reason behind that, but as far as I can say all messages are correctly sent, and extracted from the network, but not correctly matched. As if they were put in a queue for the wrong process (which could be explained by the intra vs. inter communicator creation path).

github-actions bot added the Target: main label Jan 9, 2024

wenduwan reviewed Jan 9, 2024

View reviewed changes

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved

jiaxiyan force-pushed the disjoint branch from eed87dc to d58ae23 Compare January 9, 2024 19:55

jiaxiyan marked this pull request as ready for review January 9, 2024 20:58

jiaxiyan force-pushed the disjoint branch from d58ae23 to 5cc1dc3 Compare January 9, 2024 20:59

wenduwan approved these changes Jan 9, 2024

View reviewed changes

jiaxiyan force-pushed the disjoint branch from 5cc1dc3 to b0e2a59 Compare January 10, 2024 20:42

wenduwan reviewed Jan 10, 2024

View reviewed changes

ompi/communicator/comm_cid.c Outdated Show resolved Hide resolved

jiaxiyan force-pushed the disjoint branch from b0e2a59 to d680336 Compare January 10, 2024 20:51

wenduwan requested a review from bosilca January 11, 2024 18:45

jiaxiyan force-pushed the disjoint branch from d680336 to 2d6a073 Compare January 12, 2024 21:06

jiaxiyan added 2 commits January 17, 2024 14:11

communicator: Rename allreduce_fn to iallreduce_fn

f3d0c59

The allreduce_fn is non-blocking. Rename it to iallreduce_fn to make it clear. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

jiaxiyan force-pushed the disjoint branch from 2d6a073 to 23df181 Compare January 17, 2024 22:11

bosilca approved these changes Jan 17, 2024

View reviewed changes

wenduwan approved these changes Jan 18, 2024

View reviewed changes

wenduwan merged commit c3b6852 into open-mpi:main Jan 18, 2024
9 of 10 checks passed

wenduwan mentioned this pull request Jan 18, 2024

MTT create_group and inter-comm test failure due to #12223 #12245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

communicator: fix max_local_peers value in disjoint function #12223

communicator: fix max_local_peers value in disjoint function #12223

jiaxiyan commented Jan 9, 2024 •

edited

Loading

bosilca commented Jan 9, 2024

wenduwan commented Jan 9, 2024

bosilca commented Jan 9, 2024

wenduwan commented Jan 9, 2024

wenduwan commented Jan 9, 2024

bosilca commented Jan 10, 2024

wenduwan commented Jan 10, 2024

wenduwan commented Jan 10, 2024

bosilca commented Jan 12, 2024

bosilca commented Jan 12, 2024

wenduwan commented Jan 12, 2024

jiaxiyan commented Jan 17, 2024

wenduwan left a comment

hppritcha commented Mar 8, 2024

jiaxiyan commented Mar 8, 2024

hppritcha commented Mar 8, 2024

bosilca commented Mar 8, 2024

hppritcha commented Mar 8, 2024

bosilca commented Mar 8, 2024

wenduwan commented Mar 11, 2024

edgargabriel commented Mar 11, 2024

bosilca commented Mar 11, 2024

communicator: fix max_local_peers value in disjoint function #12223

communicator: fix max_local_peers value in disjoint function #12223

Conversation

jiaxiyan commented Jan 9, 2024 • edited Loading

bosilca commented Jan 9, 2024

wenduwan commented Jan 9, 2024

bosilca commented Jan 9, 2024

wenduwan commented Jan 9, 2024

wenduwan commented Jan 9, 2024

bosilca commented Jan 10, 2024

wenduwan commented Jan 10, 2024

wenduwan commented Jan 10, 2024

bosilca commented Jan 12, 2024

bosilca commented Jan 12, 2024

wenduwan commented Jan 12, 2024

jiaxiyan commented Jan 17, 2024

wenduwan left a comment

Choose a reason for hiding this comment

hppritcha commented Mar 8, 2024

jiaxiyan commented Mar 8, 2024

hppritcha commented Mar 8, 2024

bosilca commented Mar 8, 2024

hppritcha commented Mar 8, 2024

bosilca commented Mar 8, 2024

wenduwan commented Mar 11, 2024

edgargabriel commented Mar 11, 2024

bosilca commented Mar 11, 2024

jiaxiyan commented Jan 9, 2024 •

edited

Loading