Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.0.x: mpi/finalized: revamp INITIALIZED/FINALIZED #5219

Merged
merged 3 commits into from
Jun 12, 2018

Conversation

jsquyres
Copy link
Member

@jsquyres jsquyres commented Jun 3, 2018

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".

Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See

This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:

  • not initialized
  • init started
  • init completed
  • finalize started
  • finalize past COMM_SELF destruction
  • finalize completed

The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.

Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to set the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.

Thanks to @AndrewGaspar for reporting the issue.

Signed-off-by: Jeff Squyres jsquyres@cisco.com
(cherry picked from commit 35438ae)

Refs #5084

@jsquyres jsquyres added the bug label Jun 3, 2018
@jsquyres jsquyres added this to the v3.0.2 milestone Jun 3, 2018
@jsquyres jsquyres changed the title mpi/finalized: revamp INITIALIZED/FINALIZED v3.0.x: mpi/finalized: revamp INITIALIZED/FINALIZED Jun 3, 2018
@jsquyres jsquyres requested a review from bosilca June 3, 2018 18:55
@jsquyres
Copy link
Member Author

jsquyres commented Jun 6, 2018

Waiting on #5234

@jsquyres jsquyres force-pushed the pr/v3.0.x/fix-finalized-hang branch from 7ef6fc5 to f5abf95 Compare June 7, 2018 19:14
@ibm-ompi
Copy link

ibm-ompi commented Jun 7, 2018

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6f9d7ef6e6dbf7f10d906c0cf500a439

jsquyres added 3 commits June 7, 2018 12:47
Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be
invoked during an attribute destruction callback (e.g., during the
destruction of keyvals on MPI_COMM_SELF during the very beginning of
MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false".

Prior to this commit, we hung in FINALIZED if it were invoked during
a COMM_SELF attribute destruction callback in FINALIZE. See

This commit converts the MPI_INITIALIZED / MPI_FINALIZED
infrastructure to use a single enum (ompi_mpi_state, set atomically)
to represent the state of MPI:

not initialized
init started
init completed
finalize started
finalize past COMM_SELF destruction
finalize completed
The "finalize past COMM_SELF destruction" state is what allows us to
return "false" from MPI_FINALIZED before COMM_SELF has been fully
destroyed / all attribute callbacks have been invoked.

Since this state is checked at nearly every MPI API call (to see if
we're outside of the INIT/FINALIZE epoch), care was taken to use
atomics to set the ompi_mpi_state value in ompi_mpi_init() and
ompi_mpi_finalize(), but performance-critical code paths can simply
read the variable without needing to use a slow call to an
opal_atomic_*() function.

Thanks to @AndrewGaspar for reporting the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 35438ae)
There was a race condition in 35438ae: if multiple threads invoked
ompi_mpi_init() simultaneously (which could happen from both MPI and
OSHMEM), the code did not catch this condition -- Bad Things would
happen.

Now use an atomic cmp/set to ensure that only one thread is able to
advance ompi_mpi_init from NOT_INITIALIZED to INIT_STARTED.

Additionally, change the prototype of ompi_mpi_init() so that
oshmem_init() can safely invoke ompi_mpi_init() multiple times (as
long as MPI_FINALIZE has not started) without displaying an error.  If
multiple threads invoke oshmem_init() simultaneously, one of them will
actually do the initialization, and the rest will loop waiting for it
to complete.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 67ba8da)
(Note: this commit would have been squashed with the prior commit, but
open-mpi#5092 was accidentally merged
before the 2 commits were squashed together)

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 9b9cb5f)
@ibm-ompi
Copy link

ibm-ompi commented Jun 7, 2018

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/891251a13bc7f3c724638c2291bae584

@jsquyres jsquyres force-pushed the pr/v3.0.x/fix-finalized-hang branch from f5abf95 to 3cc727f Compare June 7, 2018 20:35
@jsquyres
Copy link
Member Author

jsquyres commented Jun 7, 2018

@artpol84 @bosilca Ready for review.

@ibm-ompi
Copy link

ibm-ompi commented Jun 8, 2018

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6

@bwbarrett bwbarrett merged commit 3d96261 into open-mpi:v3.0.x Jun 12, 2018
@jsquyres jsquyres deleted the pr/v3.0.x/fix-finalized-hang branch December 7, 2021 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants