Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading Summit modules #2856

Merged
merged 2 commits into from
Apr 22, 2019
Merged

Conversation

jayeshkrishna
Copy link
Contributor

@jayeshkrishna jayeshkrishna commented Apr 15, 2019

Upgrading the summit cmake, essl and MPI modules.

Also updating the ROMIO version to prevent OOM errors.

Fixes #2847

[BFB]

@jayeshkrishna
Copy link
Contributor Author

Tried X cases successfully with ibm and pgi compilers on Summit

@sarats
Copy link
Member

sarats commented Apr 15, 2019

For the MMF runs, we had to add the following to bypass ROMIO out-of-memory errors.
Error: 847 Out of memory in file ../../../../../../../opensrc/ompi/ompi/mca/io/romio321/romio/adio/ad_gpfs/ad_gpfs_rdcoll.c, line 1178

Fix:

    <env name="OMPI_MCA_io">romio314</env>

@sarats
Copy link
Member

sarats commented Apr 15, 2019

A request: there is a new default cmake module. It might be good to update that as well.
cmake/3.13.4

@jayeshkrishna
Copy link
Contributor Author

jayeshkrishna commented Apr 15, 2019

Sure, I will update the branch

@jayeshkrishna jayeshkrishna force-pushed the jayeshkrishna/summit_module_fixes branch from ba0ff02 to fba6edf Compare April 15, 2019 19:03
Upgrading the summit cmake, essl and MPI modules.

With the older version of MPI modules, MPI_Finalize call hangs.
The older version of essl module is no longer available.

Fixes #2847
@jayeshkrishna jayeshkrishna force-pushed the jayeshkrishna/summit_module_fixes branch from fba6edf to c6709af Compare April 15, 2019 19:04
@minxu74
Copy link
Contributor

minxu74 commented Apr 16, 2019

@sarats @jayeshkrishna I got the OOM error during running the high-res F case with IOs as mentioned by @sarats . So we need to add the fix into the machine file too.

@jayeshkrishna
Copy link
Contributor Author

Ok, I will add the env and rebase the branch (@minxu74 : Did the env work for you?)

@sarats
Copy link
Member

sarats commented Apr 16, 2019

Just FYI. This explains the context for the env variable setting from OLCF page on Summit resolved issues.
So, new ROMIO is needed for better performance of parallel HDF5 and preventing hangs in MPI_Finalize but seems to trigger OOM errors.

Old ROMIO doesn't result in OOM errors but may cause performance issue. Perhaps, we need to check new spectrum-mpi while disabling darshan-runtime or contact OLCF to figure out the path ahead.

Job hangs in MPI_Finalize
There is a known issue in Spectrum MPI 10.2.0.10 provided by the spectrum-mpi/10.2.0.10-20181214 modulefile that causes a hang in MPI_Finalize when ROMIO 3.2.1 is being used and the darshan-runtime modulefile is loaded. The recommended and default Spectrum MPI version as of March 3, 2019 is Spectrum MPI 10.2.0.11 provided by the spectrum-mpi/10.2.0.11-20190201 modulefile. If you are seeing this issue, please make sure that you are using the latest version of Spectrum MPI.

If you need to use a previous version of Spectrum MPI, your options are:
Unload the darshan-runtime modulefile.
Alternatively, set export OMPI_MCA_io=romio314 in your environment to use the previous version of ROMIO. Please note that this version has known performance issues with parallel HDF5 (see “Slow performance using parallel HDF5” issue below).

@minxu74
Copy link
Contributor

minxu74 commented Apr 16, 2019

@jayeshkrishna The test run is in the queue now. However, based on the response from @sarats , the fix is not needed because you already use the latest Spectrum MPI. I did used the old Spectrum MPI and got the OOM.

@sarats
Copy link
Member

sarats commented Apr 16, 2019

@minxu74 To clarify, my OOM errors went away only after putting that environment variable.

@minxu74
Copy link
Contributor

minxu74 commented Apr 16, 2019

@jayeshkrishna @sarats My test run confirmed that the env setting fixed the OOM error.

@jayeshkrishna
Copy link
Contributor Author

jayeshkrishna commented Apr 16, 2019

@minxu74 : Did you run the tests with the latest MPI module, spectrum-mpi/10.2.0.11-20190201 (did you need the env setting to fix the OOM error with the latest MPI module?)?

@minxu74
Copy link
Contributor

minxu74 commented Apr 16, 2019

@jayeshkrishna I ran the old spectrum MPI/10.2.0.10-20181214. For the new version, I submitted a test run without the env fix and It is in the queue now.

@jayeshkrishna
Copy link
Contributor Author

@minxu74 : Did your job complete? Does the new MPI lib need the romio environment variable?

@minxu74
Copy link
Contributor

minxu74 commented Apr 19, 2019

@jayeshkrishna The test case with the new MPI and w/o romio env actually used PIO2 and it failed on Summit with a segmentation error, not OOM. So the new MPI may not need the romio env. But I will do another test with new MPI and PIO1, but without romio env to doubly check.

@jayeshkrishna
Copy link
Contributor Author

When using PIO2 with E3SM you need to use the PIO2 in our fork (replace cime/src/externals/pio2 with master from https://github.com/E3SM-Project/ParallelIO). I will be updating the version of PIO in CIME soon.
Please try with PIO1 and see if you get the OOM error.

@minxu74
Copy link
Contributor

minxu74 commented Apr 20, 2019

@jayeshkrishna Thanks for the PIO2 tip. The run with the new MPI but w/o romio env failed with the same OOM error. So the romio env setting is needed. I have run a case successfully with the new MPI and romio env setting.

@jayeshkrishna
Copy link
Contributor Author

Great, I will add the env setting to the branch and merge it to next

OLCF recommends setting "OMPI_MCA_io" env variable to "romio314"
to prevent out of memory issues with the code. In our testing we
have found that setting this env gets rid of OOM errors with
certain E3SM simulations (ECP simulation runs).

However also note that setting this environment variable reduces
the performance of parallel HDF5 (when using NetCDF4P PIO iotype
to write data).

See Issue #2856
@jayeshkrishna
Copy link
Contributor Author

@minxu74 / @sarats : Do the changes look good to you? I am planning to merge this to next and master today

@jayeshkrishna
Copy link
Contributor Author

Tested the latest branch with X case (ran successfully) on Summit

@minxu74
Copy link
Contributor

minxu74 commented Apr 22, 2019

@jayeshkrishna The changes look good to me. Thanks a lot for your PR.

@jayeshkrishna
Copy link
Contributor Author

Adding some text from Summit user guide ,

The following issue was resolved with the software default changes from March 12, 2019 that set Spectrum MPI 10.2.0.11 (20190201) as default and moved ROMIO to version 3.2.1

So the above doc seems to suggest that ROMIO version was upgraded to 3.2.1 (and the env should have only been needed with the older version of MPI lib). We can handle this issue in a future PR (Since @minxu74 observed that this env setting was required with the old and new MPI libs)

jayeshkrishna added a commit that referenced this pull request Apr 22, 2019
Upgrading the summit cmake, essl and MPI modules.

Also updating the ROMIO version to prevent OOM errors.

Fixes #2847

[BFB]
@jayeshkrishna
Copy link
Contributor Author

These changes are only to summit machine files (should not impact nightlies or any other machine), so merging to master

@jayeshkrishna jayeshkrishna merged commit 5cdf492 into master Apr 22, 2019
jgfouca pushed a commit that referenced this pull request Jun 25, 2019
OLCF recommends setting "OMPI_MCA_io" env variable to "romio314"
to prevent out of memory issues with the code. In our testing we
have found that setting this env gets rid of OOM errors with
certain E3SM simulations (ECP simulation runs).

However also note that setting this environment variable reduces
the performance of parallel HDF5 (when using NetCDF4P PIO iotype
to write data).

See Issue #2856
jgfouca pushed a commit that referenced this pull request Jun 25, 2019
…e_fixes

Upgrading the summit cmake, essl and MPI modules.

Also updating the ROMIO version to prevent OOM errors.

Fixes #2847

[BFB]
jgfouca pushed a commit that referenced this pull request Jun 26, 2024
…p_to_share

Automatically Merged using E3SM Pull Request AutoTester
PR Title: Move IOP from control/ to share/iop
PR Author: tcclevenger
PR LABELS: cmake, BFB, AT: AUTOMERGE, bugfix, DP-SCREAM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E3SM hangs in MPI finalize on Summit
3 participants