-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading Summit modules #2856
Upgrading Summit modules #2856
Conversation
Tried X cases successfully with ibm and pgi compilers on Summit |
For the MMF runs, we had to add the following to bypass ROMIO out-of-memory errors. Fix:
|
A request: there is a new default cmake module. It might be good to update that as well. |
Sure, I will update the branch |
ba0ff02
to
fba6edf
Compare
Upgrading the summit cmake, essl and MPI modules. With the older version of MPI modules, MPI_Finalize call hangs. The older version of essl module is no longer available. Fixes #2847
fba6edf
to
c6709af
Compare
@sarats @jayeshkrishna I got the OOM error during running the high-res F case with IOs as mentioned by @sarats . So we need to add the fix into the machine file too. |
Ok, I will add the env and rebase the branch (@minxu74 : Did the env work for you?) |
Just FYI. This explains the context for the env variable setting from OLCF page on Summit resolved issues. Old ROMIO doesn't result in OOM errors but may cause performance issue. Perhaps, we need to check new spectrum-mpi while disabling darshan-runtime or contact OLCF to figure out the path ahead.
|
@jayeshkrishna The test run is in the queue now. However, based on the response from @sarats , the fix is not needed because you already use the latest Spectrum MPI. I did used the old Spectrum MPI and got the OOM. |
@minxu74 To clarify, my OOM errors went away only after putting that environment variable. |
@jayeshkrishna @sarats My test run confirmed that the env setting fixed the OOM error. |
@minxu74 : Did you run the tests with the latest MPI module, spectrum-mpi/10.2.0.11-20190201 (did you need the env setting to fix the OOM error with the latest MPI module?)? |
@jayeshkrishna I ran the old spectrum MPI/10.2.0.10-20181214. For the new version, I submitted a test run without the env fix and It is in the queue now. |
@minxu74 : Did your job complete? Does the new MPI lib need the romio environment variable? |
@jayeshkrishna The test case with the new MPI and w/o romio env actually used PIO2 and it failed on Summit with a segmentation error, not OOM. So the new MPI may not need the romio env. But I will do another test with new MPI and PIO1, but without romio env to doubly check. |
When using PIO2 with E3SM you need to use the PIO2 in our fork (replace cime/src/externals/pio2 with master from https://github.com/E3SM-Project/ParallelIO). I will be updating the version of PIO in CIME soon. |
@jayeshkrishna Thanks for the PIO2 tip. The run with the new MPI but w/o romio env failed with the same OOM error. So the romio env setting is needed. I have run a case successfully with the new MPI and romio env setting. |
Great, I will add the env setting to the branch and merge it to next |
OLCF recommends setting "OMPI_MCA_io" env variable to "romio314" to prevent out of memory issues with the code. In our testing we have found that setting this env gets rid of OOM errors with certain E3SM simulations (ECP simulation runs). However also note that setting this environment variable reduces the performance of parallel HDF5 (when using NetCDF4P PIO iotype to write data). See Issue #2856
Tested the latest branch with X case (ran successfully) on Summit |
@jayeshkrishna The changes look good to me. Thanks a lot for your PR. |
Adding some text from Summit user guide ,
So the above doc seems to suggest that ROMIO version was upgraded to 3.2.1 (and the env should have only been needed with the older version of MPI lib). We can handle this issue in a future PR (Since @minxu74 observed that this env setting was required with the old and new MPI libs) |
Upgrading the summit cmake, essl and MPI modules. Also updating the ROMIO version to prevent OOM errors. Fixes #2847 [BFB]
These changes are only to summit machine files (should not impact nightlies or any other machine), so merging to master |
OLCF recommends setting "OMPI_MCA_io" env variable to "romio314" to prevent out of memory issues with the code. In our testing we have found that setting this env gets rid of OOM errors with certain E3SM simulations (ECP simulation runs). However also note that setting this environment variable reduces the performance of parallel HDF5 (when using NetCDF4P PIO iotype to write data). See Issue #2856
…e_fixes Upgrading the summit cmake, essl and MPI modules. Also updating the ROMIO version to prevent OOM errors. Fixes #2847 [BFB]
…p_to_share Automatically Merged using E3SM Pull Request AutoTester PR Title: Move IOP from control/ to share/iop PR Author: tcclevenger PR LABELS: cmake, BFB, AT: AUTOMERGE, bugfix, DP-SCREAM
Upgrading the summit cmake, essl and MPI modules.
Also updating the ROMIO version to prevent OOM errors.
Fixes #2847
[BFB]