Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

Merged
merged 1 commit into from
Mar 22, 2023

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Mar 17, 2023

Minor version increases for several modules.
Does not impact compiler versions.
Motivation is to keep up-to-date with machine defaults.
Do not see any measurable performance changes.

cray-mpich/8.1.22 -> cray-mpich/8.1.24
cray-hdf5-parallel/1.12.2.1 -> cray-hdf5-parallel/1.12.2.3
cray-netcdf-hdf5parallel/4.9.0.1 -> cray-netcdf-hdf5parallel/4.9.0.3
cray-parallel-netcdf/1.12.3.1 -> cray-parallel-netcdf/1.12.3.3
cmake/3.22.0 -> cmake/3.24.3

Added specific version numbers for craype and cray-libsci to reduce surprises when the default version is changed (these were already using default)

Also added a couple of modules to remove just in case they are loaded.

Updating alvarez the same way, but it may be that the machine goes away.

Fixes #5525
[bfb]

@ndkeen ndkeen self-assigned this Mar 17, 2023
@ndkeen ndkeen added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Mar 17, 2023
@ndkeen ndkeen requested a review from rljacob March 17, 2023 22:54
@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 17, 2023

I tested against baselines on pm-cpu with e3sm_integration. And have been running larger cases with these modules versions in a scream repo (to test pm-gpu).

Example of some of the changes:

  Changes in Cray MPICH 8.1.24

      - CAST-24802 - Fix for MPI_CXX datatypes in the mpi header
      - CAST-26727 - Error instead of warning when NIC asymmetry is detected at start-up
      - CAST-31527 - Remove shared object constructors in MPI-IO
      - PE-44058 - Remove fallback usage of VNI and depend on launcher definitions
      - PE-44653 - Fix regression with memory debugging in OFI code
      - PE-44772 - Add support for collecting CXI counters
      - PE-44989 - Fix MPI-IO debug trace output
      - PE-45030 - Fix environment variable printing for MPI_DPM_DIR and MPICH_SPAWN_USE_RANKPOOL
      - PE-45042 - Add support for lmod auto swapping to the Cray-MPICH module
      - PE-45094 - Add support for program environment swapping and Cray-MPIXlate
      - PE-45160 - Enable GPU kernel-based optimizations by default

https://github.com/Parallel-NetCDF/Parallel-NetCDF.github.io/blob/master/Release_notes/1.12.3.md

@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 20, 2023

There are some TPUT failures when I tried to run again vs baselines, but it looks fine to me -- just comparing two very fast cases. Would I need to bless TPUT fails? Or increase the TPU tolerance for the machine?

ndkeen added a commit that referenced this pull request Mar 20, 2023
…5533)

Minor version increases for several modules on pm-cpu/pm-gpu.
Does not impact compiler versions.
Motivation is to keep up-to-date with machine defaults.
Do not see any measurable performance changes.

cray-mpich/8.1.22 -> cray-mpich/8.1.24
cray-hdf5-parallel/1.12.2.1 -> cray-hdf5-parallel/1.12.2.3
cray-netcdf-hdf5parallel/4.9.0.1 -> cray-netcdf-hdf5parallel/4.9.0.3
cray-parallel-netcdf/1.12.3.1 -> cray-parallel-netcdf/1.12.3.3
cmake/3.22.0 -> cmake/3.24.3

Added specific version numbers for craype and cray-libsci to reduce surprises when the default version is changed (these were already using default)

Also added a couple of modules to remove just in case they are loaded.

Updating alvarez the same way, but it may be that the machine goes away.

Fixes #5525
[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Mar 20, 2023

Merged to next. The tests runs are way too short to actually see any performance differences between before/after. In my testing, some tests report larger than 10% TPUT, but those cases seem fine and I have not seen any issues with larger/longer cases.

Actually, I forgot that NERSC current has an env variable set (FI_MR_CUDA_CACHE_MONITOR_ENABLED=0) to avoid NODE_FAILs, but is slowing many things down. Which could be the reason for some of the TPUT errors.

Looks like all the GNU tests passed on cdash. Manually checking nvidia developer -- completed as expected.

@ndkeen ndkeen merged commit c648dc2 into master Mar 22, 2023
@ndkeen ndkeen deleted the ndk/machinefiles/pm-module-updateMar2023 branch March 22, 2023 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files pm-cpu Perlmutter at NERSC (CPU-only nodes) pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

For Perlmutter (pm-cpu/pm-cpu), update module versions to defaults
2 participants