For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

ndkeen · 2023-03-17T22:53:57Z

Minor version increases for several modules.
Does not impact compiler versions.
Motivation is to keep up-to-date with machine defaults.
Do not see any measurable performance changes.

cray-mpich/8.1.22 -> cray-mpich/8.1.24
cray-hdf5-parallel/1.12.2.1 -> cray-hdf5-parallel/1.12.2.3
cray-netcdf-hdf5parallel/4.9.0.1 -> cray-netcdf-hdf5parallel/4.9.0.3
cray-parallel-netcdf/1.12.3.1 -> cray-parallel-netcdf/1.12.3.3
cmake/3.22.0 -> cmake/3.24.3

Added specific version numbers for craype and cray-libsci to reduce surprises when the default version is changed (these were already using default)

Also added a couple of modules to remove just in case they are loaded.

Updating alvarez the same way, but it may be that the machine goes away.

Fixes #5525
[bfb]

Does not impact compiler.

ndkeen · 2023-03-17T22:55:29Z

I tested against baselines on pm-cpu with e3sm_integration. And have been running larger cases with these modules versions in a scream repo (to test pm-gpu).

Example of some of the changes:

  Changes in Cray MPICH 8.1.24

      - CAST-24802 - Fix for MPI_CXX datatypes in the mpi header
      - CAST-26727 - Error instead of warning when NIC asymmetry is detected at start-up
      - CAST-31527 - Remove shared object constructors in MPI-IO
      - PE-44058 - Remove fallback usage of VNI and depend on launcher definitions
      - PE-44653 - Fix regression with memory debugging in OFI code
      - PE-44772 - Add support for collecting CXI counters
      - PE-44989 - Fix MPI-IO debug trace output
      - PE-45030 - Fix environment variable printing for MPI_DPM_DIR and MPICH_SPAWN_USE_RANKPOOL
      - PE-45042 - Add support for lmod auto swapping to the Cray-MPICH module
      - PE-45094 - Add support for program environment swapping and Cray-MPIXlate
      - PE-45160 - Enable GPU kernel-based optimizations by default

https://github.com/Parallel-NetCDF/Parallel-NetCDF.github.io/blob/master/Release_notes/1.12.3.md

ndkeen · 2023-03-20T14:35:56Z

There are some TPUT failures when I tried to run again vs baselines, but it looks fine to me -- just comparing two very fast cases. Would I need to bless TPUT fails? Or increase the TPU tolerance for the machine?

…5533) Minor version increases for several modules on pm-cpu/pm-gpu. Does not impact compiler versions. Motivation is to keep up-to-date with machine defaults. Do not see any measurable performance changes. cray-mpich/8.1.22 -> cray-mpich/8.1.24 cray-hdf5-parallel/1.12.2.1 -> cray-hdf5-parallel/1.12.2.3 cray-netcdf-hdf5parallel/4.9.0.1 -> cray-netcdf-hdf5parallel/4.9.0.3 cray-parallel-netcdf/1.12.3.1 -> cray-parallel-netcdf/1.12.3.3 cmake/3.22.0 -> cmake/3.24.3 Added specific version numbers for craype and cray-libsci to reduce surprises when the default version is changed (these were already using default) Also added a couple of modules to remove just in case they are loaded. Updating alvarez the same way, but it may be that the machine goes away. Fixes #5525 [bfb]

ndkeen · 2023-03-20T19:32:55Z

Merged to next. The tests runs are way too short to actually see any performance differences between before/after. In my testing, some tests report larger than 10% TPUT, but those cases seem fine and I have not seen any issues with larger/longer cases.

Actually, I forgot that NERSC current has an env variable set (FI_MR_CUDA_CACHE_MONITOR_ENABLED=0) to avoid NODE_FAILs, but is slowing many things down. Which could be the reason for some of the TPUT errors.

Looks like all the GNU tests passed on cdash. Manually checking nvidia developer -- completed as expected.

For pm-cpu/pm-gpu, update some module versions to the current defaults.

3f43b4e

Does not impact compiler.

ndkeen self-assigned this Mar 17, 2023

ndkeen added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) pm-cpu Perlmutter at NERSC (CPU-only nodes) labels Mar 17, 2023

ndkeen requested a review from rljacob March 17, 2023 22:54

rljacob approved these changes Mar 19, 2023

View reviewed changes

This was referenced Mar 22, 2023

Update to v1.14.0 E3SM-Project/mache#113

Merged

Update to v1.2.0-alpha.5 MPAS-Dev/compass#555

Merged

ndkeen merged commit c648dc2 into master Mar 22, 2023

ndkeen deleted the ndk/machinefiles/pm-module-updateMar2023 branch March 22, 2023 20:45

darincomeau mentioned this pull request Jun 2, 2023

Bug fixes and pm-cpu gnu machine support darincomeau/E3SM#34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

ndkeen commented Mar 17, 2023 •

edited

Loading

ndkeen commented Mar 17, 2023 •

edited

Loading

ndkeen commented Mar 20, 2023

ndkeen commented Mar 20, 2023 •

edited

Loading

For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

For pm-cpu/pm-gpu, update some module versions to current machine defaults #5533

Conversation

ndkeen commented Mar 17, 2023 • edited Loading

ndkeen commented Mar 17, 2023 • edited Loading

ndkeen commented Mar 20, 2023

ndkeen commented Mar 20, 2023 • edited Loading

ndkeen commented Mar 17, 2023 •

edited

Loading

ndkeen commented Mar 17, 2023 •

edited

Loading

ndkeen commented Mar 20, 2023 •

edited

Loading