Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UFS-WM regression test failures: spack stack PR #1707 #708

Closed
jkbk2004 opened this issue Aug 10, 2023 · 13 comments
Closed

UFS-WM regression test failures: spack stack PR #1707 #708

jkbk2004 opened this issue Aug 10, 2023 · 13 comments
Assignees
Labels
bug Something is not working INFRA JEDI Infrastructure

Comments

@jkbk2004
Copy link

Describe the bug
Reproducibility problem was found during the regression test with ufs-community/ufs-weather-model#1707

  1. Experiment results are not reproducible in the regression test runs to check B4B comparison after creating baselines for cpld_2threads_p8_intel, cpld_mpi_p8_intel, cpld_bmark_p8_intel, hafs_regional_docn_oisst_intel
  2. Experiments run ok but results are not identical in the second run of the experiment

To Reproduce
Steps to reproduce the behavior: create baseline for those cases and run the regression test on Gaea: e.g. ./rt.sh -c -e -a [account name] and ./rt.sh -m -e -a [account name]

Expected behavior
A clear and concise description of what you expected to happen. nccmp result shows difference in fields

Jong.Kim@gaea10:/lustre/f2/pdata/ncep/Jong.Kim/rt-1707/tests/logs$ nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format  /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20230807/hafs_regional_docn_oisst_intel/sfcf006.nc /lustre/f2/scratch/Jong.Kim/FV3_RT/rt_39997/hafs_regional_docn_oisst_intel/sfcf006.nc
Variable      Group   Count          Sum      AbsSum          Min         Max       Range         Mean      StdDev
acond         /      624432    -0.813792     77.8705    -0.033305   0.0246161   0.0579211 -1.30325e-06 0.000420176
albdo_ave     /         144    0.0202661    0.306048   -0.0882874    0.120814    0.209102  0.000140737   0.0128115
cduvb_ave     /         195  6.97528e-08 4.56946e-07 -2.71248e-08 4.27826e-08 6.99074e-08  3.57707e-10 5.75421e-09
cnwat         /       83295     -18.2661     340.479         -0.5    0.499961    0.999961 -0.000219294   0.0264393
cpofp         /         128     -3.78541      14.625     -0.86563    0.740301     1.60593   -0.0295735    0.225738

System:
What system(s) are you running the code on? Gaea and we suspect similar behavior happening on Acorn as well

Additional context
Add any other context about the problem here. We didn't see any problem when we merge in the latest library update PR ufs-community/ufs-weather-model#1745. A suggestion is to revert the jasper and zlib updates and set exactly same library option as current UFS-WM develop and start debugging from there.

@AlexanderRichert-NOAA
Copy link
Collaborator

I'm currently exploring this on Gaea using the cpld_2threads_p8 case. I tried reverting a number of libraries, including jasper+zlib, to no avail. Interestingly, when I swap out a number of libraries for hpc-stack ones, I can get my cpld_2threads_p8 results to match cpld_control_p8 baseline results based hpc-stack (20230804). I'll keep updating here as I find out more.

@AlexanderRichert-NOAA
Copy link
Collaborator

AlexanderRichert-NOAA commented Aug 11, 2023

As a general observation, the main differences between spack-stack and hpc-stack are:

  • jasper and zlib versions (I don't think this is it but @ulmononian is going to look deeper at this)
  • in spack-stack on Gaea, everything is built with cray wrappers, whereas with hpc-stack, only the MPI-based libraries are; all the non-MPI libraries are built with icc/ifort
  • spack sets different defaults for some (relatively) obscure cmake options. In the past this has impacted one or two libraries that behaved differently between CMAKE_BUILD_TYPE=Release and CMAKE_BUILD_TYPE=RelWithDebInfo (where the latter is the spack default, though it can be changed). Once I widdle down the list of libraries at issue, I'll look more closely at these.

I have not been able to find any difference in terms of which packages are built with OpenMP support, which seemed like a logical thing to check based on the inconsistency between the cpld_control_p8 and cpld_2threads_p8 cases.

@AlexanderRichert-NOAA
Copy link
Collaborator

AlexanderRichert-NOAA commented Aug 11, 2023

I can get the cpld_2threads_p8 test on Gaea to successfully match against the hpc-stack (20230804) baseline results if I swap the hpc-stack esmf library into my spack-stack installation but leave everything else exactly the same. For what it's worth, the references in libesmf.a to other libraries (NetCDF, PIO) are identical between the spack-stack (failing) and hpc-stack (succeeding) version of the library, and in my test that passes, I'm linking entirely against the spack-stack versions of NetCDF and PIO, so long story short, I'm pretty confident the problem is with the build of the ESMF library itself and not its dependencies. Looking at the libesmf.a symbols (nm output), I can see some differences in debug info. Also, Spack inserts a lot of its own build options like CPU arch targets and such, so I suspect the issue at least for cpld_2threads_p8 lies in some subtle but critical difference in the build settings (not all of which are necessarily visible in the build logs on account of cray and spack wrappers and their use of environment variables, so, yeah that'll be fun).

@AlexanderRichert-NOAA
Copy link
Collaborator

AlexanderRichert-NOAA commented Aug 11, 2023

I've got cpld_2threads_p8 running successfully on Gaea with a pure spack-stack installation by making just a couple of build setting changes to ESMF. The most critical ones appear to be adding "-fp-model precise" to the F90 and CXX flags; I also fiddled with a few other settings so I'll keep testing to make sure that the "-fp-model precise" change is the only tweak needed. If that does it, then we should either update our spack-stack configuration (use fflags="-fp-model precise" cxxflags="-fp-model precise for intel version of ESMF), or add a variant (variant("fpprecise", default=False, when="%intel"). NOTE: previously I put 'strict' where I should have put 'precise'

@jkbk2004
Copy link
Author

I've got cpld_2threads_p8 running successfully on Gaea with a pure spack-stack installation by making just a couple of build setting changes to ESMF. The most critical ones appear to be adding "-fp-model strict" to the F90 and CXX flags; I also fiddled with a few other settings so I'll keep testing to make sure that the "-fp-model strict" change is the only tweak needed. If that does it, then we should either update our spack-stack configuration (use fflags="-fp-model strict" cxxflags="-fp-model strict for intel version of ESMF), or add a variant (variant("fpstrict", default=False, when="%intel").

Sounds like progress!

@AlexanderRichert-NOAA
Copy link
Collaborator

Solved. I can get these RTs to match the hpc-stack-based baseline data by adding -fp-model precise to ESMF's F90 and CXX flags, as well as setting MAPL's CMAKE_BUILD_TYPE to Release (rather than Spack's default RelWithDebInfo). I'll create separate issues for the longterm fixes, so let's wait to close this issue until we have the ufs-pi-2.5.10 environments updated.

@jkbk2004
Copy link
Author

Awesome!

@climbfuji
Copy link
Collaborator

The ufs-pio-2.5.10 environment is now also available on S4 with the modifications from Alex described above. I don't know if the ufs-pio-2.5.10 environments were rebuilt according to those instructions on the other platforms, since I was out on vacation.

@ulmononian
Copy link
Collaborator

ulmononian commented Aug 15, 2023

the tests reported here by @jkbk2004 now pass (see attached log; each test was compared against a baseline created with spack-stack PR 1707 w/ alex's test stack with the esmf/mapl fixes). @climbfuji i think we can go ahead and perform alex's modifications as described in his email for all machines w/ the ufs-pio env.

RegressionTests_gaea.log

@climbfuji
Copy link
Collaborator

Thanks @ulmononian - can you create a list of machines that need the updated installation? Since I was on vacation when the chained environments were created, I don't know for sure which machines received those updates.

@ulmononian
Copy link
Collaborator

@climbfuji #712

@climbfuji climbfuji added the INFRA JEDI Infrastructure label Aug 17, 2023
@ulmononian
Copy link
Collaborator

ulmononian commented Aug 18, 2023

@jkbk2004 @climbfuji since these tests now pass on gaea c4 in a spack-stack - spack-stack comparison (i.e. the method suggested by jong in the issue description), can we close this issue?

@climbfuji
Copy link
Collaborator

The UFS transitioned to spack-stack two weeks ago after @AlexanderRichert-NOAA and @ulmononian figured out the b4b reproducibility issues. We can close this issue as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working INFRA JEDI Infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants