Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many tests with LND fail with GNU on cori-knl and cori-haswell #3270

Closed
ndkeen opened this issue Oct 30, 2019 · 14 comments · Fixed by #3787
Closed

Many tests with LND fail with GNU on cori-knl and cori-haswell #3270

ndkeen opened this issue Oct 30, 2019 · 14 comments · Fixed by #3787
Labels

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Oct 30, 2019

After the Cori upgrade, I had trouble with GNU. In fact, all GNU tests failed immediately with a runtime error that was eventually attributed to the hugepages module. NERSC added this module by default, and we now remove it. After the PR to remove hugepages, I ran e3sm_developer with GNU and some tests passed, but most failed. I think most or all fail with
Program received signal SIGILL: Illegal instruction.

Example of how to reproduce (I just tried with master as of Oct23rd) is:
ERS.f09_g16.I1850CLM45CN --compiler=gnu

Looking back at e3sm_developer test after the PR to remove hugepages, the only tests that passed were:

ERS.f09_g16_g.MALISIA.cori-knl_gnu
ERS.f19_g16_rx1.A.cori-knl_gnu
ERS.ne30_g16_rx1.A.cori-knl_gnu
ERS_Ld5.T62_oQU120.CMPASO-NYF.cori-knl_gnu
SEQ.f19_g16.X.cori-knl_gnu
SMS.ne30_f19_g16_rx1.A.cori-knl_gnu

Trying with DEBUG=TRUE has same failure and I do get a stack trace (below)
I also tried using only 1 thread -- same fail.
Same error using gnu 8.3.0 (currently we use 8.2.0) and higher version of cray-libsci.
Same error on cori-haswell with gnu 8.2.0.

However, with gnu 8.3.0 (as well as higher version of cray-libsci), the test works on cori-haswell.

Stack from DEBUG runs:

29: Program received signal SIGILL: Illegal instruction.
29: 
29: Backtrace for this error:
29: 
29: Program received signal SIGILL: Illegal instruction.
29: 
29: Backtrace for this error:
29: #0  0x2aaab21ea1df in ???
29: #0  0x2aaab21ea1df in ???
29: #1  0xa2934a in __soiltemperaturemod_MOD_setmatrix_snownonurban
29: #1  0xa2934a in __soiltemperaturemod_MOD_setmatrix_snownonurban
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3729
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3729
29: #2  0xa2e2de in __soiltemperaturemod_MOD_setmatrix_snow
29: #2  0xa2e2de in __soiltemperaturemod_MOD_setmatrix_snow
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3467
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3467
29: #3  0xa31364 in __soiltemperaturemod_MOD_setmatrix
29: #3  0xa31364 in __soiltemperaturemod_MOD_setmatrix
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3234
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:3234
29: #4  0xa5b7f4 in __soiltemperaturemod_MOD_solvetemperature
29: #4  0xa5b7f4 in __soiltemperaturemod_MOD_solvetemperature
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:811
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:811
29: #5  0xa60515 in __soiltemperaturemod_MOD_soiltemperature
29: #5  0xa60515 in __soiltemperaturemod_MOD_soiltemperature
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:470
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/biogeophys/SoilTemperatureMod.F90:470
29: #6  0x543d4a in __clm_driver_MOD_clm_drv._omp_fn.4
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/clm_driver.F90:1296
29: #6  0x543d4a in __clm_driver_MOD_clm_drv._omp_fn.4
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/clm_driver.F90:1296
29: #7  0x2aaab1b3bb0e in GOMP_parallel
29:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/parallel.c:168
29: #8  0x545e75 in __clm_driver_MOD_clm_drv
29:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/clm_driver.F90:1296
29: #7  0x2aaab1b4443d in gomp_thread_start
29:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/team.c:120
29: #8  0x2aaab15de568 in ???
29: #9  0x2aaab22aca2e in ???
29: #10  0xffffffffffffffff in ???

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/m17-oct23/ERS_D.f09_g16.I1850CLM45CN.cori-knl_gnu.20191030_115956_s1tz98

@ndkeen ndkeen added the Cori label Oct 30, 2019
@ndkeen ndkeen changed the title GNU broken on cori-knl GNU broken on cori-knl and cori-haswell Oct 30, 2019
@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 1, 2019

Adjusting the DEBUG flags, I get this error:

37: At line 1342 of file /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/histFileMod.F90
37: Fortran runtime error: Index '1' of dimension 2 of array 'clmptr_ra' outside of expected range (0:0)
37:
37: Error termination. Backtrace:
37: #0  0x7a57fa in hist_update_hbuf_field_2d
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/histFileMod.F90:1342
37: #1  0x7bb4ef in __histfilemod_MOD_hist_update_hbuf._omp_fn.0
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/histFileMod.F90:1005
37: #2  0x2aaab1b3bb0e in GOMP_parallel
37:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/parallel.c:168
37: #3  0x7b1522 in __histfilemod_MOD_hist_update_hbuf
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/histFileMod.F90:1001
37: #4  0x61c3d0 in __clm_driver_MOD_clm_drv
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/main/clm_driver.F90:1413
37: #5  0x5fe36b in __lnd_comp_mct_MOD_lnd_run_mct
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/components/clm/src/cpl/lnd_comp_mct.F90:509
37: #6  0x438248 in __component_mod_MOD_component_run
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/cime/src/drivers/mct/main/component_mod.F90:714
37: #7  0x415859 in __cime_comp_mod_MOD_cime_run
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/cime/src/drivers/mct/main/cime_comp_mod.F90:2612
37: #8  0x43522e in cime_driver
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/cime/src/drivers/mct/main/cime_driver.F90:133
37: #9  0x435284 in main
37:     at /global/cscratch1/sd/ndk/wacmy/m17-oct23/cime/src/drivers/mct/main/cime_driver.F90:23

    if (no_snow_behavior /= no_snow_unset) then
       ! For multi-layer snow fields, build a special output variable that handles
       ! missing snow layers appropriately

       ! Note, regarding bug 1786: The following allocation is not what we would want if
       ! this routine were operating in a threaded region (or, more generally, within a
       ! loop over nclumps) - in that case we would want to use the bounds information for
       ! this clump. But currently that's not possible because the bounds of some fields
       ! have been reset to 1 - see also bug 1786. Similarly, if we wanted to allow
       ! operation within a loop over clumps, we would need to pass 'bounds' to
       ! hist_set_snow_field_2d rather than relying on beg1d & end1d (which give the proc,
       ! bounds not the clump bounds)

       allocate(field(lbound(clmptr_ra(hpindex)%ptr, 1) : ubound(clmptr_ra(hpindex)%ptr, 1), 1:num2d))
       field_allocated = .true.

       call hist_set_snow_field_2d(field, clmptr_ra(hpindex)%ptr, no_snow_behavior, type1d, &
            beg1d, end1d)
    else
       field => clmptr_ra(hpindex)%ptr(:,1:num2d)  ! <-- line 1342
       field_allocated = .false.
    end if

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 2, 2019

Noting that this test also fails with Intel DEBUG as in #3284

The above issue is fixed (intel+debug)

@amametjanov
Copy link
Member

In tests on Theta, SIGILL: Illegal instruction with GNU 8.3.0 is fixed by adding -O0 to debug flags. Existing flag -Og must be adding "debugger-friendly" instructions that do not play well with KNL's. Appending -O0 overrides -Og. I did not change -Og to -O0 to avoid affecting GNU builds on other machines.

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 7, 2019

I tested using -O0 on two of my repos -- one with gcc8.20 and one with gcc8.3.0. They both failed, but may not be failing with same error. Now that we have the fix in #3260, I will test there.

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 7, 2019

Using repo from Nov 6th, I tried SMS_D.f09_g16.I1850CLM45CN.cori-knl_gnu with -O0 instead of -Og. It still fails with illegal instruction, but it may be a different error. The e3sm.log files are messy for this failure so it's not easy to see.

I also tried SMS_D_PMx1.f09_g16.I1850CLM45CN.cori-knl_gnu with similar result.

The case dirs are here:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/m19-nov6

@amametjanov
Copy link
Member

amametjanov commented Nov 8, 2019

Looks like an issue with an uninitialized variable, hitting a NaN. After replacing -Og with -O0, same error in both threaded and MPI-only runs: e.g. SMS_D_PMx1.f09_g16.I1850CLM45CN.cori-knl_gnu

11:
11: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
11:
11: Backtrace for this error:
11: #0  0x28b3d8f in ???
11:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
11: #1  0x229d59b in __shr_infnan_mod_MOD_shr_infnan_isnan_double
11:     at /global/cscratch1/sd/azamat/acme_scratch/cori-knl/SMS_D_PMx1.f09_g16.I1850CLM45CN.cori-knl_gnu.20191108_101128_9uh7ns/bld/gnu/mpt/debug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/csm_share/shr_infnan_mod.F90:227
11: #2  0x1584ae2 in __firemod_MOD_firearea
11:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/biogeochem/FireMod.F90:375
11: #3  0x152a9a4 in __ecosystemdynmod_MOD_ecosystemdynnoleaching2
11:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/biogeochem/EcosystemDynMod.F90:837
11: #4  0x64f7b6 in __clm_driver_MOD_clm_drv
11:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/main/clm_driver.F90:939
11: #5  0x6226e5 in __lnd_comp_mct_MOD_lnd_run_mct
11:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/cpl/lnd_comp_mct.F90:509
11: #6  0x4312d6 in __component_mod_MOD_component_run
11:     at /global/u2/a/azamat/cori/repos/E3SM/cime/src/drivers/mct/main/component_mod.F90:714
11: #7  0x40f601 in __cime_comp_mod_MOD_cime_run
11:     at /global/u2/a/azamat/cori/repos/E3SM/cime/src/drivers/mct/main/cime_comp_mod.F90:2612
11: #8  0x42e734 in cime_driver
11:     at /global/u2/a/azamat/cori/repos/E3SM/cime/src/drivers/mct/main/cime_driver.F90:133
11: #9  0x42e79e in main
11:     at /global/u2/a/azamat/cori/repos/E3SM/cime/src/drivers/mct/main/cime_driver.F90:23

@bbye
Copy link
Contributor

bbye commented Nov 8, 2019

That looks oddly familiar - see #1832

@amametjanov
Copy link
Member

Thanks for pointing to that issue and the one it points to https://github.com/ESMCI/cime/issues/1974. So current, latest GNU compilers 8.2.0 and 8.3.0 still have that issue of throwing SIGFPE on calls to isnan intrinsic.

My vote is to remove invalid from -ffpe-trap list in DEBUG FFLAGS: from CESM config_compilers.xml

  <CFLAGS>
    <append DEBUG="TRUE"> -g -Wall -Og -fbacktrace -ffpe-trap=invalid,zero,overflow -fcheck=bounds </append>
  </CFLAGS>
  <FFLAGS>
    <!-- Ideally, we would also have 'invalid' in the ffpe-trap list. But at
         least with some versions of gfortran (confirmed with 5.4.0, 6.3.0 and
         7.1.0), gfortran's isnan (which is called in cime via the
         CPRGNU-specific shr_infnan_isnan) causes a floating point exception
         when called on a signaling NaN. -->
    <append DEBUG="TRUE"> -g -Wall -Og -fbacktrace -ffpe-trap=zero,overflow -fcheck=bounds </append>
  </FFLAGS>

Doing that allowed SMS_D_PMx1.f09_g16.I1850CLM45CN.cori-knl_gnu to complete without errors. Thoughts?

There's still an issue with DEBUG runs multi-threaded: SMS_D.f09_g16.I1850CLM45CN.cori-knl_gnu

32: *** Error in `/global/cscratch1/sd/azamat/acme_scratch/cori-knl/SMS_D.f09_g16.I1850CLM45CN.cori-knl_gnu.20191108_101128_9uh7ns/bld/e3sm.exe': free(): invalid size: 0x00002aaaf8000be0 ***
32:
32: Program received signal SIGABRT: Process abort signal.
32:
32: Backtrace for this error:
32: #0  0x28c048f in ???
32:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
32: #1  0x43afe80 in raise
32:     at ../sysdeps/unix/sysv/linux/raise.c:51
32: #2  0x4471330 in abort
32:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/stdlib/abort.c:79
32: #3  0x4498376 in __libc_message
32:     at ../sysdeps/posix/libc_fatal.c:181
32: #4  0x449e7e2 in malloc_printerr
32:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:5428
32: #5  0x44a0090 in _int_free
32:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:4170
32: #6  0x22a7c1e in __shr_log_mod_MOD_shr_log_errmsg
32:     at /global/u2/a/azamat/cori/repos/E3SM/cime/src/share/util/shr_log_mod.F90:78
32: #7  0xf6b59b in __snowsnicarmod_MOD_snicar_rt
32:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/biogeophys/SnowSnicarMod.F90:377
32: #8  0x115f11b in __surfacealbedomod_MOD_surfacealbedo
32:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/biogeophys/SurfaceAlbedoMod.F90:782
32: #9  0x6500be in __clm_driver_MOD_clm_drv._omp_fn.4
32:     at /global/u2/a/azamat/cori/repos/E3SM/components/clm/src/main/clm_driver.F90:1296
32: #10  0x43b299d in gomp_thread_start
32:     at ../../../cray-gcc-8.2.0-201811010913.df0113f60eb17/libgomp/team.c:120
32: #11  0x28bb7e8 in start_thread
32:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/pthread_create.c:465
32: #12  0x44e4a6e in ???
32:     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
32: #13  0xffffffffffffffff in ???
srun: error: nid02519: task 32: Aborted

But at least there's a way to run DEBUG in MPI-only mode.

@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 8, 2019

Huh, I had forgotten about 1832. And you're right that it looks similar.

I would be OK with adjusting the compiler flags to stop on fewer conditions, but
do we know there really is no issue?

@ndkeen ndkeen changed the title GNU broken on cori-knl and cori-haswell Many tests with LND fail with GNU on cori-knl and cori-haswell Nov 8, 2019
@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 13, 2019

Update:
The Cori software upgrade just before these tests began failing did update the version of SLES. I'm being told that this included a newer version of glibc (2.26 vs 2.22). On theta, where these GNU tests still work, the version of SLES is the same as it was before the Cori upgrade.

cori03% cat /etc/os-release 
NAME="SLES"
VERSION="15"
VERSION_ID="15"
PRETTY_NAME="SUSE Linux Enterprise Server 15"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15"

cori08% ls -l /lib64/libc-*.so
-rwxr-xr-x 1 root root 2034840 Jul  9 05:18 /lib64/libc-2.26.so*


thetalogin4% cat /etc/os-release 
NAME="SLES"
VERSION="12-SP3"
VERSION_ID="12.3"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp3"

thetalogin4% ls -l /lib64/libc-*.so
-rwxr-xr-x 1 root root 1933128 Jul  8 13:54 /lib64/libc-2.22.so*

I've also found that I can remove the -Og flag in our DEBUG set that will allow several of the GNU DEBUG runs to complete. I see that on theta, -O0 is used, but I found that it is not needed.
Of course, this doesn't help the non-DEBUG runs that are still failing with illegal instructions.

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 20, 2019

I've been running a few more experiments and I've found that
a) Almost all GNU DEBUG tests I've tried pass when I change "-Og -ffpe-trap=invalid,zero,overflow" to "-ffpe-trap=zero,overflow " (and apply PR #3324)
b) All GNU non-DEBUG tests I've tried pass when I change "-O" to "-O2"

I would want to make this change and allow GNU tests to work now. Then later try to work allowing the invalid trap to work again. My guess with -O2 is that it's perhaps tested more in general. The -O flag turns on different flags than -O2. FWIW, in previous projects I've worked on, we have always used -O2.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 9, 2020

I was going to make 2 PR's: One to remove -Og from DEBUG GNU builds and another to bump optimization for GNU from -O1 to -O2. I wanted to do this only for Cori as it's the only machine right now with this issue. However, there are 2 things I'm struggling with:

  1. Because of the way CIME interprets config_compilers.xml, we can basically only make machine-specific edits by appending compiler flags. Not sure I can find a way to negate the -Og flag with another flag, but it may be that it's actually best to remove -Og for DEBUG GNU builds on all machines. It's not an important flag and perhaps is a simplification. This flag allows for optimizations that do not interfere with debugging.

I think I can effectively use -O2 for cori only (as we can only append flags, the compiler should interpret -O1 -O2 as -O2 -- which is messy but could work for now).

These changes allow the GNU tests (that I have run so far on cori) to pass, but certainly the optimization level change is likely not BFB compared to previous runs.

  1. This may not be addressing the root problem. We still don't know why these runs are hitting illegal instructions in the first place or why the compiler flag changes seem to work. As noted above, there is the version change of libc on this machine (from 2.22 to 2.26). Perhaps we should find a way of testing GNU builds with the same libc version on another machine? Will future machine updates on other machines also hit this same issue? Or is it possible it's something odd with the install on Cori? It certainly doesn't seem like a code issue, but I also can't rule that out.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 8, 2020

Cori now has a new version of GNU compilers (gcc/9.2.0). However, with initial tests, I think it's still behaving the same way. I am motivated to make a PR that removes the -Og flag from all GNU builds so that we at least have DEBUG builds working on cori.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jul 7, 2020

With #3629, we can now use GNU with DEBUG, but currently the issue still exists without DEBUG. It's still true that changing the fortran flag for GNU from -O to -O2 will allow tests to run. This adds optimizations and will not be BFB (when comparing -O to -O2), but may be an acceptable way to run for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants