Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For cori-knl, update PE layout for F compsets at ne4 resolution. #3920

Merged
merged 1 commit into from
Nov 12, 2020

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Oct 29, 2020

This change will make the default PE layout for ne4 F compsets be more efficient by using only 3 nodes.
But keep the fast-as-possible Large compset and add a single node Small compset (for ensemble testing?)

         SYPD    current     with this change
1  node  18.7                  S
3  node  32.6                  M default
13 node  52.1    M default     L

Fixes #3939

[bfb]

         SYPD    current     with this change
1  node  18.7                  S
3  node  32.6                  M default
13 node  52.1    M default     L

[bfb]
ndkeen added a commit that referenced this pull request Nov 4, 2020
This change will make the default PE layout for ne4 F compsets be more efficient by using only 3 nodes.
But keep the fast-as-possible Large compset and add a single node Small compset (for ensemble testing?)

         SYPD    current     with this change
1  node  18.7                  S
3  node  32.6                  M default
13 node  52.1    M default     L

[bfb]
@ndkeen
Copy link
Contributor Author

ndkeen commented Nov 4, 2020

Merged to next

@singhbalwinder
Copy link
Contributor

@ndkeen : I recently tried GNU on Cori KNL. The model blew up with the following error for ./create_test SMS_D.ne4_ne4.FC5AV1C-L --compiler gnu -p e3sm:

*** Error in `/global/cscratch1/sd/bsingh/e3sm_scratch/cori-knl/SMS_D.ne4_ne4.FC5AV1C-L.cori-
knl_gnu.mem_corrupt_bug_mstr_c0155f2975778a/bld/e3sm.exe': corrupted size vs. prev_size: 0x00002aab28058aa0 ***

My naive git bisect lead to this PR. I reverted this change from my master copy and the code worked fine. I suspect this PR is causing this failure. Following is some info on the PE-layout this test picked up:

Pes setting: grid match    is a%ne4np4_
Pes setting: machine match is cori-knl
Pes setting: grid          is a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
Pes setting: compset       is 2000_EAM%AV1C-L_ELM%SPBC_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV_SIAC_SESP
Pes setting: tasks       is {'NTASKS_ATM': 96, 'NTASKS_ICE': 96, 'NTASKS_CPL': 96, 'NTASKS_LND': 96, 'NTASKS_WAV': 96, 'NTASKS_ROF': 96, 'NTASKS_OCN': 96, 'NTASKS_GLC': 96}
Pes setting: threads     is {'NTHRDS_ICE': 4, 'NTHRDS_ATM': 4, 'NTHRDS_ROF': 4, 'NTHRDS_LND': 4, 'NTHRDS_WAV': 4, 'NTHRDS_OCN': 4, 'NTHRDS_CPL': 4, 'NTHRDS_GLC': 4}

Let me know if I missed anything or if there is some flag I need to add to make it work with GNU.

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 10, 2020

I don't think the root cause is the change of PE layouts. You don't have enough of the error message, but I suspect it's the same as what we had here

#3270

13:
13: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

which happened after Cori did some software upgrades. The "fix" I ended up with was to simply alter the compiler flags, which allowed the tests to pass, but I knew it wasn't the right fix.

I think that before this current PR, the ne4 cases were using 1 MPI per column (13 nodes) and no threading. You can still achieve that very same layout using "L"

./create_test SMS_PL_D.ne4_ne4.FC5AV1C-L --compiler gnu

In my recent GNU testing, I've also run into the same issue which proves that changing the compiler flags isn't a cure all. I can try debugging further, but I actually don't think it's a problem in our code.

Here are some examples of runs (with master of Nov 19th) that failed in this way. But note many other tests work.

SMS_D_Ln5.ne16_ne16.FC5AV1C-L.cori-knl_gnu.r01gnu/run/e3sm.log.36421669.201119-182756:371: *** Error in `/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/m49-nov19/SMS_D_Ln5.ne16_ne16.FC5AV1C-L.cori-knl_gnu.r01gnu/bld/e3sm.exe': munmap_chunk(): invalid pointer: 0x000000001e018670 ***
SMS_D_Ln5.ne30_ne30.FC5AV1C-L.cori-knl_gnu.r01gnu/run/e3sm.log.36421683.201119-183012:1105: *** Error in `/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/m49-nov19/SMS_D_Ln5.ne30_ne30.FC5AV1C-L.cori-knl_gnu.r01gnu/bld/e3sm.exe': munmap_chunk(): invalid pointer: 0x000000001ef232b0 ***
SMS_D_Ln5.ne4_ne4.FC5AV1C-L.cori-knl_gnu.r01gnu/run/e3sm.log.36420891.201119-175420: 9: *** Error in `/global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/m49-nov19/SMS_D_Ln5.ne4_ne4.FC5AV1C-L.cori-knl_gnu.r01gnu/bld/e3sm.exe': free(): invalid size: 0x000000001ca32600 ***
SMS_PS_Ld5.ne4_ne4.FC5AV1C-L.cori-knl_gnu.r01gnu/run/e3sm.log.36424423.201119-192408:13: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
SMS_PT.ne30_oECv3_ICG.A_WCYCL1850S_CMIP6.cori-knl_gnu.r01gnu/run/e3sm.log.36425595.201120-020131: 612: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
SMS_PT.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LRtunedHR.cori-knl_gnu.eam-cosplite.r01gnu/run/e3sm.log.36425605.201120-020746: 605: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
SMS_PT.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LRtunedHR.cori-knl_gnu.r01gnu/run/e3sm.log.36425600.201120-020746:   4: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
SMS_PT.ne30_oECv3_ICG.A_WCYCL2000S.cori-knl_gnu.r01gnu/run/e3sm.log.36425617.201120-021149: 545: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

So I think we just don't yet know why GNU is not working in all cases on Cori. Note I've also tried other versions of GNU, but have not yet been able to build/run with most recent GNU 10 (issue about that as well).

@singhbalwinder
Copy link
Contributor

Thanks for this info. I didn't do any extensive testing except for the git bisect. It might just be a fluke that it worked for me twice (master and my branch) after I reverted this commit. GNU is working fine on other machines so it might just be a Cori-KNL issue (like you mentioned).

This error message and the trace keeps on changing pointing to different files each time we run into this error but I am pasting here one of the complete error message for future reference:

0: *** Error in `/global/cscratch1/sd/bsingh/e3sm_scratch/cori-knl/SMS_D.ne4_ne4.FC5AV1C-L.cori-knl_gnu.mem_corrupt_bug_mstr_c0155f2975778a/bld/e3sm.exe': corrupted size vs. prev_size: 0x00002aab240876e0 ***
10: 
10: Program received signal SIGABRT: Process abort signal.
10: 
10: Backtrace for this error:
10: #0  0x4473e3f in ???
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
10: #1  0x7ab7990 in raise
10:     at ../sysdeps/unix/sysv/linux/raise.c:51
10: #2  0x7b60960 in abort
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/stdlib/abort.c:79
10: #3  0x7b7f946 in __libc_message
10:     at ../sysdeps/posix/libc_fatal.c:181
10: #4  0x7b85db2 in malloc_printerr
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:5428
10: #5  0x7b86261 in malloc_consolidate
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:4501
10: #6  0x7b8917c in _int_malloc
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:3701
10: #7  0x7b8ab41 in __libc_malloc
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/malloc/malloc.c:3081
10: #8  0x44996c7 in data_transfer_init
10:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgfortran/io/transfer.c:2842
10: #9  0x3f8b4ca in __shr_strconvert_mod_MOD_i4tostring
10:     at /global/u2/b/bsingh/delete/E3SM/cime/src/share/util/shr_strconvert_mod.F90:74
10: #10  0x3ee2082 in __shr_log_mod_MOD_shr_log_errmsg
10:     at /global/u2/b/bsingh/delete/E3SM/cime/src/share/util/shr_log_mod.F90:78
10: #11  0x2c4023e in calc_beta_leepielke1992
10:     at /global/u2/b/bsingh/delete/E3SM/components/elm/src/biogeophys/SurfaceResistanceMod.F90:123
10: #12  0x2c42f9c in __surfaceresistancemod_MOD_calc_soilevap_stress
10:     at /global/u2/b/bsingh/delete/E3SM/components/elm/src/biogeophys/SurfaceResistanceMod.F90:80
10: #13  0x28b795f in __canopytemperaturemod_MOD_canopytemperature
10:     at /global/u2/b/bsingh/delete/E3SM/components/elm/src/biogeophys/CanopyTemperatureMod.F90:229
10: #14  0x21f2714 in __elm_driver_MOD_elm_drv._omp_fn.4
10:     at /global/u2/b/bsingh/delete/E3SM/components/elm/src/main/elm_driver.F90:1296
10: #15  0x7ab85ed in gomp_thread_start
10:     at ../../../cray-gcc-8.3.0-201903122028.16ea96cb84a9a/libgomp/team.c:120
10: #16  0x446eec8 in start_thread
10:     at /home/abuild/rpmbuild/BUILD/glibc-2.26/nptl/pthread_create.c:465
10: #17  0x7bcb74e in ???
10:     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
10: #18  0xffffffffffffffff in ???

@rljacob rljacob deleted the ndk/machinefiles/cori-knl-ne4-layout branch January 11, 2021 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce number of nodes used by test MVK_PL.ne4_ne4.FC5AV1C-L.cori-knl
3 participants