Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

History Crash in UFS w/ newer Intel compilers #2213

Closed
ulmononian opened this issue Jun 27, 2023 · 35 comments
Closed

History Crash in UFS w/ newer Intel compilers #2213

ulmononian opened this issue Jun 27, 2023 · 35 comments

Comments

@ulmononian
Copy link

ulmononian commented Jun 27, 2023

when testing the ufs weather model in coupled mode (w/ waves and aerosols; i.e. S2SWA) on msu's hercules and gfdl's gaea c5, the model compiles successfully but fails at runtime in what seems to be mapl-related (and happens during the gocart run step). note that the model runs successfully if aerosols are turned off (and mapl in turn is not used).

the tests on both machines are w/ newer intel compilers (2022.2.1). the mpi's used are intel-oneapi-mpi/2021.7.1 and cray-mpich/8.1.25, respectively. we are using mapl/2.35.2 and esmf/8.4.2 (can provide full lib. stack if useful). the aerosol model (gocart) hashes we've tested w/ are c485cbc and b94145f; the results are the same with each.

the model fails at the same place on each machine. for example, on hercules the err file (when using esmf + mapl debug versions) shows:

118: [hercules-08-44:97124:0:97124] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
139: ==== backtrace (tid:  97145) ====
122: ==== backtrace (tid:  97128) ====
122:  0 0x0000000000054d90 __GI___sigaction()  :0
122:  1 0x00000000070507b8 do_alloc_assign()  for_alloc_copy.c:0
122:  2 0x000000000320c59a mapl_historygridcompmod_mp_initialize_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-mapl-2.35.2-fwxcinvghuqqqbos47hu33njf677dc4q/spack-src/gridcomps/History/MAPL_HistoryGridComp.F90:2336
122:  3 0x00000000017c00f4 ESMCI::FTable::callVFuncPtr()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
122:  4 0x00000000017c40da ESMCI_FTableCallEntryPointVMHop()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
122:  5 0x000000000195448f ESMCI::VMK::enter()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2320
122:  6 0x0000000001940ad2 ESMCI::VM::enter()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
122:  7 0x00000000017c1547 c_esmc_ftablecallentrypointvm_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
122:  8 0x00000000007b6e0d esmf_compmod_mp_esmf_compexecute_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1223
122:  9 0x0000000001003499 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1412
122: 10 0x000000000333be70 mapl_genericmod_mp_mapl_genericwrapper_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-mapl-2.35.2-fwxcinvghuqqqbos47hu33njf677dc4q/spack-src/generic/MAPL_Generic.F90:1813
122: 11 0x00000000017c00f4 ESMCI::FTable::callVFuncPtr()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
122: 12 0x00000000017c40da ESMCI_FTableCallEntryPointVMHop()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
122: 13 0x000000000195448f ESMCI::VMK::enter()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2320
122: 14 0x0000000001940ad2 ESMCI::VM::enter()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
122: 15 0x00000000017c1547 c_esmc_ftablecallentrypointvm_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
122: 16 0x00000000007b6e0d esmf_compmod_mp_esmf_compexecute_()  /work/noaa/epic-ps/role-epic-ps/spack-stack/spack-stack-1.4.0-hercules/cache/build_stage/spack-stage-esmf-8.4.2-rhqmf6ses26aijoekktbtisfq7er5krw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1223

and the out file shows:

180:  WW3 log written to /work2/noaa/epic-ps/cbook/stmp/cbook/FV3_RT/rt_221844/cpld_c
180:  ontrol_p8_intel/./log.ww3
  0:  Starting pFIO input server on Clients
  0:  Starting pFIO output server on Clients
  0:  Character Resource Parameter: ROOT_CF:AERO.rc
  0:  Character Resource Parameter: ROOT_NAME:AERO
  0:  Character Resource Parameter: HIST_CF:AERO_HISTORY.rc
  0:  Character Resource Parameter: EXTDATA_CF:AERO_ExtData.rc
  0:  DU::SetServices: Dust emission scheme is fengsha
  0:  WARNING: falling back on MAPL NUM_BANDS
  0:  GOCART2G::Initialize: Starting...
  0:   
  0:  Integer*4 Resource Parameter: RUN_DT:720
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_NO3 already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_OH already exists. Skipping ...
  0:  ===================>
  0:  MAPL_StateCreateFromSpecNew: var SU_H2O2 already exists. Skipping ...
  0:   oserver is not split
  0:  
  0:  EXPSRC:GEOSgcm-v10.16.0
  0:  EXPID: gocart
  0:  Descr: GOCART2g_diagnostics_at_c360
  0:  DisableSubVmChecks: F
  0:  
  0:  Reading HISTORY RC Files:
  0:  -------------------------
  0:  NOT using buffer I/O for file: AERO_HISTORY.rc
  0:  NOT using buffer I/O for file: inst_aod.rcx
  0:  
  0:  Freq: 00060000  Dur: 00010000  TM:   -1  Collection: inst_aod

@mathomp4 pointed out that the model seems to be dying in History, though he and @bena-nasa did not notice anything particularly wrong with the history file being used (ufs-community/ufs-weather-model#1791 (comment)). @bena-nasa suggested it may be a compiler or memory bug, which was then followed by comments by @mathomp4 about potential issues with mapl & intel compilers newer than 2021.7.x (again, we are using 2022.2.1).

i was just wondering if there is perhaps any further information regarding mapl's compatibility with newer intel compilers? we don't have access to sys admin installed intel compilers older than 2022.2.1 on either hercules or gaea c5 at this time, so we are hoping to find a solution using the available compiler version.

some additional details can be found in the ufs weather model issue #1791.

thank you!!!

@tclune
Copy link
Collaborator

tclune commented Jun 27, 2023

Hmm. The code is really doing something pretty boring at the line that is failing - basically copying a string. My one thought is that perhaps the string argument list(n)%positive is itself not allocated. That could give that sort of error. Unfortunately, I don't know the layer well enough to immediately speculate where that argument should have been previously established.

If you have the ability to recompile and try again, I would suggest adding the following line just before line 2336 in MAPL_HistoryGridComp.F90:

_ASSERT(allocated(list(n)%positive), "unallocated string: 'list(n)%positive'")

But hopefully @bena-nasa is back from vacation today and can provide a more complete diagnosis before we trudge down the debugging path ...

@mathomp4
Copy link
Member

Wait. I think Intel confused me once again, @ulmononian.

I believe Intel oneAPI 2022.1.0 is actually ifort 2021.6.0 which is what we use operationally:

$ which ifort
/usr/local/intel/oneapi/2021/compiler/2022.1.0/linux/bin/intel64/ifort
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Thus, this should work. It's ifort 2021.7 that had problems (and I think 2021.9 on my mac failed as well).

@mathomp4
Copy link
Member

Even odder, I'm not sure that can ever not be filled. From #941 this code came in:

          call ESMF_ConfigFindLabel(cfg,trim(string)//'positive:',isPresent=isPresent,_RC)
          if (isPresent) then
             call ESMF_ConfigGetAttribute(cfg,value=list(n)%positive,_RC)
             _ASSERT(list(n)%positive=='down'.or.list(n)%positive=='up',"positive value for collection must be down or up")
          else
             list(n)%positive = 'down'
          end if

From what I see of your history, you don't specify a positive, so you must be in the down case.

@bena-nasa bena-nasa changed the title MAPL functionality w/ newer Intel compilers History Crash in UFS w/ newer Intel compilers Jun 27, 2023
@mathomp4
Copy link
Member

I suppose that is something you could try. Add:

  inst_aod.positive: 'down',

to your HISTORY and force the subject?

Note: I just tried your inst_aod collection in a GEOS run here with and without that line and no difference. Both ran just fine!

@bena-nasa
Copy link
Collaborator

That string on the right hand side is always filled with "up" or "down", it's either in the History and the user provided it (and if it is not one of those one should die), or if not, it defaults to "down". There's no way you get to that point in the code without it it being one of those two

@bena-nasa
Copy link
Collaborator

Does this fail in the same place with optimization?

@tclune
Copy link
Collaborator

tclune commented Jun 27, 2023

@weiyuan-jiang I think you are going to need to attempt reproducing this on Orion unfortunately. Please work with our NOAA counterparts to get the details. Once you can reproduce, pull @bena-nasa to debug.

@ulmononian
Copy link
Author

ulmononian commented Jun 27, 2023

Wait. I think Intel confused me once again, @ulmononian.

I believe Intel oneAPI 2022.1.0 is actually ifort 2021.6.0 which is what we use operationally:

$ which ifort
/usr/local/intel/oneapi/2021/compiler/2022.1.0/linux/bin/intel64/ifort
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.6.0 Build 20220226_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Thus, this should work. It's ifort 2021.7 that had problems (and I think 2021.9 on my mac failed as well).

with intel-oneapi-compilers/2022.2.1 that we use on hercules, it looks like it uses ifort/2021.7.1:

[] $ which ifort
/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2022.2.1-z2sjni66fcyqcsamnoccgb7c77mn37qj/compiler/2022.2.1/linux/bin/intel64/ifort
[]$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.1 Build 20221019_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

if 2021.7.x is the problematic ifort, perhaps this could be a clue

@ulmononian
Copy link
Author

Does this fail in the same place with optimization?

can i ask what optimization you are referring to?

@ulmononian
Copy link
Author

Hmm. The code is really doing something pretty boring at the line that is failing - basically copying a string. My one thought is that perhaps the string argument list(n)%positive is itself not allocated. That could give that sort of error. Unfortunately, I don't know the layer well enough to immediately speculate where that argument should have been previously established.

If you have the ability to recompile and try again, I would suggest adding the following line just before line 2336 in MAPL_HistoryGridComp.F90:

_ASSERT(allocated(list(n)%positive), "unallocated string: 'list(n)%positive'")

But hopefully @bena-nasa is back from vacation today and can provide a more complete diagnosis before we trudge down the debugging path ...

thanks for this suggestion! since we build mapl as part of a larger stack using spack, it would be a bit of an endeavor to re-compile w/ this code change, but i can certainly do it. is this something i should pursue or hold off until the team has a chance to look into this a bit more?

@tclune
Copy link
Collaborator

tclune commented Jun 27, 2023

Based on guidance from @bena-nasa I would not bother. At this point the focus should be on getting my team to reproduce the problem on Orion. We can then play with flags and such to see where that takes us.

@weiyuan-jiang
Copy link
Contributor

@ulmononian Do you have instructions for me to reproduce this on Orion?

@ulmononian
Copy link
Author

ulmononian commented Jun 27, 2023

@ulmononian Do you have instructions for me to reproduce this on Orion?

i am not sure this is reproducible on orion... the same ufs model config (w/ the same mapl & gocart version) runs fine on orion, with the only difference being the compiler/mpi version there (and obviously machine architecture/cpu config/etc. between hercules-orion or c5-orion). the specific branches we are testing are for porting the weather model to hercules and gaea c5, and the problem only arises on these two machines.

@tclune
Copy link
Collaborator

tclune commented Jun 27, 2023

Our ability to trouble shoot this is very limited if there is not an environment where we can reproduce this. I thought that was the point of Orion.

My only other suggestion would then be "pair debugging", where someone on the NOAA end drives the keyboard and someone on our end screenshares and suggests next steps. It would work, but ...

@ulmononian
Copy link
Author

ulmononian commented Jun 27, 2023

i completely understand. i think that anyone with orion access should also be able to access hercules, though, as they are both msu machines and share a filesystem. @weiyuan-jiang are you able to use hercules by chance? to log-in, it is the same ssh command as for orion, but with the @orion replaced with @hercules.

@weiyuan-jiang
Copy link
Contributor

It seems I can login to hercules. So please give me instruction to reproduce the issue.

@ulmononian
Copy link
Author

ulmononian commented Jun 27, 2023

@weiyuan-jiang awesome! to reproduce there, please do the following:

git clone --recursive -b feature/add_hercules https://github.com/ulmononian/ufs-weather-model.git
cd ufs-weather-model/tests
./rt.sh -a <slurm_account_you_can_charge_to> -c -n cpld_control_p8 intel
vim <rt_dir>/cpld_control_p8_intel/out
vim <rt_dir/cpld_control_p8_intel/err

if you want to use debug versions of esmf and mapl for your test, please edit ufs-weather-model/modulefiles/ufs_hercules.intel.lua and add the following lines below the loading of ufs_common (line 19):

load("esmf/8.4.2-debug")
load("mapl/2.35.2-debug-esmf-8.4.2-debug")

@weiyuan-jiang
Copy link
Contributor

It turned out my access to Hercules is short live. I am asking for help now

@weiyuan-jiang
Copy link
Contributor

weiyuan-jiang commented Jun 29, 2023

The crash happened earlier in my run than in your run. Should I change something to re-produce your error message? @ulmononian

150: WARNING from PE 0: Unused line in INPUT/MOM_input : ODA_INCUPD_NHOURS = 6
150:
103: forrtl: severe (189): LHS and RHS of an assignment statement have incompatible types
103: Image PC Routine Line Source
103: fv3.exe 00000000020A05CA Unknown Unknown Unknown
103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown
103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown
103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown
103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown
103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown
103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown
103: fv3.exe 0000000000894A71 Unknown Unknown Unknown
103: fv3.exe 00000000021CFEA0 Unknown Unknown Unknown
103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown
103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown
103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown
103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown
103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown
103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown
103: fv3.exe 0000000000894A71 Unknown Unknown Unknown
103: fv3.exe 000000000206339F Unknown Unknown Unknown
103: fv3.exe 0000000002068216 Unknown Unknown Unknown
103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown
103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown
103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown
103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown
103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown
103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown
103: fv3.exe 0000000000894A71 Unknown Unknown Unknown
103: fv3.exe 00000000020608CD Unknown Unknown Unknown
103: fv3.exe 0000000001D9F59F aerosol_cap_mp_mo 348 Aerosol_Cap.F90

@weiyuan-jiang
Copy link
Contributor

@ulmononian Do you have instruction to build MAPL on Hercules?

@ulmononian
Copy link
Author

The crash happened earlier in my run than in your run. Should I change something to re-produce your error message? @ulmononian

150: WARNING from PE 0: Unused line in INPUT/MOM_input : ODA_INCUPD_NHOURS = 6 150: 103: forrtl: severe (189): LHS and RHS of an assignment statement have incompatible types 103: Image PC Routine Line Source 103: fv3.exe 00000000020A05CA Unknown Unknown Unknown 103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown 103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown 103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown 103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown 103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown 103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown 103: fv3.exe 0000000000894A71 Unknown Unknown Unknown 103: fv3.exe 00000000021CFEA0 Unknown Unknown Unknown 103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown 103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown 103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown 103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown 103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown 103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown 103: fv3.exe 0000000000894A71 Unknown Unknown Unknown 103: fv3.exe 000000000206339F Unknown Unknown Unknown 103: fv3.exe 0000000002068216 Unknown Unknown Unknown 103: fv3.exe 0000000000B4BA14 Unknown Unknown Unknown 103: fv3.exe 0000000000B4F98F Unknown Unknown Unknown 103: fv3.exe 0000000000CBDCFA Unknown Unknown Unknown 103: fv3.exe 0000000000CAA362 Unknown Unknown Unknown 103: fv3.exe 0000000000B4CE6A Unknown Unknown Unknown 103: fv3.exe 00000000005B18F0 Unknown Unknown Unknown 103: fv3.exe 0000000000894A71 Unknown Unknown Unknown 103: fv3.exe 00000000020608CD Unknown Unknown Unknown 103: fv3.exe 0000000001D9F59F aerosol_cap_mp_mo 348 Aerosol_Cap.F90

interesting. there is nothing more you should need to do. can you share your run directory? i will take a look.

as for building mapl on hercules: we are using spack to build it as part of the full stack for the ufs-wm and other applications. i can provide you with instructions on how to install the stack, but i have not installed mapl manually on hercules. the script that spack uses to build mapl is: https://github.com/JCSDA/spack/tree/158dada02ce08a0b42606f82059c51e8f9f02ef0/var/spack/repos/builtin/packages/mapl/package.py.

@weiyuan-jiang
Copy link
Contributor

weiyuan-jiang commented Jun 30, 2023

Never mind, I think my error message is as same as yours because history init is within the cap init. I will need to build MAPl first so I can insert something and get more information about the crash.

@ulmononian
Copy link
Author

some follow-up on how mapl is built:

these variants are applied:

 mapl@2.35.2%intel+debug ^esmf@8.4.2%intel+external-parallelio~pnetcdf~shared~xerces snapshot=none

 mapl@2.35.2%intel@2021.7.1 cflags="-diag-disable=10441" cxxflags="-diag-disable=10441" +debug+esma_gfe_namespace~extdata2g~fargparse~flap~ipo~pflogger+pnetcdf~shared build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-rocky9-icelake

for the variants: ~ indicates a setting is turned off; + indicates it is turned on. the variants listed above can be found in the packages.py script i linked to in my previous comment. they essentially correspond to cmake flags that are set for the build.

@ulmononian
Copy link
Author

Never mind, I think my error message is as same as yours because history init is within the cap init. I will need to build MAPl first so I can insert something and get more information about the crash.

ok -- yes it is indeed the same error i received when i ran with the NON-debug version of mapl (as in this log https://github.com/ufs-community/ufs-weather-model/files/11682589/hercules_err.txt).

re-compiling & running the WM should yield the full error log that points to the history crash...

@weiyuan-jiang
Copy link
Contributor

I believe it is the problem of the compiler. For example, it reports the error " 0x00000000074e08a8 do_alloc_assign() for_alloc_copy.c:0..." at this line
list(n)%vdata = VerticalData(positive=list(n)%positive,_RC)
if I change it to ( of course, change the type vdata as allocatable too)
allocate(list(n)%vdata, source = VerticalData(positive=list(n)%positive))
it passes that line.
In MAPL, there are many assignments like that, I am wondering if we should make such changes for a buggy compiler. @tclune @ulmononian

@ulmononian
Copy link
Author

ulmononian commented Jun 30, 2023

I believe it is the problem of the compiler. For example, it reports the error " 0x00000000074e08a8 do_alloc_assign() for_alloc_copy.c:0..." at this line

list(n)%vdata = VerticalData(positive=list(n)%positive,_RC)

if I change it to ( of course, change the type vdata as allocatable too)

allocate(list(n)%vdata, source = VerticalData(positive=list(n)%positive))

it passes that line.

In MAPL, there are many assignments like that, I am wondering if we should make such changes for a buggy compiler. @tclune @ulmononian

that's great news that you can get it to pass by making this adjustment. did the model run completion with these changes, by chance?

if these changes would be possible on the mapl level, it would be most appreciated. is there anything we can do on our side?

thank you!!

@weiyuan-jiang
Copy link
Contributor

No, the model didn't complete because of the same assignments error at other locations. I think there are many assignment like that.

@ulmononian
Copy link
Author

No, the model didn't complete because of the same assignments error at other locations. I think there are many assignment like that.

ok -- makes sense.

it is not clear to me -- based on your example, in what file is this happening? further: are these assignments restricted to this single file or is this throughout the code?

@climbfuji
Copy link

climbfuji commented Jul 1, 2023 via email

@tclune
Copy link
Collaborator

tclune commented Jul 3, 2023

Yeah - this is an odd regression for Intel. Hopefully they have lots of other customers with similar bugs, as it can be a bit tricky to boil down a standalone reproducer for these. Clearly most of the allocate-on-assignment statements are working just fine. There is a chance that a small reproducer that just brings in the VerticalData class can reproduce, so that should be attempted. (@weiyuan-jiang )

@ulmononian
Copy link
Author

ulmononian commented Jul 6, 2023

Yeah - this is an odd regression for Intel. Hopefully they have lots of other customers with similar bugs, as it can be a bit tricky to boil down a standalone reproducer for these. Clearly most of the allocate-on-assignment statements are working just fine. There is a chance that a small reproducer that just brings in the VerticalData class can reproduce, so that should be attempted. (@weiyuan-jiang )

@weiyuan-jiang just wanted to touch base and see if you would be able to create the reproducer patch/bugfix to address this intel compiler issue. it would be most appreciated and help us get the weather model running on hercules (and potentially gaea c5 too)! thank you!

@weiyuan-jiang
Copy link
Contributor

@ulmononian I was unable to create a simple reproducer.

@ulmononian
Copy link
Author

@ulmononian I was unable to create a simple reproducer.

thanks for trying. given that, is there anything else that can be done on the mapl side? otherwise, we may have to take this to intel and/or the hercules/gaea sys admins.

@ulmononian
Copy link
Author

@weiyuan-jiang if you didn't find it, it should be at the top of the out file generated in the rt_*/compile_* directory generated after you run ./rt.sh. however, you can compile the ufs-weather-model manually (and save the cmake output) like:

git clone --recursive -b feature/add_hercules https://github.com/ulmononian/ufs-weather-model.git
cd ufs-weather-model
module use modulefiles; module load ufs_hercules.intel
mkdir build; cd build
#cmake comment with output saved below
cmake  -DAPP=S2SWA -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON .. 2>&1 | tee log.cmake
make -j <#>

@ulmononian
Copy link
Author

this was resolved by upgrading the intel compiler version on hercules and c5 (see JCSDA/spack-stack#673). i believe this issue can be closed. thanks for all the help!

@tclune tclune closed this as completed Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants