Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fortran runtime error: Index '1' of dimension 2 of array 'this' outside of expected range SMS_D.f19_g16.I1850ELM.machine_compiler.elm-betr with invalid #5832

Closed
ndkeen opened this issue Jul 24, 2023 · 14 comments · Fixed by #6214
Assignees

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jul 24, 2023

As we closed #5539, I'm making another issue here with same error.
We are trying to add the invalid check to the fortran compiler.

With SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr:

 3: At line 124 of file /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_gnu-add-invalid-to-DEBUG/components/elm/src/external_models/sbetr/src/betr/betr_core/TracerStateType.F90
 3: Fortran runtime error: Index '1' of dimension 2 of array 'this' outside of expected range (140737046949536:40202912)

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/mfgnuinvalid/SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr.gh5539

To add invalid check:

login04% git diff cime_config/machines/cmake_macros/gnu.cmake
diff --git a/cime_config/machines/cmake_macros/gnu.cmake b/cime_config/machines/cmake_macros/gnu.cmake
index eae59e3e4b..a8fce54cbf 100644
--- a/cime_config/machines/cmake_macros/gnu.cmake
+++ b/cime_config/machines/cmake_macros/gnu.cmake
@@ -19,7 +19,8 @@ endif()
 if (DEBUG)
   string(APPEND CFLAGS " -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow")
   string(APPEND CXXFLAGS " -g -Wall -fbacktrace")
-  string(APPEND FFLAGS " -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=zero,overflow")
+  string(APPEND FFLAGS " -g -Wall -fbacktrace -fcheck=bounds,pointer -ffpe-trap=invalid,zero,overflow")
@jinyun1tang
Copy link
Contributor

@ndkeen I fixed the issue with branch jinyuntang/fix5832, could you do a test? The problem is an array size inconsistency between elm and sbetr. A small update of sbetr fixed the problem as far as I can tell from my test.

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 19, 2023

When I add invalid flag to recent master and try the test, I now see a different error mesg that reported above.

 95: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 95:
 95: Backtrace for this error:
 95: #0  0x14f0c72dedbf in ???
 95: #1  0x1f604a7 in __tracerparamsmod_MOD_calc_aerecond
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/betr/betr_para/TracerParamsMod.F90:1271
 95: #2  0x1f4c777 in __betrbgcmod_MOD_stage_tracer_transport
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/betr/betr_main/BetrBGCMod.F90:203
 95: #3  0x1e420e8 in __betrtype_MOD_step_without_drainage
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/driver/shared/BeTRType.F90:375
 95: #4  0x1b25651 in __betrsimulationelm_MOD_elmstepwithoutdrainage
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/driver/elm/BeTRSimulationELM.F90:314
 95: #5  0x6862a8 in __elm_driver_MOD_elm_drv
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/main/elm_driver.F90:1178
 95: #6  0x6509c7 in __lnd_comp_mct_MOD_lnd_run_mct
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/cpl/lnd_comp_mct.F90:514

If I check out your branch, add invalid, I do not see a crash. However, I'm not sure what changes you made based on the branch.

@jinyun1tang
Copy link
Contributor

@ndkeen the problem is due to a more recent update of maxpft from a small number to a larger number 50, causing a mistmatch between sbetr and elm. If you find my fix solve the problem, I will update sbetr, and update e3sm and create a pull request based on this.

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 19, 2023

Note above, I show how to add invalid check, so you can try yourself. Then go ahead and make PR.

@jinyuntang
Copy link
Contributor

jinyuntang commented Aug 19, 2023 via email

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 19, 2023

Great! Then sounds like you have fixed this issue. You would not want to include that change in your PR -- we would like to add it, but are still trying to fix issues that were uncovered with it (like this one).

@jinyuntang
Copy link
Contributor

jinyuntang commented Aug 19, 2023 via email

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 27, 2023

With Oct27th checkout, I still see this error

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 19, 2024

I'm still seeing the same error with Jan18th master and Jan23rd master

@jinyun1tang
Copy link
Contributor

@ndkeen Is there any change I'd made to do test? I recall last time you instructed me to made some changes in time.
Now, after I trying
./create_test SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr
I got the following error "FAIL SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr (phase CREATE_NEWCASE)". I have no clue what is going on.
Thanks.

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 6, 2024

Yes that is correct command. I don't have enough info there to know what's wrong, but if I were to guess: Are you trying that on perlmutter? If on another machine, need the machine name instead of pm-cpu. Are you trying from cime/scripts? I guess so as it would otherwise say create_test not found.

When I try this test on master:
create_test SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr
I still see the same error as noted above

Note that the change I mention above (regarding compiler flags) should no longer be needed as master has this change (for quite a while).

@jinyun1tang
Copy link
Contributor

@ndkeen It appeared I have to update the submodules. After that, now it is working. I will report back the result once it is done.

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 6, 2024

Ah, yep, that's another common mistake I should have mentioned

@jinyuntang
Copy link
Contributor

@ndkeen, just let you know that the tests passed.

bishtgautam added a commit that referenced this issue Mar 21, 2024
This update synchronizes the data structure change between sbetr and elm.
It fixes the failure of SMS_D.f19_g16.I1850ELM test.

[BFB].

Fixes #5832
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants