-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
floating invalid error in 20tr_cam5_av1c-04p2 on cori-knl #3061
Comments
Are you able to provide a simple way for someone else to recreate the error? Either a script or via a create_test command? Note that switching to cori from edison does change a few things, but it could easily be the case that we could make the code fail in the same way on edison as well. Certainly, adjusting the PE layout can exercise the code in different ways. Things easy to try: change the number of MPI's, turn off threads, run in DEBUG, run on cori-haswell (instead of cori-knl)... |
What do you mean by "r8 (double precision) flag "? |
Here is the PE layout I used for the simulation. I will try cori-haswell to see what happened. else if ( e3sm_print 'using custom layout for cori-knl because $processor_config = '$processor_config ${xmlchange_exe} MAX_TASKS_PER_NODE="64" ${xmlchange_exe} NTASKS_ATM="5400" ${xmlchange_exe} NTASKS_LND="320" ${xmlchange_exe} NTASKS_ICE="5120" ${xmlchange_exe} NTASKS_OCN="3840" ${xmlchange_exe} NTASKS_CPL="5120" ${xmlchange_exe} NTASKS_GLC="320" ${xmlchange_exe} NTASKS_ROF="320" ${xmlchange_exe} NTASKS_WAV="5120" ${xmlchange_exe} NTHRDS_ATM="1" endif
|
I mean FC_AUTO_R8 flag. <FC_AUTO_R8> When I tried to read new file I created for soil erodibility, the values of variable do not seem right without this flag.
|
This is still not enough info. We need the "create_newcase" line which you can find in README.case in the case directory. And the full path to your case directory. |
My case directory is listed below. I changed the access permission and let me know in case you can not access it.
|
I still think it's better if we can recreate the case.
|
Can you try one more time to access the case directory? or let me know how I share the script with you to recreate the case. |
We STRONGLY discourage autopromotion. Please explicitly type your variables with the correct type (r8 I assume, using the usual "types" module). |
I copied your run_e3sm script here:
And made some changes to allow this to work for me. When I set
Are there code changes you are making? When I set One thing you can easily try yourself, is building with DEBUG=TRUE. This might easily catch some floating-point issues and give you more information. |
Also I see a potential issue in the way your script is setting the PE layout.
MAX_MPITASKS_PER_NODE has a new name and is the most important setting. The COSTPES variables is not needed at all. This will certainly impact your PE layout, but may not fix the error. In your casedir, the CseStatus file has:
Which is not what you want. |
I created the new chemistry module called "linoz_mam4_resus_mom_soag_biop" that is specifically designed to include both soluble and insoluble phosphorus aerosol emitted from different sources from landscapes (e.g., fires, fossil fuel, dust, etc) into the atmosphere. Could you use the option "include_fire=False" (that will use the chemistry module "linoz_mam4_resus_mom_soag") to see if you can compile and run the model successfully?
|
OK I will remove this flags in the config_compilers.xml.
|
I see. It is good to know that. I will modify the PE layout and try one more time. Thanks!
|
I now see that with or without fire, I would need access to |
I changed the access permission for the inputdata directory for the run without fires and you may try if you can access those data.
|
Let us know how the test goes when you have the correct number of MPI tasks per node. And if you could try with DEBUG=TRUE (I assume you know you can xmlchange DEBUG=TRUE before building to get this). I did try again but the permissions are still off.
|
I can try the correct PE layout and switch on DEBUG=TRUE first. BTW, I modified the permission for the /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc.
|
OK, I was able to build/run and even with DEBUG=TRUE, I get the same error as you. Without DEBUG=TRUE, I actually did get a different error though (COSP related).
|
That IS exactly same error I had!
|
Rebuilding with GNU compiler, I get an error in perhaps same place:
|
Does that mean that the compset "F20TRC5AV1C-04P2" perhaps is not working on cori-knl? There is no code change at all in the no-fire run (except those input files I created). The model uses the default chemistry module defined in the CAM5 namelist of E3SM.
|
I think there is likely a floating-point issue as the compiler reports. The fact that you did not see it with edison may be more a function of the software (compiler version, for example), than the actual hardware. I added a write statement just before the line that produces a floating-point issue and it looks like the array Different compilers treat using uninitialized variables differently, but typically we want to find/fix those. Here is what I printed:
I don't think I also ran this same case with nothing in user_nl_cam and user_nl_clm and I see the same result. Do we expect this type of test to work? I did just try:
which both passed. |
I also tested without COSP -- same failure. |
@ndkeen Thanks for helping me testing the code! I tried multiple ways, like switching the master (newest vs the older one published in Aug. 2018), modifying the compiler option (back and forth), unlimit the stacksize and coredumpsize, the error changed to Then I switched off the "-cosp" option in the CAM_CONFIG_OPTS, then the 3-day test run succeeded. Afterwards I switch on the "-cosp" but use cosp_lite=.true. in the namelist, so far so good. I finished one-year run and the excutable seems working now...... |
Hmmm. That's interesting that when I switched off COSP entirely, I still got the same error. |
@ndkeen I tried to use the newest one but got the same error and switched back to the older master (Aug. 2018 version), it surprisingly passed. It IS interesting, isn't it? |
When I run a F-case (ATM only) using the compset
With these, it will fail with a floating invalid error (run in DEBUG). It is a different error than posted above, but I can change the PE layout in the run_e3sm script and also get this same error:
So then I started taking off some of those options to see if I could narrow anything down. I found that using:
Will cause the above error, while other combinations do not. It looks like I cannot try simply The source where it is stopping is here:
Which makes me think it's another example of an array being used before init. |
In github issue #3142, I'm now seeing that I get the same error with our "normal" F compset cases -- but only if I force 1 thread (pure MPI). |
Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks! |
Did you try to define cosp_lite=.true. in the namelist and use "-cosp" in the runscript? I used this strategy to solve the error.
Sent from Yahoo Mail for iPhone
On Wednesday, August 28, 2019, 12:20, susburrows <notifications@github.com> wrote:
Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@lxu16: I don't think setting cosp_lite to true is a solution to this problem. It may have allowed you to continue to run, but it could be dangerous. Also, I think you are using a PE layout meant for coupled case, while you appear to actually only be doing ATM-only. |
@ndkeen I agree. This is the work around to the error. Please keep me posted if you find the better way to resolve this issue. Thanks!
|
I think the fix in PR3324 should work here, but I'm unable to run the same script as I did before (even after making Cori module changes). If someone could simply add the one line (to init qrl_idx=0) and try again that would be great? Or provide me an updated script that works on Cori and I can try. |
Add workflow files for eamxx standalone and v1 testing
I have been debugging this float invalid error for a while on the cori machine without any clues. I included e3sm.log error below. I can compile and run the code on edison without any problem and have this issue after switching to the cori machine. I felt the error is related to the r8 (double precision) flag and NetCDF file input because the code is as same as before. The error just pops out after switching the machine from edison to cori.
May I use some help in the E3SM community? I appreciate your suggestions.
The text was updated successfully, but these errors were encountered: