-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaNs in ieflx_gmean calc on KNLs with Intel compiler #2178
Comments
Thanks @amametjanov for reporting this. I have seen this error before but with a different e3sm configuration. Which compset are you using and what resolution? Is this reproducible? If yes, do you get this with debug flags turned on? Hope is that the debug flags might reveal where it first originates. |
Yes, second time seeing this with |
Looking through the history file (
Theta's machine precision is
Arithmetic ops with such small numbers can be causing NaNs. |
Could those numbers be junk from an array that wasn't initialized to zero? I doubt small numbers like that are computed by the model. |
It looks like they are being computed :) |
We can catch instances where these small numbers are generated using compiler's underflow flags. I have never seen them causing any issues in the past. I remember using this flag long time back and the code would crash very early on with underflow detection. They always seemed harmless to me but it may depend on the compiler too as some compilers would automatically set underflow to zero while others won't. |
Additional data point about initialization. ATM after 4th step in the continued run has:
And these checks:
show in e3sm.log:
So the arrays of size 16 are initialized to 0, but get NaNs in fields
May need to check these fields in history files. |
@amametjanov - are you saying that these NaNs show up for the first time after step 4, or are they formed upon initialization and persist through step 4? |
Yes, showing up for the first time after step 4 (twice in these restart runs and once after step 1 in a 145-node startup run). Reprosum calculation has a check for NaNs and INFs and will abort/endrun if there is any such value in the summation. If ne30 runs did not encounter the |
Hmm. I think we can conclude from the fact that NaNs show up on step 4 that this is not an initialization problem. Does this problem always show up on step 4, or does the timestep it shows up on vary? If always step 4, is there something special about step 4 (e.g. radiation is called every hour = every 4 steps)? |
I think cflx is dimensioned (pcols, pcnst) , therefore the issue is in column 9.
|
Sorry forgot to mention, I changed |
Sorry, I don't think columns should be 8 instead of 9. One reason for this behavior might be that |
Chunks can have different numbers of columns (and not use all of the available space in a chunk). If some routine is using pcols instead of ncols, this could cause the situation @singhbalwinder is conjecturing about. |
Should be fixed by #2208. I was not able to re-produce this after that PR. |
Az, if these are KNL-specific bugs with KNL-specific solutions, please change the title to include "on KNL". |
To help searches by future users. |
…torrange_in_eamxx Automatically Merged using E3SM Pull Request AutoTester PR Title: Use TeamVectorRange for flat-loops PR Author: tcclevenger
Logging an issue to track down the location of this error. The NaN is in column 8 of chunk 127455.
Source of the NaN is in one of
cam_in(lchnk)%cflx(:ncol,1)
cam_out(lchnk)%precsc(:ncol)
cam_out(lchnk)%precsl(:ncol)
cam_out(lchnk)%precc(:ncol)
cam_out(lchnk)%precl(:ncol)
The text was updated successfully, but these errors were encountered: