Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs in ieflx_gmean calc on KNLs with Intel compiler #2178

Closed
amametjanov opened this issue Mar 20, 2018 · 17 comments
Closed

NaNs in ieflx_gmean calc on KNLs with Intel compiler #2178

amametjanov opened this issue Mar 20, 2018 · 17 comments

Comments

@amametjanov
Copy link
Member

Logging an issue to track down the location of this error. The NaN is in column 8 of chunk 127455.

SHR_REPROSUM_CALC: Input contains  0.10000E+01 NaNs and  0.00000E+00 INFs on process   20527
 ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image              PC                Routine            Line        Source
e3sm.exe           000000000312B45F  shr_abort_mod_mp_         114  shr_abort_mod.F90
e3sm.exe           000000000324F04E  shr_reprosum_mod_         428  shr_reprosum_mod.F90
e3sm.exe           00000000005BBA2B  phys_gmean_mp_gme         417  phys_gmean.F90
e3sm.exe           00000000010594C0  check_energy_mp_i         709  check_energy.F90
e3sm.exe           0000000000623B8F  physpkg_mp_phys_r        1210  physpkg.F90
e3sm.exe           0000000000505626  cam_comp_mp_cam_r         285  cam_comp.F90
e3sm.exe           00000000004F2287  atm_comp_mct_mp_a         501  atm_comp_mct.F90
e3sm.exe           0000000000429664  component_mod_mp_         728  component_mod.F90
e3sm.exe           000000000040F172  cime_comp_mod_mp_        3371  cime_comp_mod.F90
e3sm.exe           0000000000429370  MAIN__                    103  cime_driver.F90

Source of the NaN is in one of

  • cam_in(lchnk)%cflx(:ncol,1)
  • cam_out(lchnk)%precsc(:ncol)
  • cam_out(lchnk)%precsl(:ncol)
  • cam_out(lchnk)%precc(:ncol)
  • cam_out(lchnk)%precl(:ncol)
 632 subroutine ieflx_gmean(state, tend, pbuf2d, cam_in, cam_out, nstep)
 ...
 673     ienet = 0._r8
 674 
 675 !DIR$ CONCURRENT
 676     do lchnk = begchunk, endchunk
 677 
 678        ncol = state(lchnk)%ncol
>679        qflx(:ncol,lchnk) = cam_in(lchnk)%cflx(:ncol,1)
>680        snow(:ncol,lchnk) = cam_out(lchnk)%precsc(:ncol) + cam_out(lchnk)%precsl(:ncol)
>681        rain(:ncol,lchnk) = cam_out(lchnk)%precc(:ncol)  + cam_out(lchnk)%precl(:ncol) - snow(:ncol,lchnk)
 682 
 683        select case (ieflx_opt)
 684 
 685        !!..................................................................................... 
 686        !! Calculate the internal energy flux at surface (imitate what is considered in the ocean model)   
 687        !! 
 688        !! ieflx_opt = 1 : air temperature in the lowest model layer will be used 
 689        !! ieflx_opt = 2 : skin temperature (from lnd/ocn/ice components) will be used  
 690        !! 
 691        !! ieflx_opt = 2 is recommended for now. 
 692        !! 
 693        !! (rhow*) converts the unit of precipitation from m/s to kg/m2/s 
 694        !!..................................................................................... 
 695 
 696        case(1)
 697           ienet(:ncol,lchnk) = cpsw * qflx(:ncol,lchnk) * cam_in(lchnk)%ts(:ncol) - &
 698                                cpsw * rhow * ( rain(:ncol,lchnk) + snow(:ncol,lchnk) ) * cam_out(lchnk)%tbot(:ncol)
 699        case(2)
>700           ienet(:ncol,lchnk) = cpsw * qflx(:ncol,lchnk) * cam_in(lchnk)%ts(:ncol) - &
>701                                cpsw * rhow * ( rain(:ncol,lchnk) + snow(:ncol,lchnk) ) * cam_in(lchnk)%ts(:ncol)
 702        case default
 703           call endrun('*** incorrect ieflx_opt ***')
 704        end select
 705 
 706 
 707     end do
 708 
>709     call gmean(ienet, ieflx_glob)
@singhbalwinder
Copy link
Contributor

Thanks @amametjanov for reporting this. I have seen this error before but with a different e3sm configuration. Which compset are you using and what resolution? Is this reproducible? If yes, do you get this with debug flags turned on? Hope is that the debug flags might reveal where it first originates.

@amametjanov
Copy link
Member Author

Yes, second time seeing this with SMS_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite. Debug runs are also continuing but not running into this yet.

@amametjanov
Copy link
Member Author

Looking through the history file (*.cam.rh0.0001-01-31-00000.nc), fields like cam_out(lchnk)%precsc(:ncol) have very small values: e.g.

    1.88079096131566e-40, 7.7582627154271e-40, 2.1084395886461e-84, 0,
    4.17619485951906e-56, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Theta's machine precision is

                                        -23
 Single precision =  0.11920929E-06 or 2
                                        -52
 Double precision =  0.22204460E-15 or 2

Arithmetic ops with such small numbers can be causing NaNs.

@rljacob
Copy link
Member

rljacob commented Mar 21, 2018

Could those numbers be junk from an array that wasn't initialized to zero? I doubt small numbers like that are computed by the model.

@amametjanov
Copy link
Member Author

It looks like they are being computed :) cam_in and cam_out arrays are initialized to 0 in components/cam/src/control/camsrfexch.F90. History files are here: e.g. /projects/ClimateEnergy_2/azamatm/SMS_Ld31_PXL.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite.20180313_181914_l5lwlk/precsc.out (ncdump -t -v var1 *.nc).

@singhbalwinder
Copy link
Contributor

We can catch instances where these small numbers are generated using compiler's underflow flags. I have never seen them causing any issues in the past. I remember using this flag long time back and the code would crash very early on with underflow detection. They always seemed harmless to me but it may depend on the compiler too as some compilers would automatically set underflow to zero while others won't.

@amametjanov
Copy link
Member Author

Additional data point about initialization. ATM after 4th step in the continued run has:

   Current step number:              3841
 ...
 nstep, te     3841   0.33411446843176217E+10   0.33411468040667338E+10   0.23448612056170461E-03   0.98507177590207240E+05
 nstep, te     3842   0.33411357703415484E+10   0.33411378602920594E+10   0.23118982348604186E-03   0.98507170438543981E+05
 nstep, te     3843   0.33411305601672096E+10   0.33411319091116810E+10   0.14921992066070427E-03   0.98507165635626530E+05
 nstep, te     3844   0.33411247440697255E+10   0.33411262608280902E+10   0.16778346050163977E-03   0.98507161897076017E+05

And these checks:

@@ -702,7 +702,29 @@ subroutine ieflx_gmean(state, tend, pbuf2d, cam_in, cam_out, nstep)
        case default
           call endrun('*** incorrect ieflx_opt ***')
        end select
-
+       do i=1,ncol
+         if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
+           write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(:,1)
+         endif
+         if (cam_in(lchnk)%ts(i) /= cam_in(lchnk)%ts(i)) then
+           write(iulog,*) 'NaN in ts',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%ts(:)
+         endif
+         if (cam_out(lchnk)%precc(i) /= cam_out(lchnk)%precc(i)) then
+           write(iulog,*) 'NaN in precc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precc(:),'rain',rain(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precl(i) /= cam_out(lchnk)%precl(i)) then
+           write(iulog,*) 'NaN in precl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precl(:),'rain',rain(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precsc(i) /= cam_out(lchnk)%precsc(i)) then
+           write(iulog,*) 'NaN in precsc',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsc(:),'snow',snow(:,lchnk)
+         endif
+         if (cam_out(lchnk)%precsl(i) /= cam_out(lchnk)%precsl(i)) then
+           write(iulog,*) 'NaN in precsl',i,lchnk,ncol,ieflx_opt,cam_out(lchnk)%precsl(:),'snow',snow(:,lchnk)
+         endif
+         if (ienet(i,lchnk) /= ienet(i,lchnk)) then
+           write(iulog,*) 'NaN in ienet',i,lchnk,ncol,ieflx_opt,ienet(i,lchnk)
+         endif
+       enddo

show in e3sm.log:

(seq_domain_areafactinit) : min/max drv2mdl   0.999999923354153       1.00000007509469    areafact_o_OCN
 NaN in clfx           9      114015           9           2
  3.971921184855017E-005  1.642525045843797E-004  2.947733991644233E-005
  4.784900201901546E-005  7.489103276731714E-005  9.108095325323595E-006
  4.129502134087975E-006  2.144881251408060E-004                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 NaN in ts           9      114015           9           2
   300.460276761804        295.163237366316        294.255842339761
   301.902920996472        299.266058591620        286.463401697433
   252.900251885251        288.404882510593                          NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000
 NaN in precl           9      114015           9           2
  8.721439523793770E-010  6.797591931365890E-013  7.631128467863829E-011
  5.912707206395664E-020  3.992641589920015E-007  3.504465270537491E-010
  5.558511966756323E-011  1.501216548661270E-008                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000 rain  3.849603926957925E-009  6.797591931365890E-013
  7.631128467863829E-011  5.912707206395664E-020  5.036713899589927E-007
  5.794703939578485E-008  0.000000000000000E+000  5.997597268333353E-010
                     NaN  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000
 NaN in precsl           9      114015           9           2
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  4.504836162751876E-021
  5.558511966756323E-011  1.441240575977937E-008                     NaN
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000 snow  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  4.504836162751876E-021  5.558511966756323E-011  1.635619849632815E-008
                     NaN  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
  0.000000000000000E+000  0.000000000000000E+000
 NaN in ienet           9      114015           9           2
                     NaN

So the arrays of size 16 are initialized to 0, but get NaNs in fields

  162 PRECL                            m/s                 1 A  Large-scale (stable) precipitation rate (liq + ice)
  168 PRECSL                           m/s                 1 A  Large-scale (stable) snow rate (water equivalent)
  174 QFLX                             kg/m2/s             1 A  Surface water flux
  188 TS                               K                   1 A  Surface temperature (radiative)

May need to check these fields in history files.

@PeterCaldwell
Copy link
Contributor

@amametjanov - are you saying that these NaNs show up for the first time after step 4, or are they formed upon initialization and persist through step 4?
Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird. I could understand a problem at the top level, but how can levels above and below get assigned and an intermediate level have problems? One possibility is that cloud physics isn't applied at the top few levels of the domain. Last I checked, cloud physics in the top 7 levels are ignored: https://acme-climate.atlassian.net/wiki/spaces/ATM/pages/129511233/Does+trop+cloud+top+press+have+any+impact . I was thinking there could be a problem with stitching the cloudy and cloud-free parts of the model together. Another possibility is that there's a problem with the bottom of the sponge layer at the top of the model. @mt5555 - do you know how many layers are part of the sponge?
I'm also curious whether these NaNs show up in low-res simulations. Could you re-run A_WCYCL1850S at ne30 with your print-statemented code, @amametjanov ?

@amametjanov
Copy link
Member Author

Yes, showing up for the first time after step 4 (twice in these restart runs and once after step 1 in a 145-node startup run). Reprosum calculation has a check for NaNs and INFs and will abort/endrun if there is any such value in the summation. If ne30 runs did not encounter the SHR_REPROSUM_CALC endrun, then they never had a NaN/INF in their calculations. Turning to compiler flags to catch NaNs earlier.

@PeterCaldwell
Copy link
Contributor

Hmm. I think we can conclude from the fact that NaNs show up on step 4 that this is not an initialization problem. Does this problem always show up on step 4, or does the timestep it shows up on vary? If always step 4, is there something special about step 4 (e.g. radiation is called every hour = every 4 steps)?
Is this error reproducible in the sense that all identical simulations fail, or do some get past step 4? If the latter, then we have a reproducibility problem (which my analysis today of my current movie run also suggests).

@singhbalwinder
Copy link
Contributor

singhbalwinder commented Mar 23, 2018

Also, it looks like the problem is at the 9th level from the top of the domain... which is kind of weird.

I think cflx is dimensioned (pcols, pcnst) , therefore the issue is in column 9. ncol is also 9, therefore it is something not assigned to the last column of a chunk which becomes Nan or ncol should be 8 instead of 9 for this chunk. It will be more clear if @amametjanov changes all his "if" conditions to look like:

if (cam_in(lchnk)%cflx(i,1) /= cam_in(lchnk)%cflx(i,1)) then
         write(iulog,*) 'NaN in clfx',i,lchnk,ncol,ieflx_opt,cam_in(lchnk)%cflx(i,1)
endif

@singhbalwinder
Copy link
Contributor

Sorry forgot to mention, I changed : in cflx to i

@singhbalwinder
Copy link
Contributor

ncol should be 8 instead of 9 for this chunk

Sorry, I don't think columns should be 8 instead of 9. One reason for this behavior might be that cam_in is going into some subroutine call with intent(out) and only 8 columns are updated in that subroutine. This would make the 9th column to have undefined values, which can be NaN.

@worleyph
Copy link
Contributor

Chunks can have different numbers of columns (and not use all of the available space in a chunk). If some routine is using pcols instead of ncols, this could cause the situation @singhbalwinder is conjecturing about.

@amametjanov
Copy link
Member Author

Should be fixed by #2208. I was not able to re-produce this after that PR.

@rljacob
Copy link
Member

rljacob commented Apr 5, 2018

Az, if these are KNL-specific bugs with KNL-specific solutions, please change the title to include "on KNL".

@rljacob
Copy link
Member

rljacob commented Apr 5, 2018

To help searches by future users.

@amametjanov amametjanov changed the title NaNs in ieflx_gmean calc NaNs in ieflx_gmean calc on KNLs with Intel compiler Apr 5, 2018
brhillman pushed a commit that referenced this issue Feb 16, 2023
…torrange_in_eamxx

Automatically Merged using E3SM Pull Request AutoTester
PR Title: Use TeamVectorRange for flat-loops
PR Author: tcclevenger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants