Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find NaN in mo_drydep.F90 #2205

Closed
amametjanov opened this issue Mar 28, 2018 · 5 comments
Closed

Cannot find NaN in mo_drydep.F90 #2205

amametjanov opened this issue Mar 28, 2018 · 5 comments

Comments

@amametjanov
Copy link
Member

Logging an issue to track the error in ne120-wcycl runs on Cori-KNL:

 3003:  dvel_inti: cannot find                      NaN  at j,pos_min,diff_min =
 3003:       410572         -99   10.0000000000000
 3003:  dvel_inti: imin,nlat_lai =            1         360
 3003:  dvel_inti: lat_lai
 3003:  -89.750     -89.250     -88.750     -88.250     -87.750     -87.250     -86.750     -86.250     -85.750     -8
5.250
 3003: Image              PC                Routine            Line        Source
 3003: e3sm.exe           00000000054BCF36  Unknown               Unknown  Unknown
 3003: e3sm.exe           00000000037222DC  shr_abort_mod_mp_         114  shr_abort_mod.F90
 3003: e3sm.exe           0000000001AC1574  mo_drydep_mp_dvel        1956  mo_drydep.F90
 3003: e3sm.exe           0000000001A94C39  mo_chemini_mp_che         212  mo_chemini.F90
 3003: e3sm.exe           00000000019230A2  chemistry_mp_chem         982  chemistry.F90
 3003: e3sm.exe           000000000061554A  physpkg_mp_phys_i         793  physpkg.F90
 3003: e3sm.exe           00000000004F09D2  cam_comp_mp_cam_i         178  cam_comp.F90
 3003: e3sm.exe           00000000004E5392  atm_comp_mct_mp_a         260  atm_comp_mct.F90
 3003: e3sm.exe           000000000042BC3F  component_mod_mp_         267  component_mod.F90
 3003: e3sm.exe           0000000000419C31  cime_comp_mod_mp_        1174  cime_comp_mod.F90
 3003: e3sm.exe           0000000000428AAF  MAIN__                     92  cime_driver.F90
 3003: e3sm.exe           000000000040B01E  Unknown               Unknown  Unknown
 3003: e3sm.exe           00000000055EDC19  Unknown               Unknown  Unknown
 3003: e3sm.exe           000000000040AF09  Unknown               Unknown  Unknown
 3003: Rank 3003 [Sat Mar 24 22:49:24 2018] [c6-1c2s14n3] application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 3003
 3003: forrtl: error (76): Abort trap signal
 3003: Image              PC                Routine            Line        Source
 3003: e3sm.exe           00000000054C8E2E  Unknown               Unknown  Unknown
 3003: e3sm.exe           0000000004D8C910  Unknown               Unknown  Unknown
 3003: e3sm.exe           000000000548DAAB  Unknown               Unknown  Unknown
 3003: e3sm.exe           00000000055F471A  Unknown               Unknown  Unknown
 3003: e3sm.exe           0000000004F78BA2  Unknown               Unknown  Unknown
 3003: e3sm.exe           000000000502207B  Unknown               Unknown  Unknown
 3003: e3sm.exe           0000000004F4CDB5  Unknown               Unknown  Unknown
 3003: e3sm.exe           0000000003832DEA  shr_mpi_mod_mp_sh        2127  shr_mpi_mod.F90
 3003: e3sm.exe           000000000372238A  shr_abort_mod_mp_          69  shr_abort_mod.F90
 3003: e3sm.exe           0000000001AC1574  mo_drydep_mp_dvel        1956  mo_drydep.F90
 3003: e3sm.exe           0000000001A94C39  mo_chemini_mp_che         212  mo_chemini.F90
 3003: e3sm.exe           00000000019230A2  chemistry_mp_chem         982  chemistry.F90
 3003: e3sm.exe           000000000061554A  physpkg_mp_phys_i         793  physpkg.F90
 3003: e3sm.exe           00000000004F09D2  cam_comp_mp_cam_i         178  cam_comp.F90
 3003: e3sm.exe           00000000004E5392  atm_comp_mct_mp_a         260  atm_comp_mct.F90
 3003: e3sm.exe           000000000042BC3F  component_mod_mp_         267  component_mod.F90
 3003: e3sm.exe           0000000000419C31  cime_comp_mod_mp_        1174  cime_comp_mod.F90
 3003: e3sm.exe           0000000000428AAF  MAIN__                     92  cime_driver.F90
 3003:  ERROR: Unknown error submitted to shr_abort_abort.

Since this only happens in threaded runs, possible culprit is a missing OMP barrier:

1935        allocate(clat(plat))
1936        call get_horiz_grid_d(plat, clat_d_out=clat)
1937        jl = 1
1938        ju = plat
1939     end if
1940     imin = 1
1941     do j = 1,ju
1942        diff_min = 10._r8
1943        pos_min  = -99
1944        target_lat = clat(j)*r2d
1945        do i = imin,nlat_lai
1946           if( abs(lat_lai(i) - target_lat) < diff_min ) then
1947              diff_min = abs(lat_lai(i) - target_lat)
1948              pos_min  = i
1949           end if
1950        end do
1951        if( pos_min < 0 ) then
1952           write(iulog,*) 'dvel_inti: cannot find ',target_lat,' at j,pos_min,diff_min = ',j,pos_min,diff_min
1953           write(iulog,*) 'dvel_inti: imin,nlat_lai = ',imin,nlat_lai
1954           write(iulog,*) 'dvel_inti: lat_lai'
1955           write(iulog,'(1p,10g12.5)') lat_lai(:)
1956           call endrun
1957        end if
 ...
1992     deallocate( lat_lai, wk_lai, clat, index_season_lai_j)

Proposed patch to diagnose this:

> git diff components/cam/src/chemistry/mozart/mo_drydep.F90
diff --git a/components/cam/src/chemistry/mozart/mo_drydep.F90 b/components/cam/src/chemistry/mozart/mo_drydep.F90
index bca17d4f5..01e679977 100644
--- a/components/cam/src/chemistry/mozart/mo_drydep.F90
+++ b/components/cam/src/chemistry/mozart/mo_drydep.F90
@@ -1559,7 +1559,7 @@ contains
     integer :: k, num_max, k_max
     integer :: num_seas(5)
     integer :: plon, plat
-    integer :: ierr
+    integer :: ierr, ithr

     real(r8)              :: spc_mass
     real(r8)              :: diff_min, target_lat
@@ -1938,6 +1938,10 @@ contains
        ju = plat
     end if
     imin = 1
+    ithr = 0
+#ifdef _OPENMP
+    ithr = omp_get_thread_num()
+#endif
     do j = 1,ju
        diff_min = 10._r8
        pos_min  = -99
@@ -1950,7 +1954,7 @@ contains
        end do
        if( pos_min < 0 ) then
           write(iulog,*) 'dvel_inti: cannot find ',target_lat,' at j,pos_min,diff_min = ',j,pos_min,diff_min
-          write(iulog,*) 'dvel_inti: imin,nlat_lai = ',imin,nlat_lai
+          write(iulog,*) 'dvel_inti: imin,nlat_lai,ithr = ',imin,nlat_lai,ithr
           write(iulog,*) 'dvel_inti: lat_lai'
           write(iulog,'(1p,10g12.5)') lat_lai(:)
           call endrun
@@ -1989,6 +1993,7 @@ contains
        end do
     end do

+    !$OMP BARRIER
     deallocate( lat_lai, wk_lai, clat, index_season_lai_j)

   end subroutine dvel_inti_xactive

Peter (@PeterCaldwell) can you apply this patch in your runs to see if the error re-occurs? Thanks.

@mt5555
Copy link
Contributor

mt5555 commented Mar 28, 2018

@amametjanov : is this a workaround, or an actual bug? it seems like if those variables are thread private, the barrier shouldn't be needed. But if they are shared, then shouldn't the deallocate also have a OMP MASTER?

For performance reasons, it seems like a bad idea to be allocating and deallocating arrays inside the physics. I hope this is an initialization routine?

@amametjanov
Copy link
Member Author

Yes, this is supposed to be part of single-threaded initialization. Just trying to eliminate multi-threading (all ithr values should come out as 0 and !$OMP BARRIER should have no effect). If threading is not the cause, then call get_horiz_grid_d is possibly returning a NaN.

@PeterCaldwell
Copy link
Contributor

Ok, I will add this change to my ongoing cori run for its next submission. Thanks for working on this.

@ndkeen
Copy link
Contributor

ndkeen commented Mar 30, 2018

I got the same error with SMS.ne30_oECv3_ICG.A_WCYCL1850S, which others have seen with highres on cori, but this is the first time I've seen it -- and first time with lowres. This is with master as of March29th (7e53f3e). I have many recent successful tests with repos prior to that.

@amametjanov
Copy link
Member Author

Should be fixed by #2208. If the error re-occurs, please re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants