You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I discovered a bug in the FV3 dycore that affects regional runs when they dycore is built in double precision (64-bit). A halo exchange routine exch_uv in model/fv_regional_bc.F90 is using MPI_REAL as the variable type, regardless of the precision of the actual Fortran variable. This leads to corrupt data being received at the other end.
When compiled with full optimization flags, this manifests itself in run-to-run differences in the results on Cheyenne with Intel 19 and SGI MPT. For reasons unknown to me, no such run-to-run differences occur on Hera. My assumption is that Intel MPI is handling this mismatch differently. On all systems, however, the code crashes with SIGFPE messages in the corresponding section of the code (adiabatic init, i.e. calling fv_dynamics forward and backward) when the code is compiled with debug flags.
(1) With this bugfix, the debug jobs run further and, for the particular test that I am running, now crash later in the code in nh_utils.F90 or nh_core.F90 - both for the 32-bit and the 64-bit dycore build.
(2) With this bugfix, we still get b4b differences on Cheyenne with Intel. I don't know yet whether this is because of an issue with our test or a second bug in the dycore.
(3) At this occasion, I would also like to note that there is still a ! FIXME: MPI_COMM_WORLD note in routine exch_uv.
The text was updated successfully, but these errors were encountered:
Update. As it turns out, this is not the only update required in this particular routine. With additional changes in commit b2b0d33 (or the corrected version 2d0479d), results are now bit-for-bit identical from run to run on Cheyenne using Intel 19.1.1 and SGI MPT 2.19. Further, the runs no longer crash in DEBUG mode!
I discovered a bug in the FV3 dycore that affects regional runs when they dycore is built in double precision (64-bit). A halo exchange routine exch_uv in model/fv_regional_bc.F90 is using
MPI_REAL
as the variable type, regardless of the precision of the actual Fortran variable. This leads to corrupt data being received at the other end.When compiled with full optimization flags, this manifests itself in run-to-run differences in the results on Cheyenne with Intel 19 and SGI MPT. For reasons unknown to me, no such run-to-run differences occur on Hera. My assumption is that Intel MPI is handling this mismatch differently. On all systems, however, the code crashes with
SIGFPE
messages in the corresponding section of the code (adiabatic init, i.e. callingfv_dynamics
forward and backward) when the code is compiled with debug flags.This code can be fixed as follows:
See PR #25.
Note.
(1) With this bugfix, the debug jobs run further and, for the particular test that I am running, now crash later in the code in
nh_utils.F90
ornh_core.F90
- both for the 32-bit and the 64-bit dycore build.(2) With this bugfix, we still get b4b differences on Cheyenne with Intel. I don't know yet whether this is because of an issue with our test or a second bug in the dycore.
(3) At this occasion, I would also like to note that there is still a
! FIXME: MPI_COMM_WORLD
note in routineexch_uv
.The text was updated successfully, but these errors were encountered: