Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

ndkeen · 2016-12-15T15:45:00Z

In debug, we throw the -fpe0 flag for intel builds which will catch certain floating-point exceptions, including divide by zero. I've seen this for a while, but my solution has been to simply remove this flag and carry on. In past work, I've seen this flag cause problems within a solver and so we would disable it from within the code before calling solvers. However, maybe someone already knows about this, wants to know, or has a fix.

It looks like it happens with all tests like this, here is one example:
SMS_D_Ld1.ne16_ne16.FC5ATMMOD.cori-knl_intel

60: forrtl: error (73): floating divide by zero
60: Image              PC                Routine            Line        Source
...
60: acme.exe           00000000007D0182  atm_comp_mct_mp_a         341  atm_comp_mct.F90
60: acme.exe           000000000044C5DC  component_mod_mp_         227  component_mod.F90
60: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
60: acme.exe           0000000000443508  MAIN__                     62  cesm_driver.F90
60: acme.exe (deleted  000000000040A8DE  Unknown               Unknown  Unknown

The text was updated successfully, but these errors were encountered:

mt5555 · 2016-12-16T16:51:22Z

in the latest ACME master, line 341, atm_comp_mct.F90 is:
call cam_run1 ( cam_in, cam_out )

so the floating point divide by zero must be lower down in the call stack

ndkeen · 2016-12-16T16:59:47Z

Yea, Erich pointed out this as well -- I omitted many lines that just said Unknown where the ... is. At one pointed, I recall tracking this down further and found that it was in a lapack/scalapack call. I no longer remember how I did that. I will try to update with more information. I might have used a debugger. Just to be clear, with -fpe0, all of the F cases I've run debug will stop for me, while removing this gets around this. And this is happening on cori-knl, but I'm pretty sure it also happens on cori-haswell. That is, I think the issue is happening with Intel v17.

ndkeen · 2016-12-16T18:00:50Z

OK, I ran another one and it looks like some tasks are pointing directly to lapack call.

56: Image              PC                Routine            Line        Source             
56: acme.exe           000000000A461F21  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A46005B  Unknown               Unknown  Unknown
56: acme.exe           000000000A40E9D4  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A40E7E6  Unknown               Unknown  Unknown
56: acme.exe           000000000A38CF06  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A3990D9  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000A045DA0  Unknown               Unknown  Unknown
56: acme.exe           00000000088F91B8  Unknown               Unknown  Unknown
56: acme.exe           00000000085B7DF3  Unknown               Unknown  Unknown
56: acme.exe           0000000008477C02  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000083C19DE  Unknown               Unknown  Unknown
56: acme.exe           0000000003A0AAC5  lapack_wrap_mp_tr  
56:        269  lapack_wrap.F90
56: acme.exe (deleted  00000000036825CD  Unknown               Unknown  Unknown
56: acme.exe           000000000367DE98  Unknown               Unknown  Unknown
56: acme.exe           000000000363DB01  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000020E5541  Unknown               Unknown  Unknown
56: acme.exe (deleted  0000000000CF459C  Unknown               Unknown  Unknown
56: acme.exe           0000000000CCEB96  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000008008B4  Unknown               Unknown  Unknown
56: acme.exe (deleted  00000000007D0182  Unknown               Unknown  Unknown
56: acme.exe           000000000044C5DC  Unknown               Unknown  Unknown
56: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
56: acme.exe           0000000000443508  Unknown               Unknown  Unknown
56: acme.exe (deleted  000000000040A8DE  Unknown               Unknown  Unknown
56: acme.exe           000000000A47EBF0  Unknown               Unknown  Unknown
56: acme.exe
56:            000000000040A7C7  Unknown               Unknown  Unknown

which is:

      call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  & 
                  rhs, ndim, info )

ndkeen · 2017-01-17T02:23:28Z

I verified that cori-haswell with Intel v17 is OK with this. It's only hitting a floating-point issue on KNL's. As stated, I suspect the issue is within the low-level solver.

ndkeen · 2017-01-17T18:24:20Z

I also verified that simply backing intel compiler version to 16 does not make this go away.

ndkeen · 2017-01-17T18:57:02Z

Here is another div-by-zero that was caught with Intels -fpe0 on Cori. This one has more information.

13: forrtl: error (73): floating divide by zero
13: Image              PC                Routine            Line        Source             
13: acme.exe           000000000A113D21  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A111E5B  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A0C07D4  Unknown               Unknown  Unknown
13: acme.exe           000000000A0C05E6  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A03ED06  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000A04AED9  Unknown               Unknown  Unknown
13: acme.exe (deleted  0000000009CF7BA0  Unknown               Unknown  Unknown
13: acme.exe (deleted  00000000085AAFB8  Unknown               Unknown  Unknown
13: acme.exe           0000000008269BF3  Unknown               Unknown  Unknown
13: acme.exe           0000000008129952  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000807372E  Unknown               Unknown  Unknown
13: acme.exe           000000000384214D  lapack_wrap_mp_tr  
13:        269  lapack_wrap.F90
13: acme.exe           000000000354400D  advance_windm_eds        1141  advance_windm_edsclrm_module.F90
13: acme.exe           000000000353F8D8  advance_windm_eds         486  advance_windm_edsclrm_module.F90
13: acme.exe           00000000034FF541  advance_clubb_cor        1940  advance_clubb_core_module.F90
13: acme.exe           00000000020C8E99  Unknown               Unknown  Unknown
13: acme.exe           0000000000CDCAA0  physpkg_mp_tphysb        2362  physpkg.F90
13: acme.exe           0000000000CB709A  physpkg_mp_phys_r        1008  physpkg.F90
13: acme.exe (deleted  00000000008008B4  Unknown               Unknown  Unknown
13: acme.exe (deleted  00000000007D0182  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000044C5DC  Unknown               Unknown  Unknown
13: acme.exe           0000000000428DE3  Unknown               Unknown  Unknown
13: acme.exe           0000000000443508  MAIN__                     62  cesm_driver.F90
13: acme.exe           000000000040A8DE  Unknown               Unknown  Unkn
13: own
13: acme.exe           000000000A1309F0  Unknown               Unknown  Unknown
13: acme.exe (deleted  000000000040A7C7  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/mndk-knl/SMS_D_Ln5.ne16_ne16.FC5AV1C.cori-knl_intel.20170116_132036

ndkeen · 2017-01-17T19:04:24Z

I don't seem to hit this issue on the haswell nodes, only the KNL nodes. So I will look closer at the external libs I am linking in to see if that could be a culprit.

For the SMS_D_Ln5.ne16_ne16.F1850C5AV1C-04.cori-knl_intel problem, I rebuilt using craype-mic-knl module and then linked with a simple -mkl instead of the current way we link in MKL. This was just a test to make sure I'm linking with correct/best MKL, although I've been reassured the way we are doing it is also fine. I still get same divide-by-zero.

Worked with Stephen Leak at NERSC on this yesterday. We tried a few things, but no solutions yet. The test will work with GNU, but a) this forces the use of libsci, not MKL and b) apparently our GNU builds to not throw any floating-point trap flags. I tried adding one (-ffpe-trap=zero) and didn't get anything, however, I would want to make sure it's doing what I want.

ndkeen · 2017-01-19T22:28:18Z

It was suggested that I try the executable that we build for haswell nodes on the KNL nodes. Easy test to learn more about the issue.

I just did this (simply copying an executable that was built/run successfully on haswell nodes over to my bld/ directory) and I see the same divide-by-zero error.

I've tried several other minor experiments, without a good solution. For example, I verified I can build optimized, add the -fpe0 flag, and still hit the same error. I've also tried other MKL linking methods, all with same result. I've used this tool to ensure we have the right flags and tinker with them. https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

ndkeen · 2017-01-23T19:38:08Z

I exchanged emails with @vlarson and he suggested I try to change the level of debug in the code.
Change two clubb_at_least_debug_level( 2 ) to be clubb_at_least_debug_level( 0 ) in advance_clubb_core_module.F90 and then make a similar change (going from 1 to 0) in advance_windm_edsclrm_module.F90. I still saw the same behavior when compiled with -fpe0 (divide by zero error without addl info). I then recompiled the ATM without the -fpe0 flag, the code completed, and I don't see any addl warning messages in the log files.

Just a snip of the code where it is stopping (it's calling tridag_solve):

   ! Solve tridiagonal system for xm.                                                                                                             
   if ( l_stats_samp .and. ixm_matrix_condt_num > 0 ) then
     call tridag_solvex &
          ( "windm_edsclrm", gr%nz, nrhs, &                            ! Intent(in)                                                               
            lhs(kp1_tdiag,:), lhs(k_tdiag,:), lhs(km1_tdiag,:), rhs, & ! Intent(inout)                                                            
            solution, rcond, err_code )                                ! Intent(out)                                                              

     ! Est. of the condition number of the variance LHS matrix                                                                                    
     call stat_update_var_pt( ixm_matrix_condt_num, 1, 1.0_core_rknd/rcond, &  ! Intent(in)                                                       
                              stats_sfc )                                      ! Intent(inout)                                                    
   else

     call tridag_solve( "windm_edsclrm", gr%nz, nrhs, &                             ! In                                                          
                        lhs(kp1_tdiag,:),  lhs(k_tdiag,:), lhs(km1_tdiag,:), rhs, & ! Inout                                                       
                        solution, err_code )                                        ! Out                                                         
   end if

And in subroutine tridag_solve of lapack_wrap.F90,
This is where it leaves ACME, presumably goes into lapack and hits divide-by-zero

   if ( kind( diag(1) ) == dp ) then
     call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &   ! ndk line where we leave ACME and get divide-by-zero with intel on KNL     
                 rhs, ndim, info )

   else if ( kind( diag(1) ) == sp ) then
     call sgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &
                 rhs, ndim, info )

   else
…

ndkeen · 2017-01-25T18:01:47Z

I was able to ask this question directly to Martyn Corden (Intel Compiler guy) who had some suggestions and gave us a possible theory as to why this might happen only on KNL's. In my mind, his solution is the same as what I had originally suggested, which is to find a way to turn off floating-point exception trapping for math library calls such as lapack. I did try one of his suggestions to give all fortran builds "-fpe-all=0" (replacing "-fpe0") and then for the one fortran file that makes the offending call, compile without this flag (did this using Depends.*). And this "worked" -- ie code is running well beyond where it was stopping. However, it seems to be running painfully slow and therefore may not be the ultimate solution. Next I would like to try unmasking floating-point exceptions JUST for certain calls using the API. I have examples of this for C/C++, but not fortran.

This is basically suggesting that there may not be a problem with our code or the data going into the LAPACK call. We just have to find a work-around if we want to continue using -fpe0 for DEBUG builds.

I'm pasting in the email from Martyn.

Noel,
              It sounds like a divide-by-zero is generated inside LAPACK, probably by the vectorizer, e.g. from code something like this:
   DO J=1,N
      IF(A(J).GT.0.)  B(J) = 1.0/A(J)
   ENDDO

The vectorizer assumes the default floating-point environment, in which floating-point exceptions are masked.
It may choose to calculate 1./A(J) for all values of J, but store into B(J) only for those values of J for which A(J)>0. So any values A(J)=0. might cause a divide-by-zero exception if exceptions are unmasked.
-fpe0 causes exceptions to be unmasked once at the start of the main program. They remain unmasked for the entire application unless you make a call to mask them again.

You say this happens only on KNL. A perhaps more likely variation is 
   DO J=1,N
       B(J) = 1.0/A(J)        ! double precision
   ENDDO

Suppose N=1607, for example. Iterations 1-1600 will be performed in a vectorized kernel.
Iterations 1601-1607 will be performed in a separate remainder loop. On KNL, the compiler may decide this is worth vectorizing. It may perform the division for values of J from 1601 to 1608, (even if it is above the upper array bound), but mask off the last element in the store to B. If the memory location corresponding to A(1608) contains zero, this may result in a divide by zero exception.

One solution would be to compile LAPACK with -fp-speculation safe. This would work for user code, but LAPACK is in MKL and you don't have the source.

The other solution is for exceptions to be masked when you call LAPACK, but unmasked for the rest of the application you are trying to debug. The quick and dirty way to do this is instead of using -fpe0, (which only affects the Fortran main program, nothing else), to use -fpe-all=0  for all source files except those that contain calls to LAPACK. That unmasks exceptions at the star of each routine in those source files, and masks them again at the exit from each routine. This might have some impact on performance, but that probably wouldn't matter if you were debugging.

The more precise way to do this is with explicit calls to the IEEE_EXCEPTIONS module of Fortran2003.
You can still build with -fpe0, so that the application starts with exceptions unmasked.
Before each call to LAPACK, you mask them, and after the return, you unmask them again.

Here's an example, (with an explicit divide by zero, not a vectorizer speculation):

program test
 integer, parameter :: N=16
 real, dimension(N) :: x=(/(mod(i,8),i=1,N)/), y

 call sub1(x,y,N)
 print *, 'Y = ', y(4:N:4)
 call sub2(x,y,N)
 print *, 'Y = ', y(4:N:4)
end program test

subroutine sub1(x,y,N)
 use, intrinsic :: ieee_arithmetic
 real, dimension(N) :: x,y
 call ieee_set_halting_mode(ieee_usual, .false.)  
 call sub2(x,y,N)
 call ieee_set_halting_mode(ieee_usual, .true.)
end

subroutine sub2(x,y,N)
 real, dimension(N) :: x,y
 y = 1. / x  
end

ifort -fpe0 -xmic-avx512 -fno-inline -traceback test_div.f90; ./a.out
Y =   0.2500000           Infinity  0.2500000           Infinity
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
a.out              000000000047D301  Unknown               Unknown  Unknown
a.out              000000000047B43B  Unknown               Unknown  Unknown
a.out              0000000000448284  Unknown               Unknown  Unknown
a.out              0000000000448096  Unknown               Unknown  Unknown
a.out              0000000000427839  Unknown               Unknown  Unknown
a.out              0000000000403B39  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F3011615100  Unknown               Unknown  Unknown
a.out              0000000000402F52  sub2_..0                   21  test_div.f90
a.out              0000000000402B7D  MAIN__                      7  test_div.f90
a.out              0000000000402A5E  Unknown               Unknown  Unknown
libc-2.17.so       00007F3011062B15  __libc_start_main     Unknown  Unknown
a.out              0000000000402969  Unknown               Unknown  Unknown
Aborted (core dumped)

The divide-by-zero in the first call to sub2 is protected, but the second is not.

Actually, I just came up with a slightly better example, which I am attaching.
In this case, the exception is due to speculation and it can be suppressed by -fp-speculation safe.
Note that -fpe0 itself sets -speculation safe, to avoid this issue, so that if your whole application is compiled with -fp-speculation safe, there is no problem. Your problem arises because your user code is compiled with -fpe0 but LAPACK inside MKL is not.
  Build and run the attached example:

$  ifort -c -xmic-avx512 -traceback sub2.f90
$  ifort -fpe0 -xmic-avx512 -traceback test_div0.f90 sub2.o; ./a.out
Y =   0.2500000      0.0000000E+00  0.2500000      0.0000000E+00
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
a.out              000000000047D371  Unknown               Unknown  Unknown
a.out              000000000047B4AB  Unknown               Unknown  Unknown
a.out              00000000004482F4  Unknown               Unknown  Unknown
a.out              0000000000448106  Unknown               Unknown  Unknown
a.out              00000000004278A9  Unknown               Unknown  Unknown
a.out              0000000000403BA9  Unknown               Unknown  Unknown
libpthread-2.17.s  00007F7EB1989100  Unknown               Unknown  Unknown
a.out              0000000000402F98  sub2_                       4  sub2.f90
a.out              0000000000402C03  MAIN__                      7  test_div0.f90
a.out              0000000000402A5E  Unknown               Unknown  Unknown
libc-2.17.so       00007F7EB13D6B15  __libc_start_main     Unknown  Unknown
a.out              0000000000402969  Unknown               Unknown  Unknown
Aborted (core dumped)

You can experiment with commenting or uncommenting the calls to ieee_set_halting_mode or the compiler switches -fpe0 and -fp-speculation safe to see their effect.

ndkeen · 2017-01-25T19:06:50Z

Example I found online of how to use the ieee extensions. Testing this now. Not sure if it needs to be behind guards for certain compilers that cannot support it yet.

!http://fortran71.rssing.com/chan-21772141/all_p35.html 
! ftn -fpp -fpe0 ieee_div_by_zero.f90
program test

   use, intrinsic :: ieee_exceptions
   implicit none

   integer, parameter :: RKIND = 16

   real(RKIND) :: a, b, c
   logical     :: ieee_flag_is_raised

   a = 1.0_RKIND
   b = 0.0_RKIND
   !
   ! Division by zero allowed
   !
   call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.)
   c = a / b
   ieee_flag_is_raised = .false.
   call ieee_get_flag(IEEE_DIVIDE_BY_ZERO, ieee_flag_is_raised)
   print *, "ieee_flag_is_raised=", ieee_flag_is_raised
   print *, "c=", c
   !
   ! Division by zero forbidden
   !
   call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .true.)
   c = a / b
   call ieee_get_flag(IEEE_DIVIDE_BY_ZERO, ieee_flag_is_raised)
   print *, "ieee_flag_is_raised=", ieee_flag_is_raised
   print *, "c=", c

end program test

ndkeen · 2017-01-25T21:47:07Z

Looks like this is working in real acme code. Will make a PR.

singhbalwinder · 2017-01-25T23:38:05Z

This is great! Thanks Noel for sorting this out.

I have seen issues like this before where libraries were compiled using different flags than the source code which is using them. In most of those cases, the compiler either won't allow me to use those flags or issue warnings about it. Were there any warnings regarding this in the atm.bldlog.* file?

I must be missing something as it seems like the proposed solution of setting IEEE_DIVIDE_BY_ZERO will cause the compiler to skip legitimate divide by zeros also. Is that the case?

ndkeen · 2017-01-25T23:45:43Z

Yes, all this is doing is telling that particular LAPACK call (which is part of Intels MKL library) to NOT catch any floating-point issues (in DEBUG mode). We could make the case that the MKL library should behave better, but we don't have control over that code (or how it is built). If you read what Martyn was saying, it makes sense how this could only happen on KNL's and not actually be a real math issue. Note the library still has checks, but would not fail on this.

We could just add this for cori-knl/intel. Or let the tests continue to fail in DEBUG until another solution is found.

singhbalwinder · 2017-01-25T23:56:04Z

Thanks. I re-read the example above and you are right. I think doing this only for Cori-KNL/Intel makes sense to me. If we see this issue on other platforms, we can look into it further.

ndkeen · 2017-01-26T00:22:51Z

It also looks like I can use a slightly different formulation and get the same results. I'm asking what the difference is. It may be that this is better.

use, intrinsic :: ieee_arithmetic
…
     call ieee_set_halting_mode(ieee_usual, .false.)  
     call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &   ! ndk line where we leave ACME and get divide-by-zero with intel on KNL
                 rhs, ndim, info )
     call ieee_set_halting_mode(ieee_usual, .true.)  
…

(whoops, i tested incorrectly and trying again -- i turned it off, then off again, instead of back on)

mt5555 · 2017-01-26T00:24:33Z

should this be inside an #ifdef DEBUG?

in non-debug mode, no reason to call extra functions?

ndkeen · 2017-01-26T00:26:10Z

Yes, you are right -- it should at least be behind #ifdef DEBUG. But should it be behind more is my question (ie just for certain compilers/machines),

ndkeen · 2017-01-27T23:41:17Z

How does this sort of code look?

#ifndef NDEBUG
#ifdef ARCH_MIC_KNL
      ! when floating-point exceptions are turned on, this call was failing with a div-by-zero on KNL with Intel/MKL. Solution 
      ! was to turn off exceptions only here at this call (and only for machine with ARCH_MIC_KNL defined) (github 1183)
      call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.) ! Turn off stopping on div-by-zero only
      !call ieee_set_halting_mode(ieee_usual, .false.)  ! Turn off stopping on all floating-point exceptions
#endif
#endif

I add -DARCH_MIC_KNL to the CPPDEFS for cori-knl. In this way, it only happens for DEBUG builds on this machine. I could also test for the Intel compiler? I did test that these routines work for GNU and they do. I think GNU 5.1 and above (at least) have the ieee_instrinsics.

I have tested this on acme_developer and all of the affected tests pass.

I also exchanged more emails with Martyn which I will paste here again just to keep it all in one place.


Hi Noel,
            The difference between the examples was that one masked only divide-by-zero exceptions, whereas the other masked also floating-invalid and overflow exceptions in addition. So it would also protect against things like log(-1.)   or 0./0. 

            No, MKL is not misbehaving. It, via the compiler, assumes the default floating-point environment. If you change the default floating-point environment, e.g. by changing the rounding mode or by unmasking exceptions, you may get unexpected results.
This applies to all platforms, not just KNL, even though it only happened on KNL for your particular case.
If MKL had to save, set and then restore the floating-point environment for every call, there could be a significant performance impact. 

        Masking exceptions before a library call and re-enabling afterwards sounds like something you might put under a DEBUG flag or macro?

I replied:

OK, your explanation makes sense.  Thanks.
And I have a fix in place that will turn off this FP checking for that one call, for this one machine/compiler.

However, I still have a nagging concern. Which others on the team have also hinted at.
Is there indeed a “real” division by zero or not?  (that our data being sent to that algorithm caused)
I’m wondering if perhaps there should not be, yet the way in which MKL behaves, it is
saying that there is one.  Now, if there really is a div-by-zero, then by all means we
should try to debug and fix.  But this is a test that is part of the regression tests
and is run by many different machines/compliers, etc.

I’ve looked at the nominal source code for the dgtsv routine and I see checks in place for
some divisions, but not for others.

And he said:

I would say that there is no evidence of a "real" divide by zero inside MKL.
You can't distinguish directly between a "real" or a "speculated" divide by zero without recompiling that part of MKL.
However, speculated divides-by-zero are expected whereas real divides-by-zero are not.
If a real divide-by-zero were occurring as part of the main computation, I would expect that MKL  would return an error flag (such as for a singular matrix) and/or the results would contain infinities or NaNs. Once you have an infinity as an intermediate result, it's hard to make it go away, even multiplying by zero only makes it a NaN.

There are Fortran 2003 intrinsic functions such as ieee_is_nan() and ieee_is_finite() to let you test the returned data.
You should be able to adapt my little test program to try them out.

singhbalwinder · 2017-01-28T00:04:14Z

It looks good to me! I think ACME already needs GNU 5.0 or later version (https://acme-climate.atlassian.net/wiki/display/SE/Configuration+Management?focusedCommentId=7373315#comment-7373315) so we should be good there.

@jedwards4b

Allow passing file and line to shr_assert Up until now, many calls to shr_assert (often done through the SHR_ASSERT macro) gave file and line information via a call to shr_log_errMsg. However, the shr_log_errMsg function is terrible for performance for some reason. This PR adds optional arguments to the shr_assert interface (and related macros) so that you can pass the file and line directly into that function. Then the cost of building an error message is only borne if you actually abort. So, for example, you can replace SHR_ASSERT(condition, shr_log_errMsg(__FILE__, __LINE__)) with SHR_ASSERT_FL(condition, __FILE__, __LINE__) (where 'FL' stands for FileLine). I have done this replacement for all relevant cime Fortran code. Test suite: cime Fortran unit tests and ./create_test cime_developer on yellowstone; also ran debug tests on yellowstone-intel, yellowstone-pgi, yellowstone-gnu and hobart-nag in a version rebased onto cime5.2.0-alpha.9 Test baseline: N/A Test namelist changes: none Test status: bit for bit User interface changes?: none Code review: @jedwards4b

…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

@cameronsmith1

* master-main: (5708 commits) Revert "Merge branch 'jinyuntang/lnd/cifix' (PR #1272)" Update the script version and the update notes Set the walltime to default to get more on Edison in the basic state; hopefully this won't cause test failures due to not getting through the queue Give project a default value, in which case CIME will determine the project to use Also correct the capitalization of ACME for the simulations directory Add a comment back Fix usage of xmlchange for the customknl configuration Rename case_root_dir to acme_simulations_dir Add vowels to 'bld' Change the default case_root_dir to acme_simulations Change the default configuration from 'S' to 'M' to get a working PE layout on Edison compare_test_results: should work much better when cases are not writable Revert "compare_test_results should not require write access to case dirs" compare_test_results should not require write access to case dirs Change handle-preexisting-dirs to use the existing ones rather than abort on finding them Fix issues with checking whether a variables is defined Fix missing endif Add comments for the section explicitly used to determine where the output should go Fix the directory structure; use the proper variables for EXEROOT and RUNDIR Jenkins jobs should always have umask 002 Remove run_name from the list of configuration variables Add in @cameronsmith1 's hack for run_acme's preferred directory structure Add in a warning if the default output location is the users home directory Add in setting the start date for simulations which require it bless_test_results should work without write access to case Move last remaining files out of cime_config jenkins_generic_job add better support for changing scratch root Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB] change underscore to hyphens Minor change to cori scratch dirs Added a S,M,L layouts for ne4_oQU240 ...

…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]

amametjanov · 2018-03-25T23:55:37Z

Okay, I see this issue being referenced in the source, so resurrecting it. In non-debug (release) mode, and additional compiler flags

ifeq ($(DEBUG), FALSE)
   FFLAGS +=  -O2 -debug minimal -qno-opt-dynamic-align -check uninit -check bounds -check pointers -fpe0 -check noarg_temp_created -check format -check output_conversion
   CFLAGS +=  -O2 -debug minimal
endif

ne120-wcycl run (SMS_Ld5.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.theta_intel.cam-cosplite with COSP and high-frequency IO) got into:

(seq_domain_areafactinit) : min/max drv2mdl   0.999999923354153       1.00000007509469    areafact_o_OCN
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
e3sm.exe           000000000227ACE6  lapack_wrap_mp_tr         282  lapack_wrap.F90
e3sm.exe           000000000210D56E  advance_windm_eds        1141  advance_windm_edsclrm_module.F90
e3sm.exe           000000000210623E  advance_windm_eds         486  advance_windm_edsclrm_module.F90
e3sm.exe           00000000020EBA88  advance_clubb_cor        2019  advance_clubb_core_module.F90
e3sm.exe           00000000016ABC08  clubb_intr_mp_clu        1855  clubb_intr.F90
e3sm.exe           000000000084D44A  physpkg_mp_tphysb        2461  physpkg.F90
e3sm.exe           0000000000846C2B  physpkg_mp_phys_r        1027  physpkg.F90
e3sm.exe           00000000005FC48D  cam_comp_mp_cam_r         250  cam_comp.F90
e3sm.exe           00000000005E8B00  atm_comp_mct_mp_a         348  atm_comp_mct.F90
e3sm.exe           0000000000439581  component_mod_mp_         267  component_mod.F90
e3sm.exe           0000000000423FF5  cime_comp_mod_mp_        1958  cime_comp_mod.F90
e3sm.exe           0000000000430BFC  MAIN__                     92  cime_driver.F90

And the surrounding source is

269 !-----------------------------------------------------------------------
270 !       *** The LAPACK Routine ***
271 !       SUBROUTINE DGTSV( N, NRHS, DL, D, DU, B, LDB, INFO )
272 !-----------------------------------------------------------------------
273 
274     if ( kind( diag(1) ) == dp ) then
275 #ifndef NDEBUG
276 #if defined(ARCH_MIC_KNL) && defined(CPRINTEL)
277       ! when floating-point exceptions are turned on, this call was failing with a div-by-zero on KNL with Intel/MKL. Solution 
278       ! was to turn off exceptions only here at this call (and only for machine with ARCH_MIC_KNL defined) (github 1183)
279       call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .false.) ! Turn off stopping on div-by-zero only
280 #endif
281 #endif
282       call dgtsv( ndim, nrhs, subd(2:ndim), diag, supd(1:ndim-1),  &
283                   rhs, ndim, info )
284 #ifndef NDEBUG
285 #if defined(ARCH_MIC_KNL) && defined(CPRINTEL)
286       call ieee_set_halting_mode(IEEE_DIVIDE_BY_ZERO, .true.) ! Turn back on stopping on div-by-zero only
287 #endif
288 #endif

So we could be getting NaNs, because of this issue. I am commenting out #ifndef NDEBUG and continuing the run to see if NaNs disappear! (NDEBUG is defined in release mode and ARCH_MIC_KNL and CPRINTEL are also defined).

ndkeen · 2018-04-19T20:51:50Z

I think the changes in the PR that addresses this issue should be put back the way they were.
I'm seeing odd behavior when we both add -fpe0, or remove -fpe0 and remove the #ifndef NDEBUG.

ndkeen · 2020-07-09T00:06:34Z

I think we can close this. Certainly we run DEBUG F cases frequently.

ndkeen assigned mt5555 and ndkeen Dec 15, 2016

ndkeen changed the title ~~Divide by zero via -fpe0 and intel v17 with many F compsets~~ Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) Jan 18, 2017

ndkeen mentioned this issue Feb 5, 2017

Improve default PE layouts on Edison #1241

Merged

ndkeen mentioned this issue Feb 14, 2017

Disable FP checks for a certain LAPACK routine. #1268

Closed

rljacob added Atmosphere bug Cori labels Feb 28, 2017

singhbalwinder mentioned this issue Mar 1, 2017

CAM/CLUBB detects failures but does not abort #1292

Closed

ndkeen mentioned this issue Apr 21, 2017

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) #1438

Closed

ndkeen mentioned this issue Apr 24, 2017

Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) #1444

Merged

ndkeen closed this as completed in #1444 Apr 26, 2017

amametjanov reopened this Mar 25, 2018

amametjanov mentioned this issue Mar 29, 2018

Turn on FP exception checking on KNL nodes with Intel compiler #2208

Merged

singhbalwinder mentioned this issue Dec 14, 2018

Clamp number of LND MPI ranks when over the number of available chunks #1952

Closed

ndkeen closed this as completed Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

ndkeen commented Dec 15, 2016

mt5555 commented Dec 16, 2016

ndkeen commented Dec 16, 2016 •

edited

Loading

ndkeen commented Dec 16, 2016 •

edited

Loading

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017 •

edited

Loading

ndkeen commented Jan 19, 2017 •

edited

Loading

ndkeen commented Jan 23, 2017

ndkeen commented Jan 25, 2017

ndkeen commented Jan 25, 2017 •

edited

Loading

ndkeen commented Jan 25, 2017

singhbalwinder commented Jan 25, 2017

ndkeen commented Jan 25, 2017

singhbalwinder commented Jan 25, 2017

ndkeen commented Jan 26, 2017 •

edited

Loading

mt5555 commented Jan 26, 2017

ndkeen commented Jan 26, 2017

ndkeen commented Jan 27, 2017

singhbalwinder commented Jan 28, 2017 •

edited

Loading

amametjanov commented Mar 25, 2018

ndkeen commented Apr 19, 2018

ndkeen commented Jul 9, 2020

Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183

Comments

ndkeen commented Dec 15, 2016

mt5555 commented Dec 16, 2016

ndkeen commented Dec 16, 2016 • edited Loading

ndkeen commented Dec 16, 2016 • edited Loading

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017

ndkeen commented Jan 17, 2017 • edited Loading

ndkeen commented Jan 19, 2017 • edited Loading

ndkeen commented Jan 23, 2017

ndkeen commented Jan 25, 2017

ndkeen commented Jan 25, 2017 • edited Loading

ndkeen commented Jan 25, 2017

singhbalwinder commented Jan 25, 2017

ndkeen commented Jan 25, 2017

singhbalwinder commented Jan 25, 2017

ndkeen commented Jan 26, 2017 • edited Loading

mt5555 commented Jan 26, 2017

ndkeen commented Jan 26, 2017

ndkeen commented Jan 27, 2017

singhbalwinder commented Jan 28, 2017 • edited Loading

amametjanov commented Mar 25, 2018

ndkeen commented Apr 19, 2018

ndkeen commented Jul 9, 2020

ndkeen commented Dec 16, 2016 •

edited

Loading

ndkeen commented Dec 16, 2016 •

edited

Loading

ndkeen commented Jan 17, 2017 •

edited

Loading

ndkeen commented Jan 19, 2017 •

edited

Loading

ndkeen commented Jan 25, 2017 •

edited

Loading

ndkeen commented Jan 26, 2017 •

edited

Loading

singhbalwinder commented Jan 28, 2017 •

edited

Loading