-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divide by zero on KNL in lapack/blas with many F compsets (OK without -fpe0) #1183
Comments
in the latest ACME master, line 341, atm_comp_mct.F90 is: so the floating point divide by zero must be lower down in the call stack |
Yea, Erich pointed out this as well -- I omitted many lines that just said Unknown where the |
OK, I ran another one and it looks like some tasks are pointing directly to lapack call.
which is:
|
I verified that cori-haswell with Intel v17 is OK with this. It's only hitting a floating-point issue on KNL's. As stated, I suspect the issue is within the low-level solver. |
I also verified that simply backing intel compiler version to 16 does not make this go away. |
Here is another div-by-zero that was caught with Intels -fpe0 on Cori. This one has more information.
|
I don't seem to hit this issue on the haswell nodes, only the KNL nodes. So I will look closer at the external libs I am linking in to see if that could be a culprit. For the Worked with Stephen Leak at NERSC on this yesterday. We tried a few things, but no solutions yet. The test will work with GNU, but a) this forces the use of libsci, not MKL and b) apparently our GNU builds to not throw any floating-point trap flags. I tried adding one ( |
It was suggested that I try the executable that we build for haswell nodes on the KNL nodes. Easy test to learn more about the issue. I just did this (simply copying an executable that was built/run successfully on haswell nodes over to my bld/ directory) and I see the same divide-by-zero error. I've tried several other minor experiments, without a good solution. For example, I verified I can build optimized, add the -fpe0 flag, and still hit the same error. I've also tried other MKL linking methods, all with same result. I've used this tool to ensure we have the right flags and tinker with them. https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor |
I exchanged emails with @vlarson and he suggested I try to change the level of debug in the code. Just a snip of the code where it is stopping (it's calling tridag_solve):
And in subroutine tridag_solve of lapack_wrap.F90,
|
I was able to ask this question directly to Martyn Corden (Intel Compiler guy) who had some suggestions and gave us a possible theory as to why this might happen only on KNL's. In my mind, his solution is the same as what I had originally suggested, which is to find a way to turn off floating-point exception trapping for math library calls such as lapack. I did try one of his suggestions to give all fortran builds "-fpe-all=0" (replacing "-fpe0") and then for the one fortran file that makes the offending call, compile without this flag (did this using Depends.*). And this "worked" -- ie code is running well beyond where it was stopping. However, it seems to be running painfully slow and therefore may not be the ultimate solution. Next I would like to try unmasking floating-point exceptions JUST for certain calls using the API. I have examples of this for C/C++, but not fortran. This is basically suggesting that there may not be a problem with our code or the data going into the LAPACK call. We just have to find a work-around if we want to continue using -fpe0 for DEBUG builds. I'm pasting in the email from Martyn.
|
Example I found online of how to use the ieee extensions. Testing this now. Not sure if it needs to be behind guards for certain compilers that cannot support it yet.
|
Looks like this is working in real acme code. Will make a PR. |
This is great! Thanks Noel for sorting this out. I have seen issues like this before where libraries were compiled using different flags than the source code which is using them. In most of those cases, the compiler either won't allow me to use those flags or issue warnings about it. Were there any warnings regarding this in the atm.bldlog.* file? I must be missing something as it seems like the proposed solution of setting |
Yes, all this is doing is telling that particular LAPACK call (which is part of Intels MKL library) to NOT catch any floating-point issues (in DEBUG mode). We could make the case that the MKL library should behave better, but we don't have control over that code (or how it is built). If you read what Martyn was saying, it makes sense how this could only happen on KNL's and not actually be a real math issue. Note the library still has checks, but would not fail on this. We could just add this for cori-knl/intel. Or let the tests continue to fail in DEBUG until another solution is found. |
Thanks. I re-read the example above and you are right. I think doing this only for Cori-KNL/Intel makes sense to me. If we see this issue on other platforms, we can look into it further. |
It also looks like I can use a slightly different formulation and get the same results. I'm asking what the difference is. It may be that this is better.
(whoops, i tested incorrectly and trying again -- i turned it off, then off again, instead of back on) |
should this be inside an #ifdef DEBUG? in non-debug mode, no reason to call extra functions? |
Yes, you are right -- it should at least be behind #ifdef DEBUG. But should it be behind more is my question (ie just for certain compilers/machines), |
How does this sort of code look?
I add I have tested this on acme_developer and all of the affected tests pass. I also exchanged more emails with Martyn which I will paste here again just to keep it all in one place.
I replied:
And he said:
|
It looks good to me! I think ACME already needs GNU 5.0 or later version (https://acme-climate.atlassian.net/wiki/display/SE/Configuration+Management?focusedCommentId=7373315#comment-7373315) so we should be good there. |
Allow passing file and line to shr_assert Up until now, many calls to shr_assert (often done through the SHR_ASSERT macro) gave file and line information via a call to shr_log_errMsg. However, the shr_log_errMsg function is terrible for performance for some reason. This PR adds optional arguments to the shr_assert interface (and related macros) so that you can pass the file and line directly into that function. Then the cost of building an error message is only borne if you actually abort. So, for example, you can replace SHR_ASSERT(condition, shr_log_errMsg(__FILE__, __LINE__)) with SHR_ASSERT_FL(condition, __FILE__, __LINE__) (where 'FL' stands for FileLine). I have done this replacement for all relevant cime Fortran code. Test suite: cime Fortran unit tests and ./create_test cime_developer on yellowstone; also ran debug tests on yellowstone-intel, yellowstone-pgi, yellowstone-gnu and hobart-nag in a version rebased onto cime5.2.0-alpha.9 Test baseline: N/A Test namelist changes: none Test status: bit for bit User interface changes?: none Code review: @jedwards4b
…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
* master-main: (5708 commits) Revert "Merge branch 'jinyuntang/lnd/cifix' (PR #1272)" Update the script version and the update notes Set the walltime to default to get more on Edison in the basic state; hopefully this won't cause test failures due to not getting through the queue Give project a default value, in which case CIME will determine the project to use Also correct the capitalization of ACME for the simulations directory Add a comment back Fix usage of xmlchange for the customknl configuration Rename case_root_dir to acme_simulations_dir Add vowels to 'bld' Change the default case_root_dir to acme_simulations Change the default configuration from 'S' to 'M' to get a working PE layout on Edison compare_test_results: should work much better when cases are not writable Revert "compare_test_results should not require write access to case dirs" compare_test_results should not require write access to case dirs Change handle-preexisting-dirs to use the existing ones rather than abort on finding them Fix issues with checking whether a variables is defined Fix missing endif Add comments for the section explicitly used to determine where the output should go Fix the directory structure; use the proper variables for EXEROOT and RUNDIR Jenkins jobs should always have umask 002 Remove run_name from the list of configuration variables Add in @cameronsmith1 's hack for run_acme's preferred directory structure Add in a warning if the default output location is the users home directory Add in setting the start date for simulations which require it bless_test_results should work without write access to case Move last remaining files out of cime_config jenkins_generic_job add better support for changing scratch root Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB] change underscore to hyphens Minor change to cori scratch dirs Added a S,M,L layouts for ne4_oQU240 ...
…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
…ons module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Disable FP checks for a certain LAPACK routine (only for cori-knl, only DEBUG, only Intel compiler) Disable FP checks for a certain LAPACK routine, by using ieee_exceptions module to turn off FP trapping that is enabled by -fpe0. Only for DEBUG builds, only for cori-knl, and only for Intel compiler. Fixes #1183 [BFB]
Okay, I see this issue being referenced in the source, so resurrecting it. In non-debug (release) mode, and additional compiler flags
ne120-wcycl run (
And the surrounding source is
So we could be getting NaNs, because of this issue. I am commenting out |
I think the changes in the PR that addresses this issue should be put back the way they were. |
I think we can close this. Certainly we run DEBUG F cases frequently. |
In debug, we throw the -fpe0 flag for intel builds which will catch certain floating-point exceptions, including divide by zero. I've seen this for a while, but my solution has been to simply remove this flag and carry on. In past work, I've seen this flag cause problems within a solver and so we would disable it from within the code before calling solvers. However, maybe someone already knows about this, wants to know, or has a fix.
It looks like it happens with all tests like this, here is one example:
SMS_D_Ld1.ne16_ne16.FC5ATMMOD.cori-knl_intel
The text was updated successfully, but these errors were encountered: