Turn on FP exception checking on KNL nodes with Intel compiler #2208

amametjanov · 2018-03-29T17:01:14Z

Turn on FP exception checking on KNL nodes with Intel compiler.

Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on.

[BFB]

amametjanov · 2018-03-29T19:53:12Z

Additional reference about the -fpe0 flag with Intel v18 compiler: https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-fpe

Sets option -fp-speculation=strict (Linux* and macOS*) or /Qfp-speculation:strict (Windows*) for any program unit compiled with -fpe0 (Linux* and macOS*) or /fpe:0 (Windows*). This disables certain optimizations in cases where speculative execution of floating-point operations could lead to floating-point exceptions that would not occur in the absence of speculation. For example, this may prevent the vectorization of some loops containing conditionals.

Disables certain optimizations that generate calls to the Short Vector Math Library that could lead to floating-point exceptions for extreme input arguments that would not occur if libm was called instead. For example, this may prevent the vectorization of some loops containing calls to transcendental math functions.

This appears to fix all of the NaN, FP invalid and div-by-0 errors in ne120-wcycl runs on Theta (~10 different runs).

mt5555 · 2018-03-29T22:54:13Z

@amametjanov - is this really needed? I would have thought that divide by zeros in MLK, from speculative execution, would be harmless.

@ndkeen IIRC, had figured out a way in DEBUG mode to allow speculative execution, but not abort on the (harmless) NaNs sometimes produced by speculative execution.

rljacob · 2018-03-29T22:58:22Z

If it makes the random fails while running high-res on KNL go away, then yes its necessary.

amametjanov · 2018-03-29T23:34:11Z

Yes, this enables harmless divide by zeros inside MKL in production mode. Previously, it was enabled only in debug mode: #1183.

Debug mode sets -fpe0 and because of that none of the errors in production runs could be re-produced in debug mode.

mt5555 · 2018-03-29T23:46:47Z

I'm still confused: in non-debug mode, we want to enable vectorization, and we want to allow the harmless divide-by-zeros, since the NaNs produced by speculative execution should not show up in the data we care about?

amametjanov · 2018-03-30T00:52:10Z

-fpe0 does not completely disable vectorization, it should only disable potentially dangerous, speculative vectorization and disable compiler-generated calls to SVML functions. Somewhere in the E3SM code, these two optimizations are producing NaNs.

We can try to track down NaN-sensitive source files and put them in Depends.intel to be compiled with -fp-speculation=strict, -fimf-use-svml=false or similar flags and eventually remove the -fpe0 flag.

worleyph · 2018-03-30T02:37:11Z

Just tried master

-compset A_WCYCL1950S_CMIP6_HR -res ne120_oRRS18v3_ICG

on Titan using PGI, once with 1 thread and once with 2 threads, same number of MPI tasks otherwise. The two runs diverge at timestep 7 in atm.log .

worleyph · 2018-03-30T02:39:38Z

Sorry - this should go on one of the github issue pages, or the Confluence page, but it appears that this case and master have problems beyond KNL systems and the intel compiler.

PeterCaldwell · 2018-03-30T05:40:30Z

Thanks for checking this, Pat. I've been wanting to know how the model does on machines other than KNL...

PeterCaldwell · 2018-03-30T14:06:18Z

I was so excited last night that someone had tried running on Titan that I forgot to respond to the bad news part of your post - did you do these runs using -fpe0 or not? It might be worth repeating your experiments with -fpe0 set to the opposite of what it was for your runs just to get more data points. Also, could you add your results to the end of the https://acme-climate.atlassian.net/wiki/spaces/SIM/pages/626721264/KNL+Reproducibility+-+v1+High-Res+Coupled page, where I added a table for non-bfb behavior on titan? Thanks!

worleyph · 2018-03-30T14:11:00Z

Not -fpe0 yet. Not even sure that this is a thing with the PGI compiler / for the AMD processor. I want to do some more studies first: see if results are deterministic, then see if same issue arises for a low res case, then see which component introduces non-b4b behavior when adding threading, then start seeing when this behavior first occurred. I will add some information to the Confluence page as soon as I get the chance. For the Titan results, I am assuming that this is a model bug, not a compiler bug.

amametjanov · 2018-03-30T16:25:19Z

Pat, were the two runs (1 and 2 threads) with the same executable (or one pureMPI and the other threaded)? A quick way to test reproducibility with the default pelayout is to

./cime/scripts/create_test ERS.ne120_oRRS18v3_ICG.A_WCYCL1950S_CMIP6_HR.titan_pgi.cam-cosplite

IIRC, there were PGI compiler upgrades recently and we don't run high-res tests on Titan yet.

worleyph · 2018-03-30T16:44:26Z

Pat, were the two runs (1 and 2 threads) with the same executable (or one pureMPI and the other threaded)?

I did a --clean-all and a fresh build after changing the PE layout. SInce BUILD_THREADED is TRUE, they could have been identical. I did not check.

amametjanov · 2018-03-30T16:53:15Z

By default, if all NTHRDS are 1, then BUILD_THREADED=FALSE (unless explicitly set to TRUE prior to building). So the first one must have been a pureMPI run and the second -- threaded. In this case, reproducibility is not expected, and so we are still OK.

worleyph · 2018-03-30T17:01:37Z

You are correct - I thought we set BUILD_THREADED to be TRUE by default so that we could better evaluate reproducibility. Am I dreaming? If not, when was this changed back?

amametjanov · 2018-03-30T17:09:59Z

It's been like this for a while (couple of years). To check threaded reproducibility, there are PET (PE Threaded) tests that are part of e3sm_integration test suite. And they pass for Intel compiler. Min (@minxu74) can switch from current e3sm_developer to e3sm_integration for regular (twice-a-week) testing on Titan to get coverage of threading runs.

worleyph · 2018-03-30T17:10:08Z

Never mind. It was broken for awhile, and we fixed it to be what it was before (MPI-only had BUILD_THREADED=FALSE).

worleyph · 2018-03-30T17:13:37Z

The two runs diverge at timestep 7 in atm.log .

So one build was without -mp and one was with. I do not trust this result as indicating a problem. Sorry about that, and thanks to @amametjanov for pointing this out. I'll do the "corect" comparison next. I'll be sticking with the production case since this is what is showing the problems on other systems.

worleyph · 2018-03-31T15:47:06Z

Update: On Titan/gpi, low res (ne30_oECv3_ICG) compset A_WCYCL1950S_CMIP6_LR is not b4b with change in threading. Tracked it down to ATM only, and is fixed with reintroducing the istanbul CPU target. Either the compiler upgrade or the use of COSP reintroduced this need (probably the latter). In any case, this does not shed any light on the problems this PR is trying to address. Sorry for the distraction. I have nothing to add on high resolution runs on Titan for this case.

rljacob · 2018-04-02T03:33:27Z

@worleyph can you open a new issue about the threading BFB problem you found?

worleyph · 2018-04-02T03:41:22Z

Will do. I had hoped to find the fix and submit a PR and open an issue at the same time, but my guess as to the cause was not accurate. This affects only PGI on Titan and is solved by specifying a CPU target of istanbul for ATM, but that is overkill. Finding which files need it could be a pain though.

rljacob · 2018-04-02T03:46:53Z

How do you know its only PGI and Titan? I'm not sure a PET test has been run with this compset on other platforms.

worleyph · 2018-04-02T03:53:42Z

What I found is solved by setting the CPU target to Istanbul. I can’t comment on non-B4B behavior on other systems or in other situations.

rljacob · 2018-04-02T19:42:23Z

@amametjanov go ahead and merge this to next.

…2208) Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]

amametjanov · 2018-04-02T20:02:16Z

Merged to next.

rljacob · 2018-04-03T18:13:26Z

@amametjanov please merge this to master.

Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]

amametjanov · 2018-04-03T19:52:16Z

Merged to master.

Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]

…ns_output Automatically Merged using E3SM Pull Request AutoTester PR Title: Improve jenkins output when things fail. PR Author: jgfouca

amametjanov added 2 commits March 28, 2018 22:25

Don't halt on div-by-0 in a clubb MKL call on KNLs

8a7e295

Turn on FP exception checking on KNL nodes with Intel compiler

d1b0715

amametjanov added Cori Theta HighRes labels Mar 29, 2018

rljacob assigned amametjanov Mar 29, 2018

amametjanov merged commit d1b0715 into master Apr 3, 2018

amametjanov deleted the azamat/knl/rm-clubb-mkl-div-by-0-halts branch April 3, 2018 19:52

ndkeen mentioned this pull request Nov 26, 2019

floating invalid of F2010C5-CMIP6-LR.ne30_oECv3 on cori-knl with maint-1.0 #3328

Open

brhillman pushed a commit that referenced this pull request Apr 5, 2023

Merge Pull Request #2208 from E3SM-Project/scream/jgfouca/imprv_jenki…

3fed8ba

…ns_output Automatically Merged using E3SM Pull Request AutoTester PR Title: Improve jenkins output when things fail. PR Author: jgfouca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn on FP exception checking on KNL nodes with Intel compiler #2208

Turn on FP exception checking on KNL nodes with Intel compiler #2208

amametjanov commented Mar 29, 2018 •

edited by rljacob

Loading

amametjanov commented Mar 29, 2018

mt5555 commented Mar 29, 2018

rljacob commented Mar 29, 2018

amametjanov commented Mar 29, 2018

mt5555 commented Mar 29, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 30, 2018

PeterCaldwell commented Mar 30, 2018

PeterCaldwell commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 31, 2018

rljacob commented Apr 2, 2018

worleyph commented Apr 2, 2018 via email •

edited

Loading

rljacob commented Apr 2, 2018

worleyph commented Apr 2, 2018 via email •

edited

Loading

rljacob commented Apr 2, 2018

amametjanov commented Apr 2, 2018

rljacob commented Apr 3, 2018

amametjanov commented Apr 3, 2018

Turn on FP exception checking on KNL nodes with Intel compiler #2208

Turn on FP exception checking on KNL nodes with Intel compiler #2208

Conversation

amametjanov commented Mar 29, 2018 • edited by rljacob Loading

amametjanov commented Mar 29, 2018

mt5555 commented Mar 29, 2018

rljacob commented Mar 29, 2018

amametjanov commented Mar 29, 2018

mt5555 commented Mar 29, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 30, 2018

PeterCaldwell commented Mar 30, 2018

PeterCaldwell commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

amametjanov commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 30, 2018

worleyph commented Mar 31, 2018

rljacob commented Apr 2, 2018

worleyph commented Apr 2, 2018 via email • edited Loading

rljacob commented Apr 2, 2018

worleyph commented Apr 2, 2018 via email • edited Loading

rljacob commented Apr 2, 2018

amametjanov commented Apr 2, 2018

rljacob commented Apr 3, 2018

amametjanov commented Apr 3, 2018

amametjanov commented Mar 29, 2018 •

edited by rljacob

Loading

worleyph commented Apr 2, 2018 via email •

edited

Loading

worleyph commented Apr 2, 2018 via email •

edited

Loading