-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn on FP exception checking on KNL nodes with Intel compiler #2208
Conversation
Additional reference about the
This appears to fix all of the NaN, FP invalid and div-by-0 errors in ne120-wcycl runs on Theta (~10 different runs). |
@amametjanov - is this really needed? I would have thought that divide by zeros in MLK, from speculative execution, would be harmless. @ndkeen IIRC, had figured out a way in DEBUG mode to allow speculative execution, but not abort on the (harmless) NaNs sometimes produced by speculative execution. |
If it makes the random fails while running high-res on KNL go away, then yes its necessary. |
Yes, this enables harmless divide by zeros inside MKL in production mode. Previously, it was enabled only in debug mode: #1183. Debug mode sets |
I'm still confused: in non-debug mode, we want to enable vectorization, and we want to allow the harmless divide-by-zeros, since the NaNs produced by speculative execution should not show up in the data we care about? |
We can try to track down NaN-sensitive source files and put them in Depends.intel to be compiled with |
Just tried master
on Titan using PGI, once with 1 thread and once with 2 threads, same number of MPI tasks otherwise. The two runs diverge at timestep 7 in atm.log . |
Sorry - this should go on one of the github issue pages, or the Confluence page, but it appears that this case and master have problems beyond KNL systems and the intel compiler. |
Thanks for checking this, Pat. I've been wanting to know how the model does on machines other than KNL... |
I was so excited last night that someone had tried running on Titan that I forgot to respond to the bad news part of your post - did you do these runs using -fpe0 or not? It might be worth repeating your experiments with -fpe0 set to the opposite of what it was for your runs just to get more data points. Also, could you add your results to the end of the https://acme-climate.atlassian.net/wiki/spaces/SIM/pages/626721264/KNL+Reproducibility+-+v1+High-Res+Coupled page, where I added a table for non-bfb behavior on titan? Thanks! |
Not -fpe0 yet. Not even sure that this is a thing with the PGI compiler / for the AMD processor. I want to do some more studies first: see if results are deterministic, then see if same issue arises for a low res case, then see which component introduces non-b4b behavior when adding threading, then start seeing when this behavior first occurred. I will add some information to the Confluence page as soon as I get the chance. For the Titan results, I am assuming that this is a model bug, not a compiler bug. |
Pat, were the two runs (1 and 2 threads) with the same executable (or one pureMPI and the other threaded)? A quick way to test reproducibility with the default pelayout is to
IIRC, there were PGI compiler upgrades recently and we don't run high-res tests on Titan yet. |
I did a --clean-all and a fresh build after changing the PE layout. SInce BUILD_THREADED is TRUE, they could have been identical. I did not check. |
By default, if all NTHRDS are 1, then BUILD_THREADED=FALSE (unless explicitly set to TRUE prior to building). So the first one must have been a pureMPI run and the second -- threaded. In this case, reproducibility is not expected, and so we are still OK. |
You are correct - I thought we set BUILD_THREADED to be TRUE by default so that we could better evaluate reproducibility. Am I dreaming? If not, when was this changed back? |
It's been like this for a while (couple of years). To check threaded reproducibility, there are |
Never mind. It was broken for awhile, and we fixed it to be what it was before (MPI-only had BUILD_THREADED=FALSE). |
So one build was without -mp and one was with. I do not trust this result as indicating a problem. Sorry about that, and thanks to @amametjanov for pointing this out. I'll do the "corect" comparison next. I'll be sticking with the production case since this is what is showing the problems on other systems. |
Update: On Titan/gpi, low res (ne30_oECv3_ICG) compset A_WCYCL1950S_CMIP6_LR is not b4b with change in threading. Tracked it down to ATM only, and is fixed with reintroducing the istanbul CPU target. Either the compiler upgrade or the use of COSP reintroduced this need (probably the latter). In any case, this does not shed any light on the problems this PR is trying to address. Sorry for the distraction. I have nothing to add on high resolution runs on Titan for this case. |
@worleyph can you open a new issue about the threading BFB problem you found? |
Will do. I had hoped to find the fix and submit a PR and open an issue at the same time, but my guess as to the cause was not accurate. This affects only PGI on Titan and is solved by specifying a CPU target of istanbul for ATM, but that is overkill. Finding which files need it could be a pain though.
|
How do you know its only PGI and Titan? I'm not sure a PET test has been run with this compset on other platforms. |
What I found is solved by setting the CPU target to Istanbul. I can’t comment on non-B4B behavior on other systems or in other situations.
|
@amametjanov go ahead and merge this to next. |
…2208) Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]
Merged to next. |
@amametjanov please merge this to master. |
Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]
Merged to master. |
Turn on FP exception checking on KNL nodes with Intel compiler. Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (-fpe0) is turned on. This makes debug and non-debug runs behave the same with -fpe0 on. [BFB]
…ns_output Automatically Merged using E3SM Pull Request AutoTester PR Title: Improve jenkins output when things fail. PR Author: jgfouca
Turn on FP exception checking on KNL nodes with Intel compiler.
Also, don't halt on div-by-0 in a clubb MKL call on KNLs when FP exception-checking (
-fpe0
) is turned on. This makes debug and non-debug runs behave the same with-fpe0
on.[BFB]