-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict use of istanbul cpu target on Titan when using PGI #1767
Conversation
Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation using an A_WCYCL test case, only CAM required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet excamined these files to determine why these files require it. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler)
Testing has been done fairly extensively using A_WCYCL2000 and ne30_oEC, but some sanity checks are probably called for, to determine that this change does preserve reproducibility in other instances. As this is not BFB, a little effort should be spent verifying that this is just roundoff differences. Note that all that has changed is compiler options, so if this is climate changing, then it is a compiler bug. |
@worleyph its not clear who is going to confirm that this PR is just roundoff differences. |
@rljacob , it has been along time since I have submitted a PR that was not BFB (or just a change in namelists). How is any PR confirmed to just be BFB? (Is this a RTFM?) Probably my comment is just stupid and should be ignored? All that is changing here is compiler options, no code, so perhaps is the clearest case for a round-off-level change PR? |
This is a part of PR #1767 submitted by Pat Worley, focusing on the changes of the settings of pgi compiler in Titan. As stated in PR #1767: "Changes including removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal." [Non-BFB] - Non Bit-For-Bit
This is a part of PR #1767 submitted by Pat Worley, focusing on the changes of the settings of pgi compiler in Titan. As stated in PR #1767: "Changes including removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal." [Non-BFB] * minxu74/titan/fix-pgi-compiler: Fixed the settings of pgi compiler on Titan
How can we determine that the non-BFB here just a roundoff difference or climate changing? just based on where we make changes in the codes? |
I'm not finding anything in the development or integrator guide on this topic. Based on the type of change here (compiler option change, and still reproducible), guess that we can assume that it is roundoff difference only. Looking at the atm.log, it is the same for step 0 and differs by step 1: (before) (after) so is consistent with being a roundoff level change. |
We are working on developing test methods to answer exactly this question. These methods are still under development but they will be soon available in test suites on experimental basis. Tagging @huiwanpnnl and @salilmahajan . |
@worleyph : From your experience, what would be different for a "climate-changing" change? I am just trying to understand how you decided that this change looks like a roundoff level change. For a "climate changing" scenario:
Thanks! |
I don't have any experience with climate-changing change, so a climate-changing change could very look like this as well. This is simply also consistent with a round-off level change when there are multiple sources, in my experience. 8-10 digits of agreement for the large values and 4 digits of agreement for the smaller values don't raise any red flags for me. If this was a change of a single line of code, then I would expect a much smaller change, but this affects the whole code in who knows what ways. Based on current practice when updating compiler versions (where we don't run experiments to verify that the climate has changed, we just update the baselines), I don't think that this is any worse. It would be nice to be more conservative, but we need the tools in current development. |
@worleyph @singhbalwinder Thanks a lot for your comments. We are looking forward to that tool. |
At timestep 1, there is also no input from the other components. Looking at timestep 97: (old) (new) growth in these is not too bad. But a climate scientist would need to comment :-). Also comparing intel with pgi: (intel, unchanged, step 1) nstep, te 1 0.33533586386590753E+10 0.33534149449900842E+10 0.31135420187541362E-02 0.98531035521301863E+05 the differences are very similar to those when changing the CPU target for PGI. (pgi, original, step 1) nstep, te 1 0.33533586601617603E+10 0.33534149692914491E+10 0.31136967736702307E-02 0.98531035591945751E+05 |
Corrected previous comment as I had duplicated the intel data rather than comparing intel with pgi. |
Actually, the Intel compiler has never had the istanbul cpu target, and comparing Intel with PGI after the modifications: (intel, unchanged, step 1) nstep, te 1 0.33533586386590753E+10 0.33534149449900842E+10 0.31135420187541362E-02 0.98531035521301863E+05 (pgi, with restricted use of cpu target, step 1) nstep, te 1 0.33533586385805354E+10 0.33534149449283252E+10 0.31135429466769334E-02 0.98531035521281417E+05 the results are more similar. Interesting ... |
Yes, they seems to be diverging very slowly which corroborates your hypothesis that it looks like a round-off level change. |
It is interesting that the different compilers got more similar results than the same compiler with different options. What are these values? global average of some variables? or just outputs at a single random grid? |
global averages and yes that looks like roundoff. |
@singhbalwinder can hopefully provide a better answer - these are global averages output from check_energy.F90 in CAM physics:
They are pretty sensitive, so is where I look to check for reproducibility during development. |
Like @worleyph and @rljacob mentioned, they are the global average sums which capture model state. Any change in the model state will bring change to these numbers. So far I have been using it for checking reproducibility just like @worleyph. @wlin7 or @philrasch might have more insights into how they are computed and how they capture model state. |
refactored datamodels refactored datamodels for new modularity This PR contains a refactorization of the data models in order to separate out the driver specfic parts of dxxx_comp_mod.F90 (which now contains datatypes such as cdata, infodata, etc. that are specific to the current mct cpl7 driver) and permit new driver implementations (e.g. nuopc) to use the core parts of the data models. In addition, a new directory data_comps/dshare now contains data model share code that was previously in share/utils. Test suite: scripts_regressions_tests The following tests had baselines generated without the refactor and tests were run to verify that the data model refactor maintained bfb answers - ERI.f09_g17.ETEST.cheyenne_intel.cice-default - ERI.f09_g17.ETEST.cheyenne_intel.cice-default - ERI.f09_g17.ETEST.cheyenne_intel.cice-default - ERI.T62_g16.C1850ECO.cheyenne_intel.pop-ecosys - ERI.T62_g16.C.cheyenne_intel.pop-default - ERI.T62_g16.CIAF.cheyenne_intel.pop-default - ERI.T62_g16.GIAF.cheyenne_intel.pop-default - ERI.T62_g17.DTEST.cheyenne_intel.cice-default - ERI.T62_g17.G.cheyenne_gnu.pop-cice - ERP_D_Ln9.f19_f19_mg17.QPC6.cheyenne_intel.cam-outfrq9s - ERP_D_Ln9.f19_f19_mg17.QSC6.cheyenne_intel.cam-outfrq9s - ERP_Ln9.f09_f09_mg17.F1850_DONOTUSE.cheyenne_intel.cam-outfrq9s - ERP_Ln9.f09_f09_mg17.FHIST_DEV.cheyenne_intel.cam-outfrq9s - ERP_Ln9.f19_f19_mg17.F2000_DEV.cheyenne_gnu.cam-outfrq9s - ERR.f45_g37_rx1.A.cheyenne_intel - ERS.f09_g16.ADLND.cheyenne_intel - ERS_Ld3.f45_g37_rx1.A.cheyenne_intel - ERS_Lm3.T62_g16.AIAF.cheyenne_intel - ERS_Ly20_N2_P2.f09_g16_gl20.T1850G1.cheyenne_intel - ERS_Ly3.f09_g16_gl4.T1850G.cheyenne_intel - PEM.f19_g16_rx1.A.cheyenne_intel - PEM.f19_g16_rx1.A.cheyenne_intel - SMS.f09_g16.ADWAV.cheyenne_intel Also ran most aux_clm5 tests and verified bfb with clm4_5_1_r251 Test baseline: see above Test namelist changes: Test status: none Fixes #1713 User interface changes?: None Update gh-pages html (Y/N)?: N Code review: jedwards, rjacob
@worleyph I ran the acme_developer test suite using your branch. There were two more failure tests compared with the results from cdash using next branch. The two tests are listed as follows and both of them have an active MPAS Land Ice model. The errors are also very similar. There are a lot of log.landice.????.err files in the RUN directory. The error message is "CRITICAL ERROR: An error has occurred in mpas_core_run. Aborting..." |
@minxu74 , I never tested with active land ice. Sorry. I'll look at this now. There may be a simple fix for this PR (just re-add the istanbul cpu target for glc), and we can revisit why this is necessary next quarter. I'll give this a try first. Thanks. |
Cases with active land ice are failing at runtime with the error: > Advective CFL violation on this processor. Maximum allowable time > step for this processor is (Days_hhh:mmm:sss): 0000000000_000:000:000 Adding the istanbul cpu target back just for GLC eliminates the error. A future task will be to determine which files in MPASLI need this cpu target.
@minxu74 , I added the istanbul CPU target back for builds of GLC. With this change, the cases
and
completed successfully. (Without this change, I observed the same runtime error that you reported.) Please update your copy of this branch and rerun the acme_developer branch and see if this eliminates the failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me
@mrnorman , this is still waiting on your review. |
@minxu74 , what do you want to do about this? You and @mrnorman were concerned about the glboal application for the istanbul target, and this addresses it. Do you care any more? I have not yet looked at pgiacc, in fact have not even tried to repeat your success (from using more OpenMP and less MPI per node). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me.
Sorry yall were waiting on my review so long. I checked my e-mail, and I didn't appear to get any. Maybe I set a bad setting somewhere on github that I need to fix? |
Thanks. |
@mrnorman check your notifications. I've also had some epic struggles with GitHub not sending me important emails. GitHub has three email systems, one for notifications, one for at-mentions, and a third I can't remember. If one of those systems sends you an email and it bounces, that system will stop sending emails entirely until you contact GitHub and tell them to clear the bounce flag. |
@minxu74 please start merging this. |
This is a part of PR #1767 submitted by Pat Worley, focusing on the changes of the settings of pgi compiler in Titan. As stated in PR #1767: "Changes including removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal." [Non-BFB] - Non Bit-For-Bit
@minxu74 Is your github email not working? |
Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.) * worleyph/machines/Titan_pgi_updates: Add istanbul cpu target back for GLC component Restrict use of istanbul cpu target Conflicts: cime/config/acme/machines/Depends.titan.pgiacc Solved the conflict caused by changing the name of "solver_init_mod" to "model_init_mod" in Depends.titan.pgiacc introduced by PR #1824
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
This is a part of PR #1767 submitted by Pat Worley, focusing on the changes of the settings of pgi compiler in Titan. As stated in PR #1767: "Changes including removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal." [Non-BFB] - Non Bit-For-Bit
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
This is a part of PR #1767 submitted by Pat Worley, focusing on the changes of the settings of pgi compiler in Titan. As stated in PR #1767: "Changes including removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal." [Non-BFB] - Non Bit-For-Bit
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.
Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.
[Non-BFB] (on Titan when using PGI compiler)
Fixes #1620
Fixes #1610
(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)