Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict use of istanbul cpu target on Titan when using PGI #1767

Merged
merged 2 commits into from
Oct 27, 2017

Conversation

worleyph
Copy link
Contributor

@worleyph worleyph commented Sep 4, 2017

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation using an A_WCYCL test case, only CAM required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet excamined these files to determine why these files require it.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)
@worleyph worleyph requested review from mrnorman and minxu74 September 4, 2017 15:04
@worleyph worleyph added Machine Files non-BFB PR makes roundoff changes to answers. labels Sep 4, 2017
@worleyph worleyph changed the title Restrict use of istanbul cpu target Restrict use of istanbul cpu target on Titan when using PGI Sep 4, 2017
@worleyph
Copy link
Contributor Author

worleyph commented Sep 4, 2017

Testing has been done fairly extensively using A_WCYCL2000 and ne30_oEC, but some sanity checks are probably called for, to determine that this change does preserve reproducibility in other instances. As this is not BFB, a little effort should be spent verifying that this is just roundoff differences. Note that all that has changed is compiler options, so if this is climate changing, then it is a compiler bug.

@rljacob
Copy link
Member

rljacob commented Sep 6, 2017

@worleyph its not clear who is going to confirm that this PR is just roundoff differences.

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

@rljacob , it has been along time since I have submitted a PR that was not BFB (or just a change in namelists). How is any PR confirmed to just be BFB? (Is this a RTFM?)

Probably my comment is just stupid and should be ignored? All that is changing here is compiler options, no code, so perhaps is the clearest case for a round-off-level change PR?

minxu74 added a commit that referenced this pull request Sep 6, 2017
This is a part of PR #1767 submitted by Pat Worley, focusing on the
changes of the settings of pgi compiler in Titan.

As stated in PR #1767:
"Changes including removing the dependence on the version of pgi/17.5.0
installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal."

[Non-BFB] - Non Bit-For-Bit
minxu74 added a commit that referenced this pull request Sep 6, 2017
This is a part of PR #1767 submitted by Pat Worley, focusing on the
changes of the settings of pgi compiler in Titan.

As stated in PR #1767:
"Changes including removing the dependence on the version of pgi/17.5.0
installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal."

[Non-BFB]
* minxu74/titan/fix-pgi-compiler:
  Fixed the settings of pgi compiler on Titan
@minxu74
Copy link
Contributor

minxu74 commented Sep 6, 2017

How can we determine that the non-BFB here just a roundoff difference or climate changing? just based on where we make changes in the codes?

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

I'm not finding anything in the development or integrator guide on this topic. Based on the type of change here (compiler option change, and still reproducible), guess that we can assume that it is roundoff difference only. Looking at the atm.log, it is the same for step 0 and differs by step 1:

(before)
nstep, te 1 0.33533586601617603E+10 0.33534149692914491E+10 0.31136967736702307E-02 0.98531035591945751E+05

(after)
nstep, te 1 0.33533586385805354E+10 0.33534149449283252E+10 0.31135429466769334E-02 0.98531035521281417E+05

so is consistent with being a roundoff level change.

@singhbalwinder
Copy link
Contributor

How can we determine that the non-BFB here just a roundoff difference or climate changing? just based on where we make changes in the codes?

We are working on developing test methods to answer exactly this question. These methods are still under development but they will be soon available in test suites on experimental basis. Tagging @huiwanpnnl and @salilmahajan .

@singhbalwinder
Copy link
Contributor

Looking at the atm.log, it is the same for step 0 and differs by step 1:
(before)
nstep, te 1 0.33533586601617603E+10 0.33534149692914491E+10 0.31136967736702307E-02 0.98531035591945751E+05
(after)
nstep, te 1 0.33533586385805354E+10 0.33534149449283252E+10 0.31135429466769334E-02 0.98531035521281417E+05
so is consistent with being a roundoff level change.

@worleyph : From your experience, what would be different for a "climate-changing" change? I am just trying to understand how you decided that this change looks like a roundoff level change.

For a "climate changing" scenario:

  • Do you expect the values to differ at nstep=0 ?
    Or
  • Do you expect the values at nstep=1 to be very different?

Thanks!

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

@worleyph : From your experience, what would be different for a "climate-changing" change? I am just trying to understand how you decided that this change looks like a roundoff level change.

I don't have any experience with climate-changing change, so a climate-changing change could very look like this as well. This is simply also consistent with a round-off level change when there are multiple sources, in my experience. 8-10 digits of agreement for the large values and 4 digits of agreement for the smaller values don't raise any red flags for me. If this was a change of a single line of code, then I would expect a much smaller change, but this affects the whole code in who knows what ways.

Based on current practice when updating compiler versions (where we don't run experiments to verify that the climate has changed, we just update the baselines), I don't think that this is any worse. It would be nice to be more conservative, but we need the tools in current development.

@minxu74
Copy link
Contributor

minxu74 commented Sep 6, 2017

@worleyph @singhbalwinder Thanks a lot for your comments. We are looking forward to that tool.

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

At timestep 1, there is also no input from the other components. Looking at timestep 97:

(old)
nstep, te 97 0.33506698271760306E+10 0.33506822919476566E+10 0.68924854905123140E-03 0.98532389263741556E+05

(new)
nstep, te 97 0.33506391949633703E+10 0.33506516729013834E+10 0.68998874601030344E-03 0.98530654088145704E+05

growth in these is not too bad. But a climate scientist would need to comment :-).

Also comparing intel with pgi:

(intel, unchanged, step 1)

nstep, te 1 0.33533586386590753E+10 0.33534149449900842E+10 0.31135420187541362E-02 0.98531035521301863E+05

the differences are very similar to those when changing the CPU target for PGI.

(pgi, original, step 1)

nstep, te 1 0.33533586601617603E+10 0.33534149692914491E+10 0.31136967736702307E-02 0.98531035591945751E+05

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

Corrected previous comment as I had duplicated the intel data rather than comparing intel with pgi.

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

Actually, the Intel compiler has never had the istanbul cpu target, and comparing Intel with PGI after the modifications:

(intel, unchanged, step 1)

nstep, te 1 0.33533586386590753E+10 0.33534149449900842E+10 0.31135420187541362E-02 0.98531035521301863E+05

(pgi, with restricted use of cpu target, step 1)

nstep, te 1 0.33533586385805354E+10 0.33534149449283252E+10 0.31135429466769334E-02 0.98531035521281417E+05

the results are more similar. Interesting ...

@singhbalwinder
Copy link
Contributor

Yes, they seems to be diverging very slowly which corroborates your hypothesis that it looks like a round-off level change.

@minxu74
Copy link
Contributor

minxu74 commented Sep 6, 2017

It is interesting that the different compilers got more similar results than the same compiler with different options. What are these values? global average of some variables? or just outputs at a single random grid?

@rljacob
Copy link
Member

rljacob commented Sep 6, 2017

global averages and yes that looks like roundoff.

@worleyph
Copy link
Contributor Author

worleyph commented Sep 6, 2017

@singhbalwinder can hopefully provide a better answer - these are global averages output from check_energy.F90 in CAM physics:

 teinp_glob, teout_glob, heat_glob, psurf_glob

They are pretty sensitive, so is where I look to check for reproducibility during development.

@singhbalwinder
Copy link
Contributor

Like @worleyph and @rljacob mentioned, they are the global average sums which capture model state. Any change in the model state will bring change to these numbers. So far I have been using it for checking reproducibility just like @worleyph. @wlin7 or @philrasch might have more insights into how they are computed and how they capture model state.

@rljacob rljacob added the Titan label Sep 12, 2017
jgfouca pushed a commit that referenced this pull request Sep 15, 2017
refactored datamodels
refactored datamodels for new modularity

This PR contains a refactorization of the data models in order to separate out the driver specfic parts of dxxx_comp_mod.F90 (which now contains datatypes such as cdata, infodata, etc. that are specific to the current mct cpl7 driver) and permit new driver implementations (e.g. nuopc) to use the core parts of the data models. In addition, a new directory data_comps/dshare now contains data model share code that was previously in share/utils.

Test suite: scripts_regressions_tests
The following tests had baselines generated without the refactor and tests were
run to verify that the data model refactor maintained bfb answers

- ERI.f09_g17.ETEST.cheyenne_intel.cice-default
- ERI.f09_g17.ETEST.cheyenne_intel.cice-default
- ERI.f09_g17.ETEST.cheyenne_intel.cice-default
- ERI.T62_g16.C1850ECO.cheyenne_intel.pop-ecosys
- ERI.T62_g16.C.cheyenne_intel.pop-default
- ERI.T62_g16.CIAF.cheyenne_intel.pop-default
- ERI.T62_g16.GIAF.cheyenne_intel.pop-default
- ERI.T62_g17.DTEST.cheyenne_intel.cice-default
- ERI.T62_g17.G.cheyenne_gnu.pop-cice
- ERP_D_Ln9.f19_f19_mg17.QPC6.cheyenne_intel.cam-outfrq9s
- ERP_D_Ln9.f19_f19_mg17.QSC6.cheyenne_intel.cam-outfrq9s
- ERP_Ln9.f09_f09_mg17.F1850_DONOTUSE.cheyenne_intel.cam-outfrq9s
- ERP_Ln9.f09_f09_mg17.FHIST_DEV.cheyenne_intel.cam-outfrq9s
- ERP_Ln9.f19_f19_mg17.F2000_DEV.cheyenne_gnu.cam-outfrq9s
- ERR.f45_g37_rx1.A.cheyenne_intel
- ERS.f09_g16.ADLND.cheyenne_intel
- ERS_Ld3.f45_g37_rx1.A.cheyenne_intel
- ERS_Lm3.T62_g16.AIAF.cheyenne_intel
- ERS_Ly20_N2_P2.f09_g16_gl20.T1850G1.cheyenne_intel
- ERS_Ly3.f09_g16_gl4.T1850G.cheyenne_intel
- PEM.f19_g16_rx1.A.cheyenne_intel
- PEM.f19_g16_rx1.A.cheyenne_intel
- SMS.f09_g16.ADWAV.cheyenne_intel
Also ran most aux_clm5 tests and verified bfb with clm4_5_1_r251
Test baseline: see above
Test namelist changes:
Test status: none

Fixes #1713

User interface changes?: None

Update gh-pages html (Y/N)?: N

Code review: jedwards, rjacob
@minxu74
Copy link
Contributor

minxu74 commented Sep 20, 2017

@worleyph I ran the acme_developer test suite using your branch. There were two more failure tests compared with the results from cdash using next branch. The two tests are listed as follows and both of them have an active MPAS Land Ice model.
ERS.f09_g16_g.MPASLISIA.titan_pgi
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.titan_pgi

The errors are also very similar. There are a lot of log.landice.????.err files in the RUN directory. The error message is "CRITICAL ERROR: An error has occurred in mpas_core_run. Aborting..."

@worleyph
Copy link
Contributor Author

@minxu74 , I never tested with active land ice. Sorry. I'll look at this now. There may be a simple fix for this PR (just re-add the istanbul cpu target for glc), and we can revisit why this is necessary next quarter. I'll give this a try first. Thanks.

Cases with active land ice are failing at runtime with the error:

> Advective CFL violation on this processor.  Maximum allowable time
> step for this processor is (Days_hhh:mmm:sss): 0000000000_000:000:000

Adding the istanbul cpu target back just for GLC eliminates the
error. A future task will be to determine which files in MPASLI
need this cpu target.
@worleyph
Copy link
Contributor Author

@minxu74 , I added the istanbul CPU target back for builds of GLC. With this change, the cases

 -compset MPASLISIA -res f09_g16_g

and

 -compset MPAS_LISIO_TEST -res T62_oQU120_ais20

completed successfully. (Without this change, I observed the same runtime error that you reported.) Please update your copy of this branch and rerun the acme_developer branch and see if this eliminates the failures.

Copy link
Contributor

@minxu74 minxu74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me

@worleyph
Copy link
Contributor Author

@mrnorman , this is still waiting on your review.

@worleyph
Copy link
Contributor Author

@minxu74 , what do you want to do about this? You and @mrnorman were concerned about the glboal application for the istanbul target, and this addresses it. Do you care any more?

I have not yet looked at pgiacc, in fact have not even tried to repeat your success (from using more OpenMP and less MPI per node).

@minxu74
Copy link
Contributor

minxu74 commented Oct 17, 2017

@worleyph I have not any concerns about this PR. I just wonder if @mrnorman has any concerns and wait for his approval.

Copy link
Contributor

@mrnorman mrnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

@mrnorman
Copy link
Contributor

Sorry yall were waiting on my review so long. I checked my e-mail, and I didn't appear to get any. Maybe I set a bad setting somewhere on github that I need to fix?

@worleyph
Copy link
Contributor Author

Thanks.

@jgfouca
Copy link
Member

jgfouca commented Oct 23, 2017

@mrnorman check your notifications. I've also had some epic struggles with GitHub not sending me important emails. GitHub has three email systems, one for notifications, one for at-mentions, and a third I can't remember. If one of those systems sends you an email and it bounces, that system will stop sending emails entirely until you contact GitHub and tell them to clear the bounce flag.

@rljacob
Copy link
Member

rljacob commented Oct 24, 2017

@minxu74 please start merging this.

jgfouca pushed a commit that referenced this pull request Oct 25, 2017
This is a part of PR #1767 submitted by Pat Worley, focusing on the
changes of the settings of pgi compiler in Titan.

As stated in PR #1767:
"Changes including removing the dependence on the version of pgi/17.5.0
installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal."

[Non-BFB] - Non Bit-For-Bit
@rljacob
Copy link
Member

rljacob commented Oct 26, 2017

@minxu74 Is your github email not working?

minxu74 added a commit that referenced this pull request Oct 27, 2017
Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)

* worleyph/machines/Titan_pgi_updates:
  Add istanbul cpu target back for GLC component
  Restrict use of istanbul cpu target

Conflicts:
	cime/config/acme/machines/Depends.titan.pgiacc

Solved the conflict caused by changing the name of "solver_init_mod"
to "model_init_mod" in Depends.titan.pgiacc introduced by PR #1824
@minxu74 minxu74 merged commit 0aff617 into master Oct 27, 2017
jgfouca pushed a commit that referenced this pull request Nov 10, 2017
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
@worleyph worleyph deleted the worleyph/machines/Titan_pgi_updates branch January 25, 2018 00:56
jgfouca pushed a commit that referenced this pull request Feb 27, 2018
This is a part of PR #1767 submitted by Pat Worley, focusing on the
changes of the settings of pgi compiler in Titan.

As stated in PR #1767:
"Changes including removing the dependence on the version of pgi/17.5.0
installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal."

[Non-BFB] - Non Bit-For-Bit
jgfouca pushed a commit that referenced this pull request Feb 27, 2018
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
jgfouca pushed a commit that referenced this pull request Mar 14, 2018
This is a part of PR #1767 submitted by Pat Worley, focusing on the
changes of the settings of pgi compiler in Titan.

As stated in PR #1767:
"Changes including removing the dependence on the version of pgi/17.5.0
installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal."

[Non-BFB] - Non Bit-For-Bit
jgfouca pushed a commit that referenced this pull request Mar 14, 2018
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
rljacob pushed a commit that referenced this pull request May 6, 2021
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix PR Machine Files non-BFB PR makes roundoff changes to answers. Titan
Projects
None yet
6 participants