Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating PGI version on Titan? (discussion question) #1610

Closed
worleyph opened this issue Jul 3, 2017 · 20 comments · Fixed by #1767
Closed

Updating PGI version on Titan? (discussion question) #1610

worleyph opened this issue Jul 3, 2017 · 20 comments · Fixed by #1767

Comments

@worleyph
Copy link
Contributor

worleyph commented Jul 3, 2017

PGI (via Dave Norton, and with significant help from @mrnorman ) determined why we have not been able to use PGI versions 16 and 17 with ACME. A simple work around was determined (dropping intent(in) from two pointer variables in both add_field_1D and add_field_2D in micro_mg_data.F90). Dave Norton indicated that the problem will be fixed in a released version of the PGI compiler soon.

I went ahead and looked at performance of 17.3.0 as compared to 15.7.0 with PGI (not PGIACC) on Titan, with the current optimization flags, and also without specifying the CPU to be Istanbul (required for reproducibiluty for 15.7.0) and also without specifying -Kieee (also required for reproducibility, I believe), so 4 combinations for each compiler version. I also ran with most I/O disabled. The case was

 -compset A_WCYCL2000 -res ne30_oEC -mach titan -compiler pgi

with the default PE layout (which is an optimized one).

 istanbul_ieee_17.3.0:           1.25 myears/wday
 istanbul_ieee_15.7.0:           1.39 myears/wday
 unmod_istanbul_ieee_15.7.0:     1.37 myears/wday

 istanbul_noieee_17.3.0:         1.41 myears/wday
 istanbul_noieee_15.7.0:         1.44 myears/wday

 noistanbul_ieee_17.3.0:         1.30 myears/wday
 noistanbul_ieee_15.7.0:         1.41 myears/wday

 noistanbul_noieee_17.3.0:       1.42 myears/wday
 noistanbul_noieee_15.7.0:       1.57 myears/wday
 unmod_noistanbul_noieee_15.7.0: 1.57 myears/wday

(Here 'unmod' means without removing the intent(in) declarations for the two pointer variables.)

So, there is no performance advantage to moving to the newer version of PGI (just the opposite). This is with the vanilla '-O2' optimization. Dropping the reproducibility requirement (istanbul, ieee) increases performance by 13% when applied globally, for this particular case.

So, while moving to a more recent version of the PGI compiler is generally a good thing, there is a downside here that needs to be evaluated. I'll try PGIACC next, though no one is using that for production runs at the moment. (In fact, the only production runs are G cases, using the Intel compiler, which is known not to be reproducible with the current default settings on Titan?)

We should also look at the same fix and do the same evaluation on other systems, e.g. Anvil.

@minxu74
Copy link
Contributor

minxu74 commented Jul 5, 2017

@worleyph I have done tests for PGI/17.3 with istanbul or interlagos targets one month ago and found that it still failed in reproducibility for both targets. Did you check this in your experiments?

@philipwjones
Copy link
Contributor

BTW, I thought intent attributes were forbidden for pointers (at least up to F95) - was this added later? Otherwise, seems like non-standard Fortran.

@worleyph
Copy link
Contributor Author

worleyph commented Jul 5, 2017

@minxu74 , I have not yet checked reproducibility. Was just looking at performance. Perhaps updating has another hurdle to address.

@worleyph
Copy link
Contributor Author

worleyph commented Jul 5, 2017

@philipwjones , I haven't a clue about intent + pointers. This code has been in there awhile. @rljacob , do you have any information?

@rljacob
Copy link
Member

rljacob commented Jul 5, 2017

You'll have to ask @wlin7 about code in micro_mg_data.F90

@amametjanov
Copy link
Member

amametjanov commented Jul 5, 2017

I think there was a discussion of whether Anvil needs to provide PGI compiler and because regression tests were resumed on Titan the need for Anvil+PGI is of lower priority. Among all combinations of {intel,gnu,pgi}x{mvapich,mpich,openmpi}, intel+mvapich is the higher-performing and most tested one (also default).

@worleyph
Copy link
Contributor Author

worleyph commented Jul 6, 2017

@amametjanov , Thanks. I could not remember the final decision on this. SInce there has not been a version of PGI installed on Anvil that worked with ACME, the discussion was somewhat hypothetical. We could get PGI working on Anvil now, but I am not arguing for this.

@rljacob
Copy link
Member

rljacob commented Jul 6, 2017

@jayeshkrishna has a task to update the PGi on Anvil/Blues. https://acme-climate.atlassian.net/browse/S2-66

@worleyph
Copy link
Contributor Author

worleyph commented Jul 6, 2017

@minxu74 , My tests for reproducibility with 17.3.0 (with the current defaults: istanbul and ieee) have been successful. It is reproducible with respect to process and thread count changes (A_WCYCL_2000, ne30_oEC). I am continuing to do a few more tests.

If I had to guess, your failure may be from comparing a run with MPI-only and a run with threading. Since specifying threading also changes some of the compiler flags, for this comparison I change env_build.xml to set BUILD_THREADED to TRUE even for the pure MPI case. You might repeat your experiment with this setting.

@minxu74
Copy link
Contributor

minxu74 commented Jul 6, 2017

@worleyph Thanks a lot. Yes. You are right. I conducted PET test before. I will try to turn BUILD_THREADED on.

@worleyph
Copy link
Contributor Author

@minxu74 and @mrnorman , the latest PR ( #1619 ) perhaps was premature. Moving from 15.7.0 to 17.5.0 slows down the code for pgi, and and pgiacc still does not work, even with 17.5.0. I could see leaving '-compiler pgi' at 15.7.0 as long as the OLCF allows us to or until the newer versions are faster in comparison, and only move to 17.5.0 (or higher) for pgiacc.

@mrnorman
Copy link
Contributor

There's another issue with this. We have three tests we're trying to run for ECP on ACME master, and one of them segfaults with 15.7 and another fails ERS with 15.7. I'm running a test now with 17.5 to see if that fixes the issue. If it does, I say we stick with it even with the slowdown. I'm running the tests now. I'll update on their status tomorrow.

@mrnorman
Copy link
Contributor

mrnorman commented Jul 11, 2017

We get the same errors on 17.5 as we do on 15.7, so I agree, let's go back to 15.7 for pgi for now. On Titan with PGI 15.7 and 17.5, we get the following test failures for ECP:

ERP_Ld5_P900.ne30_ne30.FC5AV1C.titan_pgi passes just fine
ERP_Ld31_P32.ne4_ne4.FC5AV1C.titan_pgi fails exact restart diff
ERP_Ld31_P32.ne4_oQU240.A_WCYCL2000.titan_pgi segfaults during the run

@sarats
Copy link
Member

sarats commented Jul 14, 2017

"dropping intent(in) from two pointer variables in both add_field_1D and add_field_2D in micro_mg_data.F90"

I have done this in my local branch. Is this already present in some branch/master?

@worleyph
Copy link
Contributor Author

@sarats , no code has been changed. pgi/17.5.0 makes this unnecessary (and it is also unnecessary for pgi/15.7.0).

@sarats
Copy link
Member

sarats commented Jul 14, 2017

Ok thanks, I'm trying to diagnose crashes on Summitdev with PGI. 15.7 is not available there and unfortunately not all the modules are built with 17.5.
So I'm using 17.4. Even with the workaround, application crashes with FC5AV1C and ne4_ne4. I need to dig further.

@worleyph
Copy link
Contributor Author

On Titan, pgi/17.3.0 plus the intent(in) changes were enough to get ne30_oEC A_WCYCL cases to work (with all of the usual flags).

jgfouca pushed a commit that referenced this issue Jul 14, 2017
Update testreporter and change hobart queue to medium.
Update testreporter.py to handle compare failures that were being missed.
Remove tagname from the testdb comments that were added to the GENERATE and
BASLINE lines in TestStatus.
Change the default queue on hobart from short to medium to handle tests that were
running a little long.

Test suite: scripts_regression_tests.pr, populated testdb for alpha06m
Test baseline:
Test namelist changes:
Test status: bit for bit,

Fixes #1555

User interface changes?:

Code review:jedwards
@mrnorman
Copy link
Contributor

Pat, that's good to know that 17.3 is working on Titan.

I'm also getting segfaults when I try 17.4 and 17.5 on my desktop. Last time I checked CDash, the segfaults had suddenly stopped on Titan. I'm not sure why, but maybe they're intermittent? I hope that's not the case.

@worleyph
Copy link
Contributor Author

Matt, pgi/17.5.0 also works for me on Titan (when using PGI, not PGIACC). I have not seen this seg. fault.

@worleyph
Copy link
Contributor Author

This was fixed by PR #1768 .

minxu74 added a commit that referenced this issue Oct 27, 2017
Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)

* worleyph/machines/Titan_pgi_updates:
  Add istanbul cpu target back for GLC component
  Restrict use of istanbul cpu target

Conflicts:
	cime/config/acme/machines/Depends.titan.pgiacc

Solved the conflict caused by changing the name of "solver_init_mod"
to "model_init_mod" in Depends.titan.pgiacc introduced by PR #1824
minxu74 added a commit that referenced this issue Oct 27, 2017
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
jgfouca pushed a commit that referenced this issue Nov 10, 2017
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
jgfouca pushed a commit that referenced this issue Feb 27, 2018
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
jgfouca pushed a commit that referenced this issue Mar 14, 2018
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
rljacob pushed a commit that referenced this issue May 6, 2021
)

Early on, it was determined that the PGI compiler required that
the CPU target be specified to be istanbul (instead of the actual
processor interlagos) in order for ACME (or CESM) to be
reproducible with respect to changes in process or thread counts.
Here this global specification of istanbul as the CPU target is
removed and only applied to the files that require it. Based
on experimentation, only CAM and MPASLI required
the modified CPU target, and we eventually identified a small
number of files within CAM for which this is required. We have not
yet examined these files to determine why these files require it.
We also have not yet looked at individual MPASLI files, and
instead continue to apply the istanbul cpu target for builds
of the entire GLC component.

Other changes include removing the dependence on the version of
pgi/17.5.0 installed in Dave Norton's directories, moving instead
to the now official OLCF installation. Also fixed a typo in the
module switch command for the PGI version used with PGIACC, and
changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no
longer legal.

[Non-BFB] (on Titan when using PGI compiler)

Fixes #1620
Fixes #1610

(Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants