-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating PGI version on Titan? (discussion question) #1610
Comments
@worleyph I have done tests for PGI/17.3 with istanbul or interlagos targets one month ago and found that it still failed in reproducibility for both targets. Did you check this in your experiments? |
BTW, I thought intent attributes were forbidden for pointers (at least up to F95) - was this added later? Otherwise, seems like non-standard Fortran. |
@minxu74 , I have not yet checked reproducibility. Was just looking at performance. Perhaps updating has another hurdle to address. |
@philipwjones , I haven't a clue about intent + pointers. This code has been in there awhile. @rljacob , do you have any information? |
You'll have to ask @wlin7 about code in micro_mg_data.F90 |
I think there was a discussion of whether Anvil needs to provide PGI compiler and because regression tests were resumed on Titan the need for Anvil+PGI is of lower priority. Among all combinations of |
@amametjanov , Thanks. I could not remember the final decision on this. SInce there has not been a version of PGI installed on Anvil that worked with ACME, the discussion was somewhat hypothetical. We could get PGI working on Anvil now, but I am not arguing for this. |
@jayeshkrishna has a task to update the PGi on Anvil/Blues. https://acme-climate.atlassian.net/browse/S2-66 |
@minxu74 , My tests for reproducibility with 17.3.0 (with the current defaults: istanbul and ieee) have been successful. It is reproducible with respect to process and thread count changes (A_WCYCL_2000, ne30_oEC). I am continuing to do a few more tests. If I had to guess, your failure may be from comparing a run with MPI-only and a run with threading. Since specifying threading also changes some of the compiler flags, for this comparison I change env_build.xml to set BUILD_THREADED to TRUE even for the pure MPI case. You might repeat your experiment with this setting. |
@worleyph Thanks a lot. Yes. You are right. I conducted PET test before. I will try to turn BUILD_THREADED on. |
@minxu74 and @mrnorman , the latest PR ( #1619 ) perhaps was premature. Moving from 15.7.0 to 17.5.0 slows down the code for pgi, and and pgiacc still does not work, even with 17.5.0. I could see leaving '-compiler pgi' at 15.7.0 as long as the OLCF allows us to or until the newer versions are faster in comparison, and only move to 17.5.0 (or higher) for pgiacc. |
There's another issue with this. We have three tests we're trying to run for ECP on ACME master, and one of them segfaults with 15.7 and another fails ERS with 15.7. I'm running a test now with 17.5 to see if that fixes the issue. If it does, I say we stick with it even with the slowdown. I'm running the tests now. I'll update on their status tomorrow. |
We get the same errors on 17.5 as we do on 15.7, so I agree, let's go back to 15.7 for pgi for now. On Titan with PGI 15.7 and 17.5, we get the following test failures for ECP: ERP_Ld5_P900.ne30_ne30.FC5AV1C.titan_pgi passes just fine |
"dropping intent(in) from two pointer variables in both add_field_1D and add_field_2D in micro_mg_data.F90" I have done this in my local branch. Is this already present in some branch/master? |
@sarats , no code has been changed. pgi/17.5.0 makes this unnecessary (and it is also unnecessary for pgi/15.7.0). |
Ok thanks, I'm trying to diagnose crashes on Summitdev with PGI. 15.7 is not available there and unfortunately not all the modules are built with 17.5. |
On Titan, pgi/17.3.0 plus the intent(in) changes were enough to get ne30_oEC A_WCYCL cases to work (with all of the usual flags). |
Update testreporter and change hobart queue to medium. Update testreporter.py to handle compare failures that were being missed. Remove tagname from the testdb comments that were added to the GENERATE and BASLINE lines in TestStatus. Change the default queue on hobart from short to medium to handle tests that were running a little long. Test suite: scripts_regression_tests.pr, populated testdb for alpha06m Test baseline: Test namelist changes: Test status: bit for bit, Fixes #1555 User interface changes?: Code review:jedwards
Pat, that's good to know that 17.3 is working on Titan. I'm also getting segfaults when I try 17.4 and 17.5 on my desktop. Last time I checked CDash, the segfaults had suddenly stopped on Titan. I'm not sure why, but maybe they're intermittent? I hope that's not the case. |
Matt, pgi/17.5.0 also works for me on Titan (when using PGI, not PGIACC). I have not seen this seg. fault. |
This was fixed by PR #1768 . |
Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.) * worleyph/machines/Titan_pgi_updates: Add istanbul cpu target back for GLC component Restrict use of istanbul cpu target Conflicts: cime/config/acme/machines/Depends.titan.pgiacc Solved the conflict caused by changing the name of "solver_init_mod" to "model_init_mod" in Depends.titan.pgiacc introduced by PR #1824
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
) Early on, it was determined that the PGI compiler required that the CPU target be specified to be istanbul (instead of the actual processor interlagos) in order for ACME (or CESM) to be reproducible with respect to changes in process or thread counts. Here this global specification of istanbul as the CPU target is removed and only applied to the files that require it. Based on experimentation, only CAM and MPASLI required the modified CPU target, and we eventually identified a small number of files within CAM for which this is required. We have not yet examined these files to determine why these files require it. We also have not yet looked at individual MPASLI files, and instead continue to apply the istanbul cpu target for builds of the entire GLC component. Other changes include removing the dependence on the version of pgi/17.5.0 installed in Dave Norton's directories, moving instead to the now official OLCF installation. Also fixed a typo in the module switch command for the PGI version used with PGIACC, and changed the 'pin' flag to 'pinned' for PGIACC, as 'pin' is no longer legal. [Non-BFB] (on Titan when using PGI compiler) Fixes #1620 Fixes #1610 (Note #1610 might need to be reopened, but this nominally addresses the primary issues.)
PGI (via Dave Norton, and with significant help from @mrnorman ) determined why we have not been able to use PGI versions 16 and 17 with ACME. A simple work around was determined (dropping intent(in) from two pointer variables in both add_field_1D and add_field_2D in micro_mg_data.F90). Dave Norton indicated that the problem will be fixed in a released version of the PGI compiler soon.
I went ahead and looked at performance of 17.3.0 as compared to 15.7.0 with PGI (not PGIACC) on Titan, with the current optimization flags, and also without specifying the CPU to be Istanbul (required for reproducibiluty for 15.7.0) and also without specifying -Kieee (also required for reproducibility, I believe), so 4 combinations for each compiler version. I also ran with most I/O disabled. The case was
with the default PE layout (which is an optimized one).
(Here 'unmod' means without removing the intent(in) declarations for the two pointer variables.)
So, there is no performance advantage to moving to the newer version of PGI (just the opposite). This is with the vanilla '-O2' optimization. Dropping the reproducibility requirement (istanbul, ieee) increases performance by 13% when applied globally, for this particular case.
So, while moving to a more recent version of the PGI compiler is generally a good thing, there is a downside here that needs to be evaluated. I'll try PGIACC next, though no one is using that for production runs at the moment. (In fact, the only production runs are G cases, using the Intel compiler, which is known not to be reproducible with the current default settings on Titan?)
We should also look at the same fix and do the same evaluation on other systems, e.g. Anvil.
The text was updated successfully, but these errors were encountered: