Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring density integrals for efficiency #591

Merged

Conversation

Hallberg-NOAA
Copy link
Member

Refactored 4 routines (int_density_generic_pcm(), int_density_generic_ppm(), int_spec_vol_generic_pcm() and int_spec_vol_generic_plm()) in density integrals for greater computational efficiency by doing fewer calls to the equation of state routines but calculating entire rows of densities at subgrid locations with each each call, replicating what was already being done int_density_dz_generic_plm(). To accomplish this, a number of variables now use larger arrays than previously. The total computational cost of the non-Boussinesq pressure gradient force calculation was more than 50% greater with the previous code in some tests. All answers are bitwise identical.

@Hallberg-NOAA Hallberg-NOAA added the refactor Code cleanup with no changes in functionality or results label Mar 31, 2024
Copy link

codecov bot commented Mar 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 37.08%. Comparing base (01b2ea9) to head (1797f52).

❗ Current head 1797f52 differs from pull request most recent head 9d90e04. Consider uploading reports for the commit 9d90e04 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##           dev/gfdl     #591      +/-   ##
============================================
+ Coverage     35.93%   37.08%   +1.14%     
============================================
  Files           268      271       +3     
  Lines         79919    80897     +978     
  Branches      15106    15075      -31     
============================================
+ Hits          28722    29998    +1276     
+ Misses        45523    45299     -224     
+ Partials       5674     5600      -74     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hallberg-NOAA Hallberg-NOAA force-pushed the efficient_density_integrals branch from 7cff497 to 1797f52 Compare March 31, 2024 19:17
@marshallward
Copy link
Member

I would guess that you should see equivalent performance in a 2D arrays like T_N(1:N,iscB+c1:isec+c2) rather than the staggered 5-tuple and 15-tuple 1D arrays introduced here, which is introducing a lot of complexity. Although the 2D array is not required to be contiguous, there is hardly a compiler in the world where this is not true. (If needed, the contiguous keyword could be added to calculate_density, calculate_spec_vol etc, although I would not recommend it).

I am guessing that would require new 2D interfaces for calculate_density etc.

There might be further benefit to break out of a single i-loop and do it over the full i-/j- arrays, although that would also require 3D interfaces.

Copy link
Member

@marshallward marshallward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While refactoring this PR, I found a few issues in the dz PCM integral, and another likely issue in the dz PPM when the Stanley EOS is enabled. These would not be detected by our current testing, since AFAIK our layered tests use analytic EOS integration rather than PCM quadrature.

These were detected in the benchmark test with EOS_QUADRATURE turned on.

Although my changes are complete, we will apply my changes on top of this PR, rather than rewriting this one.

@Hallberg-NOAA Hallberg-NOAA force-pushed the efficient_density_integrals branch from 1797f52 to bae962b Compare April 26, 2024 08:56
@Hallberg-NOAA
Copy link
Member Author

Thank you for spotting these bugs. I have updated this commit to address the 4 bugs noted above, plus two others that I found after expanding my testing to set EOS_QUADRATURE = True.

  Refactored 4 routines (int_density_generic_pcm, int_density_generic_ppm,
int_spec_vol_generic_pcm and int_spec_vol_generic_plm) in density integrals for
greater computational efficiency by doing fewer calls to the equation of state
routines but calculating entire rows of densities at subgrid locations with each
each call, replicating what was already being done int_density_dz_generic_plm.
To accomplish this, a number of variables now use larger arrays than
previously.  The total computational cost of the non-Boussinesq pressure
gradient force calculation was more than 50% greater with the previous code in
some tests.  All answers are bitwise identical.
@Hallberg-NOAA Hallberg-NOAA force-pushed the efficient_density_integrals branch from bae962b to 9d90e04 Compare April 26, 2024 09:01
@marshallward
Copy link
Member

Could you document the other two changes that were needed here? I will need to confirm that they were fixed in my changes.

@Hallberg-NOAA
Copy link
Member Author

The other two bugs in the original pull request were an incorrect memory declaration for S5 in int_spec_vol_dp_generic_pcm() and a missing pos = assignment in int_density_dz_generic_ppm(). Both bugs would have been eliminated in the much simpler refactoring of this module that you are doing.

@marshallward
Copy link
Member

I have profiled these changes and they show a very clear speedup. Results for a 6hr run of benchmark using the legacy ("buggy") Wright EOS with EOS_QUADRATURE enabled.

The current code:

(Ocean pressure force)                  24     15.356693
(Ocean pressure force)                  24     15.422615
(Ocean pressure force)                  24     15.515217

This PR:

(Ocean pressure force)                  24      6.039421
(Ocean pressure force)                  24      6.056434
(Ocean pressure force)                  24      6.034422

This 2.5x speedup is very reproducible.

The suggested refactor is currently unable to achieve this level of performance. The conversion from a 1D to 3D array appears to introduce a slowdown from 6s to about 8s, and appears to be due to converting the 15N loop to nested 3x5xN element loops. The compilers are either unable or unwilling to convert these to single loops. (The contiguous keyword was no help here.)

@adcroft and I have tested a potential version using assumed-size arrays which appear to get very close to the optimized speed (0.13/6.03 sec ~ 2%)

(Ocean pressure force)                  24      6.221991
(Ocean pressure force)                  24      6.187612
(Ocean pressure force)                  24      6.166069

and believe there may be additional speedups using this technique.

Since there are no plans to replace this commit, I will approve this one and will continue to develop the improvements to the EOS, which will be submitted into a separate PR.

@marshallward
Copy link
Member

Gaea regression: https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/23267 ✔️

@marshallward marshallward merged commit 4058462 into NOAA-GFDL:dev/gfdl Apr 30, 2024
10 checks passed
@marshallward
Copy link
Member

The previous runtimes only contained updates to 3d array times (which dominated the profile). After updating the 2d profiles, the runtimes are equivalent.

The accepted PR using packed 1D arrays:

(Ocean pressure force)                  24      5.995598
(Ocean pressure force)                  24      6.019213
(Ocean pressure force)                  24      5.997528

Proposed changes based on 2d/3d arrays (not yet submitted):

(Ocean pressure force)                  24      5.955973
(Ocean pressure force)                  24      5.997446
(Ocean pressure force)                  24      5.985435

So there is no longer any reason to not refactor this PR to replace the pack arrays with equivalent 2D and 3D arrays.

@marshallward
Copy link
Member

I would not normally keep bantering on in a merged PR, but this PR appears to have changed answers, at least in benchmark using the Wright EOS and EOS_QUADRATURE = True.

4c4
<     24,       0.250,     0, En 5.4681542507953240E-01, CFL  0.01994, SL -1.1407E-08, M 7.90973E+19, S 35.0000, T  5.0639, Me  1.93E-16, Se  6.51E-15, Te  3.38E-15
---
>     24,       0.250,     0, En 5.4681542498895175E-01, CFL  0.01994, SL -1.1405E-08, M 7.90973E+19, S 35.0000, T  5.0639, Me  1.95E-16, Se  6.65E-15, Te  3.39E-15

I used ifort (IFORT) 2021.7.1 20221019 with -O2 optimization (plus essential flags).

I am not sure if this requires special action. I am unsure if anyone is even using EOS_QUADRATURE in production. I first noticed it because my refactoring of this PR produced a third set of answers (equally small in size).

@marshallward
Copy link
Member

marshallward commented May 1, 2024

I repeated using all of the GFDL flags:

  • -O2
  • -fp-model source
  • -march=core-avx-i
  • -qno-opt-dynamic-align

and the answers now agree (and in many cases are faster, although that is somewhat expected).

I would have thought that -fprotect-parens was sufficient for bit reproducibility, but there appears to be more going on here. But that is a conversation for another place and time. By our definition, this PR is fine and there is no reproducibility issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code cleanup with no changes in functionality or results
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants