Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase optimization for ugwp_driver_v0.F #76

Closed
wants to merge 1 commit into from

Conversation

dkokron
Copy link

@dkokron dkokron commented May 29, 2023

Performance profiling of a HAFS case on NOAA systems revealed significant of time was spent in subroutine fv3_ugwp_solv2_v0(). This pull request (PR) allows the compiler to fully optimize this routine while maintaining round-off level differences.

How Has This Been Tested?
My testing includes the creation of an offline driver for fv3_ugwp_solv2_v0(). I extracted full volume (all ranks and threads) inputs to and outputs from fv3_ugwp_solv2_v0() at time step 290 (5.8 hours) into a HAFS case. The offline driver feeds the saved inputs into fv3_ugwp_solv2_v0() and compares the output against the saved ground truth. The driver gives zero-diff output when compiled with the flags used in production.

I found a set of compile options that allows the fv3_ugwp_solv2_v0() to run more than 2x faster than the baseline yet gives absolute differences of +- e-14. Numerical stats attached as ugwp_results.txt.

ugwp_results.txt

The driver and associated input and output files are located at
alogin02:/lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L/UGWPsolv2V0/driver.f90
alogin02:/lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L/UGWPsolv2V0/IO

This PR is a non zero-diff change, so the baseline will need to be regenerated.

I ran the rt.sh suite on acorn and cactus using the "-c" option then again with "-m". The logs from both runs on both machines indicated "REGRESSION TEST WAS SUCCESSFUL"

Performance metric:
Add up the phase1 and phase2 timings printed in the output listing
grep PASS stdout | awk '{t+=$10;print t}' | tail -1
The units are seconds.

I ran a 126-hour simulation on acorn using a 26-node case.
acorn:/lfs/h1/hpc/support/daniel.kokron/HAFS/hafsv1_final/T2O_2020092200_17L

Trial Baseline Optimized
1 7881 7489
2 7866.7 7529.6
3 7865.7 7488.4
4 7849.9 7464.7
5 7851.7 7509.3
6 7877.1 7499.2
Mean 7865.4 7496.7

I also ran a 24-hour simulation on cactus using a 47-node case provided to me by Bin Liu
cactus:/lfs/h2/emc/ptmp/bin.liu/hafsv1_merge_hfsb/2021082712/09L/forecast

The units are seconds.

Trial Baseline Optimized
1 984.975 984.944
2 987.563 917.817
3 986.981 948.971
4 999.552 935.908
5 1006.71 918.378
6 985.828 940.994
7 1000.54 935.985
Mean 993.2 940.4

Based on these timings, I expect savings from this PR of ~(52.8s)*5=264 seconds for a full 5-day simulation.

Comment on lines +170 to +173
if(CMAKE_BUILD_TYPE STREQUAL "Release" AND ${CMAKE_Fortran_COMPILER_ID} STREQUAL "Intel")
SET_SOURCE_FILES_PROPERTIES(${LOCAL_CURRENT_SOURCE_DIR}/physics/ugwp_driver_v0.F
APPEND_STRING PROPERTY COMPILE_FLAGS " ${CMAKE_Fortran_FLAGS_PHYSICS} -fp-model=fast -fprotect-parens -fimf-precision=high")
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also restrict this with if(FASTER) so that the original Release compilation options don't change.

@dkokron
Copy link
Author

dkokron commented May 30, 2023 via email

@dkokron
Copy link
Author

dkokron commented May 30, 2023 via email

@SamuelTrahanNOAA
Copy link
Collaborator

Protecting this change with a check for use of -DFASTER does limit the
consequences to the following regression tests even though other tests run
with a FASTER executable. However, that also limits the benefits of this
optimization to those projects that know about and use the -DFASTER flag.

The purpose of the FASTER flag is to test a new combination of compilation options that will eventually replace the current Release. We're keeping the original options unmodified during a transition period so modelers can test the new and old options together..

@dkokron
Copy link
Author

dkokron commented May 30, 2023 via email

@SamuelTrahanNOAA
Copy link
Collaborator

If there are specific HAFS tests you want to run with -DFASTER=ON, we can add regression tests. You can see some _faster tests in tests/rt.conf for other models that are experimenting with this.

@dkokron
Copy link
Author

dkokron commented May 30, 2023 via email

@grantfirl
Copy link
Collaborator

Thanks @dkokron. I see that there are not upstream PRs into fv3atm and ufs-weather-model for this change. Since you're only changing CMakelists, it would be great to combine this with another PR, but since this changes the RT baselines by itself, it makes it trickier since we don't like to combine PRs that change baselines for different reasons. It looks like #77 doesn't change baselines and is a candidate for combining this into that one, which would make upstream PRs unnecessary. I'll inquire about doing this.

@grantfirl
Copy link
Collaborator

Also, like @SamuelTrahanNOAA said, we can ask about either adding a test, or ideally, modifying and existing test to not add to the testing load. I think that @DusanJovic-NOAA, @junwang-noaa, @BrianCurtis-NOAA and @BinLiu-NOAA would be good to ask their opinion on updating rt.conf or individual tests for HAFS.

@SamuelTrahanNOAA
Copy link
Collaborator

My preference is to update the existing tests, if they don't accurately represent planned HAFS configurations.

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@SamuelTrahanNOAA
Copy link
Collaborator

I think the way we should handle that is to separate the PRs. Have this PR only update the optimization options. Then use another PR to update the HAFS regression test configuration.

The HAFS regression test reconfiguration could be rolled into my ufs-community/ufs-weather-model#1769. That one corrects some RRFS regression tests, and the only code changes don't change the results.

@DusanJovic-NOAA
Copy link

Why are the compiler flags changed only for one file? What is so special about this file? Have you tried to use these flags for all other files? Are you getting reproducible answers on all platforms?

@junwang-noaa
Copy link

From https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/fp-model-fp.html:

Tells the compiler to use more aggressive optimizations when implementing floating-point calculations. These optimizations increase speed, but may affect the accuracy or reproducibility of floating-point computations.

I'd suggest we do more testing before adding this feature to "FASTER" option, we need to confirm the accuracy. We are maintaining reproducibility for "Release" mode.

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@junwang-noaa
Copy link

That might be a question for HAFS group. UFS RT is just one canned case. From my point of view, cpld_control_p8 and control_p8 producing different results already shows the results will change when using this option. But I understand there is a balance on speed and the reproducibility. So it's each application's decision to use this option or not.

@yangfanglin
Copy link
Collaborator

GWD should be only taking a very small portion of the CPU time of the entire physics package. I'd suggest the developer looking inside the code to find out where the bottleneck is, and optimizing the code if possible. Simply changing the compiler option without checking the code may not be the best approach ( informing @mdtoyNOAA )

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@grantfirl
Copy link
Collaborator

@dkokron @junwang-noaa @DusanJovic-NOAA @yangfanglin Coming back to Dusan's and Fanglin's comments, do we want to go forward with a one file compiler option change for the HAFS application like this PR suggests? Or do we want to explore adding flags to the rest of physics (since I would think GWD would not be high on the list of schemes eating up compute cycles) or asking the developer of the scheme to make it work faster?

On one hand, if this change really makes that much of a difference for the HAFS application, I don't want to stand in the way, but on the other hand, I agree with @DusanJovic-NOAA and @yangfanglin that this is an oddly "targeted" solution for speeding up physics.

@dkokron
Copy link
Author

dkokron commented Jun 6, 2023 via email

@grantfirl
Copy link
Collaborator

@BinLiu-NOAA @ZhanZhang-NOAA @junwang-noaa @DusanJovic-NOAA @yangfanglin Do any of you have strong opinions about whether this change is desired? I know that some reservations were expressed. Should we continue merging this or close it?

@grantfirl
Copy link
Collaborator

After talking with UFS code managers (including @BinLiu-NOAA for HAFS), it was decided to close this PR as-is and to instead explore adding the extra compilation flags to all physics. See #86.

@grantfirl grantfirl closed this Jun 22, 2023
grantfirl pushed a commit that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants