-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decomposition test failure #1103
Comments
@MinsukJi-NOAA Is this issue showing up in previous revisions? |
The previous version |
Since the physics suite is changed and there is no dycore update in 9b6b740, it might be related to physics. |
Is there any uncoupled atmos decomposition test ? Is it working ? |
Yes, there is an uncoupled decomposition test. I'm running it with he 8,3 combination now. |
The standalone atm decomposition test w/the 8,3 combination passed. |
So 3,8 is the what the baseline is created with. 4,6 also gives the same answer, but 8,3 does not. It's my understanding that the only guarantee is for repro mode, so are we compiling in repro mode? I don't see any instructions on that here: https://github.com/ufs-community/ufs-weather-model/wiki/Building-model I believe @DeniseWorthen's usual suggestion for debugging something is to write out and check the mediator history files, so I believe for that we follow the instructions here: https://github.com/ufs-community/ufs-weather-model/wiki/Advanced-Topics#using-the-cmeps-mediator-to-understand-the-coupling-fields-under-construction correct? |
@JessicaMeixner-NOAA I believe repro mode compilation can be done with |
UFS discussion 934 contains the instructions for writing the mediator history files. In this case, you could add the following to nems.configure:
This will write the ATM mediator history as a single file containing all 6 tiles on every pass through the coupling loop. ICE and OCN will get their own history files. |
Thanks @MinsukJi-NOAA and @DeniseWorthen |
@JessicaMeixner-NOAA To clarify, 1) the cpld_control_p8 and cpld_decomp_p8 RT have decomposition 4x6 and 3x8, 2)ORT test runs cpld_control_p8 with 8x3, both run in PROD mode for some time. It was working in previous PRs, the 8x3 setting stopped working since PR#1071. |
@JessicaMeixner-NOAA If you can point me to a run directory containing cpld_control_p8, I can look start with the mediator history files. |
I just started a run with the extra outputs, they will be here: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_13823 |
I copied your run director and made a sandbox. I used it to create 3 run directories /scratch1/NCEPDEV/stmp2/Denise.Worthen/decomp96/decomp38 where each varies only in the input.nml layout variable. I set fhmax=2 in model configure. The decomp38 and decomp46 directories are b4b after 2 hours. The decomp38 and decomp83 directories differ on the 2nd coupling step in the coupling fields sent by ATM (rain, snow, shum, tbot, height). The differing values are randomly scattered on each tile---they are not associated w/ land fraction for example, which I've seen before. A diff file at the second timestep is in the run directory (
|
Repro mode did not help: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_10951 (comparing the output between the two directories, there are many diffs starting at fhr001). |
I did some runs saving every time-step and different configurations. I then compiled with REPRO=ON, and running with 3x8 and 4x6 gets the same answers, but these answers are different than the regression test, which I expect since the model was optimized differently. And in the REPRO cases, the aspec ration average is the same as the 3x8 case above. |
I also changed cplchm=F and turned off the aerosols by changing the nems.configure and field_table, and 4x6 and 3x8 give same results with the original RT executable. |
Phil, Thanks. This is interesting. I believe Jessica was also testing the coupled without gocart aerosols. Are Raffaele's tracer fix (related to Thompson MP) included in these tests ? |
@yangfanglin I don't know. @rmontuoro? |
I am making a test now on WCOSS. I checked out the latest ufs-weatehr-model and pointed to https://github.com/rmontuoro/fv3atm/tree/bugfix/thompson-tracer-index to the fv3atm. I am only testing the cpld_decomp_p8 in the rt.sh. Is this sufficient ? Anyhow, I will report back when the test is done. |
My RT returns "+ echo REGRESSION TEST WAS SUCCESSFUL". Does this mean the decomposition bug is also fixed with Raffaele's tracer bug fix ? Should I change the layout manually to test different configurations ? |
@yangfanglin follow the steps at the top of this thread to try a different processor layout |
@pjpegion Got it. Running step 3 now. |
@yangfanglin I have run with @rmontuoro PR changes and it solves some of the decomp issues, but there's still something going on as the results do not completely reproduce. In trying to figure out what is going on, I'm going to try to attempt to run this test w/aerosols but with the older physics options to see if it's again something pointing to an interaction of physics/aerosols that is not otherwise being seen. |
@JessicaMeixner-NOAA Was your coupled model test without gocart aerosol successful ? |
I haven't tried with the very top of develop, but with the 9b6b740 commit, I get consistent results with @pjpegion that it worked. Trying to figure out why since the updated code @rmontuoro seems to resolve some of these issues with leaving aerosols on. |
My test with 8x3 decomposition did not reproduce using the latest ufs-weather-model develop branch and Raffaele's fv3-atm. |
Please see https://docs.google.com/document/d/11vo2-DyrR2LWbQoqprlVoTxhSEnpvblWPw30WqOl6uA/edit for a track of changes made to fv3-atm and ccpp repos after March 4 and before March 10 when Minsuk first reported the decomposition failure. Can we reverse the " fix 2phases intermediate restart" update and see if the decomposition RT works ? |
I have not run the ORT yet, but will try to do that now. I personally find it easier to create new regression tests over running the ORT. |
The ORT is a requirement to PRs so I would like confirmation that this works. |
It's running now I will post as soon as I have the answer to your question. I'm using the compiler flags @rmontuoro gave me, I'm not sure if that's will be the final decision for the flags or not. I will run the ORT and regression tests again once I know what the final flags are. |
The ORT failed:
will look after the meeting as to why. |
Okay so this failed because we shouldn't be trying to compile the coupled model with -D32BIT=ON (yet). I'll resubmit specifying which tests and report back how the tests go. |
Next error:
Debug build fails because of waves I'm assuming. I thought we could at least build though, so I'll look into debugging that issue, in the meantime: Running now with the following options: Please let me know if I should be testing/compiling with different compiler flags or using different options in the ORT test. |
Can we run the opnreqtest with the flags that @rmontuoro identified ? |
I think your debug failure for S2SWA requires the fix that Kyle provided here. |
Thanks @DeniseWorthen I agree. Next ORT failure:
Where can I find when test and what options are expected to work for the ORT tests? I don't see anything here: https://ufs-weather-model.readthedocs.io/en/latest/BuildingAndRunning.html#using-the-operational-requirement-test-script |
For now I would like to know if these flags do work. We have reached out to GDIT about appropriate flags. @MinsukJi-NOAA can you address the questions that @JessicaMeixner-NOAA has raised ? |
@arunchawla-NOAA in terms of knowing if the flags worked, I did the following: |
The following test passed:
I'll work on making the debug test pass as well as any other desired combination/test. |
so if we were to replace fp-model flag with source instead of consistent in repro mode the ORT test would pass ? (instead of having to create these seperate flags just for UFS GOCART |
This is the original code I tried: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissue which I have also now confirmed with the fix @DeniseWorthen pointed me to dbg also works. I'm now trying the suggested update @arunchawla-NOAA is curious about with repro mode and replacing fp-model flag with source instead of consistent. The code is here: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/decompissuerepro and I'm trying the following test: I'll report back when I have more information. |
Here's my output of the ORT with turning on repro mode:
So the debug test failed, but perhaps I should be building with both -DREPRO=ON and -DDEBUG=ON, but it looks to me like all the other tests passed although the log file didn't get created. |
I was able to successfully run the operational requirement test for
The original ufs-weather-model/cmake/Intel.cmake Lines 19 to 23 in 1bd68ca
|
@rmontuoro Thanks for confirming the ORT does work with REPRO mode. Can you build GOCART only with REPRO compile option and the rest of model in PROD mode to see if it still passes? |
By removing -no-prec-div and -no-prec-sqrt in the prod option for non 32 bit, (see: https://github.com/JessicaMeixner-NOAA/ufs-weather-model/tree/bug/twoflags) the ORT test: |
I would think this would be a better solution than having conditional builds with different flags for different components. @junwang-noaa and @DusanJovic-NOAA what is the best way to assess the impact of this change on the timing of the runs. Since this change keeps the AVX options for GOCART which seems to be critical for speed based on what @climbfuji has stated, this should work right ? |
I'd suggest running RT and maybe one 35 day benchmark test to confirm the timing is not impacted much. |
Preliminary tests performed on Hera using
Note that
These flags are still used in GFDL_atmos_cubed_sphere: |
Okay, so I some how majorly goofed. So many apologies for this, but I misread or read what I wanted to see, but the removing just the two-flags at the top level does in fact not work for the decomposition tests. I kept re-running tests last night because they all of a sudden were not working and I kept thinking I had done something wrong, but then this morning I rechecked in the original log file which does in fact say the test failed: I still think it's valuable information for what @rmontuoro ran and discovered, because if it holds that other configurations also do not have a boost from using these flags then perhaps they should be removed. However, it still appears we need to find a clean solution for the decomposition issue. |
Below are results from my latest ORT on Hera:
|
@MinsukJi-NOAA Thanks for testing. Since the problem is resolved, I will close the issue. |
Sure, I will do some testing. But my understanding is that the changes are
independent to decomposition, it is to fix the writing time for restart
files in the two phases run.
…On Sun, Mar 20, 2022 at 12:05 AM Fanglin Yang ***@***.***> wrote:
Please see
https://docs.google.com/document/d/11vo2-DyrR2LWbQoqprlVoTxhSEnpvblWPw30WqOl6uA/edit
for a track of changes made to fv3-atm and ccpp repos after March 4 and
before March 10 when Minsuk first reported the decomposition failure. Can
we reverse the " fix 2phases intermediate restart" update and see if the
decomposition RT works ?
—
Reply to this email directly, view it on GitHub
<#1103 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMTFLFDU3VT6U3WDJDVA2PWXANCNFSM5QNZ5QIQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
Description
cpld_decomp_p8
fails with a different domain decompositionTo Reproduce:
What compilers/machines are you seeing this with? Intel/Hera
Give explicit steps to reproduce the behavior.
9b6b740
)cd ufs-weather-model/tests; ./rt.sh -n cpld_decomp_p8
. This test will PASSufs-weather-model/tests/tests/cpld_decomp_p8
:For a comparison,
5. Check out the previous commit
38aa634
of the ufs weather model6. Repeat steps 2, 3, and 4 above. Both tests will PASS.
The text was updated successfully, but these errors were encountered: