-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixes for 32-bit physics & correct the lake scheme in FV3_HRRR_c3 & FV3_HRRR_gf #1880
Conversation
This should be combined with #1467. They fix related issues, and both PRs are needed urgently by the RRFS parallels. |
@SamuelTrahanNOAA please bring these up to date with respective authoritative repositories
|
You are an especially polite bot, github-actions. Yes, I'm working on this now. My regression tests were about to finish from the last pass at updating this branch. Now, I find out that the UFS has updated again since I started. Time to start over, I suppose. |
Combining with #1853 would be great too, since that fixes another 32-bit physics quilting restart bug. |
@SamuelTrahanNOAA Dusan is on leave, but you could add his changes manually. |
Actually, I'd like to hear from @junwang-noaa about whether to merge #1853 into this. She had doubts about whether Dusan's workaround was the best way forward. His workaround was to have the ESMF-based write component mimic what is essentially an FMS I/O bug. Specifically, always using double precision to write out axis variables, even if those variables are originally stored in single precision. This way, the quilt and non-quilt restart files will be identical. The alternative is to accept that quilt and non-quilt axis variables will have different precision. That won't matter anymore once we're using quilting restarts for everything. However, it will make regression testing more difficult until then. If Jun is here and agrees to use Dusan's workaround, then I can merge #1853. If she'd prefer to wait for Dusan to come back and discuss matters, then I should not merge 1853. |
@SamuelTrahanNOAA @DusanJovic-NOAA will come back next Monday, let's discuss this with him before we merge it. Thanks |
@jkbk2004 - Based on Jun's response, we're not combining any PRs with this one. I have finished my final testing: no output changes, and all new tests pass on Hera. You can proceed with final testing when you are ready. |
@junwang-noaa - Certainly, we shall wait on 1853, but I don't want to wait on this PR or #1467 due to the urgency in the RRFS parallels. @jkbk2004 was discussing starting regression tests on this PR tomorrow. Could you review the FV3 PR sometime tomorrow? I believe you're already familiar with the fix in this PR. |
@SamuelTrahanNOAA can you sync up your branch here and resolve conflicts? #1773 is merged so we want to begin working this PR next. |
I need to see the full backtrace of the error to know where and why it is crashing. |
|
At this point, I'd like to disable all three tests. The c3 scheme is experimental, and may have unknown bugs. Thankfully, we now have a way to reproduce a crash, even if it is on a machine few developers can access. Quilting restart does not work for 32-bit physics. ESMF accesses uninitialized memory extensively during quilt server initialization. This is easy to detect on any platform using valgrind. The result is gibberish data sent to the NetCDF library. If that gibberish includes a signalling NaN, then the model will crash. Whether this is a bug in ESMF, or the model, I do not know. Neither would surprise me. This branch has some bug fixes for 32-bit physics with quilting restart, but not enough to get it to work. Now that @DusanJovic-NOAA has returned, I'm hoping he can help me get the quilting restart to work with 32-bit physics. Until then, disabling a test of a known-broken feature makes the most sense. |
I've disabled these tests:
|
Ok, I'll run the final matching then on Cheyenne with those cases turned off and then we can merge this PR. |
Created an issue to follow up later: #1882 |
wcoss2: tests are hitting wall clock both with baseline creation and comparison. No consistency in which test fails, so it's most likely from high load slowing things down. I'm about to re-run the failed tests (2) and logs will come soon. |
Update on Cheyenne. The tests just finished a few minutes ago, however one case failed to match against the baseline. Hoping to have it resolved shortly. |
Testing is complete. I'll follow up on the FV3atm sub-pr |
@SamuelTrahanNOAA FV3 hash: NOAA-EMC/fv3atm@51e570c |
My branch points to the head of FV3 develop, and I've reverted .gitmodules. |
PR Author Checklist:
Description
Corrects a few problems found in RRFS parallel development and consolodates FV3_HRRR suite regression tests:
Linked Issues and Pull Requests
Fixes these three fv3atm issues:
Subcomponent Pull Requests
Blocking Dependencies
Subcomponents involved:
Anticipated Changes
Input data
Regression Tests:
All tests with "conus13km" in their name are replaced by new ones with shorter names. New tests hrrr_gf and hrrr_c3 are added.
Full list of new tests
Libraries
Code Managers Log
Testing Log: