Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified vertical range for KEvertex calc to avoid NaNs #4945

Merged
merged 2 commits into from
Aug 31, 2022

Conversation

alicebarthel
Copy link
Contributor

@alicebarthel alicebarthel commented May 12, 2022

Following @mark-petersen's suggestion, I modified the vertical range over which KEvertex is calculated.
The previous version had NaNs in normalVelocity and activeTracers which led to an immediate State Validation Fail when RK4 and config_include_KE_vertex were used together.
I successfully ran a 1-day test in E3SM (C-case).
@xylar @mark-petersen let me know if you'd like me to run other tests.

Fixes #4933
[BFB]

@xylar xylar self-requested a review May 12, 2022 19:24
@xylar xylar assigned mark-petersen, xylar and jonbob and unassigned mark-petersen and xylar May 12, 2022
@xylar xylar requested a review from mark-petersen May 12, 2022 19:25
@xylar xylar added the BFB PR leaves answers BFB label May 12, 2022
@xylar
Copy link
Contributor

xylar commented May 12, 2022

@alicebarthel, I will try to run some tests today or tomorrow. Thanks for submitting this!

@jonbob
Copy link
Contributor

jonbob commented May 12, 2022

@alicebarthel -- just a note for next time that the branch naming convention is to use all lower-case letters...

@xylar
Copy link
Contributor

xylar commented May 13, 2022

Testing

I made a test merge of this branch with master (currently just one merge behind) and tested the result against master in both E3SM and compass on Anvil.

For E3SM testing, I ran ERS.ne11_oQU240.WCYCL1850NS.anvil_intel and found it to be bit-for-bit with master. @alicebarthel, for your reference, here is what I did:

cd E3SM/
git fetch --all -p
git reset --hard origin/master 
git submodule update --init --recursive 
cd cime/scripts/
./create_test --wait --walltime 01:00:00  -g -b master_20220512 --baseline-root /lcrc/group/e3sm/ac.xylar/e3sm_baselines ERS.ne11_oQU240.WCYCL1850NS.anvil_intel
cd ../..
git remote add alicebarthel/E3SM git@github.com:alicebarthel/E3SM.git
fetch alicebarthel/E3SM 
git worktree add ../ocn/fix-RK4-KEvertex-calc -b ocn/fix-RK4-KEvertex-calc
cd ../ocn/fix-RK4-KEvertex-calc/
git reset --hard alicebarthel/E3SM/ocn/fix-RK4-KEvertex-calc 
git merge --no-ff origin/master 
git submodule update --init --recursive 
cd cime/scripts/
./create_test --wait --walltime 01:00:00  -c -b master_20220512 --baseline-root /lcrc/group/e3sm/ac.xylar/e3sm_baselines ERS.ne11_oQU240.WCYCL1850NS.anvil_intel

I also ran the compass ocean pr test suite with Intel and OpenMPI, again using E3SM master as a baseline. All tests passed and were bit-for-bit.

Copy link
Contributor

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good fix to have. In general, I think we don't want to be computing with invalid values.

@xylar
Copy link
Contributor

xylar commented May 13, 2022

@alicebarthel, thanks for putting this in. @jonbob, do you think my testing is sufficient or should we wait for @mark-petersen to return and do a review?

@alicebarthel
Copy link
Contributor Author

Thanks for the tests and instructions @xylar, I appreciate it.
@jonbob Thanks for the note about naming conventions, I'll do better next time.

@jonbob
Copy link
Contributor

jonbob commented May 13, 2022

@xylar - I think your testing is complete and there's no need to wait for @mark-petersen to review. I'll run E3SM tests before I merge and between the two we should have it covered. Thanks again for your help

@xylar xylar removed the request for review from mark-petersen May 13, 2022 16:19
@xylar
Copy link
Contributor

xylar commented May 13, 2022

Sounds great!

@mark-petersen mark-petersen self-requested a review May 14, 2022 11:01
Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alicebarthel for making this PR and @xylar for testing. During my original debugging I ran numerous stand-alone nightly suites with comparisons, with rk4 and split explicit, with config_include_KE_vertex = .true.. Specifically, I just ran the nightly suite with split explicit and this is bfb with master using gnu optimized stand-alone on badger.

@jonbob
Copy link
Contributor

jonbob commented May 25, 2022

My tests are failing for this PR. SMS_D_Ld3.T62_oQU120.CMPASO-IAF.chrysalis_intel complains about the following:

 94: forrtl: severe (408): fort: (3): Subscript #1 of the array KINETICENERGYVERTEX has value 0 which is less than the lower bound of 1

It may be necessary for the lower do-loop extent to be max(1,minLevelEdgeBot(iEdge))?

@xylar
Copy link
Contributor

xylar commented May 25, 2022

Thanks @jonbob. Sorry our other testing didn't turn this up.

@cbegeman, do you have recommendations on how to handle this based on similar loops over edges elsewhere in the code?

@xylar
Copy link
Contributor

xylar commented May 25, 2022

It seems most likely that we need a check for valid edgesOnVertex. Presumably, we're using the dummy index for nonexistent edges on vertices and this is leading to minLevelEdgeBot == 0?

@xylar
Copy link
Contributor

xylar commented May 25, 2022

It seems like this code block shows how to do this properly:

do iVertex = 1, nVertices
kmin = minLevelVertexBot(iVertex)
kmax = maxLevelVertexTop(iVertex)
del2relVort(:, iVertex) = 0.0_RKIND
areaTriInv = 1.0_RKIND/areaTriangle(iVertex)
do i = 1, vertexDegree
iEdge = edgesOnVertex(i, iVertex)
do k = kmin, kmax
del2relVort(k,iVertex) = del2relVort(k,iVertex) &
+ edgeSignOnVertex(i,iVertex) &
*dcEdge(iEdge)*del2u(k,iEdge)*areaTriInv
end do
end do
end do

We need to use minLevelVertexBot and maxLevelVertexTop rather than minLevelEdgeBot and maxLevelEdgeBot.

@alicebarthel, would you like to make the corresponding changes? Then, we could talk through how to rerun the test that @jonbob ran that failed for him.

@alicebarthel
Copy link
Contributor Author

Thanks @xylar for pointing out a relevant code snippet. I'll look into it and make the changes later today.

@alicebarthel
Copy link
Contributor Author

I was able to reproduce @jonbob's error with the old code.
The updated code passed SMS_D_Ld3.T62_oQU120.CMPASO-IAF.chrysalis_intel successfully!
I only ran this one test. @jonbob, @xylar let me know if you are ok picking up the testing from here? Else, let me know which test/ test suites I should run. Thanks!

@xylar
Copy link
Contributor

xylar commented Jun 7, 2022

@alicebarthel, that's great! I'll check with @jonbob in a bit but I think that will do for now.

@jonbob
Copy link
Contributor

jonbob commented Jun 7, 2022

@alicebarthel - the tests no longer result in error messages, but they do indicate non-BFB behavior for both cases I ran so far:

  • ERS.ne11_oQU240.WCYCL1850NS.chrysalis_intel
  • SMS_D_Ld3.T62_oQU120.CMPASO-IAF.chrysalis_intel

The TestStatus for both shows:

FAIL ERS.ne11_oQU240.WCYCL1850NS.chrysalis_intel BASELINE master: DIFF
FAIL SMS_D_Ld3.T62_oQU120.CMPASO-IAF.chrysalis_intel BASELINE master: DIFF

@alicebarthel
Copy link
Contributor Author

Ok, I've thought about his more and now understand your comments above.
I agree with @xylar that it won't be BFB unless we set the invalid velocities to zero. I have started to look at split.F and rk4.F for the differences, but have not had time to check the initialization values (0? weird fill_value?).
One thing of note is the calculation on L625, where uTemp is set to zero then the calculation is limited to the valid vertical range:
uTemp = 0.0_RKIND
do k = minLevelEdgeBot(iEdge), maxLevelEdgeTop(iEdge)

@cbegeman I like your solution and I think it makes physical sense, but split_explicit gives non-zero values for KEvertex on vertices connecting land and ocean, so I think we would get in trouble changing them to zero (hence Jon's non BFB).
I look forward to the rest of this conversation but I think the zero-ing of velocity should fall on someone else's to-do list if we are in a timeline for the PR.

@rljacob
Copy link
Member

rljacob commented Jun 30, 2022

status: waiting for a bug to be fixed.

@rljacob
Copy link
Member

rljacob commented Aug 4, 2022

status: @mark-petersen is working to fix this.

@rljacob
Copy link
Member

rljacob commented Aug 25, 2022

status: still waiting on @mark-petersen

@mark-petersen
Copy link
Contributor

@alicebarthel, based on the comments here, it is better to set up RK4 the same way as split explicit, i.e. to make sure that normalVelocity is always zero on the edges. Then this error does not occur in the first place. I was able to reproduce this failure and then solve it with these changes:
https://github.com/mark-petersen/E3SM/pull/new/fix-rk4-by-zeroing-normalvelocity

You can grab commit 3757046 from my fork, branch fix-rk4-by-zeroing-normalvelocity. @alicebarthel could you test this in your set-up? Thanks.

@xylar
Copy link
Contributor

xylar commented Aug 29, 2022

@mark-petersen, this is a nice solution! @alicebarthel and I chatted about this branch today and came up with an alternative approach to the same block of code. We were thinking of using edgeMask rather than 2 separate do loops but the outcome would be essentially the same.

@mark-petersen
Copy link
Contributor

mark-petersen commented Aug 29, 2022

Yes, either would work fine. @alicebarthel you can push either of those fixes to this branch, or ask for help if needed. An advantage of using edgeMask from 1 to nVertLevels is that it incorporates inactive top cells, and my suggestion does not.

@alicebarthel
Copy link
Contributor Author

@mark-petersen As I mentioned this morning, I was also working on this. The solution @xylar mentioned is implemented on this new fix branch .
I ran an E3SM RK4+include_KEvertex test and it ran successfully. Same with RK4 with include_KEvertex=false.
Let me check if I can do a BFB test with split-explicit.

@mark-petersen
Copy link
Contributor

@alicebarthel, this is great! I tested your new fix branch in stand-alone with the nightly suite on both intel debug and gnu debug with default flags, and then gnu debug with forced config_include_KE_vertex = .true. on all tests, and they all pass.

I then compared gnu optimized nightly suite between your fix branch and the branch point c5f8b37, and it compares bfb except that on RK4 tests, edges where both cell neighbors are land now have normalVelocity=0. Before your change, they are -1e34. This change is acceptable, because split explicit already has zero values for normalVelocity for land edges. So this PR would make RK4 match split-explicit. It still counts as a BFB PR for E3SM, which only compares split-explicit tests.

You can force push your branch to this PR, and I will approve. I'm also happy to post more details of my testing if you like.

@alicebarthel alicebarthel force-pushed the ocn/fix-RK4-KEvertex-calc branch from 5a56c86 to 4911116 Compare August 30, 2022 15:45
@alicebarthel
Copy link
Contributor Author

@mark-petersen Thank you so much for testing the BFB for split_explicit. I force pushed as requested.
I suppose the next steps are out of my hand, but let me know if you need me to do anything.
@xylar thanks for your help and review.

Copy link
Contributor

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving based on @alicebarthel and @mark-petersen's testing and a visual inspecting.

@xylar
Copy link
Contributor

xylar commented Aug 30, 2022

@alicebarthel, a rebase may be needed to resolve a conflict. Do you want to give that a try? Or leave it for @jonbob to sort out during merge? (@jonbob, what's your preference?)

@jonbob
Copy link
Contributor

jonbob commented Aug 30, 2022

@xylar and @alicebarthel - I'm happy to sort out that conflict when I merge this PR, if that's easier

@alicebarthel alicebarthel force-pushed the ocn/fix-RK4-KEvertex-calc branch from 0b31efe to a9a39ad Compare August 30, 2022 17:24
@alicebarthel
Copy link
Contributor Author

I rebased it to the current master, manually resolved the conflict, and pushed it. @jonbob let me know if you need me to do anything else, and thanks for doing the merge!

@jonbob
Copy link
Contributor

jonbob commented Aug 30, 2022

Do we need @philipwjones to re-review? Or is this good to go?

@xylar
Copy link
Contributor

xylar commented Aug 30, 2022

@jonbob, I think we can go ahead. RK4 isn't really a target for the performance team so I think @philipwjones will be happy as long as we don't break anything (and we're not likely to do that in RK4 itself).

@xylar
Copy link
Contributor

xylar commented Aug 30, 2022

I think @philipwjones's only comment was to say that the loop bounds changes were fine, but these have been removed in any case (see #4945 (comment)).

@jonbob
Copy link
Contributor

jonbob commented Aug 30, 2022

@xylar - I did resolve that conversation hoping it would add Phil's review approval. Anyway, I'll test this and get it merged as soon as possible

@xylar
Copy link
Contributor

xylar commented Aug 30, 2022

@xylar - I did resolve that conversation hoping it would add Phil's review approval. Anyway, I'll test this and get it merged as soon as possible

I don't think that's how GitHub works. I think commenting on the code automatically makes you a reviewer whether you intended that. So it's kind of important for people to either request changes or approve when they comment. Otherwise, it's hard to know if they still need to approve. But requesting changes sometimes feels rude because there's a big red flag. So I can see why people don't do it...

@mark-petersen mark-petersen self-requested a review August 30, 2022 19:43
Copy link
Contributor

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase was successful. Checked again by visual inspection and test compile. Still passes nightly suite with gnu debug.

jonbob added a commit that referenced this pull request Aug 30, 2022
…4945)

Modified vertical range for KEvertex calc to avoid NaNs

Modifies the vertical range over which KEvertex is calculated. The
previous version had NaNs in normalVelocity and activeTracers which
led to an immediate State Validation Fail when RK4 and
config_include_KE_vertex were used together.

Fixes #4933
[BFB]
@jonbob
Copy link
Contributor

jonbob commented Aug 30, 2022

test merge shows BFB results inside E3SM for:

  • ERS.ne11_oQU240.WCYCL1850NS.chrysalis_intel
  • SMS_D_Ld3.T62_oQU120.CMPASO-IAF.chrysalis_intel
  • SMS_P12x2.ne4_oQU240.WCYCL1850NS.chrysalis_intel.allactive-mach_mods

merged to next

@jonbob jonbob merged commit d15f9fa into E3SM-Project:master Aug 31, 2022
@jonbob
Copy link
Contributor

jonbob commented Aug 31, 2022

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB bug fix PR mpas-ocean
Projects
None yet
Development

Successfully merging this pull request may close these issues.

E3SM fails with RK4 and config_include_KE_vertex
7 participants