-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139
Comments
When I was looking at test expense last week, I somehow missed these extremely expensive C96 tests:
I feel it is high priority to figure out a way to remove these or significantly decrease their expense, since they are burning our allocation and slowing down our testing. |
At a minimum: high-res cases should just be short smoke tests, non-debug or very short debug. But we may want to do more than that. |
C96 is one degree resolution. So it's not really a high resolution case. And we should compare it to other 1-degree cases. One thing we could do is to make sure they are using stub glacier and stub river, and then just do a 9 step test for them. |
I'm thinking the best to do for the ne0ARCTICGRISne30x8 case is to rely on the build-namelist testing for it (I already have it in place for it and other special resolutions as well). So we could remove the explicit test for it. The build-namelist test takes care of most of what's needed, making sure the namelist is created and the datasets are in place. |
Possibly another thing we should have is an additional test list to run for scientifically supported resolutions. aux_clm could be the one we do for most tags. But, sci_clm (or whatever we call it) would be additionally run for release tags or at times we want to make sure the scientifically supported resolutions are all working well. |
@ekluzek I agree with all of your thoughts here. See also #1141 for some additional thoughts, but I'm comfortable just relegating this to a build-namelist test if you are. The idea of having a Regarding C96 being at 1 degree resolution: About a year ago, I changed our 1-degree tests to generally be short-ish non-debug smoke tests, leaving the debug and ERP/ERI testing for coarse resolutions. So that should be done here, too. |
@ekluzek 's feeling that I'm good with: let's just rely on build-namelist testing, together with a sci_clm test list. |
We should talk more about the C96 tests. Since it's the standard fv3 resolution, I don't want to just remove them. I think we should at least have one in place, but keep it short and non-debug. The others we should move to the new ctsm_sci list. We can talk more at our meeting this afternoon. |
I'd also like to talk about removing some of these tests or at least reducing their core counts:
While not as expensive as the others in this issue, their high core counts means that their queue wait times are high, so they hurt test suite turnaround time. |
We decided to move these into the "ctsm_sci" test list. |
This reverts commit c7723f9 and makes some changes on top of it. Moving forward, we'd like to have a new test category, ctsm_sci, that tests some resolutions / compsets that are important in CESM, but are too expensive to test with every tag. Resolves ESCOMP#1139
CNFire: btran2 fixes and general cleanup (1) Call routine to calculate fire's btran2 from CNFireArea; this has a few advantages: - It makes the logic of CNFireArea more clear (rather than depending on a btran2 variable that is calculated from some other part of the code) - This avoids having the biogeophysics depend on the biogeochemistry - This lets us avoid doing this btran calc if using no-fire – or other, future, fire methods that don't need it Note regarding testing: In the initial step, I kept this calculation dependent on a saved version of h2osoi_vol to avoid changing answers; I changed this in the answer changes in step (2), as noted below. (2) Answer-changing fixes to CNFire's btran2 calculation and use: (a) Calculate fire btran2 using updated h2osoi_vol (this is an answer-changing cleanup step from (1)) (b) TEMPORARY CHANGE (reverted in the cleanup in (3)): Reinitialize fire btran2 to spval for all patches in each time step, so that the fire code isn't trying to use btran2 from earlier time steps for patches that were in the exposed veg filter at one point but no longer are. One implication of this is that, if there is currently no exposed veg on a column, the new code leads to the block of code that forces fire_m = 0 (because wtlf will be 0). Previously, in contrast, it looks like fire_m was allowed to be non-zero even if there is currently no exposed veg, because btran2 and wtlf were accumulated if a patch ever was exposed in the past. (c) Limit fire btran2 to be <= 1, rather than letting it be slightly greater than 1. (Due to a conditional in CNFireArea, these slightly-greater-tan-1 values were being ignored when computing btran_col, rather than averaging in a 1 value.) (3) Non-answer-changing fire code cleanup: (a) Cleanup of the btran2 fixes, including reverting the TEMPORARY CHANGE noted in (2b), instead relying on a better mechanism: just doing the calculations of btran_col and wtlf over the exposedvegp filter. Also, get rid of the checks for shr_infnan_isnan(btran2(p)) and btran2(p) <= 1 (allowed by the other changes in (2) and (3)). (b) Set btran2 to 0 over non-exposed-veg points: this changes answers for the BTRAN2 diagnostic field, but nothing else. (This follows what is done for the standard BTRAN.) (c) Move calc_fire_root_wetness for CNFireLi2021 into the base type to avoid future bugs (assuming that the next fire module will extend the base class but will also want to use this new version of calc_fire_root_wetness). (d) Change fire looping structure to be more standard (4) Remove some very expensive tests from aux_clm, putting some in the new ctsm_sci test list instead (5) A bit of other minor cleanup Issues fixed: - Resolves #1139 (Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests) - Resolves #1153 (Fire btran2 is only computed for exposed veg patches, but is used over all veg patches) - Resolves #1170 (CNFire code: btran2 should not be skipped when it's greater than 1) - Partially addresses #1142 (Add ctsm_sci test list that would be used for releases and making sure important resolutions run well)
The test
SMS_Ln9.ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12.ISSP585Clm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode
takes 470 core-hours to run, which is about 3x as expensive as any other test in the aux_clm test suite. To put this in perspective: If the aux_clm test suite is run 100 times per year (which is not at all unreasonable), this one test would use up nearly 1% of the SEWG annual allocation.Nearly all of the time in this test is spent in model initialization – probably (though I haven't confirmed) in init_interp.
We should think about what is really important about this test, and trim it down so that it is just testing the aspects that are truly important to test here. My guess is that we're not getting any additional code coverage (of the Fortran code) from this test, so it is mostly about testing compatibility of datasets. If so, can we cover this well enough with a cold start test? Or even by just relying on build-namelist testing for this resolution?
I know that it would be ideal to have a somewhat realistic test of every supported resolution, but our aux_clm test suite is unmanageably large, so I feel we need to be a little more thoughtful here, and cut some corners even if it means slightly increasing the risk that issues will be discovered in production. I feel it's important to ensure that as much as possible of our Fortran code is covered in the test suite, but if a user occasionally runs into a problem where a run at a particular (fairly obscure) resolution doesn't get off the ground, I feel that's an acceptable risk to take.
The text was updated successfully, but these errors were encountered: