Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139

billsacks · 2020-09-02T16:41:38Z

The test SMS_Ln9.ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12.ISSP585Clm50BgcCrop.cheyenne_intel.clm-clm50cam6LndTuningMode takes 470 core-hours to run, which is about 3x as expensive as any other test in the aux_clm test suite. To put this in perspective: If the aux_clm test suite is run 100 times per year (which is not at all unreasonable), this one test would use up nearly 1% of the SEWG annual allocation.

Nearly all of the time in this test is spent in model initialization – probably (though I haven't confirmed) in init_interp.

We should think about what is really important about this test, and trim it down so that it is just testing the aspects that are truly important to test here. My guess is that we're not getting any additional code coverage (of the Fortran code) from this test, so it is mostly about testing compatibility of datasets. If so, can we cover this well enough with a cold start test? Or even by just relying on build-namelist testing for this resolution?

I know that it would be ideal to have a somewhat realistic test of every supported resolution, but our aux_clm test suite is unmanageably large, so I feel we need to be a little more thoughtful here, and cut some corners even if it means slightly increasing the risk that issues will be discovered in production. I feel it's important to ensure that as much as possible of our Fortran code is covered in the test suite, but if a user occasionally runs into a problem where a run at a particular (fairly obscure) resolution doesn't get off the ground, I feel that's an acceptable risk to take.

The text was updated successfully, but these errors were encountered:

billsacks · 2020-09-08T18:04:13Z

When I was looking at test expense last week, I somehow missed these extremely expensive C96 tests:

ERS_D_Ld10.C96_C96_mg17.IHistClm50Sp.cheyenne_intel.clm-decStart - 450 core hours
ERI_D_Ld9.C96_t061.I2000Clm50SpRsGs.cheyenne_intel.clm-default - 610 core hours
ERP_D_Ld5.C96_C96_mg17.IHistClm50BgcCrop.cheyenne_intel.clm-allActive - 690 core hours

I feel it is high priority to figure out a way to remove these or significantly decrease their expense, since they are burning our allocation and slowing down our testing.

billsacks · 2020-09-08T18:05:31Z

At a minimum: high-res cases should just be short smoke tests, non-debug or very short debug. But we may want to do more than that.

ekluzek · 2020-09-08T18:09:54Z

C96 is one degree resolution. So it's not really a high resolution case. And we should compare it to other 1-degree cases. One thing we could do is to make sure they are using stub glacier and stub river, and then just do a 9 step test for them.

ekluzek · 2020-09-08T18:13:24Z

I'm thinking the best to do for the ne0ARCTICGRISne30x8 case is to rely on the build-namelist testing for it (I already have it in place for it and other special resolutions as well). So we could remove the explicit test for it. The build-namelist test takes care of most of what's needed, making sure the namelist is created and the datasets are in place.

ekluzek · 2020-09-08T18:16:11Z

Possibly another thing we should have is an additional test list to run for scientifically supported resolutions. aux_clm could be the one we do for most tags. But, sci_clm (or whatever we call it) would be additionally run for release tags or at times we want to make sure the scientifically supported resolutions are all working well.

billsacks · 2020-09-08T18:20:24Z

@ekluzek I agree with all of your thoughts here. See also #1141 for some additional thoughts, but I'm comfortable just relegating this to a build-namelist test if you are. The idea of having a sci_clm list (or release test list, etc.) also seems reasonable. Another option would be to do what CAM is doing: putting some of these additional resolutions in the prealpha or prebeta test lists rather than aux_clm.

Regarding C96 being at 1 degree resolution: About a year ago, I changed our 1-degree tests to generally be short-ish non-debug smoke tests, leaving the debug and ERP/ERI testing for coarse resolutions. So that should be done here, too.

billsacks · 2020-09-08T21:02:16Z

@ekluzek 's feeling that I'm good with: let's just rely on build-namelist testing, together with a sci_clm test list.

billsacks · 2020-09-18T10:41:55Z

I have resolved this in c7723f9. Note that this is on a branch off of the branch for #1151 . I'm not sure if we'll end up merging that branch; if not, c7723f9 can be cherry-picked elsewhere.

ekluzek · 2020-09-18T16:28:14Z

We should talk more about the C96 tests. Since it's the standard fv3 resolution, I don't want to just remove them. I think we should at least have one in place, but keep it short and non-debug. The others we should move to the new ctsm_sci list. We can talk more at our meeting this afternoon.

billsacks · 2020-09-18T18:22:55Z

I'd also like to talk about removing some of these tests or at least reducing their core counts:

ERS_Ln9.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.IHistClm50SpGs.cheyenne_intel.clm-clm50cam6LndTuningMode_1979Start
SMS_Ln9.ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12.IHistClm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode_1979Start
SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode_2013Start

While not as expensive as the others in this issue, their high core counts means that their queue wait times are high, so they hurt test suite turnaround time.

ekluzek · 2020-09-18T20:45:07Z

I'd also like to talk about removing some of these tests or at least reducing their core counts:

ERS_Ln9.ne0ARCTICne30x4_ne0ARCTICne30x4_mt12.IHistClm50SpGs.cheyenne_intel.clm-clm50cam6LndTuningMode_1979Start

SMS_Ln9.ne0ARCTICGRISne30x8_ne0ARCTICGRISne30x8_mt12.IHistClm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode_1979Start

SMS_Ln9.ne0CONUSne30x8_ne0CONUSne30x8_mt12.IHistClm50Sp.cheyenne_intel.clm-clm50cam6LndTuningMode_2013Start

While not as expensive as the others in this issue, their high core counts means that their queue wait times are high, so they hurt test suite turnaround time.

We decided to move these into the "ctsm_sci" test list.

This reverts commit c7723f9 and makes some changes on top of it. Moving forward, we'd like to have a new test category, ctsm_sci, that tests some resolutions / compsets that are important in CESM, but are too expensive to test with every tag. Resolves ESCOMP#1139

CNFire: btran2 fixes and general cleanup (1) Call routine to calculate fire's btran2 from CNFireArea; this has a few advantages: - It makes the logic of CNFireArea more clear (rather than depending on a btran2 variable that is calculated from some other part of the code) - This avoids having the biogeophysics depend on the biogeochemistry - This lets us avoid doing this btran calc if using no-fire – or other, future, fire methods that don't need it Note regarding testing: In the initial step, I kept this calculation dependent on a saved version of h2osoi_vol to avoid changing answers; I changed this in the answer changes in step (2), as noted below. (2) Answer-changing fixes to CNFire's btran2 calculation and use: (a) Calculate fire btran2 using updated h2osoi_vol (this is an answer-changing cleanup step from (1)) (b) TEMPORARY CHANGE (reverted in the cleanup in (3)): Reinitialize fire btran2 to spval for all patches in each time step, so that the fire code isn't trying to use btran2 from earlier time steps for patches that were in the exposed veg filter at one point but no longer are. One implication of this is that, if there is currently no exposed veg on a column, the new code leads to the block of code that forces fire_m = 0 (because wtlf will be 0). Previously, in contrast, it looks like fire_m was allowed to be non-zero even if there is currently no exposed veg, because btran2 and wtlf were accumulated if a patch ever was exposed in the past. (c) Limit fire btran2 to be <= 1, rather than letting it be slightly greater than 1. (Due to a conditional in CNFireArea, these slightly-greater-tan-1 values were being ignored when computing btran_col, rather than averaging in a 1 value.) (3) Non-answer-changing fire code cleanup: (a) Cleanup of the btran2 fixes, including reverting the TEMPORARY CHANGE noted in (2b), instead relying on a better mechanism: just doing the calculations of btran_col and wtlf over the exposedvegp filter. Also, get rid of the checks for shr_infnan_isnan(btran2(p)) and btran2(p) <= 1 (allowed by the other changes in (2) and (3)). (b) Set btran2 to 0 over non-exposed-veg points: this changes answers for the BTRAN2 diagnostic field, but nothing else. (This follows what is done for the standard BTRAN.) (c) Move calc_fire_root_wetness for CNFireLi2021 into the base type to avoid future bugs (assuming that the next fire module will extend the base class but will also want to use this new version of calc_fire_root_wetness). (d) Change fire looping structure to be more standard (4) Remove some very expensive tests from aux_clm, putting some in the new ctsm_sci test list instead (5) A bit of other minor cleanup Issues fixed: - Resolves #1139 (Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests) - Resolves #1153 (Fire btran2 is only computed for exposed veg patches, but is used over all veg patches) - Resolves #1170 (CNFire code: btran2 should not be skipped when it's greater than 1) - Partially addresses #1142 (Add ctsm_sci test list that would be used for releases and making sure important resolutions run well)

billsacks added the testing additions or changes to tests label Sep 2, 2020

billsacks assigned ekluzek Sep 2, 2020

billsacks changed the title ~~Decrease expense of ne0ARCTICGRISne30x8 test~~ Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests Sep 8, 2020

billsacks added priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Sep 8, 2020

billsacks mentioned this issue Sep 8, 2020

For resolutions important to CAM: Consider reducing testing to just checking namelist generation #1141

Closed

ekluzek mentioned this issue Sep 8, 2020

Add ctsm_sci test list that would be used for releases and making sure important resolutions run well #1142

Closed

billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Sep 8, 2020

billsacks mentioned this issue Sep 18, 2020

Calculate btran2 inside fire - and call the new routine from within the fire code #1155

Merged

billsacks closed this as completed in c7723f9 Oct 6, 2020

ekluzek added this to CTSM: High priority Aug 15, 2024

ekluzek moved this to Done in CTSM: High priority Aug 15, 2024

ekluzek added this to CTSM: Rework test list Aug 15, 2024

ekluzek moved this to Done in CTSM: Rework test list Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139

Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139

billsacks commented Sep 2, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 8, 2020

ekluzek commented Sep 8, 2020

ekluzek commented Sep 8, 2020

ekluzek commented Sep 8, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 18, 2020

ekluzek commented Sep 18, 2020

billsacks commented Sep 18, 2020

ekluzek commented Sep 18, 2020

Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139

Decrease expense of ne0ARCTICGRISne30x8 test and C96 tests #1139

Comments

billsacks commented Sep 2, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 8, 2020

ekluzek commented Sep 8, 2020

ekluzek commented Sep 8, 2020

ekluzek commented Sep 8, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 8, 2020

billsacks commented Sep 18, 2020

ekluzek commented Sep 18, 2020

billsacks commented Sep 18, 2020

ekluzek commented Sep 18, 2020