EAMxx: add multi-instance and nbfb testing #6984

mahf708 · 2025-02-07T21:23:57Z

Adding support for multi-instance capability in EAMxx (a one-liner in mct coupling layer) and adding a customized/specialized MVK tests for climate/nbfb reproducibility in EAMxx. More details to come.

[BFB]

github-actions · 2025-02-07T21:26:06Z

PR Preview Action v1.6.0
🚀 View preview at https://E3SM-Project.github.io/E3SM/pr-preview/pr-6984/
Built to branch `gh-pages` at 2025-02-17 20:49 UTC. Preview will be ready when the GitHub Pages deployment is complete.

mahf708 · 2025-02-07T22:25:51Z

To trigger:

generate baselines:

./cime/scripts/create_test MVKxx_Ly1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-mvkpert --generate --baseline-root=$PSCRATCH/e3sm-scratch/test-pr/baselines -b mkvxx_test -o

compare to above-generated baselines:

./cime/scripts/create_test MVKxx_Ly1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-mvkpert --compare --baseline-root=$PSCRATCH/e3sm-scratch/test-pr/baselines -b mkvxx_test -o

Sample of test output viewable on the web (but not sure if it is meaningful; seems like fields are constants? 🤷 tbd more validation): https://portal.nersc.gov/project/e3sm/mahf708/MVKxx_Ly1.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-mvkpert.C.20250210_142343_isu2io.evv/

cime_config/config_files.xml

bartgol · 2025-02-10T16:47:05Z

Adding @jgfouca since he is Mr. CIME, so he may have thoughts on the SystemTests part.

jgfouca

Very cool! I think this is our first E3SM-only SystemTest. We should think about moving the other tests that are very E3SM-specific here.

bartgol

I think Jim should review this, but I have one thought.

I see lots of overlap with MVK. So two sub-thoughts:

Let's prune what's the same, and only keep the differences
Wild idea: would it make sense to not add MVKxx, and just hack our buildnml scripts, so that we read what's in user_nl_eamxx, and perform the corresponding atmchanges? It would be taylored for this test type. We could put a check "if it's MVK test, then parse user_nl_eamxx and do atmchanges"... EDIT: I now see that another big diff is that MVKxx calls ksxx, while MVK relies on this evv external library. If this is the easier thing to do, then so be it. But maybe @jgfouca can tell whether we can cast our problem into something that existing MVK can handle?

EDIT 2: I don't want to burden this PR with more work, adding a feature is already a lot. But looking ahead, maybe we could make the infrastructure more general. E.g., I expect that maybe other new components (like omega) may want a MVK-like test, but may fail to fit in the CIME's MVK approach. Maybe we should create an e3sm-owned MVK approach that delegates to the component's buildnml script the job of modifying the different instances inputs, as well as uses some module in the component's path to collect the output files. That is, remove assumptions in the main scripts about how inputs/outputs files are called/structured. Let the components handle that bit of info.

mahf708 · 2025-02-10T17:15:39Z

I think Jim should review this, but I have one thought.

I see lots of overlap with MVK. So two sub-thoughts:

Let's prune what's the same, and only keep the differences

Wild idea: would it make sense to not add MVKxx, and just hack our buildnml scripts, so that we read what's in user_nl_eamxx, and perform the corresponding atmchanges? It would be taylored for this test type. We could put a check "if it's MVK test, then parse user_nl_eamxx and do atmchanges"... EDIT: I now see that another big diff is that MVKxx calls ksxx, while MVK relies on this evv external library. If this is the easier thing to do, then so be it. But maybe @jgfouca can tell whether we can cast our problem into something that existing MVK can handle?

EDIT 2: I don't want to burden this PR with more work, adding a feature is already a lot. But looking ahead, maybe we could make the infrastructure more general. E.g., I expect that maybe other new components (like omega) may want a MVK-like test, but may fail to fit in the CIME's MVK approach. Maybe we should create an e3sm-owned MVK approach that delegates to the component's buildnml script the job of modifying the different instances inputs, as well as uses some module in the component's path to collect the output files. That is, remove assumptions in the main scripts about how inputs/outputs files are called/structured. Let the components handle that bit of info.

I will explain the PR better before merging. The logic in this PR is copied in super large part of two earlier efforts (one is the vanilla default MVK that's designed for EAM and the other is a specialized case for MPASO, called MVKO).

The goal is to hand this off to the infra team and move all of this for better long-term maintainability to this repo (https://github.com/LIVVkit/evv4esm) where the stuff I am overriding with these python files is there. So the goal is to specialize those files there to first-class support EAMxx rather than hack something here. I got access to that repo on Friday, so I (or @mkstratos) could do that once we are happy with the results for EAMxx.

There's on minor issue I'd like to address before declaring this ready for review. I need to figure out how to sidestep the lack of conventional short-term archiving capability in EAMxx (a small part of this test relies on it). After that, I will need to run this test for 12+ months twice to see if it behaves as expected. If all goes well, I will clean up this PR into ~3 commits and rebase, and invite Jim, Luca, Rob, Mike, and Peter for a review. (Luca, I'd like you to help me think about the MCT coupling logic and it how it interacts with this multi-instance capability; the one-liner edit in MCT coupling layer does the job for this PR purposes, but maybe it is confusing and annoying to do it like that? If you want, we should start thinking about this in a more organizational aspect --- the multi-instance capability rely on touching the last file of record (in eam world, that is atm_in; in eamxx, it is scream_input.yaml). For eam, the last file of record gets handled automatically the multi-instance mechanics deep down, but in eamxx, it obviously doesn't and need some manual intervention)

bartgol · 2025-02-10T17:22:12Z

Yeah, I'm perfectly fine doing things in-house to get the capability in. But I would really like a nice conversation in the infra team to avoid the proliferation of tests/scripts that are very similar to each other, and promote instead the asbtraction of key concepts, so that components are left in charge only of what specific to them.

As for the shortcoming of mention (the archiving), I am not familiar with that feature. I don't know what I have to do locally, to test what happens, as I have never used it (I am not a model user; not enough, at least). But if you want, we can check together and discuss what needs to change (in eamxx and/or CIME) in order to fix things.

jgfouca · 2025-02-10T18:07:40Z

Yeah, I think MVK is also E3SM-only for all intents. We could move both and encapsulate some of the duplicated code in this directory.

mahf708 · 2025-02-10T23:36:31Z

@rljacob @bartgol @jgfouca, this is now ready for integration. PTAL. Note that @PeterCaldwell and I are planning to take this for a spin in terms of validation. For now, this uses a small subset of variables (only four, see components/eamxx/cime_config/SystemTests/ksxx_vars.json) just to get things rolling and it hardcodes the number of ensemble members to 2 (2!) only, also to get things rolling. We intend to test and see what sort of settings are scientifically valid for climate reproducibility testing for EAMxx. More to come, but this is the first technical hurdle to just get things going. See #6984 (comment) for a sample run with output. And refer to comments above about planned activities in terms of moving code around.

@mkstratos: This is ready if you'd like to take a rigorous look. Note that we will easily move stuff to evv once we are ready, most of the eamxx-specific mods are relatively (so lots of duplicated code for now). Because evv gets released as part of e3sm-unified/cime-latest conda envs, I'd like to keep all this duplication here until we figure out a better alternative.

Anyway, I hope this will be my role in getting the software/technical end of this done, and I can go back to the validation aspects now :)

components/eamxx/src/mct_coupling/atm_comp_mct.F90

ndkeen · 2025-02-13T05:30:31Z

Might consider changes to the shell_commands. Not sure the goal, but probably don't want 72 vertical levels.
Start date? COSP? Compute tendencies? Budgets? No need for lambda setting (now default).

mahf708 · 2025-02-17T20:38:22Z

@rljacob: I'd like this merged soon so that we can evaluate it in the near future. I messaged @mkstratos to review it, and he said he would. There'll be another iteration on the settings (variable names, number of instances, etc.) once the test is scientifically validated.

I'll be mostly out for the week (still ill), but hopefully I don't need to be around for it to be merged. Okay with you if @bartgol or @jgfouca integrate it or do you prefer we wait until I am back online fully?

rljacob · 2025-02-17T20:47:55Z

I'm ok with merging.

mkstratos

Other than the duplicated EVV code, which we can (re)move later, this looks good to me.

mkstratos · 2025-02-17T21:39:00Z

components/eamxx/cime_config/SystemTests/ksxx.py

+    return monthly_avgs
+
+
+def load_mpas_climatology_ensemble(files, field_name, mask_value=None):


This method isn't used below, but this file will go away once we update EVV

yes, will clean up once we reconcile the eam/eamxx files in evv

mkstratos · 2025-02-17T21:40:28Z

components/eamxx/cime_config/SystemTests/ksxx.py

+    table_data = pd.DataFrame(details).T
+    uc_rejections = (table_data["K-S test p-val"] < args.alpha).sum()
+    _hdrs = [
+        "h",


Maybe pedantic, but the "h0" here originally refers to null hypothesis acceptance/rejection rather than h0 from the hist files (only changes the text of output tables on the generated webpage)

oops, I did a blind h0->h because of the archive stuff. Will also fix once we reconcile.

mahf708 · 2025-02-20T15:27:15Z

@bartgol and/or @jgfouca, could one of you integrate this now? I think it is ready; there will be clean-up activities for two aspects soon:

all the py files here will be moved to evv and once we update that, we can rely on native support for eamxx --- this will take care of @mkstratos's comments above
the specific settings in the testmod (which can also be moved to the test itself) are tbd (number of instances, specific settings to test, etc.) --- this will take care of @ndkeen's comments above

jgfouca · 2025-02-20T16:38:42Z

@mahf708 , yes, I can start integrating.

bartgol · 2025-02-20T16:39:13Z

Ok. Just quickly re-running failed v1 checks, to be sure. Should just be a matter of old baselines.

Adding support for multi-instance capability in EAMxx (a one-liner in mct coupling layer) and adding a customized/specialized MVK tests for climate/nbfb reproducibility in EAMxx. More details to come. [BFB]

jgfouca · 2025-02-20T16:40:06Z

Merged to next.

mahf708 added the EAMxx PRs focused on capabilities for EAMxx label Feb 7, 2025

mahf708 requested a review from mkstratos February 7, 2025 21:23

mahf708 requested a review from rljacob February 7, 2025 22:23

rljacob reviewed Feb 7, 2025

View reviewed changes

cime_config/config_files.xml Show resolved Hide resolved

rljacob approved these changes Feb 7, 2025

View reviewed changes

bartgol requested a review from jgfouca February 10, 2025 16:46

jgfouca approved these changes Feb 10, 2025

View reviewed changes

bartgol reviewed Feb 10, 2025

View reviewed changes

rljacob requested a review from jasonb5 February 10, 2025 18:30

mahf708 added 14 commits February 10, 2025 15:24

EAMxx: start adding mvk test skeleton to eamxx

53b1663

EAMxx: suffix yaml_fname with instance index

7749a79

EAMxx: add docs for multi-instance and nbfb testing

7026ae3

EAMxx: override evv test internals to accomodate eamxx

3f3b386

EAMxx: add mvkpert testmod to help with setup

d2df0f5

EAMxx: accomodate eamxx conventions in file handling

76c3b13

EAMxx: use ".h." convention for eamxx history files

eb323d2

EAMxx: use montlhy files instead of daily

ea6bba1

EAMxx: more edits needed for move to monthly

c323b81

EAMxx: rename files to ease generation and comparions

2299cb6

EAMxx: turn off cice io because it is causing problems

a30691d

EAMxx: add comment about cice io to fix it later

303c8b9

EAMxx: fix hist workaround and fix docs

d5640fe

EAMxx: override more of evv internals to accomodate eamxx

cfd3659

mahf708 force-pushed the mahf708/eamxx/mvk-testing branch from 7882633 to cfd3659 Compare February 10, 2025 23:31

mahf708 marked this pull request as ready for review February 10, 2025 23:31

mahf708 requested review from singhbalwinder and PeterCaldwell February 10, 2025 23:31

mahf708 assigned jgfouca Feb 10, 2025

mahf708 added the Testing Anything related to unit/system tests label Feb 10, 2025

mahf708 requested a review from bartgol February 11, 2025 02:18

mahf708 commented Feb 11, 2025

View reviewed changes

components/eamxx/src/mct_coupling/atm_comp_mct.F90 Show resolved Hide resolved

bartgol approved these changes Feb 11, 2025

View reviewed changes

rljacob changed the title ~~EAMxx: multi-instance and nbfb testing~~ EAMxx: add multi-instance and nbfb testing Feb 11, 2025

mahf708 assigned bartgol Feb 11, 2025

mahf708 requested review from ndkeen and AaronDonahue February 12, 2025 22:29

rljacob closed this Feb 17, 2025

rljacob reopened this Feb 17, 2025

mkstratos approved these changes Feb 17, 2025

View reviewed changes

jgfouca merged commit fdf19f2 into master Feb 21, 2025
40 of 56 checks passed

jgfouca deleted the mahf708/eamxx/mvk-testing branch February 21, 2025 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EAMxx: add multi-instance and nbfb testing #6984

EAMxx: add multi-instance and nbfb testing #6984

mahf708 commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-02-17 20:49 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

mahf708 commented Feb 7, 2025 •

edited

Loading

bartgol commented Feb 10, 2025

jgfouca left a comment

bartgol left a comment •

edited

Loading

mahf708 commented Feb 10, 2025

bartgol commented Feb 10, 2025

jgfouca commented Feb 10, 2025

mahf708 commented Feb 10, 2025

ndkeen commented Feb 13, 2025

mahf708 commented Feb 17, 2025

rljacob commented Feb 17, 2025

mkstratos left a comment

mkstratos Feb 17, 2025

mahf708 Feb 20, 2025

mkstratos Feb 17, 2025

mahf708 Feb 20, 2025

mahf708 commented Feb 20, 2025

jgfouca commented Feb 20, 2025

bartgol commented Feb 20, 2025

jgfouca commented Feb 20, 2025

		return monthly_avgs


		def load_mpas_climatology_ensemble(files, field_name, mask_value=None):

EAMxx: add multi-instance and nbfb testing #6984

EAMxx: add multi-instance and nbfb testing #6984

Conversation

mahf708 commented Feb 7, 2025 • edited Loading

github-actions bot commented Feb 7, 2025 • edited Loading

Built to branch gh-pages at 2025-02-17 20:49 UTC. Preview will be ready when the GitHub Pages deployment is complete.

mahf708 commented Feb 7, 2025 • edited Loading

bartgol commented Feb 10, 2025

jgfouca left a comment

Choose a reason for hiding this comment

bartgol left a comment • edited Loading

Choose a reason for hiding this comment

mahf708 commented Feb 10, 2025

bartgol commented Feb 10, 2025

jgfouca commented Feb 10, 2025

mahf708 commented Feb 10, 2025

ndkeen commented Feb 13, 2025

mahf708 commented Feb 17, 2025

rljacob commented Feb 17, 2025

mkstratos left a comment

Choose a reason for hiding this comment

mkstratos Feb 17, 2025

Choose a reason for hiding this comment

mahf708 Feb 20, 2025

Choose a reason for hiding this comment

mkstratos Feb 17, 2025

Choose a reason for hiding this comment

mahf708 Feb 20, 2025

Choose a reason for hiding this comment

mahf708 commented Feb 20, 2025

jgfouca commented Feb 20, 2025

bartgol commented Feb 20, 2025

jgfouca commented Feb 20, 2025

mahf708 commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-02-17 20:49 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

mahf708 commented Feb 7, 2025 •

edited

Loading

bartgol left a comment •

edited

Loading