Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Initial Condition Handling to EAMxx DAG Generator #6860

Merged

Conversation

mjs271
Copy link

@mjs271 mjs271 commented Dec 19, 2024

Finishes PR #3101 originally submitted to the scream repo the atmospheric DAG generator (atmosphere_process_dag.xpp) and atmosphere driver to get things working again. Namely,

There are now 2 checkpoints during AtmosphereDriver::initialize() at which we build the most complete DAG available and write it to file.
This is intended to serve as a diagnostic tool in case a build crashes, giving the user information about the created/initialized processes or fields.
These checkpoints occur at the end of the following functions, and the associated DAG is written out to a .dot file in the run directory:

  1. create_fields()--scream_atm_createField_dag.dot
  2. initialize_atm_procs()--scream_atm_initProc_dag.dot

I've added some code that enables the DAG to properly handle initial conditions and determine whether they are missing at the time when he DAG is generated, and if so this is indicated in the corresponding node (see images in original PR for examples).
In the initial DAG, (within initialize_atm_procs()), an extra box is added to the DAG with a disclaimer that any fields indicated as missing may be provided by the forthcoming initial condition file.
A node for the Initial Condition is now displayed on the DAG with edges leading to the "Begin of Atm Time Step" node adn any nodes for a process that uses and does not change a field from the initial condition.
I've made some minor formatting changes for readability and to more clearly indicate the different types of objects represented on the DAG--e.g., the Begin/End of Atm. Timestep nodes are colored differently, and fields are printed in green when provided directly by the initial condition at initialization.
I've also fenced-off the write_dag() statements to ensure they are only printed by the main MPI thread, since I noticed occasional formatting errors due to a presumptive race condition.
Additionally, when running as an eamxx standalone test, the filenames of the DAGs are tagged with .np<N> to avoid conflicts writing to file.
Lastly, I added a RESOURCE_LOCK to the standalone test mam4_srf_online_emiss_mam4_constituent_fluxes that appeared to be failing due to incorrect order of running the tests being compared.

Note: Closes E3SM/scream Issue #1869.

@mjs271 mjs271 force-pushed the mjs271/eamxx/DAG-refactor branch from ef142bc to c15b1e5 Compare December 19, 2024 13:31
@mjs271
Copy link
Author

mjs271 commented Dec 19, 2024

Looks like I can't add reviewers, so tagging those from original PR @bartgol @tcclevenger @jgfouca @mahf708 @AaronDonahue

@mjs271
Copy link
Author

mjs271 commented Dec 19, 2024

Note that the eamxx-sa/gcc-cuda/* test failures are expected. The fix for mam4_aero_microphys_standalone_baseline_cmp will be in a following PR shortly

m_ad_status |= s_fields_created;

// If the user requested it, we can save a dictionary of the FM fields to file
auto& driver_options_pl = m_atm_params.sublist("driver_options");
// auto& driver_options_pl = m_atm_params.sublist("driver_options");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be deleted

@@ -9,6 +9,7 @@ CreateADUnitTest(${TEST_BASE_NAME}
LABELS physics mam4_srf_online_emiss mam4_constituent_fluxes
MPI_RANKS ${TEST_RANK_START} ${TEST_RANK_END}
FIXTURES_SETUP_INDIVIDUAL ${FIXTURES_BASE_NAME}
PROPERTIES RESOURCE_LOCK ${FIXTURES_BASE_NAME}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you serializing these tests? Is this b/c of the generated dag? Imho, we can just let them run in ||, even if they overwrite each other's dag...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was unrelated to the dag, though kept crashing while I was debugging. Could be idiosyncratic to whichever machine I was on, this appeared to fix it. I'm also ok with reverting and keeping an eye out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it only this test that failed? You could modify its input.yaml file to set the dag verbosity to 0 (or whatever turns it completely off). That is, assuming the dag was the issue...

Copy link
Author

@mjs271 mjs271 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartgol --getting back to this after the break, and it appears that the failures I was seeing were due to some transient computational weirdness. Tests seem to be passing without the lock, but I'll run a few more tests on my end to confirm

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have one question on the test modification.

@mjs271 mjs271 force-pushed the mjs271/eamxx/DAG-refactor branch from 4f0a134 to e37b47b Compare January 6, 2025 18:32
@bartgol
Copy link
Contributor

bartgol commented Jan 6, 2025

@mjs271 I think you may have rebased onto the eagle-project's version of master (or onto your local version of master), rather that onto e3sm-project's master.

@mjs271
Copy link
Author

mjs271 commented Jan 6, 2025

oof! I'll push a fix shortly

@mjs271 mjs271 force-pushed the mjs271/eamxx/DAG-refactor branch from e37b47b to 8e6b5c5 Compare January 7, 2025 00:18
@rljacob rljacob added the EAMxx PRs focused on capabilities for EAMxx label Jan 7, 2025
…ating as such

fixed minor bug--still an issue with 2 fields in p3 for pg2 cases

print formatting fix for dag

remove 2 intermediate DAGs

add descriptor box for IC fields
@mjs271 mjs271 force-pushed the mjs271/eamxx/DAG-refactor branch from 8e6b5c5 to 651927e Compare January 8, 2025 21:36
@mjs271
Copy link
Author

mjs271 commented Jan 9, 2025

I think this may be noted above, but to refresh:

  • The test failures for feamxx-sa/gcc-cuda/[...] (mam4_aero_microphys_standalone_baseline_cmp) are expected due to this PR not including the changes from Fast-forward MAM4xx submodule to Add Fixes for Test Failures #6867, where the failing tests are fixed.
  • The other failure for eamxx-v1 / cpu-gcc / SMS_Ln3_P4.ne4pg2_oQU480.F2010-MMF2 appears to be due to a missing input file and should be unrelated to the substance of this PR.

@rljacob
Copy link
Member

rljacob commented Jan 9, 2025

Is this ready?

@bartgol
Copy link
Contributor

bartgol commented Jan 9, 2025

Is this ready?

Yes. But I want to rerun tests now that we merged the mam4xx fix, to verify we are no longer non-deterministic in those tests.

@bartgol
Copy link
Contributor

bartgol commented Jan 9, 2025

The fails are unrelated. I'm merging.

@bartgol bartgol merged commit 1c14228 into E3SM-Project:master Jan 9, 2025
17 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output Atm DAG file is incorrect
5 participants