Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bmauer/restart for jedi #529

Merged
merged 5 commits into from
Aug 28, 2020
Merged

Conversation

bena-nasa
Copy link
Collaborator

@bena-nasa bena-nasa commented Aug 27, 2020

This is adding new functionality requested by the Jedi developers.
They have the requirement to be able to run multiple forward time integrations within one execution, restoring the initial state in between of whatever model is hooked into the Jedi Framework. This has been done for the FV3 standalone case, and UFS, waiting for us.

Jedi currently uses hooks into our Cap and CapGridComp utilizing the work Kyle did with a custom driver that implements the abstract model phases Jedi has in the framework. So essentially for the first step they want to initialize GEOS, integrate GEOS forward in time (this has been implemented), but now rewind, start again from the same state, and make sure on the second integration they get back to the same state at the end. Similar to replay but not quite the same. Since this is being driven from outside, rather than say in replay where the extra time looping is controlled inside GCM, I could not use the Record feature we currently have. This works by doing a MAPL_AddRecord, but that is not a recursively routine so it relies on doing this in the component set services, then all the record states get replicated in the children as part of generic initialize. In addition the GenericRecord feature uses alarms to control when the record happens which I do not trust to use until ESMF fixes them. Indeed, I tried hardcoding an addrecord in the CapGridComp set to record at the model start time just to see if the underlying machinery would work. This caused a hang in the alarms after the rewind, so that is a no go.

Instead I added simple functions in MAPL_Generic that can be directly invoked by the driver by calls in Cap and CapGridComp to rewind the clock, checkpoint the state to memory, and restore the state. It is basically generic/record/refresh without all the alarm complication in there. I also added a new type to the MAPL state to store the filenames, whether memory or on disk as the existing record structure is all convoluted with the alarms. By default it will do memory checkpointing but I did add a set so that you could do disk if you really wanted to.

With this Jedi can choose to memory checkpoint the state, rewind the cap clocks, restore the saved state as it sees fit.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Trivial change (affects only documentation or cleanup)

Checklist:

  • I have tested this change with a run of GEOSgcm (if non-trivial)
  • I have added one of the required labels (0 diff, 0 diff trivial, 0 diff structural, non 0-diff)
  • I have updated the CHANGELOG.md accordingly following the style of Keep a Changelog

@bena-nasa bena-nasa added 🎁 New Feature This is a new feature 0 Diff The changes in this pull request have verified to be zero-diff with the target branch. labels Aug 27, 2020
@bena-nasa bena-nasa requested a review from a team as a code owner August 27, 2020 19:28
@bena-nasa bena-nasa requested review from tclune and atrayano August 27, 2020 19:45
@tclune tclune merged commit 2740cea into develop Aug 28, 2020
@tclune tclune deleted the feature/bmauer/restart-for_jedi branch August 28, 2020 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 Diff The changes in this pull request have verified to be zero-diff with the target branch. 🎁 New Feature This is a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants