Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New SPLIT_CHECKPOINT option to replace read/write by face #2394

Merged
merged 29 commits into from
Feb 12, 2024

Conversation

bena-nasa
Copy link
Collaborator

This implements a new SPLIT_CHECKPOINT option that replaces the more limited read/write by face. This is because we know that on some file systems getting more files writing gives the best performance and this number is more than 6.

The general idea is that if you say "SPLIT_CHECKPOINT: YES" this triggers the checkpoints to be split. It still uses NUM_WRITERS to determine how many files to make.

The supplied checkpoint name, i.e. fvcore_internal_checkpoint for example now becomes a single yaml file if this is requested.
This YAML file for now just has the resolution and the number of files that were written.

The individual files would then be
fvcore_internal_checkpoint_0
fvcore_internal_checkpoint_1
fvcore_internal_checkpoint_n

where n = NUM_WRITERS-1

When it comes time to read these back as restarts, I first see, can I make an HCONFIG out of the supplied restart name and if so does it have the keys I'm looking for that would denote a set of split restarts.

If so that triggers the SPLIT_RESTART option which then reads the N files backs.

If the the number of files does not divide NY an error is thrown. I am working python scripts that can be used to "resplit" the files.

Note this required moving some code around. The big thing was that since if it is a split restart, I need to essential make the number of restarts the "NUM_READERS", but in the old code, the communicators for all would have already been created. I needed to defer this so I moved where the communicators are created when reading. Now of course if the user has a single file input, they can still set NUM_READERS and go through the normal parallel NETCDF path.

However, if I've detected this is a split file, I can override the num_readers for that particular file and the communicators only get created after this point.

I've done a start/stop regression test. I.E. the a control code for 24 hours, then with the new code take the same unspilt restarts, run 12 hours, output split restarts, read those split restarts, and output the normal single file restarts. This was 0-diff to the 24 hours control run

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Trivial change (affects only documentation or cleanup)

Checklist:

  • I have tested this change with a run of GEOSgcm (if non-trivial)
  • I have added one of the required labels (0 diff, 0 diff trivial, 0 diff structural, non 0-diff)
  • I have updated the CHANGELOG.md accordingly following the style of Keep a Changelog

@bena-nasa bena-nasa added 🎁 New Feature This is a new feature 0 Diff The changes in this pull request have verified to be zero-diff with the target branch. labels Sep 29, 2023
@bena-nasa bena-nasa requested a review from a team as a code owner September 29, 2023 14:52
Copy link
Collaborator

@tclune tclune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly ok. But worth fixing up the error trapping before we merge. I may have missed some checks, so please do another sweep.

bena-nasa and others added 8 commits September 29, 2023 11:09
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Copy link
Collaborator

@tclune tclune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested some minor changes for code safety. (Fresh off an admonition about testing from last weeks workshop.)

@bena-nasa bena-nasa requested review from a team as code owners February 2, 2024 20:08
@bena-nasa bena-nasa requested a review from tclune February 2, 2024 20:10
@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Feb 2, 2024

@tclune @atrayano
I finally after much, much struggle, a difficult to diagnose bug in the o-server that required tracking and fixing (already done in another PR that has been merged) and probably making the code too ugly, I have the generic split checkpoint writing with the o-server now too. As far as I'm concerned this seem to be working. I can start from the split checkpoints or a single checkpoint and I get to the same place at the end. It works both with no o-server or the o-server with both the mutligroup and "simple". I've also added a couple python utilities to split/recombine the checkpoints. Don't know if we want those, but I had them so figured might as well capture rather than just accidentally misplace the scripts.

So I guess we should merge (assuming CI tests pass) or not merge if I've made things too ugly and try to cleanup this module to an acceptable state

Probably then need to cleanup any script that was aware of the by_face logic and replace with new logic but that's not in this repo.

@tclune
Copy link
Collaborator

tclune commented Feb 2, 2024

If it works and you (@bena-nasa ) don't see any simple way to improve code quality, I say merge. Safe to say that after MAPL3 settles down, a re-engineering effort for the o-server is (unfortunately) in order. It is simply too important to leave in such a confusing state.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Feb 8, 2024

If it works and you (@bena-nasa ) don't see any simple way to improve code quality, I say merge. Safe to say that after MAPL3 settles down, a re-engineering effort for the o-server is (unfortunately) in order. It is simply too important to leave in such a confusing state.

I don't think there's a "simple" or quick way to improve code quality, it's a question of how much time effort do we want to put in. It could be cleaned up, looking at the interfaces, seeing if anything can be simplified for all the write overloads, breaking/reorganizing this up into more logical programmatic structures where possible, the o-server is just one piece of this code.

In some ways this is more complicated than history, there we only support output of a rather limited subset of the possible ESMF fields we can represent (at least in MAPL2). The checkpoint/restart output layer must support every conceivable ESMF field the model needs in NetCDF. All with the unique gather/scatter algorithm used by nothing else in the code.

It looks like there are conflicts I need to resolve because develop changed but then I agree merge. And figure out where cleaning all this up lies in our priorities.

@tclune tclune merged commit 336f207 into develop Feb 12, 2024
29 checks passed
@tclune tclune deleted the feature/bmauer/file_per_writer branch February 12, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 Diff The changes in this pull request have verified to be zero-diff with the target branch. 🎁 New Feature This is a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants