-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New SPLIT_CHECKPOINT option to replace read/write by face #2394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly ok. But worth fixing up the error trapping before we merge. I may have missed some checks, so please do another sweep.
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested some minor changes for code safety. (Fresh off an admonition about testing from last weeks workshop.)
@tclune @atrayano So I guess we should merge (assuming CI tests pass) or not merge if I've made things too ugly and try to cleanup this module to an acceptable state Probably then need to cleanup any script that was aware of the by_face logic and replace with new logic but that's not in this repo. |
If it works and you (@bena-nasa ) don't see any simple way to improve code quality, I say merge. Safe to say that after MAPL3 settles down, a re-engineering effort for the o-server is (unfortunately) in order. It is simply too important to leave in such a confusing state. |
I don't think there's a "simple" or quick way to improve code quality, it's a question of how much time effort do we want to put in. It could be cleaned up, looking at the interfaces, seeing if anything can be simplified for all the write overloads, breaking/reorganizing this up into more logical programmatic structures where possible, the o-server is just one piece of this code. In some ways this is more complicated than history, there we only support output of a rather limited subset of the possible ESMF fields we can represent (at least in MAPL2). The checkpoint/restart output layer must support every conceivable ESMF field the model needs in NetCDF. All with the unique gather/scatter algorithm used by nothing else in the code. It looks like there are conflicts I need to resolve because develop changed but then I agree merge. And figure out where cleaning all this up lies in our priorities. |
This implements a new SPLIT_CHECKPOINT option that replaces the more limited read/write by face. This is because we know that on some file systems getting more files writing gives the best performance and this number is more than 6.
The general idea is that if you say "SPLIT_CHECKPOINT: YES" this triggers the checkpoints to be split. It still uses NUM_WRITERS to determine how many files to make.
The supplied checkpoint name, i.e. fvcore_internal_checkpoint for example now becomes a single yaml file if this is requested.
This YAML file for now just has the resolution and the number of files that were written.
The individual files would then be
fvcore_internal_checkpoint_0
fvcore_internal_checkpoint_1
fvcore_internal_checkpoint_n
where n = NUM_WRITERS-1
When it comes time to read these back as restarts, I first see, can I make an HCONFIG out of the supplied restart name and if so does it have the keys I'm looking for that would denote a set of split restarts.
If so that triggers the SPLIT_RESTART option which then reads the N files backs.
If the the number of files does not divide NY an error is thrown. I am working python scripts that can be used to "resplit" the files.
Note this required moving some code around. The big thing was that since if it is a split restart, I need to essential make the number of restarts the "NUM_READERS", but in the old code, the communicators for all would have already been created. I needed to defer this so I moved where the communicators are created when reading. Now of course if the user has a single file input, they can still set NUM_READERS and go through the normal parallel NETCDF path.
However, if I've detected this is a split file, I can override the num_readers for that particular file and the communicators only get created after this point.
I've done a start/stop regression test. I.E. the a control code for 24 hours, then with the new code take the same unspilt restarts, run 12 hours, output split restarts, read those split restarts, and output the normal single file restarts. This was 0-diff to the 24 hours control run
Description
Related Issue
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist: