New SPLIT_CHECKPOINT option to replace read/write by face #2394

bena-nasa · 2023-09-29T14:52:35Z

This implements a new SPLIT_CHECKPOINT option that replaces the more limited read/write by face. This is because we know that on some file systems getting more files writing gives the best performance and this number is more than 6.

The general idea is that if you say "SPLIT_CHECKPOINT: YES" this triggers the checkpoints to be split. It still uses NUM_WRITERS to determine how many files to make.

The supplied checkpoint name, i.e. fvcore_internal_checkpoint for example now becomes a single yaml file if this is requested.
This YAML file for now just has the resolution and the number of files that were written.

The individual files would then be
fvcore_internal_checkpoint_0
fvcore_internal_checkpoint_1
fvcore_internal_checkpoint_n

where n = NUM_WRITERS-1

When it comes time to read these back as restarts, I first see, can I make an HCONFIG out of the supplied restart name and if so does it have the keys I'm looking for that would denote a set of split restarts.

If so that triggers the SPLIT_RESTART option which then reads the N files backs.

If the the number of files does not divide NY an error is thrown. I am working python scripts that can be used to "resplit" the files.

Note this required moving some code around. The big thing was that since if it is a split restart, I need to essential make the number of restarts the "NUM_READERS", but in the old code, the communicators for all would have already been created. I needed to defer this so I moved where the communicators are created when reading. Now of course if the user has a single file input, they can still set NUM_READERS and go through the normal parallel NETCDF path.

However, if I've detected this is a split file, I can override the num_readers for that particular file and the communicators only get created after this point.

I've done a start/stop regression test. I.E. the a control code for 24 hours, then with the new code take the same unspilt restarts, run 12 hours, output split restarts, read those split restarts, and output the normal single file restarts. This was 0-diff to the 24 hours control run

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Trivial change (affects only documentation or cleanup)

Checklist:

I have tested this change with a run of GEOSgcm (if non-trivial)
I have added one of the required labels (0 diff, 0 diff trivial, 0 diff structural, non 0-diff)
I have updated the CHANGELOG.md accordingly following the style of Keep a Changelog

base/FileIOShared.F90

tclune

Mostly ok. But worth fixing up the error trapping before we merge. I may have missed some checks, so please do another sweep.

Co-authored-by: Tom Clune <thomas.l.clune@nasa.gov>

base/NCIO.F90

tclune

Requested some minor changes for code safety. (Fresh off an admonition about testing from last weeks workshop.)

bena-nasa · 2024-02-02T20:16:45Z

@tclune @atrayano
I finally after much, much struggle, a difficult to diagnose bug in the o-server that required tracking and fixing (already done in another PR that has been merged) and probably making the code too ugly, I have the generic split checkpoint writing with the o-server now too. As far as I'm concerned this seem to be working. I can start from the split checkpoints or a single checkpoint and I get to the same place at the end. It works both with no o-server or the o-server with both the mutligroup and "simple". I've also added a couple python utilities to split/recombine the checkpoints. Don't know if we want those, but I had them so figured might as well capture rather than just accidentally misplace the scripts.

So I guess we should merge (assuming CI tests pass) or not merge if I've made things too ugly and try to cleanup this module to an acceptable state

Probably then need to cleanup any script that was aware of the by_face logic and replace with new logic but that's not in this repo.

tclune · 2024-02-02T20:39:31Z

If it works and you (@bena-nasa ) don't see any simple way to improve code quality, I say merge. Safe to say that after MAPL3 settles down, a re-engineering effort for the o-server is (unfortunately) in order. It is simply too important to leave in such a confusing state.

bena-nasa · 2024-02-08T22:06:51Z

If it works and you (@bena-nasa ) don't see any simple way to improve code quality, I say merge. Safe to say that after MAPL3 settles down, a re-engineering effort for the o-server is (unfortunately) in order. It is simply too important to leave in such a confusing state.

I don't think there's a "simple" or quick way to improve code quality, it's a question of how much time effort do we want to put in. It could be cleaned up, looking at the interfaces, seeing if anything can be simplified for all the write overloads, breaking/reorganizing this up into more logical programmatic structures where possible, the o-server is just one piece of this code.

In some ways this is more complicated than history, there we only support output of a rather limited subset of the possible ESMF fields we can represent (at least in MAPL2). The checkpoint/restart output layer must support every conceivable ESMF field the model needs in NetCDF. All with the unique gather/scatter algorithm used by nothing else in the code.

It looks like there are conflicts I need to resolve because develop changed but then I agree merge. And figure out where cleaning all this up lies in our priorities.

bena-nasa added 6 commits September 22, 2023 15:42

changes for per writer output

2e8314c

more bug fixes

b17510b

more bug fixes

4362894

remove commented out code

cd6c99e

Merge branch 'develop' into feature/bmauer/file_per_writer

feb072b

update changelog

ddba7ab

bena-nasa added 🎁 New Feature This is a new feature 0 Diff The changes in this pull request have verified to be zero-diff with the target branch. labels Sep 29, 2023

bena-nasa requested a review from a team as a code owner September 29, 2023 14:52

remove uncommented code

3a8a1ce