Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Update trajectory sampler to read IODA files for AIST #2144

Open
bena-nasa opened this issue May 17, 2023 · 6 comments
Open
Assignees
Labels
❗ High Priority This is a high priority PR 🎁 New Feature This is a new feature ⌛ Long Term Long term issues

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented May 17, 2023

Arlindo stressed that he would like the trajectory sampler updated by the end of June to understand the IODA files of the AIST work. I'm just making a note here of the steps that need to be done, note this is different from the swath sampler as we already have a trajectory sampler.

I've enumerated what must happen to the current code. Whereas the current sampler assumes that you will lat/lon/time as a function of a dimension named time, for the IODA files, you will have lat/lon/lev/time as a function of an arbitrary coordinate and the names of those 4 variables may not be consistent. So the user needs be able to simply tell the sampler, what is the coordinate name and the variable names in the file series it can find lon/lat/lev/time.

So here is what needs to be done.

  1. Comfirm it still works (done)
  2. Update to allow for a template to be given so that we can sample to a file series which would require periodic regeneration of the location stream corresponding updates to the output file metadata.
  3. Update the sampler to allow user modifiable input so that the coordinate that lon/lat/time depend on is user configuration as is the name of the variables rather than making any assumptions.
  4. The times as a function of this generalized coordinate may not be time ordered so we would have to sort them on processing if we want to use the existing machinery or not sort, but requires some updates for the code.
  5. The whole thing needs tweaking, in particular right now the trajectory sampler writes at each time step so append to the file which I believe is not optional and will be a performance drag. We should modify it to this "epoch" paradigm (accumulation for number of steps before writing) so that the writes happen less frequently.
  6. The trajectory sampler may also need to keep track of a vertical coordinate as a function of this common index that we may need to interpolate to
  7. The sampler also does everything on root, this may not be optimal, we could for example create the locstream distributed by the grid like we do with our own locstream but there's a tradeoff with communication as this will need to be reshuffled in a not entirely trivial way back say a single process for writing. On the other hand if we change it so that the writing is less frequency and so the locstream is grid-distributed perhaps it would be better than what is there now.
@bena-nasa bena-nasa added 🎁 New Feature This is a new feature ❗ High Priority This is a high priority PR labels May 17, 2023
@bena-nasa
Copy link
Collaborator Author

bena-nasa commented May 19, 2023

I did confirm that the current trajectory sampler works (i.e. it doesn't crash and does indeed create output given a single input file) which I did by making a cute little python program to simulate and create a realistic trajectory that I could use as it is hard to test without an input file.

I already see that even the existing implementation is inadequate beyond all the other stuff outlined above. For example it does not take a grad template but rather a single file so the current implementation only generates a single location stream based on that file at the start. This is in contrast to take a template and regenerating the locstream and time data you are sampling to when you run out of data in the existing file which I'm 110% sure is a capability this needs.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Jun 15, 2023

Just redoing the list of priorities for this. I think we should really do these 1 or 2 at a time (followed by a PR) as is logical rather than waiting for one massive PR. I think the logical grouping is:
1,2 then PR, 3,4 then PR, 5,6 then PR, then 7

Steps 1 and 2 are just getting it working so it can ingest IODA files and use them, nothing more. Steps 3 and 4 are optimizing how to create and distribute the locstream and the IO implications of that change. Steps 5 and 6 are how to optimize/handle regenerating the locstream as needed during the run, changing the IO strategy to accumulate a period before writing at the same time.

So the plan (I will strikeout items as they are completed):

  1. Update so that the "coordinate" and names of the lat/lon/time in the file is user configurable. Note that variables may be in a group in the NetCDF file.
  2. Decide how to handle the fact that the data is not time order in the file, i.e. sort or not worry about it, if we don't worry about sorting it, then more code changes are needed as I assume right now the times were ordered.
  3. The current implementation makes the locstream on root only, better to distribute based on background grid. So we probably need to distribute the locstream based on the attached grid we will be regridding from. The API to do this can be seen here:
    https://earthsystemmodeling.org/docs/nightly/develop/ESMF_refdoc/node5.html#SECTION05094700000000000000
  4. If the locstream is distributed based on the "background" grid, it will have implications for writing. I.E. will have to do a non-trivial gather, potential reordering if desired before writing back to disk. One way to do this would be via an ESMF_Redist, from the grid distributed locstream to one distributed on route.
  5. Right now the sampler writes and appends at every time step. This will be too inefficient in a real model, we will need to implement the same strategy as the swath, define an accumulation period, aka "epoch", create the locstream for that period, accumulate, then write at the end of the accumulation period, with the accumulation period being user configurable with the minimum period a single application time step.
  6. Allow the locstream to be created from multiple files if the accumulation period or application run time is not covered by a single file. (i.e. provide template and frequency for the files). For example if you have IDOA files that contain a 6 hour window and do not want to stop execution, this is mandatory.
  7. Allow multiple input files (presumably would be covering the same time window, if not it doesn't make sense but I believe for the use case we have in mind this would be the case) to be concatenated into a single location stream for processing during the run. In other words while there may be several files, it will see one locstream. Note at output, they need to split back out into the original files.
  8. The code as originally written output profiles. These IODA files may have a height or pressure, i.e. longitude(location), latitude(location) and possibly pressure(location) or height(location). So one could for 3D variables from the model rather than output the full column, sample to the height as well. This should be straightforward.

@bena-nasa
Copy link
Collaborator Author

@metdyn
I put a code demonstration of the algorithm from steps 3 to 5 here on discover:
/home/bmauer/for_users/for_yonggang/epoch_accumulation_demo.F90

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the ❄️ Stale This issue has been marked stale label Sep 17, 2023
@mathomp4
Copy link
Member

I believe @metdyn is working on this (see #2353) so we'll un-stale it.

@mathomp4 mathomp4 removed the ❄️ Stale This issue has been marked stale label Sep 18, 2023
Copy link

stale bot commented Nov 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the ❄️ Stale This issue has been marked stale label Nov 18, 2023
@mathomp4 mathomp4 added the ⌛ Long Term Long term issues label Nov 19, 2023
@stale stale bot removed the ❄️ Stale This issue has been marked stale label Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❗ High Priority This is a high priority PR 🎁 New Feature This is a new feature ⌛ Long Term Long term issues
Projects
None yet
Development

No branches or pull requests

4 participants