Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CESM2-LE pipeline #53

Closed
wants to merge 5 commits into from
Closed

Add CESM2-LE pipeline #53

wants to merge 5 commits into from

Conversation

mgrover1
Copy link

@mgrover1 mgrover1 commented Jun 24, 2021

Closes #51

I added a couple files which @cisaacstern worked through this morning. This is preliminary for now, and this can only be run within the GLADE filesystem at NCAR since the data are there, but I am hoping this will at least provide an example!

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cisaacstern
Copy link
Member

Awesome, thanks @mgrover1! Recapping here for clarity that our plan was to "skip" caching, because you already have access to all of the source files on GLADE. To implement this, we initially instantiated your source file directory as a CacheFSSpecTarget object. This raised the issue that your source filenames do not include the prefix added by pangeo-forge-recipes.storage here.

In e377b76, I changed your source file target to an instance of FSSpecTarget. (Btw, I pushed this commit directly to your PR branch 🧙 🎉 .) FSSpecTarget is a parent class of CacheFSSpecTarget, in which the (in this case) problematic prefix is not added to the file paths (as seen here, fwiw).

So I'm curious, if you execute your recipe from this updated execution notebook, do you still get a FileNotFoundError when you call recipe.prepare_target()?

@mgrover1
Copy link
Author

thanks @cisaacstern ! Now I am running into this

AttributeError: 'FSSpecTarget' object has no attribute 'getitems'

when running recipe.prepare_target()

@cisaacstern
Copy link
Member

Progress! (I hope 😄)

Now I am running into this

AttributeError: 'FSSpecTarget' object has no attribute 'getitems'

Can you provide a full Traceback?

@mgrover1
Copy link
Author

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-bbcc9bf6cc50> in <module>
----> 1 recipe.prepare_target()

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(self)
    264         # Regardless of whether there is an existing dataset or we are creating a new one,
    265         # we need to expand the concat_dim to hold the entire expected size of the data
--> 266         input_sequence_lens = self.calculate_sequence_lens()
    267         n_sequence = sum(input_sequence_lens)
    268         logger.info(f"Expanding target concat dim '{self._concat_dim}' to size {n_sequence}")

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in calculate_sequence_lens(self)
    476         # get the sequence length of every file
    477         # this line could become problematic for large (> 10_000) lists of files
--> 478         input_meta = self.get_input_meta(*self._inputs_chunks)
    479         # use a numpy array to allow reshaping
    480         all_lens = np.array([m["dims"][self._concat_dim] for m in input_meta.values()])

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in get_input_meta(self, *input_keys)
    462         if self.metadata_cache is None:
    463             raise ValueError("metadata_cache is not set.")
--> 464         return self.metadata_cache.getitems([_input_metadata_fname(k) for k in input_keys])
    465 
    466     def input_position(self, input_key):

AttributeError: 'FSSpecTarget' object has no attribute 'getitems'

@TomAugspurger
Copy link
Contributor

I think you want pangeo_forge_recipes.storage.MetadataTarget. I hit / am fixing this in the tutorials in pangeo-forge/pangeo-forge-recipes#160.

@mgrover1
Copy link
Author

mgrover1 commented Jun 24, 2021

Adding this in instead

import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.storage import FSSpecTarget, CacheFSSpecTarget
from pangeo_forge_recipes.storage import MetadataTarget

fs_local = LocalFileSystem()

cache_dir = tempfile.TemporaryDirectory()
# cache_target = CacheFSSpecTarget(fs_local, direct)
cache_target = FSSpecTarget(fs_local, direct)

#target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir)

meta_dir = tempfile.TemporaryDirectory()
meta_store = MetadataTarget(fs_local, meta_dir.name)

recipe.input_cache = cache_target
recipe.target = target
recipe.metadata_cache = meta_store

cache_target.root_path, target.root_path, meta_store.root_path

results in

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/mapping.py in getitems(self, keys, on_error)
     89         try:
---> 90             out = self.fs.cat(keys2, on_error=oe)
     91         except self.missing_exceptions as e:

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in cat(self, path, recursive, on_error, **kwargs)
    718                 try:
--> 719                     out[path] = self.cat_file(path, **kwargs)
    720                 except Exception as e:

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in cat_file(self, path, start, end, **kwargs)
    658         # explicitly set buffering off?
--> 659         with self.open(path, "rb", **kwargs) as f:
    660             if start is not None:

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
    967                 cache_options=cache_options,
--> 968                 **kwargs,
    969             )

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self, path, mode, block_size, **kwargs)
    131             self.makedirs(self._parent(path), exist_ok=True)
--> 132         return LocalFileOpener(path, mode, fs=self, **kwargs)
    133 

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in __init__(self, path, mode, autocommit, fs, **kwargs)
    219         self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 220         self._open()
    221 

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/implementations/local.py in _open(self)
    224             if self.autocommit or "w" not in self.mode:
--> 225                 self.f = open(self.path, mode=self.mode)
    226             else:

FileNotFoundError: [Errno 2] No such file or directory: '/glade/scratch/mgrover/tmpnqia56i9/input-meta-0.json'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-4-bbcc9bf6cc50> in <module>
----> 1 recipe.prepare_target()

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in prepare_target(self)
    264         # Regardless of whether there is an existing dataset or we are creating a new one,
    265         # we need to expand the concat_dim to hold the entire expected size of the data
--> 266         input_sequence_lens = self.calculate_sequence_lens()
    267         n_sequence = sum(input_sequence_lens)
    268         logger.info(f"Expanding target concat dim '{self._concat_dim}' to size {n_sequence}")

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in calculate_sequence_lens(self)
    476         # get the sequence length of every file
    477         # this line could become problematic for large (> 10_000) lists of files
--> 478         input_meta = self.get_input_meta(*self._inputs_chunks)
    479         # use a numpy array to allow reshaping
    480         all_lens = np.array([m["dims"][self._concat_dim] for m in input_meta.values()])

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py in get_input_meta(self, *input_keys)
    462         if self.metadata_cache is None:
    463             raise ValueError("metadata_cache is not set.")
--> 464         return self.metadata_cache.getitems([_input_metadata_fname(k) for k in input_keys])
    465 
    466     def input_position(self, input_key):

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/pangeo_forge_recipes/storage.py in getitems(self, keys)
    161     def getitems(self, keys: Sequence[str]) -> dict:
    162         mapper = self.get_mapper()
--> 163         all_meta_raw = mapper.getitems(keys)
    164         return {k: json.loads(raw_bytes) for k, raw_bytes in all_meta_raw.items()}
    165 

~/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/fsspec/mapping.py in getitems(self, keys, on_error)
     90             out = self.fs.cat(keys2, on_error=oe)
     91         except self.missing_exceptions as e:
---> 92             raise KeyError from e
     93         out = {
     94             k: (KeyError() if isinstance(v, self.missing_exceptions) else v)

KeyError: 

@cisaacstern
Copy link
Member

I think you want pangeo_forge_recipes.storage.MetadataTarget. I hit / am fixing this in the tutorials in pangeo-forge/pangeo-forge-recipes#160.

Amazing catch. And oops! This wouldn't have come up for Max if I had resolved: pangeo-forge/pangeo-forge-recipes#135 (comment) 🤭

Adding that in instead results in

Yep, that's expected because there is also one other issue here, which is that we haven't actually cached any metadata (because we skipped caching). I am about to push a commit which should address this.

@cisaacstern
Copy link
Member

@mgrover1, I don't know if it would've worked to cache metadata to a TemporaryDirectory, but just to be safe I wrote e6f62f8 as if you'd made a new directory called '/glade/scratch/mgrover/cesm2-le-metadata' and then instantiated a MetadataTarget with that path.

Then, before preparing the target, I've added:

for input_name in recipe.iter_inputs():
    recipe.cache_input_metadata(input_name)

Can you see where running these changes before the call to recipe.prepare_target() gets us?

@mgrover1
Copy link
Author

@cisaacstern we are in business 😊😊😊
Screen Shot 2021-06-24 at 3 13 16 PM

@mgrover1
Copy link
Author

Now the question is:

  • How can I automate this for an entire catalog of output?
  • Would there be a good way to separate out the static grid variables (ex. hyam, hybi, etc.)

The first question may be able to be solved within the make_full_path but do I really need this if I have the full path from the intake-esm catalog?

@cisaacstern
Copy link
Member

cisaacstern commented Jun 24, 2021

d1193c1 adds the store_chunks and finalize_target steps. Without these, you just have the first time step (which is written in prepare_target).

@mgrover1
Copy link
Author

When running this, I run into the following warning

/glade/u/home/mgrover/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/xarray/conventions.py:207: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  SerializationWarning,

Is this something to be concerned about when running this on the larger 1 TB+ zarr stores I plan on running this on?

@mgrover1
Copy link
Author

Here is an example of what the make_filename would look like:

def make_filename(component, frequency, variable, experiment, forcing, experiment_number, member_id, stream, time):
    return f"/glade/campaign/cgd/cesm/CESM2-LE/timeseries/{component}/proc/tseries/{frequency}/{variable}/b.e21.{experiment}{forcing}.f09_g17.LE2-{experiment_number}.{member_id}.{stream}.{variable}.{time}.nc"

@cisaacstern
Copy link
Member

When running this, I run into the following warning

/glade/u/home/mgrover/miniconda3/envs/cesm2-marbl/lib/python3.7/site-packages/xarray/conventions.py:207: SerializationWarning: variable None has data in the form of a dask array with dtype=object, which means it is being loaded into memory to determine a data type that can be safely stored on disk. To avoid this, coerce this variable to a fixed-size dtype with astype() before saving it.
  SerializationWarning,

Is this something to be concerned about when running this on the larger 1 TB+ zarr stores I plan on running this on?

I've never seen this before. Seems like a question for @TomAugspurger or @rabernat.

Would there be a good way to separate out the static grid variables (ex. hyam, hybi, etc.)

Are these mirrored across every one of the source files? If so, you may be able to create a separate recipe for them and write them only once. Here it's worth noting that your cesm_le2_recipe.py can instantiate as many recipes as you want, as long as you wrap them all in a dictionary at the bottom of the file. So you could do:

#  ... define recipes above, then ...
recipes = {
    "historical/atm": historical_atm_recipe,  # each dict value is a XarrayZarrRecipe instance
    "ssp370/atm": ssp370_atm_recipe,
    "grid": grid_recipe,
}

Then in the execution notebook:

from cesm_le2_recipe import recipes

for input_name in recipes["historical_atm"].iter_inputs():
    recipes["historical_atm"].cache_input_metadata(input_name)

#   ... etc. ...

How can I automate this for an entire catalog of output? ... [this] may be able to be solved within the make_full_path

Yes! You can add dimensional complexity to your recipe by parameterizing additional components of the path returned from make_full_path. Your mock-up in #53 (comment) is on exactly the right track to achieve this.

Then, each of these parameters (aside from time) becomes it's own MergeDim as described here.

but do I really need this if I have the full path from the intake-esm catalog?

Yep, you do need to parameterize these in the make_full_path function, just as you've already started to do in your comment above.

@cisaacstern
Copy link
Member

@mgrover1, I note in #53 (comment) that you've given a frequency argument which also appears in the recipe here.

Assuming this refers to temporal resolution (monthly, daily, etc.), then each frequency will presumably need to be its own separate zarr store. Unless I'm missing something (which is possible), anything you define as a MergeDim will need to be the same length in the time dimension.

@mgrover1
Copy link
Author

Yes - the zarr stores will be separated by component/frequency/cesm2-le.experiment.forcing.variable.zarr

@cisaacstern
Copy link
Member

How's this going, @mgrover1? Anything we can troubleshoot or is everything working as desired?

@rabernat
Copy link
Contributor

rabernat commented Mar 3, 2022

Just pinging this PR. Is this recipe still viable? Could we run it in our bakery?

@mgrover1 mgrover1 closed this by deleting the head repository Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example pipeline for CESM2-LE
4 participants