Skip to content

Commit

Permalink
WIP "Recipes" (pangeo-forge#483)
Browse files Browse the repository at this point in the history
  • Loading branch information
derekocallaghan committed Mar 3, 2023
1 parent 08364b5 commit 47e46f8
Showing 1 changed file with 39 additions and 24 deletions.
63 changes: 39 additions & 24 deletions docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,24 @@ Recipe authors (i.e. data users or data managers) can either execute their recip
on their own computers and infrastructure, in private, or make a {doc}`../../pangeo_forge_cloud/recipe_contribution`
to {doc}`../../pangeo_forge_cloud/index`, which allows the recipe to be automatically by via [Bakeries](../../pangeo_forge_cloud/core_concepts.md).

## Recipe Classes
## Recipe Pipelines

To write a recipe, you must start from one of the existing recipe classes.
Recipe classes are based on a specific data model for the input files and target dataset format.
Right now, there are two distinct recipe classes implemented.
In the future, we may add more.
A recipe is defined as a pipeline of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an `apache_beam.PCollection`, performing a one-to-one mapping using `apache_beam.Map` of input elements to output elements, applying the specified transformation.

TODO add rubric for choosing a recipe class.
To write a recipe, you define a pipeline that uses existing transforms, in combination with new transforms if required for custom processing of the input data collection.

### XarrayZarr Recipe
Right now, there are two categories of recipe pipelines based on a specific data model for the input files and target dataset format.
In the future, we may add more.

```{note}
The full API Reference documentation for this recipe class can be found at
{class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe`
The full API Reference documentation for the existing recipe `PTransform` implementations can be found at
{doc}`../api_reference`.
```

The {class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe` recipe class uses
### Xarray to Zarr Recipes


This recipe category uses
[Xarray](http://xarray.pydata.org/) to read the input files and
[Zarr](https://zarr.readthedocs.io/) as the target dataset format.
The inputs can be in any [file format Xarray can read](http://xarray.pydata.org/en/latest/user-guide/io.html),
Expand All @@ -48,7 +49,7 @@ The target Zarr dataset will conform to the
[Xarray Zarr encoding conventions](http://xarray.pydata.org/en/latest/internals/zarr-encoding-spec.html).

The best way to really understand how recipes work is to go through the relevant
tutorials for this recipe class. These are, in order of increasing complexity
tutorials for this recipe category. These are, in order of increasing complexity

- {doc}`../tutorials/xarray_zarr/netcdf_zarr_sequential`
- {doc}`../tutorials/xarray_zarr/cmip6-recipe`
Expand All @@ -59,29 +60,43 @@ tutorials for this recipe class. These are, in order of increasing complexity
Below we give a very basic overview of how this recipe is used.

First you must define a {doc}`file pattern <file_patterns>`.
Once you have a {class}`file_pattern <pangeo_forge_recipes.patterns.FilePattern>` object,
initializing an `XarrayZarrRecipe` can be as simple as this.

Once you have a {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` object,
the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:
* `OpenURLWithFSSpec`: retrieves each pattern file using the specified URLs.
* `OpenWithXarray`: load each pattern file into an `xarray.Dataset`:
* The `file_type` is specified from the pattern.
* `StoreToZarr`: generate a Zarr store by combining the datasets:
* `store_name` specifies the name of the generated Zarr store.
* `target_root` specifies where the output will be stored, in this case, the temporary directory we created.
* `combine_dims` informs the transform of the dimension used to combine the datasets. Here we use the dimension specified in the file pattern (`time`).
* `target_chunks`: specifies a dictionary of required chunk size per dimension. In the event that this is not specified for a particular dimension, it will default to the corresponding full shape.

For example:
```{code-block} python
recipe = XarrayZarrRecipe(file_pattern)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=store_name,
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
)
```

There are many other options we could pass, all covered in the {class}`API documentation <pangeo_forge_recipes.recipes.XarrayZarrRecipe>`. Many of these options are explored further in the {doc}`../tutorials/index`.
The available transform options are all covered in the {doc}`../api_reference`. Many of these options are explored further in the {doc}`../tutorials/index`.

All recipes need a place to store the target dataset. Refer to {doc}`storage` for how to assign this and any other required storage targets.

Once your recipe is defined and has its storage targets assigned, you're ready to
move on to {doc}`execution`.

### HDF Reference Recipe

```{note}
The full API Reference documentation for this recipe class can be found at
{class}`pangeo_forge_recipes.recipes.HDFReferenceRecipe`
```
### HDF Reference Recipes

Like the `XarrayZarrRecipe`, this recipe allows us to more efficiently access data from a bunch of NetCDF / HDF files.
However, this recipe does not actually copy the original source data.
Like the Xarray to Zarr recipes, this category allows us to more efficiently access data from a bunch of NetCDF / HDF files.
However, such a recipe does not actually copy the original source data.
Instead, it generates metadata files which reference and index the original data, allowing it to be accessed more quickly and easily.
For more background, see [this blog post](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685).

Expand Down

0 comments on commit 47e46f8

Please sign in to comment.