Skip to content

Commit

Permalink
Port Recipes User Guide to beam-refactor (#499)
Browse files Browse the repository at this point in the history
* WIP "Recipes" (#483)

* Full draft of Beam port of "Recipes User Guide" (#483)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated following review comments, and some minor additional edits to Recipes (#483)

* Updated following review comments, and restored link to OPenDAP notebook (#483)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
derekocallaghan and pre-commit-ci[bot] authored Mar 14, 2023
1 parent 067e421 commit 661d13b
Show file tree
Hide file tree
Showing 5 changed files with 89 additions and 81 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ There are a number of resources available when working with Pangeo Forge:
- **Introduction Tutorial**: {doc}`introduction_tutorial/index` - Walks you through creating, running, and staging your first Recipe.
- **User Guides** explain core Pangeo Forge concepts in detail. They provide
background information to aid in gaining a depth of understanding:
- {doc}`pangeo_forge_recipes/recipe_user_guide/index` - For learning about how to create Recipes.
- {doc}`pangeo_forge_recipes/recipe_user_guide/index` - For learning about how to create Recipes. A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam](https://beam.apache.org/) [transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to a data collection, performing one or more transformations of input elements to output elements.
- {doc}`pangeo_forge_cloud/recipe_contribution` - For learning how to contribute recipes to Pangeo Forge Cloud.
- {doc}`pangeo_forge_recipes/development/development_guide` - For developers seeking to contribute to Pangeo Forge core functionality.
- **Advanced Examples** walk through examples of using Pangeo Forge Recipes:
Expand Down
34 changes: 9 additions & 25 deletions docs/pangeo_forge_recipes/recipe_user_guide/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,39 +11,23 @@ recipe has already been initialized in the variable `recipe`.

## Recipe Executors

```{note}
API reference documentation for execution can be found in {mod}`pangeo_forge_recipes.executors`.
```

A recipe is an abstract description of a transformation pipeline.
Recipes can be _compiled_ to executable objects.
We currently support three types of compilation.

### Python Function

To compile a recipe to a single python function, use the method `.to_function()`.
For example

```{code-block} python
recipe_func = recipe.to_function()
recipe_func() # actually execute the recipe
```

Note that the python function approach does not support parallel or distributed execution.
It's mostly just a convenience utility.
We currently support the following execution mechanism.

### Beam PTransform

You can compile your recipe to an Apache Beam [PTransform](https://beam.apache.org/documentation/programming-guide/#transforms)
to be used within a [Pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) using the
:meth:`BaseRecipe.to_beam()` method. For example
A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms that operate on an `apache_beam.PCollection`, applying the specified transformation from input to output elements. Having created a transforms pipeline (see {doc}`recipes`}, it may be executed with Beam as follows:

```{code-block} python
import apache_beam as beam
with beam.Pipeline() as p:
p | recipe.to_beam()
p | transforms
```

By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/).
See [runners](https://beam.apache.org/documentation/#runners) for more.
By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/), which is useful during recipe development. However, alternative Beam runners are available, for example:
* [FlinkRunner](https://beam.apache.org/documentation/runners/flink/): execute Beam pipelines using [Apache Flink](https://flink.apache.org/).
* [DataflowRunner](https://beam.apache.org/documentation/runners/dataflow/): uses the [Google Cloud Dataflow managed service](https://cloud.google.com/dataflow/service/dataflow-service-desc).
* [DaskRunner](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.dask.dask_runner.html): executes pipelines via [Dask.distributed](https://distributed.dask.org/en/stable/).

See [here](https://beam.apache.org/documentation/#runners) for details of the available Beam runners.
63 changes: 39 additions & 24 deletions docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,24 @@ Recipe authors (i.e. data users or data managers) can either execute their recip
on their own computers and infrastructure, in private, or make a {doc}`../../pangeo_forge_cloud/recipe_contribution`
to {doc}`../../pangeo_forge_cloud/index`, which allows the recipe to be automatically by via [Bakeries](../../pangeo_forge_cloud/core_concepts.md).

## Recipe Classes
## Recipe Pipelines

To write a recipe, you must start from one of the existing recipe classes.
Recipe classes are based on a specific data model for the input files and target dataset format.
Right now, there are two distinct recipe classes implemented.
In the future, we may add more.
A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an [`apache_beam.PCollection`](https://beam.apache.org/documentation/programming-guide/#pcollections), performing a mapping of input elements to output elements (for example, using [`apache_beam.Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map/)), applying the specified transformation.

TODO add rubric for choosing a recipe class.
To write a recipe, you define a pipeline that uses existing transforms, in combination with new transforms if required for custom processing of the input data collection.

### XarrayZarr Recipe
Right now, there are two categories of recipe pipelines based on a specific data model for the input files and target dataset format.
In the future, we may add more.

```{note}
The full API Reference documentation for this recipe class can be found at
{class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe`
The full API Reference documentation for the existing recipe `PTransform` implementations ({class}`pangeo_forge_recipes.transforms`) can be found at
{doc}`../api_reference`.
```

The {class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe` recipe class uses
### Xarray to Zarr Recipes


This recipe category uses
[Xarray](http://xarray.pydata.org/) to read the input files and
[Zarr](https://zarr.readthedocs.io/) as the target dataset format.
The inputs can be in any [file format Xarray can read](http://xarray.pydata.org/en/latest/user-guide/io.html),
Expand All @@ -48,7 +49,7 @@ The target Zarr dataset will conform to the
[Xarray Zarr encoding conventions](http://xarray.pydata.org/en/latest/internals/zarr-encoding-spec.html).

The best way to really understand how recipes work is to go through the relevant
tutorials for this recipe class. These are, in order of increasing complexity
tutorials for this recipe category. These are, in order of increasing complexity

- {doc}`../tutorials/xarray_zarr/netcdf_zarr_sequential`
- {doc}`../tutorials/xarray_zarr/cmip6-recipe`
Expand All @@ -59,29 +60,43 @@ tutorials for this recipe class. These are, in order of increasing complexity
Below we give a very basic overview of how this recipe is used.

First you must define a {doc}`file pattern <file_patterns>`.
Once you have a {class}`file_pattern <pangeo_forge_recipes.patterns.FilePattern>` object,
initializing an `XarrayZarrRecipe` can be as simple as this.

Once you have a {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` object,
the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:
* {class}`pangeo_forge_recipes.transforms.OpenURLWithFSSpec`: retrieves each pattern file using the specified URLs.
* {class}`pangeo_forge_recipes.transforms.OpenWithXarray`: load each pattern file into an [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html):
* The `file_type` is specified from the pattern.
* {class}`pangeo_forge_recipes.transforms.StoreToZarr`: generate a Zarr store by combining the datasets:
* `store_name` specifies the name of the generated Zarr store.
* `target_root` specifies where the output will be stored, in this case, the temporary directory we created.
* `combine_dims` informs the transform of the dimension used to combine the datasets. Here we use the dimension specified in the file pattern (`time`).
* `target_chunks`: specifies a dictionary of required chunk size per dimension. In the event that this is not specified for a particular dimension, it will default to the corresponding full shape.

For example:
```{code-block} python
recipe = XarrayZarrRecipe(file_pattern)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=store_name,
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
)
```

There are many other options we could pass, all covered in the {class}`API documentation <pangeo_forge_recipes.recipes.XarrayZarrRecipe>`. Many of these options are explored further in the {doc}`../tutorials/index`.
The available transform options are all covered in the {doc}`../api_reference`. Many of these options are explored further in the {doc}`../tutorials/index`.

All recipes need a place to store the target dataset. Refer to {doc}`storage` for how to assign this and any other required storage targets.

Once your recipe is defined and has its storage targets assigned, you're ready to
move on to {doc}`execution`.

### HDF Reference Recipe

```{note}
The full API Reference documentation for this recipe class can be found at
{class}`pangeo_forge_recipes.recipes.HDFReferenceRecipe`
```
### HDF Reference Recipes

Like the `XarrayZarrRecipe`, this recipe allows us to more efficiently access data from a bunch of NetCDF / HDF files.
However, this recipe does not actually copy the original source data.
Like the Xarray to Zarr recipes, this category allows us to more efficiently access data from a bunch of NetCDF / HDF files.
However, such a recipe does not actually copy the original source data.
Instead, it generates metadata files which reference and index the original data, allowing it to be accessed more quickly and easily.
For more background, see [this blog post](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685).

Expand Down
68 changes: 39 additions & 29 deletions docs/pangeo_forge_recipes/recipe_user_guide/storage.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,32 @@
# Storage

Recipes need a place to store data. This information is provided to the recipe by its `.storage_config` attribute, which is an object of type {class}`pangeo_forge_recipes.storage.StorageConfig`.
The `StorageConfig` object looks like this
Recipes need a place to store data. This information is provided to the recipe using the transforms in the corresponding pipeline, where storage configuration may include a *cache* location to store retrieved source data products, and a *target* location to store the recipe output.
Here, this is illustrated using two transforms typically used in {doc}`recipes`.

```{eval-rst}
.. autoclass:: pangeo_forge_recipes.storage.StorageConfig
.. autoclass:: pangeo_forge_recipes.transforms.OpenURLWithFSSpec
:noindex:
```
```{eval-rst}
.. autoclass:: pangeo_forge_recipes.transforms.StoreToZarr
:noindex:
```

As shown above, the storage configuration includes three distinct parts: `target`, `cache`, and `metadata`.

## Default storage

When you create a new recipe, a default `StorageConfig` will automatically be created pointing at a local a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
When you create a new recipe, it is common to specify storage locations pointing at a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
This allows you to write data to temporary local storage during the recipe development and debugging process.
This means that any recipe can immediately be executed with minimal configuration.
However, in a realistic "production" scenario, you will want to customize your storage locations.
However, in a realistic "production" scenario, a separate location will be used. In all cases, the storage locations are customized using the corresponding transform parameters.

## Customizing *target* storage: `StoreToZarr`

## Customizing storage: the `target`
The minimal requirement for instantiating `StoreToZarr` is a location in which to store the final dataset produced by the recipe. This is acheieved with the following parameters:

To write a recipe's full dataset to a persistant storage location, re-assign `.storage_config` to be a {class}`pangeo_forge_recipes.storage.StorageConfig` pointing to the location(s) of your choice. The minimal requirement for instantiating `StorageConfig` is a location in which to store the final dataset produced by the recipe. This is called the ``target``. Pangeo Forge has a special class for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`.
* `store_name` specifies the name of the generated Zarr store.
* `target_root` specifies where the output will be stored. For example, a temporary directory created during local development.

Creating a ``target`` requires two arguments:
Although `target_root` may be a `str` pointing to a location, it also accepts a special class provided by Pangeo Forge for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`. Creating an ``FSSpecTarget`` requires two arguments:
- The ``fs`` argument is an [fsspec](https://filesystem-spec.readthedocs.io/en/latest/)
filesystem. Fsspec supports many different types of storage via its
[built in](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations)
Expand All @@ -35,38 +40,43 @@ import s3fs
from pangeo_forge_recipes.storage import FSSpecTarget
fs = s3fs.S3FileSystem(key="MY_AWS_KEY", secret="MY_AWS_SECRET")
target_path = "pangeo-forge-bucket/my-dataset-v1.zarr"
target = FSSpecTarget(fs=fs, root_path=target_path)
target_root = FSSpecTarget(fs=fs, root_path="pangeo-forge-bucket")
```

This target can then be assiged to a recipe as follows:
This target can then be assiged to a recipe as follows (see also {doc}`recipes`):
```{code-block} python
from pangeo_forge_recipes.storage import StorageConfig
recipe.storage_config = StorageConfig(target)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name="my-dataset-v1.zarr",
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
```

Once assigned, the `target` can be accessed from the recipe with:

```{code-block} python
recipe.target
```

## Customizing storage continued: caching

Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. Some recipes require separate caching of metadata, which is provided by a third class {class}`pangeo_forge_recipes.storage.MetadataTarget`.
## Customizing storage continued: caching with `OpenURLWithFSSpec`

A `StorageConfig` which declares all three storage locations is assigned as follows:
Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. For example, extending the previous recipe pipeline:

```{code-block} python
from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget, MetadataTarget, StorageConfig
# define your fsspec filesystems for the target, cache, and metadata locations here
target = FSSpecTarget(fs=<fsspec-filesystem-for-target>, root_path="<path-for-target>")
cache = CacheFSSpecTarget(fs=<fsspec-filesystem-for-cache>, root_path="<path-for-cache>")
metadata = MetadataTarget(fs=<fsspec-filesystem-for-metadata>, root_path="<path-for-metadata>")
recipe.storage_config = StorageConfig(target, cache, metadata)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec(cache=cache)
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name="my-dataset-v1.zarr",
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
```
3 changes: 1 addition & 2 deletions docs/pangeo_forge_recipes/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ xarray_zarr/netcdf_zarr_sequential
xarray_zarr/cmip6-recipe
xarray_zarr/multi_variable_recipe
xarray_zarr/terraclimate
xarray_zarr/opendap_subset_recipe
hdf_reference/reference_cmip6
```

[//]: # (TODO - Restore this in previous toctree if/when XarrayZarrRecipe.subset_inputs is supported on Beam, https://github.com/pangeo-forge/pangeo-forge-recipes/issues/496: xarray_zarr/opendap_subset_recipe)

0 comments on commit 661d13b

Please sign in to comment.