Port Recipes User Guide to beam-refactor (#499)

* WIP "Recipes" (#483) * Full draft of Beam port of "Recipes User Guide" (#483) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated following review comments, and some minor additional edits to Recipes (#483) * Updated following review comments, and restored link to OPenDAP notebook (#483) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
pangeo-forge · Mar 14, 2023 · 661d13b · 661d13b
1 parent 067e421
commit 661d13b
Show file tree

Hide file tree

Showing 5 changed files with 89 additions and 81 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -16,7 +16,7 @@ There are a number of resources available when working with Pangeo Forge:
 - **Introduction Tutorial**: {doc}`introduction_tutorial/index` - Walks you through creating, running, and staging your first Recipe.
 - **User Guides** explain core Pangeo Forge concepts in detail. They provide
   background information to aid in gaining a depth of understanding:
-  - {doc}`pangeo_forge_recipes/recipe_user_guide/index` - For learning about how to create Recipes.
+  - {doc}`pangeo_forge_recipes/recipe_user_guide/index` - For learning about how to create Recipes. A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam](https://beam.apache.org/) [transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to a data collection, performing one or more transformations of input elements to output elements.
   - {doc}`pangeo_forge_cloud/recipe_contribution` - For learning how to contribute recipes to Pangeo Forge Cloud.
   - {doc}`pangeo_forge_recipes/development/development_guide` - For developers seeking to contribute to Pangeo Forge core functionality.
 - **Advanced Examples** walk through examples of using Pangeo Forge Recipes:

diff --git a/docs/pangeo_forge_recipes/recipe_user_guide/execution.md b/docs/pangeo_forge_recipes/recipe_user_guide/execution.md
@@ -11,39 +11,23 @@ recipe has already been initialized in the variable `recipe`.
 
 ## Recipe Executors
 
-```{note}
-API reference documentation for execution can be found in {mod}`pangeo_forge_recipes.executors`.
-```
-
 A recipe is an abstract description of a transformation pipeline.
-Recipes can be _compiled_ to executable objects.
-We currently support three types of compilation.
-
-### Python Function
-
-To compile a recipe to a single python function, use the method `.to_function()`.
-For example
-
-```{code-block} python
-recipe_func = recipe.to_function()
-recipe_func()  # actually execute the recipe
-```
-
-Note that the python function approach does not support parallel or distributed execution.
-It's mostly just a convenience utility.
+We currently support the following execution mechanism.
 
 ### Beam PTransform
 
-You can compile your recipe to an Apache Beam [PTransform](https://beam.apache.org/documentation/programming-guide/#transforms)
-to be used within a [Pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) using the
-:meth:`BaseRecipe.to_beam()` method. For example
+A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms that operate on an `apache_beam.PCollection`, applying the specified transformation from input to output elements. Having created a transforms pipeline (see {doc}`recipes`}, it may be executed with Beam as follows:
 
 ```{code-block} python
 import apache_beam as beam
 
 with beam.Pipeline() as p:
-   p | recipe.to_beam()
+    p | transforms
 ```
 
-By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/).
-See [runners](https://beam.apache.org/documentation/#runners) for more.
+By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/), which is useful during recipe development. However, alternative Beam runners are available, for example:
+* [FlinkRunner](https://beam.apache.org/documentation/runners/flink/): execute Beam pipelines using [Apache Flink](https://flink.apache.org/).
+* [DataflowRunner](https://beam.apache.org/documentation/runners/dataflow/): uses the [Google Cloud Dataflow managed service](https://cloud.google.com/dataflow/service/dataflow-service-desc).
+* [DaskRunner](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.dask.dask_runner.html): executes pipelines via [Dask.distributed](https://distributed.dask.org/en/stable/).
+
+See [here](https://beam.apache.org/documentation/#runners) for details of the available Beam runners.
diff --git a/docs/pangeo_forge_recipes/recipe_user_guide/recipes.md b/docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
@@ -19,23 +19,24 @@ Recipe authors (i.e. data users or data managers) can either execute their recip
 on their own computers and infrastructure, in private, or make a {doc}`../../pangeo_forge_cloud/recipe_contribution`
 to {doc}`../../pangeo_forge_cloud/index`, which allows the recipe to be automatically by via [Bakeries](../../pangeo_forge_cloud/core_concepts.md).
 
-## Recipe Classes
+## Recipe Pipelines
 
-To write a recipe, you must start from one of the existing recipe classes.
-Recipe classes are based on a specific data model for the input files and target dataset format.
-Right now, there are two distinct recipe classes implemented.
-In the future, we may add more.
+A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an [`apache_beam.PCollection`](https://beam.apache.org/documentation/programming-guide/#pcollections), performing a mapping of input elements to output elements (for example, using [`apache_beam.Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map/)), applying the specified transformation.
 
-TODO add rubric for choosing a recipe class.
+To write a recipe, you define a pipeline that uses existing transforms, in combination with new transforms if required for custom processing of the input data collection.
 
-### XarrayZarr Recipe
+Right now, there are two categories of recipe pipelines based on a specific data model for the input files and target dataset format.
+In the future, we may add more.
 
 ```{note}
-The full API Reference documentation for this recipe class can be found at
-{class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe`
+The full API Reference documentation for the existing recipe `PTransform` implementations ({class}`pangeo_forge_recipes.transforms`) can be found at
+{doc}`../api_reference`.
 ```
 
-The {class}`pangeo_forge_recipes.recipes.XarrayZarrRecipe` recipe class uses
+### Xarray to Zarr Recipes
+
+
+This recipe category uses
 [Xarray](http://xarray.pydata.org/) to read the input files and
 [Zarr](https://zarr.readthedocs.io/) as the target dataset format.
 The inputs can be in any [file format Xarray can read](http://xarray.pydata.org/en/latest/user-guide/io.html),
@@ -48,7 +49,7 @@ The target Zarr dataset will conform to the
 [Xarray Zarr encoding conventions](http://xarray.pydata.org/en/latest/internals/zarr-encoding-spec.html).
 
 The best way to really understand how recipes work is to go through the relevant
-tutorials for this recipe class. These are, in order of increasing complexity
+tutorials for this recipe category. These are, in order of increasing complexity
 
 - {doc}`../tutorials/xarray_zarr/netcdf_zarr_sequential`
 - {doc}`../tutorials/xarray_zarr/cmip6-recipe`
@@ -59,29 +60,43 @@ tutorials for this recipe class. These are, in order of increasing complexity
 Below we give a very basic overview of how this recipe is used.
 
 First you must define a {doc}`file pattern <file_patterns>`.
-Once you have a {class}`file_pattern <pangeo_forge_recipes.patterns.FilePattern>` object,
-initializing an `XarrayZarrRecipe` can be as simple as this.
-
+Once you have a {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` object,
+the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:
+* {class}`pangeo_forge_recipes.transforms.OpenURLWithFSSpec`: retrieves each pattern file using the specified URLs.
+* {class}`pangeo_forge_recipes.transforms.OpenWithXarray`: load each pattern file into an [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html):
+  * The `file_type` is specified from the pattern.
+* {class}`pangeo_forge_recipes.transforms.StoreToZarr`: generate a Zarr store by combining the datasets:
+  * `store_name` specifies the name of the generated Zarr store.
+  * `target_root` specifies where the output will be stored, in this case, the temporary directory we created.
+  * `combine_dims` informs the transform of the dimension used to combine the datasets. Here we use the dimension specified in the file pattern (`time`).
+  * `target_chunks`: specifies a dictionary of required chunk size per dimension. In the event that this is not specified for a particular dimension, it will default to the corresponding full shape.
+
+For example:
 ```{code-block} python
-recipe = XarrayZarrRecipe(file_pattern)
+transforms = (
+    beam.Create(pattern.items())
+    | OpenURLWithFSSpec()
+    | OpenWithXarray(file_type=pattern.file_type)
+    | StoreToZarr(
+        store_name=store_name,
+        target_root=target_root,
+        combine_dims=pattern.combine_dim_keys,
+        target_chunks={"time": 10}
+    )
+)
 ```
 
-There are many other options we could pass, all covered in the {class}`API documentation <pangeo_forge_recipes.recipes.XarrayZarrRecipe>`. Many of these options are explored further in the {doc}`../tutorials/index`.
+The available transform options are all covered in the {doc}`../api_reference`. Many of these options are explored further in the {doc}`../tutorials/index`.
 
 All recipes need a place to store the target dataset. Refer to {doc}`storage` for how to assign this and any other required storage targets.
 
 Once your recipe is defined and has its storage targets assigned, you're ready to
 move on to {doc}`execution`.
 
-### HDF Reference Recipe
-
-```{note}
-The full API Reference documentation for this recipe class can be found at
-{class}`pangeo_forge_recipes.recipes.HDFReferenceRecipe`
-```
+### HDF Reference Recipes
 
-Like the `XarrayZarrRecipe`, this recipe allows us to more efficiently access data from a bunch of NetCDF / HDF files.
-However, this recipe does not actually copy the original source data.
+Like the Xarray to Zarr recipes, this category allows us to more efficiently access data from a bunch of NetCDF / HDF files.
+However, such a recipe does not actually copy the original source data.
 Instead, it generates metadata files which reference and index the original data, allowing it to be accessed more quickly and easily.
 For more background, see [this blog post](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685).
 

diff --git a/docs/pangeo_forge_recipes/recipe_user_guide/storage.md b/docs/pangeo_forge_recipes/recipe_user_guide/storage.md
@@ -1,27 +1,32 @@
 # Storage
 
-Recipes need a place to store data. This information is provided to the recipe by its `.storage_config` attribute, which is an object of type {class}`pangeo_forge_recipes.storage.StorageConfig`.
-The `StorageConfig` object looks like this
+Recipes need a place to store data. This information is provided to the recipe using the transforms in the corresponding pipeline, where storage configuration may include a *cache* location to store retrieved source data products, and a *target* location to store the recipe output.
+Here, this is illustrated using two transforms typically used in {doc}`recipes`.
 
 ```{eval-rst}
-.. autoclass:: pangeo_forge_recipes.storage.StorageConfig
+.. autoclass:: pangeo_forge_recipes.transforms.OpenURLWithFSSpec
+    :noindex:
+```
+```{eval-rst}
+.. autoclass:: pangeo_forge_recipes.transforms.StoreToZarr
     :noindex:
 ```
-
-As shown above, the storage configuration includes three distinct parts: `target`, `cache`, and `metadata`.
 
 ## Default storage
 
-When you create a new recipe, a default `StorageConfig` will automatically be created pointing at a local a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
+When you create a new recipe, it is common to specify storage locations pointing at a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
 This allows you to write data to temporary local storage during the recipe development and debugging process.
 This means that any recipe can immediately be executed with minimal configuration.
-However, in a realistic "production" scenario, you will want to customize your storage locations.
+However, in a realistic "production" scenario, a separate location will be used. In all cases, the storage locations are customized using the corresponding transform parameters.
+
+## Customizing *target* storage: `StoreToZarr`
 
-## Customizing storage: the `target`
+The minimal requirement for instantiating `StoreToZarr` is a location in which to store the final dataset produced by the recipe. This is acheieved with the following parameters:
 
-To write a recipe's full dataset to a persistant storage location, re-assign `.storage_config` to be a {class}`pangeo_forge_recipes.storage.StorageConfig` pointing to the location(s) of your choice. The minimal requirement for instantiating `StorageConfig` is a location in which to store the final dataset produced by the recipe. This is called the ``target``. Pangeo Forge has a special class for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`.
+* `store_name` specifies the name of the generated Zarr store.
+* `target_root` specifies where the output will be stored. For example, a temporary directory created during local development.
 
-Creating a ``target`` requires two arguments:
+Although `target_root` may be a `str` pointing to a location, it also accepts a special class provided by Pangeo Forge for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`. Creating an ``FSSpecTarget`` requires two arguments:
 - The ``fs`` argument is an [fsspec](https://filesystem-spec.readthedocs.io/en/latest/)
   filesystem. Fsspec supports many different types of storage via its
   [built in](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations)
@@ -35,38 +40,43 @@ import s3fs
 from pangeo_forge_recipes.storage import FSSpecTarget
 
 fs = s3fs.S3FileSystem(key="MY_AWS_KEY", secret="MY_AWS_SECRET")
-target_path = "pangeo-forge-bucket/my-dataset-v1.zarr"
-target = FSSpecTarget(fs=fs, root_path=target_path)
+target_root = FSSpecTarget(fs=fs, root_path="pangeo-forge-bucket")
 ```
 
-This target can then be assiged to a recipe as follows:
+This target can then be assiged to a recipe as follows (see also {doc}`recipes`):
 ```{code-block} python
-from pangeo_forge_recipes.storage import StorageConfig
-
-recipe.storage_config = StorageConfig(target)
+transforms = (
+    beam.Create(pattern.items())
+    | OpenURLWithFSSpec()
+    | OpenWithXarray(file_type=pattern.file_type)
+    | StoreToZarr(
+        store_name="my-dataset-v1.zarr",
+        target_root=target_root,
+        combine_dims=pattern.combine_dim_keys,
+        target_chunks={"time": 10}
+    )
 ```
 
-Once assigned, the `target` can be accessed from the recipe with:
-
-```{code-block} python
-recipe.target
-```
-
-## Customizing storage continued: caching
-
-Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. Some recipes require separate caching of metadata, which is provided by a third class {class}`pangeo_forge_recipes.storage.MetadataTarget`.
+## Customizing storage continued: caching with `OpenURLWithFSSpec`
 
-A `StorageConfig` which declares all three storage locations is assigned as follows:
+Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. For example, extending the previous recipe pipeline:
 
 ```{code-block} python
 
 from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget, MetadataTarget, StorageConfig
 
 # define your fsspec filesystems for the target, cache, and metadata locations here
 
-target = FSSpecTarget(fs=<fsspec-filesystem-for-target>, root_path="<path-for-target>")
 cache = CacheFSSpecTarget(fs=<fsspec-filesystem-for-cache>, root_path="<path-for-cache>")
-metadata = MetadataTarget(fs=<fsspec-filesystem-for-metadata>, root_path="<path-for-metadata>")
 
-recipe.storage_config = StorageConfig(target, cache, metadata)
+transforms = (
+    beam.Create(pattern.items())
+    | OpenURLWithFSSpec(cache=cache)
+    | OpenWithXarray(file_type=pattern.file_type)
+    | StoreToZarr(
+        store_name="my-dataset-v1.zarr",
+        target_root=target_root,
+        combine_dims=pattern.combine_dim_keys,
+        target_chunks={"time": 10}
+    )
 ```
diff --git a/docs/pangeo_forge_recipes/tutorials/index.md b/docs/pangeo_forge_recipes/tutorials/index.md
@@ -9,7 +9,6 @@ xarray_zarr/netcdf_zarr_sequential
 xarray_zarr/cmip6-recipe
 xarray_zarr/multi_variable_recipe
 xarray_zarr/terraclimate
+xarray_zarr/opendap_subset_recipe
 hdf_reference/reference_cmip6
 ```
-
-[//]: # (TODO - Restore this in previous toctree if/when XarrayZarrRecipe.subset_inputs is supported on Beam, https://github.com/pangeo-forge/pangeo-forge-recipes/issues/496: xarray_zarr/opendap_subset_recipe)