Skip to content

Commit

Permalink
Updated following review comments, and some minor additional edits to…
Browse files Browse the repository at this point in the history
… Recipes (pangeo-forge#483)
  • Loading branch information
derekocallaghan committed Mar 14, 2023
1 parent 196bd0e commit e594b76
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 9 deletions.
8 changes: 6 additions & 2 deletions docs/pangeo_forge_recipes/recipe_user_guide/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,9 @@ with beam.Pipeline() as p:
p | transforms
```

By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/).
See [runners](https://beam.apache.org/documentation/#runners) for more details.
By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/), which is useful during recipe development. However, alternative Beam runners are available, for example:
* [FlinkRunner](https://beam.apache.org/documentation/runners/flink/): execute Beam pipelines using [Apache Flink](https://flink.apache.org/).
* [DataflowRunner](https://beam.apache.org/documentation/runners/dataflow/): uses the [Google Cloud Dataflow managed service](https://cloud.google.com/dataflow/service/dataflow-service-desc).
* [DaskRunner](https://beam.apache.org/releases/pydoc/current/apache_beam.runners.dask.dask_runner.html): executes pipelines via [Dask.distributed](https://distributed.dask.org/en/stable/).

See [here](https://beam.apache.org/documentation/#runners) for details of the available Beam runners.
10 changes: 5 additions & 5 deletions docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,15 @@ to {doc}`../../pangeo_forge_cloud/index`, which allows the recipe to be automati

## Recipe Pipelines

A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an `apache_beam.PCollection`, performing a one-to-one mapping using `apache_beam.Map` of input elements to output elements, applying the specified transformation.
A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an [`apache_beam.PCollection`](https://beam.apache.org/documentation/programming-guide/#pcollections), performing a mapping of input elements to output elements (for example, using [`apache_beam.Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map/)), applying the specified transformation.

To write a recipe, you define a pipeline that uses existing transforms, in combination with new transforms if required for custom processing of the input data collection.

Right now, there are two categories of recipe pipelines based on a specific data model for the input files and target dataset format.
In the future, we may add more.

```{note}
The full API Reference documentation for the existing recipe `PTransform` implementations can be found at
The full API Reference documentation for the existing recipe `PTransform` implementations ({class}`pangeo_forge_recipes.transforms`) can be found at
{doc}`../api_reference`.
```

Expand Down Expand Up @@ -62,10 +62,10 @@ Below we give a very basic overview of how this recipe is used.
First you must define a {doc}`file pattern <file_patterns>`.
Once you have a {class}`FilePattern <pangeo_forge_recipes.patterns.FilePattern>` object,
the recipe pipeline will contain at a minimum the following transforms applied to the file pattern collection:
* `OpenURLWithFSSpec`: retrieves each pattern file using the specified URLs.
* `OpenWithXarray`: load each pattern file into an `xarray.Dataset`:
* {class}`pangeo_forge_recipes.transforms.OpenURLWithFSSpec`: retrieves each pattern file using the specified URLs.
* {class}`pangeo_forge_recipes.transforms.OpenWithXarray`: load each pattern file into an [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html):
* The `file_type` is specified from the pattern.
* `StoreToZarr`: generate a Zarr store by combining the datasets:
* {class}`pangeo_forge_recipes.transforms.StoreToZarr`: generate a Zarr store by combining the datasets:
* `store_name` specifies the name of the generated Zarr store.
* `target_root` specifies where the output will be stored, in this case, the temporary directory we created.
* `combine_dims` informs the transform of the dimension used to combine the datasets. Here we use the dimension specified in the file pattern (`time`).
Expand Down
4 changes: 2 additions & 2 deletions docs/pangeo_forge_recipes/recipe_user_guide/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ transforms = (
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=my-dataset-v1.zarr,
store_name="my-dataset-v1.zarr",
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
Expand All @@ -74,7 +74,7 @@ transforms = (
| OpenURLWithFSSpec(cache=cache)
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=my-dataset-v1.zarr,
store_name="my-dataset-v1.zarr",
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
Expand Down

0 comments on commit e594b76

Please sign in to comment.