Skip to content

Commit

Permalink
Full draft of Beam port of "Recipes User Guide" (pangeo-forge#483)
Browse files Browse the repository at this point in the history
  • Loading branch information
derekocallaghan committed Mar 7, 2023
1 parent 47e46f8 commit 196bd0e
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 54 deletions.
28 changes: 4 additions & 24 deletions docs/pangeo_forge_recipes/recipe_user_guide/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,39 +11,19 @@ recipe has already been initialized in the variable `recipe`.

## Recipe Executors

```{note}
API reference documentation for execution can be found in {mod}`pangeo_forge_recipes.executors`.
```

A recipe is an abstract description of a transformation pipeline.
Recipes can be _compiled_ to executable objects.
We currently support three types of compilation.

### Python Function

To compile a recipe to a single python function, use the method `.to_function()`.
For example

```{code-block} python
recipe_func = recipe.to_function()
recipe_func() # actually execute the recipe
```

Note that the python function approach does not support parallel or distributed execution.
It's mostly just a convenience utility.
We currently support the following execution mechanism.

### Beam PTransform

You can compile your recipe to an Apache Beam [PTransform](https://beam.apache.org/documentation/programming-guide/#transforms)
to be used within a [Pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) using the
:meth:`BaseRecipe.to_beam()` method. For example
A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms that operate on an `apache_beam.PCollection`, applying the specified transformation from input to output elements. Having created a transforms pipeline (see {doc}`recipes`}, it may be executed with Beam as follows:

```{code-block} python
import apache_beam as beam
with beam.Pipeline() as p:
p | recipe.to_beam()
p | transforms
```

By default the pipeline runs using Beam's [DirectRunner](https://beam.apache.org/documentation/runners/direct/).
See [runners](https://beam.apache.org/documentation/#runners) for more.
See [runners](https://beam.apache.org/documentation/#runners) for more details.
2 changes: 1 addition & 1 deletion docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ to {doc}`../../pangeo_forge_cloud/index`, which allows the recipe to be automati

## Recipe Pipelines

A recipe is defined as a pipeline of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an `apache_beam.PCollection`, performing a one-to-one mapping using `apache_beam.Map` of input elements to output elements, applying the specified transformation.
A recipe is defined as a [pipeline](https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline) of [Apache Beam transforms](https://beam.apache.org/documentation/programming-guide/#transforms) applied to the data collection associated with a {doc}`file pattern <file_patterns>`. Specifically, each recipe pipeline contains a set of transforms, which operate on an `apache_beam.PCollection`, performing a one-to-one mapping using `apache_beam.Map` of input elements to output elements, applying the specified transformation.

To write a recipe, you define a pipeline that uses existing transforms, in combination with new transforms if required for custom processing of the input data collection.

Expand Down
68 changes: 39 additions & 29 deletions docs/pangeo_forge_recipes/recipe_user_guide/storage.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,32 @@
# Storage

Recipes need a place to store data. This information is provided to the recipe by its `.storage_config` attribute, which is an object of type {class}`pangeo_forge_recipes.storage.StorageConfig`.
The `StorageConfig` object looks like this
Recipes need a place to store data. This information is provided to the recipe using the transforms in the corresponding pipeline, where storage configuration may include a *cache* location to store retrieved source data products, and a *target* location to store the recipe output.
Here, this is illustrated using two transforms typically used in {doc}`recipes`.

```{eval-rst}
.. autoclass:: pangeo_forge_recipes.storage.StorageConfig
.. autoclass:: pangeo_forge_recipes.transforms.OpenURLWithFSSpec
:noindex:
```
```{eval-rst}
.. autoclass:: pangeo_forge_recipes.transforms.StoreToZarr
:noindex:
```

As shown above, the storage configuration includes three distinct parts: `target`, `cache`, and `metadata`.

## Default storage

When you create a new recipe, a default `StorageConfig` will automatically be created pointing at a local a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
When you create a new recipe, it is common to specify storage locations pointing at a local [`tempfile.TemporaryDirectory`](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory).
This allows you to write data to temporary local storage during the recipe development and debugging process.
This means that any recipe can immediately be executed with minimal configuration.
However, in a realistic "production" scenario, you will want to customize your storage locations.
However, in a realistic "production" scenario, a separate location will be used. In all cases, the storage locations are customized using the corresponding transform parameters.

## Customizing *target* storage: `StoreToZarr`

## Customizing storage: the `target`
The minimal requirement for instantiating `StoreToZarr` is a location in which to store the final dataset produced by the recipe. This is acheieved with the following parameters:

To write a recipe's full dataset to a persistant storage location, re-assign `.storage_config` to be a {class}`pangeo_forge_recipes.storage.StorageConfig` pointing to the location(s) of your choice. The minimal requirement for instantiating `StorageConfig` is a location in which to store the final dataset produced by the recipe. This is called the ``target``. Pangeo Forge has a special class for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`.
* `store_name` specifies the name of the generated Zarr store.
* `target_root` specifies where the output will be stored. For example, a temporary directory created during local development.

Creating a ``target`` requires two arguments:
Although `target_root` may be a `str` pointing to a location, it also accepts a special class provided by Pangeo Forge for this: {class}`pangeo_forge_recipes.storage.FSSpecTarget`. Creating an ``FSSpecTarget`` requires two arguments:
- The ``fs`` argument is an [fsspec](https://filesystem-spec.readthedocs.io/en/latest/)
filesystem. Fsspec supports many different types of storage via its
[built in](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations)
Expand All @@ -35,38 +40,43 @@ import s3fs
from pangeo_forge_recipes.storage import FSSpecTarget
fs = s3fs.S3FileSystem(key="MY_AWS_KEY", secret="MY_AWS_SECRET")
target_path = "pangeo-forge-bucket/my-dataset-v1.zarr"
target = FSSpecTarget(fs=fs, root_path=target_path)
target_root = FSSpecTarget(fs=fs, root_path="pangeo-forge-bucket")
```

This target can then be assiged to a recipe as follows:
This target can then be assiged to a recipe as follows (see also {doc}`recipes`):
```{code-block} python
from pangeo_forge_recipes.storage import StorageConfig
recipe.storage_config = StorageConfig(target)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec()
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=my-dataset-v1.zarr,
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
```

Once assigned, the `target` can be accessed from the recipe with:

```{code-block} python
recipe.target
```

## Customizing storage continued: caching

Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. Some recipes require separate caching of metadata, which is provided by a third class {class}`pangeo_forge_recipes.storage.MetadataTarget`.
## Customizing storage continued: caching with `OpenURLWithFSSpec`

A `StorageConfig` which declares all three storage locations is assigned as follows:
Oftentimes it is useful to cache input files, rather than read them directly from the data provider. Input files can be cached at a location defined by a {class}`pangeo_forge_recipes.storage.CacheFSSpecTarget` object. For example, extending the previous recipe pipeline:

```{code-block} python
from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget, MetadataTarget, StorageConfig
# define your fsspec filesystems for the target, cache, and metadata locations here
target = FSSpecTarget(fs=<fsspec-filesystem-for-target>, root_path="<path-for-target>")
cache = CacheFSSpecTarget(fs=<fsspec-filesystem-for-cache>, root_path="<path-for-cache>")
metadata = MetadataTarget(fs=<fsspec-filesystem-for-metadata>, root_path="<path-for-metadata>")
recipe.storage_config = StorageConfig(target, cache, metadata)
transforms = (
beam.Create(pattern.items())
| OpenURLWithFSSpec(cache=cache)
| OpenWithXarray(file_type=pattern.file_type)
| StoreToZarr(
store_name=my-dataset-v1.zarr,
target_root=target_root,
combine_dims=pattern.combine_dim_keys,
target_chunks={"time": 10}
)
```

0 comments on commit 196bd0e

Please sign in to comment.