-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation on how to run recipes locally #89
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ feedstocks | |
```{toctree} | ||
:maxdepth: 1 | ||
|
||
tutorial/local | ||
tutorial/flink | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Running a recipe locally | ||
|
||
`pangeo-forge-runner` supports baking your recipes locally, primarily so you | ||
can test the exact setup that will be used to bake your recipe on the cloud. | ||
This allows for fast iteration on your recipe, while guaranteeing that the | ||
behavior you see on your local system is what you will get when running | ||
scaled out on the cloud. | ||
|
||
## Clone a sample recipe repo to work on | ||
|
||
This tutorial will work with any recipe, but to simplify things we will use | ||
this pruned [GPCP Recipe](https://github.com/pforgetest/gpcp-from-gcs-feedstock/) | ||
that pulls a subset of GPCPC netcdf files from Google Cloud storage and writes it | ||
out as Zarr. The config we have setup for `pangeo-forge-runner` will fetch the | ||
files from remote storage only once on your system, caching it so future runs | ||
will be faster. | ||
|
||
This same setup would work for any recipe! | ||
|
||
1. Clone a copy of the recipe to work on: | ||
|
||
```bash | ||
git clone https://github.com/pforgetest/gpcp-from-gcs-feedstock | ||
cd gpcp-from-gcs-feedstock | ||
``` | ||
|
||
You can make edits to this if you would like. | ||
|
||
2. Setup a virtual environment that will contain `pangeo-forge-runner` and | ||
any other dependencies this recipe will need. We use a `venv` here, | ||
but you may also use `conda` or other python package management setup you | ||
are familiar with. | ||
|
||
```bash | ||
python -m venv venv | ||
source venv/bin/activate | ||
``` | ||
|
||
3. Install `pangeo-forge-runner` into this environment. | ||
|
||
```bash | ||
pip install pangeo-forge-runner | ||
``` | ||
|
||
Now you're ready to go! | ||
|
||
## Setting up config file | ||
|
||
Construct a `local_config.py` file that describes where the output | ||
data should go, and what should be used for caching the input files. Since we just | ||
want to test locally, these can point to the local filesystem! | ||
|
||
```python | ||
# Let's put all our data on the same dir as this config file | ||
from pathlib import Path | ||
import os | ||
HERE = Path(__file__).parent | ||
|
||
DATA_PREFIX = HERE / 'data' | ||
os.makedirs(DATA_PREFIX, exist_ok=True) | ||
|
||
# Target output should be partitioned by job id | ||
c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}" | ||
|
||
c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class | ||
c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args | ||
|
||
# Input data cache should *not* be partitioned by job id, as we want to get the datafile | ||
# from the source only once | ||
c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input" | ||
|
||
c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class | ||
c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args | ||
# Metadata cache should be per job, as kwargs changing can change metadata | ||
c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job}}/cache/metadata" | ||
``` | ||
|
||
This will create a directory called `data` in the same directory this | ||
config file is located in, and put all outputs and caches in there. To | ||
speed up multiple runs, input files will be cached under the `data/cache` | ||
directory. | ||
|
||
## Run a pruned version of your recipe | ||
|
||
You're all set to run your recipe now! | ||
|
||
```bash | ||
pangeo-forge-runner bake \ | ||
--config local_config.py \ | ||
--repo . \ | ||
--Bake.job_name=test1 \ | ||
--prune | ||
``` | ||
|
||
This should run for a few seconds, and your output Zarr should now be | ||
in `output/tests1`! Let's explore the various parameters passed. | ||
|
||
1. `--config local_config.py` specifies the config file we want `pangeo-forge-runner` | ||
to read. If we were to try to run this on GCP or AWS, we can have additional | ||
`aws_config.py` or `gcp_config.py` files, and just pass those instead - everything | ||
else can remain the same. By putting most config into files, this also eases | ||
collaboration - multiple people can know they're running the same config. | ||
2. `--repo .` specifies that we want the current directory to be treated as a recipe | ||
and run. This can instead point to a git repo, zenodo URI, etc as needed. | ||
3. `--Bake.job_name=test1` specifies a unique job name for this particular run. | ||
In our `local_config.py`, we use this name to create the output directory. If | ||
not specified, this would be autogenerated. | ||
4. `--prune` specifies we only want to run the recipe on about 2 input files, rather | ||
than on everything. This makes for fast turnaround time and easy testing. | ||
|
||
You can test the created Zarr store by opening it with `xarray` | ||
|
||
```python | ||
>>> import xarray as xr | ||
>>> ds = xr.open_zarr("data/test1/gpcp") | ||
>>> ds | ||
<xarray.Dataset> | ||
Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 2) | ||
Coordinates: | ||
* latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 | ||
* longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 | ||
* time (time) datetime64[ns] 1996-10-01 1996-10-02 | ||
Dimensions without coordinates: nv | ||
Data variables: | ||
lat_bounds (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray> | ||
lon_bounds (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray> | ||
precip (time, latitude, longitude) float32 dask.array<chunksize=(1, 180, 360), meta=np.ndarray> | ||
time_bounds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray> | ||
Attributes: (12/41) | ||
Conventions: CF-1.6, ACDD 1.3 | ||
Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... | ||
acknowledgment: This project was supported in part by a grant... | ||
cdm_data_type: Grid | ||
cdr_program: NOAA Climate Data Record Program for satellit... | ||
cdr_variable: precipitation | ||
... ... | ||
sensor: Imager, TOVS > TIROS Operational Vertical Sou... | ||
spatial_resolution: 1 degree | ||
standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) | ||
summary: Global Precipitation Climatology Project (GPC... | ||
time_coverage_duration: P1D | ||
title: Global Precipitation Climatatology Project (G... | ||
>>> | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.