diff --git a/docs/index.md b/docs/index.md index ac5aba1..1faace9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,6 +14,7 @@ feedstocks ```{toctree} :maxdepth: 1 +tutorial/local tutorial/flink ``` diff --git a/docs/tutorial/local.md b/docs/tutorial/local.md new file mode 100644 index 0000000..12f1741 --- /dev/null +++ b/docs/tutorial/local.md @@ -0,0 +1,144 @@ +# Running a recipe locally + +`pangeo-forge-runner` supports baking your recipes locally, primarily so you +can test the exact setup that will be used to bake your recipe on the cloud. +This allows for fast iteration on your recipe, while guaranteeing that the +behavior you see on your local system is what you will get when running +scaled out on the cloud. + +## Clone a sample recipe repo to work on + +This tutorial will work with any recipe, but to simplify things we will use +this pruned [GPCP Recipe](https://github.com/pforgetest/gpcp-from-gcs-feedstock/) +that pulls a subset of GPCPC netcdf files from Google Cloud storage and writes it +out as Zarr. The config we have setup for `pangeo-forge-runner` will fetch the +files from remote storage only once on your system, caching it so future runs +will be faster. + +This same setup would work for any recipe! + +1. Clone a copy of the recipe to work on: + + ```bash + git clone https://github.com/pforgetest/gpcp-from-gcs-feedstock + cd gpcp-from-gcs-feedstock + ``` + + You can make edits to this if you would like. + +2. Setup a virtual environment that will contain `pangeo-forge-runner` and + any other dependencies this recipe will need. We use a `venv` here, + but you may also use `conda` or other python package management setup you + are familiar with. + + ```bash + python -m venv venv + source venv/bin/activate + ``` + +3. Install `pangeo-forge-runner` into this environment. + + ```bash + pip install pangeo-forge-runner + ``` + +Now you're ready to go! + +## Setting up config file + +Construct a `local_config.py` file that describes where the output +data should go, and what should be used for caching the input files. Since we just +want to test locally, these can point to the local filesystem! + +```python +# Let's put all our data on the same dir as this config file +from pathlib import Path +import os +HERE = Path(__file__).parent + +DATA_PREFIX = HERE / 'data' +os.makedirs(DATA_PREFIX, exist_ok=True) + +# Target output should be partitioned by job id +c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}" + +c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class +c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args + +# Input data cache should *not* be partitioned by job id, as we want to get the datafile +# from the source only once +c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input" + +c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class +c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args +# Metadata cache should be per job, as kwargs changing can change metadata +c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job}}/cache/metadata" +``` + +This will create a directory called `data` in the same directory this +config file is located in, and put all outputs and caches in there. To +speed up multiple runs, input files will be cached under the `data/cache` +directory. + +## Run a pruned version of your recipe + +You're all set to run your recipe now! + +```bash +pangeo-forge-runner bake \ + --config local_config.py \ + --repo . \ + --Bake.job_name=test1 \ + --prune +``` + +This should run for a few seconds, and your output Zarr should now be +in `output/tests1`! Let's explore the various parameters passed. + +1. `--config local_config.py` specifies the config file we want `pangeo-forge-runner` + to read. If we were to try to run this on GCP or AWS, we can have additional + `aws_config.py` or `gcp_config.py` files, and just pass those instead - everything + else can remain the same. By putting most config into files, this also eases + collaboration - multiple people can know they're running the same config. +2. `--repo .` specifies that we want the current directory to be treated as a recipe + and run. This can instead point to a git repo, zenodo URI, etc as needed. +3. `--Bake.job_name=test1` specifies a unique job name for this particular run. + In our `local_config.py`, we use this name to create the output directory. If + not specified, this would be autogenerated. +4. `--prune` specifies we only want to run the recipe on about 2 input files, rather + than on everything. This makes for fast turnaround time and easy testing. + +You can test the created Zarr store by opening it with `xarray` + +```python +>>> import xarray as xr +>>> ds = xr.open_zarr("data/test1/gpcp") +>>> ds + +Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 2) +Coordinates: + * latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 + * longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 + * time (time) datetime64[ns] 1996-10-01 1996-10-02 +Dimensions without coordinates: nv +Data variables: + lat_bounds (latitude, nv) float32 dask.array + lon_bounds (longitude, nv) float32 dask.array + precip (time, latitude, longitude) float32 dask.array + time_bounds (time, nv) datetime64[ns] dask.array +Attributes: (12/41) + Conventions: CF-1.6, ACDD 1.3 + Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... + acknowledgment: This project was supported in part by a grant... + cdm_data_type: Grid + cdr_program: NOAA Climate Data Record Program for satellit... + cdr_variable: precipitation + ... ... + sensor: Imager, TOVS > TIROS Operational Vertical Sou... + spatial_resolution: 1 degree + standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) + summary: Global Precipitation Climatology Project (GPC... + time_coverage_duration: P1D + title: Global Precipitation Climatatology Project (G... +>>> +```