From 2c33615b534924d2b7d6f6d73277169cc22450a0 Mon Sep 17 00:00:00 2001 From: YuviPanda Date: Sat, 19 Aug 2023 12:23:31 -0700 Subject: [PATCH 1/3] Add documentation on how to run recipes locally --- docs/index.md | 1 + docs/tutorial/local.md | 144 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 145 insertions(+) create mode 100644 docs/tutorial/local.md diff --git a/docs/index.md b/docs/index.md index 557a2b7a..b1aeba6f 100644 --- a/docs/index.md +++ b/docs/index.md @@ -14,6 +14,7 @@ feedstocks ```{toctree} :maxdepth: 1 +tutorial/local tutorial/flink ``` diff --git a/docs/tutorial/local.md b/docs/tutorial/local.md new file mode 100644 index 00000000..974a486a --- /dev/null +++ b/docs/tutorial/local.md @@ -0,0 +1,144 @@ +# Running a recipe locally + +`pangeo-forge-runner` supports baking your recipes locally, primarily so you +can test the exact setup that will be used to bake your recipe on the cloud. +This allows for fast iteration on your recipe, while guaranteeing that the +behavior you see on your local system is what you will get when running +scaled out on the cloud. + +## Clone a sample recipe repo to work on + +This tutorial will work with any recipe, but to simplify things we will use +this pruned [GPCP Recipe](https://github.com/pforgetest/gpcp-from-gcs-feedstock/) +that pulls a subset of GPCPC netcdf files from Google Cloud storage and writes it +out as Zarr. The config we have setup for `pangeo-forge-runner` will fetch the +files from remote storage only once on your system, caching it so future runs +will be faster. + +This same setup would work for any recipe! + +1. Clone a copy of the recipe to work on: + + ```bash + git clone https://github.com/pforgetest/gpcp-from-gcs-feedstock + cd gpcp-from-gcs-feedstock + ``` + + You can make edits to this if you would like. + +2. Setup a virtual environment that will contain `pangeo-forge-runner` and + any other dependencies this recipe will need. We use a `venv` here, + but you may also use `conda` or other python package management setup you + are familiar with. + + ```bash + python -m venv venv + source venv/bin/activate + ``` + +3. Install `pangeo-forge-runner` into this environment. + + ```bash + pip install pangeo-forge-runner + ``` + +Now you're ready to go! + +## Setting up config file + +Construct a `local_config.py` file that describes where the output +data should go, and what should be used for caching the input files. Since we just +want to test locally, these can point to the local filesystem! + +``` +# Let's put all our data on the same dir as this config file +from pathlib import Path +import os +HERE = Path(__file__).parent + +DATA_PREFIX = HERE / 'data' +os.makedirs(DATA_PREFIX, exists_ok=True) + +# Target output should be partitioned by job id +c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}" + +c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class +c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args + +# Input data cache should *not* be partitioned by job id, as we want to get the datafile +# from the source only once +c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input" + +c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class +c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args +# Metadata cache should be per job, as kwargs changing can change metadata +c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job}}/cache/metadata" +``` + +This will create a directory called `data` in the same directory this +config file is located in, and put all outputs and caches in there. To +speed up multiple runs, input files will be cached under the `data/cache` +directory. + +## Run a pruned version of your recipe + +You're all set to run your recipe now! + +```bash +pangeo-forge-runner bake \ + --config local_config.py \ + --repo . \ + --Bake.job_name=test1 \ + --prune +``` + +This should run for a few seconds, and your output Zarr should now be +in `output/tests1`! Let's explore the various parameters passed. + +1. `--config local_config.py` specifies the config file we want `pangeo-forge-runner` + to read. If we were to try to run this on GCP or AWS, we can have additional + `aws_config.py` or `gcp_config.py` files, and just pass those instead - everything + else can remain the same. By putting most config into files, this also eases + collaboration - multiple people can know they're running the same config. +2. `--repo .` specifies that we want the current directory to be treated as a recipe + and run. This can instead point to a git repo, zenodo URI, etc as needed. +3. `--Bake.job_name=test1` specifies a unique job name for this particular run. + In our `local_config.py`, we use this name to create the output directory. If + not specified, this would be autogenerated. +4. `--prune` specifies we only want to run the recipe on about 2 input files, rather + than on everything. This makes for fast turnaround time and easy testing. + +You can test the created Zarr store by opening it with `xarray` + +```python +>>> import xarray as xr +>>> ds = xr.open_zarr("data/test1/gpcp") +>>> ds + +Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 2) +Coordinates: + * latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 + * longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 + * time (time) datetime64[ns] 1996-10-01 1996-10-02 +Dimensions without coordinates: nv +Data variables: + lat_bounds (latitude, nv) float32 dask.array + lon_bounds (longitude, nv) float32 dask.array + precip (time, latitude, longitude) float32 dask.array + time_bounds (time, nv) datetime64[ns] dask.array +Attributes: (12/41) + Conventions: CF-1.6, ACDD 1.3 + Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... + acknowledgment: This project was supported in part by a grant... + cdm_data_type: Grid + cdr_program: NOAA Climate Data Record Program for satellit... + cdr_variable: precipitation + ... ... + sensor: Imager, TOVS > TIROS Operational Vertical Sou... + spatial_resolution: 1 degree + standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) + summary: Global Precipitation Climatology Project (GPC... + time_coverage_duration: P1D + title: Global Precipitation Climatatology Project (G... +>>> +``` From 66126a8d4112f9448419f70a0ba268ab2751d097 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Wed, 23 Aug 2023 12:40:00 -0700 Subject: [PATCH 2/3] Specify what language doc is Co-authored-by: Charles Stern <62192187+cisaacstern@users.noreply.github.com> --- docs/tutorial/local.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorial/local.md b/docs/tutorial/local.md index 974a486a..819f5aa1 100644 --- a/docs/tutorial/local.md +++ b/docs/tutorial/local.md @@ -50,7 +50,7 @@ Construct a `local_config.py` file that describes where the output data should go, and what should be used for caching the input files. Since we just want to test locally, these can point to the local filesystem! -``` +```python # Let's put all our data on the same dir as this config file from pathlib import Path import os From 09c471047b98c37105b2be69571a85cb9c54d359 Mon Sep 17 00:00:00 2001 From: Yuvi Panda Date: Wed, 25 Oct 2023 12:36:26 +0530 Subject: [PATCH 3/3] Fix exist_ok param Co-authored-by: Sean Quinlan <1011062+sbquinlan@users.noreply.github.com> --- docs/tutorial/local.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorial/local.md b/docs/tutorial/local.md index 819f5aa1..12f17414 100644 --- a/docs/tutorial/local.md +++ b/docs/tutorial/local.md @@ -57,7 +57,7 @@ import os HERE = Path(__file__).parent DATA_PREFIX = HERE / 'data' -os.makedirs(DATA_PREFIX, exists_ok=True) +os.makedirs(DATA_PREFIX, exist_ok=True) # Target output should be partitioned by job id c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}"