Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beam-refactor reference recipe #486

Merged
merged 74 commits into from
Apr 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
002600f
start of beam-refactor ref recipe port
norlandrhagen Feb 2, 2023
c46bbe8
progress on combining reference files in MultiZarrToZarr
norlandrhagen Feb 2, 2023
92333db
progress on combine refs
norlandrhagen Feb 7, 2023
6942407
added path for parquet backend
norlandrhagen Feb 7, 2023
ddb8ea6
progress on WriteCombinedReference, CombineReferences, DropKeys
norlandrhagen Feb 7, 2023
792484a
Update pangeo_forge_recipes/transforms.py
norlandrhagen Feb 13, 2023
8fd4c37
Update pangeo_forge_recipes/transforms.py
norlandrhagen Feb 13, 2023
4df8e90
update to combine_refs & OpenWithKerchunk dtypes
norlandrhagen Feb 13, 2023
a7976de
Merge branch 'beam-refactor' into beam-kerchunk
norlandrhagen Mar 29, 2023
f7114fc
added testing script
norlandrhagen Mar 29, 2023
8c9d0ca
back to error in WriteCombinedReference
norlandrhagen Mar 29, 2023
d1200e8
added DropKeys into pipeline and a E2E test for kerchunk recipe
norlandrhagen Apr 10, 2023
5837ff4
use CombineFn for kerchunk MultiZarrToZarr
cisaacstern Apr 10, 2023
2075752
kerchunk>=0.1.0 fixes references test for netcdf4
cisaacstern Apr 11, 2023
cd68f6e
fix pre-commit issues
cisaacstern Apr 11, 2023
06f41a6
note docstring fixme in open_with_kerchunk
cisaacstern Apr 11, 2023
b054edf
move write_combine_references to writers module
cisaacstern Apr 11, 2023
f75418f
NetCDF3ToZarr requires filename as str
cisaacstern Apr 11, 2023
d693a8d
try to write to parquet
cisaacstern Apr 11, 2023
b7a1544
give up on parquet for now
cisaacstern Apr 11, 2023
f19f966
add ZarrWriterMixin
cisaacstern Apr 11, 2023
1185037
remove top-level beam_test.py
cisaacstern Apr 11, 2023
1f95bf0
revert terraclimate tutorial to upstream
cisaacstern Apr 11, 2023
bf95dfb
Merge remote-tracking branch 'origin/beam-refactor' into beam-kerchunk
cisaacstern Apr 12, 2023
25a4fba
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 12, 2023
3b66821
updates to transforms & combiners as well as cmip6 example ref recipe
norlandrhagen Apr 12, 2023
865e864
Merge branch 'beam-kerchunk' of https://github.com/norlandrhagen/pang…
norlandrhagen Apr 12, 2023
027aca3
removed file pattern truncation
norlandrhagen Apr 12, 2023
a428265
reference writer, use target.fs.open and make fname an option
cisaacstern Apr 12, 2023
29d0162
updated cmip6 ref tutorial
norlandrhagen Apr 12, 2023
970ebe2
added context in recipe type selection
norlandrhagen Apr 12, 2023
0ef13db
updated docstring in openers/open_with_kerchunk
norlandrhagen Apr 12, 2023
8c7dd6c
updated scan_grib reference opener with filename
norlandrhagen Apr 12, 2023
23874b8
GRIB2 based reference tutorial [WIP]
norlandrhagen Apr 12, 2023
95ed34b
add grib test based on kerchunk grib test
cisaacstern Apr 13, 2023
d2c5121
start of kerchunk unit testing
norlandrhagen Apr 14, 2023
8244125
latest version of grib/reference notebook failing due to path issue
norlandrhagen Apr 14, 2023
1382f1b
updated python->python3 for start_http_server
norlandrhagen Apr 18, 2023
b2123ca
added remote protocol in open_with_kerchunk
norlandrhagen Apr 18, 2023
74c3ce9
updates to ref/grib tutorial
norlandrhagen Apr 18, 2023
3694823
added tests for openwithkerchunk
norlandrhagen Apr 18, 2023
9a337ee
added pass for zarr filetype in test_OPenWithKerchunk_direct
norlandrhagen Apr 19, 2023
0c12f4a
updated kerchunk for grib2 fix
norlandrhagen Apr 19, 2023
e6037d0
start of test test_inline_threshold - should be moved to test_openers
norlandrhagen Apr 21, 2023
46da58d
added missing inline_threshold arg
norlandrhagen Apr 21, 2023
fcc6de3
moved inline_thresh test to test_openers & shared fixtures to conftest
norlandrhagen Apr 24, 2023
555be04
factor out url_or_file_obj preprocessor into standalone func
cisaacstern Apr 24, 2023
f1ff239
grib integration test first commit
cisaacstern Apr 25, 2023
e542c3d
test_grib integration cont
cisaacstern Apr 25, 2023
0005138
get multi message grib test to pass
cisaacstern Apr 26, 2023
51f423b
make drop_keys default behavior in OpenWithKerchunk
cisaacstern Apr 26, 2023
b0d1357
fix tests for changes in kerchunk opener
cisaacstern Apr 26, 2023
216f91d
if url includes remote_protocol, don't re-add it
cisaacstern Apr 26, 2023
07262eb
rename test_grib test for specificity
cisaacstern Apr 26, 2023
bb47200
add translate and maybe_eager_combine methods to mzz combinefn
cisaacstern Apr 27, 2023
f846885
pass eager_combine param in first grib integration test
cisaacstern Apr 27, 2023
4d79800
revert methods on mzz combiner, use 'precombine_inputs' as param name
cisaacstern Apr 27, 2023
e1c9373
fix first hrrr test with precombine_inputs param name
cisaacstern Apr 27, 2023
a2f979f
finishing touches on second hrrr integration test
cisaacstern Apr 27, 2023
e2ed607
fix combinerefs test for change in combinerefs api
cisaacstern Apr 27, 2023
537d964
Merge remote-tracking branch 'origin/beam-refactor' into beam-kerchunk
cisaacstern Apr 27, 2023
614b55a
add integration test workflow
cisaacstern Apr 27, 2023
d3f8ef0
in integration tests, move if block to prepare-env job
cisaacstern Apr 27, 2023
f5a9a78
use kerchunk_open_kwargs instead of explicit kws on kerchunk opener
cisaacstern Apr 28, 2023
353ea4a
fix docstrings + arg order in transforms
cisaacstern Apr 28, 2023
fe0010e
add docstring to pythia hrrr integration test
cisaacstern Apr 28, 2023
e37da6a
move narrative comments on reference recipes to user guide section
cisaacstern Apr 28, 2023
6e27e8f
upgrade myst-nb to try to fix docs
cisaacstern Apr 28, 2023
361437b
upgrade myst-parser to match myst-nb upgrade
cisaacstern Apr 28, 2023
9dcb30a
upgrade sphinx to fix docs build
cisaacstern Apr 28, 2023
16e6790
readthedocs use py39
cisaacstern Apr 28, 2023
f1ec41b
revert docs reqs changes
cisaacstern Apr 28, 2023
6d0d697
update ref tutorials for the latest reference api
cisaacstern Apr 28, 2023
8e6deef
update ref tutorials for the latest reference api
cisaacstern Apr 28, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/workflows/test-integration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Integration tests

on:
push:
branches: [ "beam-refactor" ] # FIXME: change to default branch post-merge
pull_request:
branches: [ "beam-refactor" ] # FIXME: change to default branch post-merge
types: [ opened, reopened, synchronize, labeled ]

env:
PYTEST_ADDOPTS: "--color=yes"

jobs:
prepare-env:
# run on:
# - all pushes to specified branch(es)
# - a PR was just labeled 'test-integration'
# - a PR with 'test-integration' label was opened, reopened, or synchronized
if: |
github.event_name == 'push' ||
github.event.label.name == 'test-integration' ||
contains( github.event.pull_request.labels.*.name, 'test-integration')
uses: ./.github/workflows/prepare-env.yaml
integration-tests:
needs: prepare-env
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10"]
dependencies: ["releases-only", "upstream-dev"]
steps:
- uses: actions/checkout@v2

# generic steps to load env from cache
- name: 🎯 Set cache number
id: cache-number
# cache will last 3 days by default
run: echo CACHE_NUMBER=`expr $(date +'%j') / 3` >> $GITHUB_ENV
- name: 🎯 Set environment file
id: env-file
run: echo "env_file=ci/py${{ matrix.python-version }}.yml" >> $GITHUB_ENV
- uses: actions/cache@v2
name: 🗃 Loaded Cached environment
with:
path: /usr/share/miniconda3/envs/pangeo-forge-recipes
key: ${{ runner.os }}-conda-${{ matrix.python-version }}-${{ hashFiles( env.env_file ) }}-${{ matrix.dependencies }}-${{ env.CACHE_NUMBER }}
id: conda-cache
- name: 🤿 Bail out if no cache hit
if: steps.conda-cache.outputs.cache-hit != 'true'
run: false
- name: 🎯 Set path to include conda python
run: echo "/usr/share/miniconda3/envs/pangeo-forge-recipes/bin" >> $GITHUB_PATH

# custom testing steps unique to this workflow
- name: 🌈 Install pangeo-forge-recipes package
shell: bash -l {0}
run: |
python -m pip install --no-deps -e .
- name: 🏄‍♂️ Run Tests
shell: bash -l {0}
run: |
pytest --timeout=600 tests-integration/ -v
10 changes: 8 additions & 2 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,14 @@ version: 2
sphinx:
configuration: docs/conf.py

# Optionally set the version of Python and requirements required to build your docs
build:
os: ubuntu-22.04
tools:
python: "3.9"

python:
version: 3.8
install:
# Install package too, so autodoc works
- method: pip
path: .
- requirements: docs/requirements.txt
4 changes: 2 additions & 2 deletions ci/py3.10.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ dependencies:
- apache-beam
- black
- boto3
- cfgrib<0.9.9.0
- cfgrib
- cftime
- codecov
- dask
Expand All @@ -20,7 +20,7 @@ dependencies:
- hdf5
- intake
- intake-xarray
- kerchunk>=0.0.6
- kerchunk>=0.1.1
- lxml # Optional dep of pydap
- matplotlib # needed for building tutorial notebooks
- netcdf4
Expand Down
4 changes: 2 additions & 2 deletions ci/py3.9.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ dependencies:
- apache-beam
- black
- boto3
- cfgrib<0.9.9.0
- cfgrib
- cftime
- codecov
- dask
Expand All @@ -20,7 +20,7 @@ dependencies:
- hdf5
- intake
- intake-xarray
- kerchunk>=0.0.6
- kerchunk>=0.1.1
- lxml # Optional dep of pydap
- matplotlib # needed for building tutorial notebooks
- netcdf4
Expand Down
36 changes: 30 additions & 6 deletions docs/pangeo_forge_recipes/recipe_user_guide/recipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,13 +93,37 @@ All recipes need a place to store the target dataset. Refer to {doc}`storage` fo
Once your recipe is defined and has its storage targets assigned, you're ready to
move on to {doc}`execution`.

### HDF Reference Recipes
### Reference Recipes

Like the Xarray to Zarr recipes, this category allows us to more efficiently access data from a bunch of NetCDF / HDF files.
However, such a recipe does not actually copy the original source data.
Instead, it generates metadata files which reference and index the original data, allowing it to be accessed more quickly and easily.
For more background, see [this blog post](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685).
Like the Xarray to Zarr recipes, this category of recipes allows us to efficiently access data from a
collection of source files. Unlike the standard Zarr recipes, these reference recipes utilize
[kerchunk](https://fsspec.github.io/kerchunk/) to generate metadata files which reference and index the
original data, allowing it to be accessed more quickly and easily, without duplicating it.

There is currently one tutorial for this recipe:
Whereas the standard Zarr recipe creates a copy of the original dataset in the Zarr format, the
kerchunk-based reference recipe does not copy the data and instead creates a Kerchunk mapping, which
allows archival formats (including NetCDF, GRIB2, etc.) to be read as if they were Zarr datasets.
More details about how Kerchunk works can be found in the
[kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)
and [this blog post](https://medium.com/pangeo/fake-it-until-you-make-it-reading-goes-netcdf4-data-on-aws-s3-as-zarr-for-rapid-data-access-61e33f8fe685).

There are currently two tutorials for reference recipes:

- {doc}`../tutorials/hdf_reference/reference_cmip6`
- {doc}`../tutorials/grib_reference/reference_HRRR`

When choosing whether to create a reference recipe, it is important to consider questions such as:

_**Where are the archival (i.e. source) files for this dataset currently stored?**_ If the original data
are not already in the cloud (or some other high-bandwidth storage device, such as an on-prem data
center), the performance benefits of using a reference recipe may be limited, because network speeds
to access the original data will constrain I/O throughput.

_**Does this dataset require preprocessing?**_ With reference recipes, modification of the underlying
data is not possible. For example, the chunking schema of a dataset cannot be modified with Kerchunk, so
you are limited to the chunk schema of the archival data. If you need to optimize your datasets chunking
schema for space or time, the standard Zarr recipe is the only option. While you cannot modify chunking
in a reference recipe, changes in the metadata (attributes, encoding, etc.) can be applied.

These caveats aside, for archival data stored on highly-throughput storage devices, for which
preprocessing is not required, reference recipes are an ideal and storage-efficient option.
Loading