Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic chunking interface for StoreToZarr #595

Merged
merged 28 commits into from
Nov 15, 2023
Merged

Conversation

cisaacstern
Copy link
Member

Closes #572
Supersedes #546

@jbusecke, can you test drive this? You should be able to repurpose your work in #546 as follows:

from pangeo_forge_recipes.transforms import StoreToZarr

from dynamic_target_chunks import dynamic_target_chunks_from_schema

kws = dict(
    target_chunk_size=...,
    target_chunks_aspect_ratio=...,
    size_tolerance=...,
    default_ratio=...,
    allow_extra_dims=...,
    allow_fallback_algo=...,
)

recipe = (
    ...
    | StoreToZarr(
        ...,
        dynamic_chunking_fn=dynamic_target_chunks_from_schema,
        dynamic_chunking_fn_kwargs=kws,
    )
)

In order to import dynamic_target_chunks_from_schema from a local module, you would need pangeo-forge/pangeo-forge-runner#92 (feel free to work on that! 😄 ). Alternatively, you could:

  • tired 🥱 define dynamic_target_chunks_from_schema at the top of your recipe module
  • wired 🤪 release a package containing dynamic_target_chunks_from_schema on PyPI and add it to requirements.txt
  • inspired 🧘 push a package containing dynamic_target_chunks_from_schema to GitHub and install in in requirements.txt with git+.... This seems like the best short-term compromise between effort + effectiveness.

@cisaacstern cisaacstern requested a review from jbusecke August 31, 2023 22:18
@cisaacstern
Copy link
Member Author

cisaacstern commented Sep 1, 2023

@jbusecke per your suggestion, changed this so the dynamic_chunking_fn takes an xarray template dataset as its first argument. As we discussed, this will be much more familiar to users and also avoids possible circular dependency problems of the dynamic chunking package(s) relying on pangeo-forge-recipes utilities. Good improvement!

@cisaacstern
Copy link
Member Author

@jbusecke could I get your help testing this on https://github.com/leap-stc/cmip6-leap-feedstock? I just started trying to do that myself, but realized I didn't understand how the recipe works there.

@jbusecke
Copy link
Contributor

Its on my list for sure! Just very busy these last few days with other stuff. Ill try to prioritize it in the next two weeks?

@jbusecke jbusecke self-assigned this Sep 27, 2023
@rsignell
Copy link

@jbusecke I'm guessing life hasn't gotten any less busy, am I right? 😬

@cisaacstern
Copy link
Member Author

@rsignell, @jbusecke and I are working together in-person this week and merging this is at the top of our to-do list!

@rsignell
Copy link

@cisaacstern thanks for the update! I saw @jbusecke has an autoreply on his email that says the next two weeks are devoted to pangeo-forge, so good luck and thanks for your great work!

@jbusecke
Copy link
Contributor

This looks awesome. I added an integration test to make sure that the chunking function does serialize properly within beam. It passes for me locally!

@jbusecke jbusecke added the test-integration Apply this label to run integration tests on a PR. label Nov 15, 2023
@jbusecke
Copy link
Contributor

Ok so I think this works, but there is an issue with -runner overwriting the local PR version of -recipes.

I have done an A/B test modfing the test-integrate.yaml:

  • This works (by effectively dropping the -recipes from the -runner requirements):
- name: 🌈 Install pangeo-forge-recipes & pangeo-forge-runner
        shell: bash -l {0}
        run: |
           python -m pip install -e ".[test,minio]"
           python -m pip install ${{ matrix.runner-version }} --no-deps
          python -m pip install jupyter-repo2docker ruamel.yaml escapism jsonschema traitlets importlib-metadata
  • but this does not:
- name: 🌈 Install pangeo-forge-recipes & pangeo-forge-runner
        shell: bash -l {0}
        run: |
          python -m pip install -e ".[test,minio]"
          python -m pip install ${{ matrix.runner-version }}

This fails with

FAILED tests/test_integration.py::test_integration[local_confpath-noaa-oisst] - assert 1 == 0
 +  where 1 = CompletedProcess(args=['sh', '/home/runner/work/pangeo-forge-recipes/pangeo-forge-recipes/examples/runner-commands/bake.sh'], returncode=1, stdout='Target Storage is FSSpecTarget(LocalFileSystem(, root_path="/tmp/pytest-of-runner/pytest-0/tmp0/target")\n\nInput Cache Storage is CacheFSSpecTarget(LocalFileSystem(, root_path="/tmp/pytest-of-runner/pytest-0/tmp0/cache")\n\nMetadata Cache Storage is MetadataTarget(AbstractFileSystem(, root_path="")\n\nParsing recipes...\nTraceback (most recent call last):\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/bin/pangeo-forge-runner", line 8, in <module>\n    sys.exit(main())\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pangeo_forge_runner/cli.py", line 28, in main\n    app.start()\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pangeo_forge_runner/cli.py", line 23, in start\n    super().start()\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/traitlets/config/application.py", line 474, in start\n    return self.subapp.start()\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pangeo_forge_runner/commands/bake.py", line 201, in start\n    recipes = feedstock.parse_recipes()\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py", line 81, in parse_recipes\n    recipes[r["id"]] = self._import(r["object"])\n  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pangeo_forge_runner/feedstock.py", line 66, in _import\n    exec(compile(source=rewritten_ast, filename=filename, mode="exec"), ns)\n  File "/home/runner/work/pangeo-forge-recipes/pangeo-forge-recipes/examples/feedstock/gpcp_from_gcs_dynamic_chunks.py", line 46, in <module>\n    | StoreToZarr(\nTypeError: __init__() got an unexpected keyword argument \'dynamic_chunking_fn\'\n', stderr='').returncode
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!

indicating that the dynamic_chunking_fn input is not present, and indeed the -runner install is collecting a different version of recipes from pypi.

Something that seems weird to me in general here is that the -recipe version built in the gh action seems incorrect. I see 0.1.dev... which would explain why -runner overwrites this.

@jbusecke
Copy link
Contributor

wired 🤪 it is

https://pypi.org/project/dynamic-chunks/

Working on implementing this with the the new plugin here.

@jbusecke
Copy link
Contributor

@cisaacstern your call if we should wait on -runner to drop the -recipes dependency or just make a comment here, open a tracking issue, and merge this bad boy!

@keewis
Copy link
Contributor

keewis commented Nov 15, 2023

Something that seems weird to me in general here is that the -recipe version built in the gh action seems incorrect.

As far as I can tell, -recipes uses setuptools_scm to dynamically determine the version from the tags, but the checkout action doesn't fetch those by default (the default is a shallow checkout). To fix that, try passing depth: 0 to actions/checkout.

Edit: as a general rule, I'd recommend installing the thing that's being tested last

Copy link
Member Author

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! One testing nit, otherwise LGTM!

ds = xr.open_dataset(store, engine="zarr", chunks={})
assert ds.title == (
"Global Precipitation Climatatology Project (GPCP) " "Climate Data Record (CDR), Daily V1.3"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbusecke let's assert that the chunking of the opened dataset is in fact what we set dynamically in with the dynamic chunking function :)

@jbusecke
Copy link
Contributor

@keewis that is actually super helpful. I was not aware of that!

Edit: as a general rule, I'd recommend installing the thing that's being tested last

Good rule to live by!

@jbusecke jbusecke merged commit 5d77d49 into main Nov 15, 2023
6 checks passed
@jbusecke jbusecke deleted the dynamic-chunks-interface branch November 15, 2023 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-integration Apply this label to run integration tests on a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define dynamic target chunks interface in StoreToZarr
4 participants