Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Beam-Refactor] Edits to intro_tutorial_part[*] #485

Conversation

norlandrhagen
Copy link
Contributor

Hi there,

I tried running the beam-refactor intro_tutorial docs and ran into a few issues. I made some edits to the docs (removing old imports etc.), however, in part3 for the example recipe.py, I'm not sure what to specify to target_path. Should the StoreToZarr transform be removed in a production recipe?
Are there any examples of recipes that have been successfully ran using the beam runner?

Thanks!

@rabernat or @cisaacstern

transforms = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type)
    | StoreToZarr(
        target=target_path,
        combine_dims=pattern.combine_dim_keys,
    )
)
transforms

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@cisaacstern
Copy link
Member

in part3 for the example recipe.py, I'm not sure what to specify to target_path. Should the StoreToZarr transform be removed in a production recipe?

Thanks a lot @norlandrhagen. You're asking exactly the right question here. In pangeo-forge/pangeo-forge-runner#48, the working idea is that StoreToZarr will not be removed for a production recipe, however the target kwarg will be omitted from it (and then dynamically injected by the backend service).

Something @yuvipanda has said for a while, though, is that we should eventually probably totally forgo the manual running of recipes using the recipe objects themselves, in favor of having a unified CLI that does production deployment and manual testing runs. So if that tool were called (as Yuvi once suggested to me, simply pf), then the docs would eventually recommend the user, during manual testing, to simply do something like

$ pf run recipe.py --target=/local/dir

At the risk of "mission creep" and piling on further blockers to the goal of Getting beam-refactor merged, I do think we may want to be asking ourselves at this point, what minimal feature set is required, not just to merge beam-refactor ... but to in fact make it a user-facing API. The latter is a heavier lift, and I think does require taking a closer look at including this unified pf CLI as part of the requirements.

Per @sharkinsspatial's comment at yesterday's meeting, API churn makes it challenging for developers to contribute, and also I'd say it limits user adoption, insofar as a constantly changing API makes it hard to become a long-term dedicated user. In that vein, I would be hesitant to, for the sake of expedience, release beam-refactor docs which suggest a certain type of recipe testing, only to, shortly thereafter, change that recommendation.

This recasts your question in a broader frame, which perhaps is unhelpful to the more specific question of how you can be helpful immediately. In terms of actual documentation rewrites today, I'd say the notebooks in https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr present an easier entry point because they do not touch this (harder) question of the relationship to the Pangeo Forge Cloud contribution flow.

@derekocallaghan
Copy link
Contributor

This recasts your question in a broader frame, which perhaps is unhelpful to the more specific question of how you can be helpful immediately. In terms of actual documentation rewrites today, I'd say the notebooks in https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr present an easier entry point because they do not touch this (harder) question of the relationship to the Pangeo Forge Cloud contribution flow.

Hi @cisaacstern, I created #483 last week which is relevant for your last point. I was hoping to add a new CCMP recipe tutorial here that includes writing a custom processor, i.e. a replacement for the process_input approach. I have a recipe working locally so it's not much work to provide a notebook version. I guess the API around this isn't fully settled yet but maybe it's easier to refine this with a working example.

@rabernat
Copy link
Contributor

rabernat commented Feb 1, 2023

If you're running PFR outside of PF Cloud, you will always have to provide a target for the pipeline to store data to.

The way we handle this now is by automatically creating a temporary directory for the outputs if it is not specified. That's why there is no storage_target in the XarrayZarrRecipe at https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_recipes/tutorials/xarray_zarr/netcdf_zarr_sequential.html#step-3-pick-a-recipe-class.

In retrospect, I'm not sure this is a good idea. It makes the tutorials simpler, at the expense of user confusion.

@norlandrhagen - I'm curious what you would like to have happen. Where would you like your data to end up for this tutorial?

@norlandrhagen
Copy link
Contributor Author

Thanks for the detailed explanation @cisaacstern!

Just to clarify my understanding a bit.

  • For the [beam] production recipe.py files, the script would looking something like this (target kwarg removed), where the target would be injected by pfc.
transforms = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type)
    | StoreToZarr(
        combine_dims=pattern.combine_dim_keys,
    )
)
transforms

@rabernat In the part2 of the tutorial a temporary directory is specified for the target, which seems like a big improvement transparency wise compared to the XarrayZarrRecipe, however in part3 which I think is demonstrating how to create a production recipe.py, it isn't obvious what you should declare for a target. I figure it would be good to explain how the target kwarg isn't needed for production?

Thanks again for the information! Haven't look around at the beam-refactor branch much at all, so I have a lot of learning to do.

@cisaacstern
Copy link
Member

For the [beam] production recipe.py files, the script would looking something like this (target kwarg removed), where the target would be injected by pfc.

Yes. With the caveat that pangeo-forge/pangeo-forge-runner#48 (which defines this behavior) hasn't been merged yet, so is still subject to change.

the example notebooks (https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr) would be better to work on because they all write to local storage and don't deal with how pfc will handle the target / don't demonstrate creating a production recipe.py file?

Also yes. And sounds like the only caveat here is just to coordinate with @derekocallaghan on #483 to avoid duplicating effort.

@derekocallaghan
Copy link
Contributor

derekocallaghan commented Feb 2, 2023

I've created #487 today that contains an initial port of the NetCDF Zarr Multi-Variable Sequential Recipe: NOAA World Ocean Atlas notebook to use the beam-refactor approach. Although it contains references to the Xarray-to-Zarr Sequential Recipe: NOAA OISST notebook, I haven't ported this yet as it contains functionality which may no longer be provided by the beam-refactor branch. I'm going to look at this next.

In the meantime, the ported NOAA World Ocean Atlas notebook in #487 may be useful as an example for porting other tutorial notebooks. It's possible these may take a number of iterations depending on API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants