[Beam-Refactor] Edits to `intro_tutorial_part[*]` #485

norlandrhagen · 2023-01-31T20:11:52Z

Hi there,

I tried running the beam-refactor intro_tutorial docs and ran into a few issues. I made some edits to the docs (removing old imports etc.), however, in part3 for the example recipe.py, I'm not sure what to specify to target_path. Should the StoreToZarr transform be removed in a production recipe?
Are there any examples of recipes that have been successfully ran using the beam runner?

Thanks!

@rabernat or @cisaacstern

transforms = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type)
    | StoreToZarr(
        target=target_path,
        combine_dims=pattern.combine_dim_keys,
    )
)
transforms

review-notebook-app · 2023-01-31T20:11:56Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cisaacstern · 2023-01-31T22:29:18Z

in part3 for the example recipe.py, I'm not sure what to specify to target_path. Should the StoreToZarr transform be removed in a production recipe?

Thanks a lot @norlandrhagen. You're asking exactly the right question here. In pangeo-forge/pangeo-forge-runner#48, the working idea is that StoreToZarr will not be removed for a production recipe, however the target kwarg will be omitted from it (and then dynamically injected by the backend service).

Something @yuvipanda has said for a while, though, is that we should eventually probably totally forgo the manual running of recipes using the recipe objects themselves, in favor of having a unified CLI that does production deployment and manual testing runs. So if that tool were called (as Yuvi once suggested to me, simply pf), then the docs would eventually recommend the user, during manual testing, to simply do something like

$ pf run recipe.py --target=/local/dir

At the risk of "mission creep" and piling on further blockers to the goal of Getting beam-refactor merged, I do think we may want to be asking ourselves at this point, what minimal feature set is required, not just to merge beam-refactor ... but to in fact make it a user-facing API. The latter is a heavier lift, and I think does require taking a closer look at including this unified pf CLI as part of the requirements.

Per @sharkinsspatial's comment at yesterday's meeting, API churn makes it challenging for developers to contribute, and also I'd say it limits user adoption, insofar as a constantly changing API makes it hard to become a long-term dedicated user. In that vein, I would be hesitant to, for the sake of expedience, release beam-refactor docs which suggest a certain type of recipe testing, only to, shortly thereafter, change that recommendation.

This recasts your question in a broader frame, which perhaps is unhelpful to the more specific question of how you can be helpful immediately. In terms of actual documentation rewrites today, I'd say the notebooks in https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr present an easier entry point because they do not touch this (harder) question of the relationship to the Pangeo Forge Cloud contribution flow.

derekocallaghan · 2023-02-01T16:38:08Z

This recasts your question in a broader frame, which perhaps is unhelpful to the more specific question of how you can be helpful immediately. In terms of actual documentation rewrites today, I'd say the notebooks in https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr present an easier entry point because they do not touch this (harder) question of the relationship to the Pangeo Forge Cloud contribution flow.

Hi @cisaacstern, I created #483 last week which is relevant for your last point. I was hoping to add a new CCMP recipe tutorial here that includes writing a custom processor, i.e. a replacement for the process_input approach. I have a recipe working locally so it's not much work to provide a notebook version. I guess the API around this isn't fully settled yet but maybe it's easier to refine this with a working example.

rabernat · 2023-02-01T17:05:53Z

If you're running PFR outside of PF Cloud, you will always have to provide a target for the pipeline to store data to.

The way we handle this now is by automatically creating a temporary directory for the outputs if it is not specified. That's why there is no storage_target in the XarrayZarrRecipe at https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_recipes/tutorials/xarray_zarr/netcdf_zarr_sequential.html#step-3-pick-a-recipe-class.

In retrospect, I'm not sure this is a good idea. It makes the tutorials simpler, at the expense of user confusion.

@norlandrhagen - I'm curious what you would like to have happen. Where would you like your data to end up for this tutorial?

norlandrhagen · 2023-02-01T20:05:53Z

Thanks for the detailed explanation @cisaacstern!

Just to clarify my understanding a bit.

For the [beam] production recipe.py files, the script would looking something like this (target kwarg removed), where the target would be injected by pfc.

transforms = (
    beam.Create(pattern.items())
    | OpenURLWithFSSpec()
    | OpenWithXarray(file_type=pattern.file_type)
    | StoreToZarr(
        combine_dims=pattern.combine_dim_keys,
    )
)
transforms

As for your point on where to help, the example notebooks (https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr) would be better to work on because they all write to local storage and don't deal with how pfc will handle the target / don't demonstrate creating a production recipe.py file? Either way, happy to help if I can.

@rabernat In the part2 of the tutorial a temporary directory is specified for the target, which seems like a big improvement transparency wise compared to the XarrayZarrRecipe, however in part3 which I think is demonstrating how to create a production recipe.py, it isn't obvious what you should declare for a target. I figure it would be good to explain how the target kwarg isn't needed for production?

Thanks again for the information! Haven't look around at the beam-refactor branch much at all, so I have a lot of learning to do.

cisaacstern · 2023-02-01T23:23:27Z

For the [beam] production recipe.py files, the script would looking something like this (target kwarg removed), where the target would be injected by pfc.

Yes. With the caveat that pangeo-forge/pangeo-forge-runner#48 (which defines this behavior) hasn't been merged yet, so is still subject to change.

the example notebooks (https://github.com/pangeo-forge/pangeo-forge-recipes/tree/master/docs/pangeo_forge_recipes/tutorials/xarray_zarr) would be better to work on because they all write to local storage and don't deal with how pfc will handle the target / don't demonstrate creating a production recipe.py file?

Also yes. And sounds like the only caveat here is just to coordinate with @derekocallaghan on #483 to avoid duplicating effort.

derekocallaghan · 2023-02-02T16:32:04Z

I've created #487 today that contains an initial port of the NetCDF Zarr Multi-Variable Sequential Recipe: NOAA World Ocean Atlas notebook to use the beam-refactor approach. Although it contains references to the Xarray-to-Zarr Sequential Recipe: NOAA OISST notebook, I haven't ported this yet as it contains functionality which may no longer be provided by the beam-refactor branch. I'm going to look at this next.

In the meantime, the ported NOAA World Ocean Atlas notebook in #487 may be useful as an example for porting other tutorial notebooks. It's possible these may take a number of iterations depending on API changes.

minor edits to beam_refactor/intro_tutorial_part[*]

f6aff9d

derekocallaghan mentioned this pull request Feb 2, 2023

Port tutorial notebooks to use Beam #487

Merged

norlandrhagen closed this Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Beam-Refactor] Edits to `intro_tutorial_part[*]` #485

[Beam-Refactor] Edits to `intro_tutorial_part[*]` #485

norlandrhagen commented Jan 31, 2023

review-notebook-app bot commented Jan 31, 2023

cisaacstern commented Jan 31, 2023

derekocallaghan commented Feb 1, 2023

rabernat commented Feb 1, 2023

norlandrhagen commented Feb 1, 2023

cisaacstern commented Feb 1, 2023

derekocallaghan commented Feb 2, 2023 •

edited

Loading

[Beam-Refactor] Edits to intro_tutorial_part[*] #485

[Beam-Refactor] Edits to intro_tutorial_part[*] #485

Conversation

norlandrhagen commented Jan 31, 2023

review-notebook-app bot commented Jan 31, 2023

cisaacstern commented Jan 31, 2023

derekocallaghan commented Feb 1, 2023

rabernat commented Feb 1, 2023

norlandrhagen commented Feb 1, 2023

cisaacstern commented Feb 1, 2023

derekocallaghan commented Feb 2, 2023 • edited Loading

[Beam-Refactor] Edits to `intro_tutorial_part[*]` #485

[Beam-Refactor] Edits to `intro_tutorial_part[*]` #485

derekocallaghan commented Feb 2, 2023 •

edited

Loading