Manually running `beam-refactor` on Dataflow #450

cisaacstern · 2022-12-13T01:04:36Z

As documented in the notebook linked in #445 (comment), the beam-refactor runs end-to-end for a real world application (NOAA OISST) when run locally with Beam's DirectRunner. I'll now run this same recipe on Dataflow, to hopefully surface any DataflowRunner-specific issues, in preparation for merging beam-refactor. I've opened this issue to track my work plan, and record findings. Notes on how I'll be proceeding:

Replicate above-linked notebook with local DirectRunner
Reference latest pangeo-forge-runner for our current selection of Dataflow pipeline options (which we know to work, based on recent successful jobs).
Build and push a image to gcr.io which includes pangeo-forge-recipes installed from the beam-refactor branch; pip-installing on-top of our current base container would be lighter-weight, but I prefer to just push a dedicated container for this, because it more closely matches how we've been working on Dataflow thus far.
Deploy the above-linked NOAA OISST transforms to Dataflow using this selection of pipeline options + custom beam-refactor image. (Do this manually, using the Beam Python SDK.)

Next steps, assuming the above works:

Open an issue on pangeo-forge-runner documenting what will be required to support the beam-refactor branch on Pangeo Forge Cloud (main point: no more .to_beam() compilation).
Think more about the fastest path to allow Use traitlets pangeo-forge-orchestrator#197 to support multiple pangeo-forge-recipes parsing environments simultaneously on Pangeo Forge Cloud.

Updates to follow on this thread.

The text was updated successfully, but these errors were encountered:

cisaacstern · 2022-12-13T23:47:55Z

Reference latest pangeo-forge-runner for our current selection of Dataflow pipeline options (which we know to work, based on recent successful jobs).

Adapting kwargs from here:

job_name =  # define manually
temp_gcs_location =  # (TBD) GCS bucket path
container_image =  # (TBD) URI of image I'll build for this experiment

dict(
    runner="DataflowRunner",
    project="pangeo-forge-4967",
    job_name=job_name,
    temp_location=temp_gcs_location,
    use_public_ips=False,
    region="us-central1",
    experiments=["use_runner_v2"],
    sdk_container_image=container_image,
    save_main_session=True,
    pickle_library="cloudpickle",
    # for NOAA OISST machine can probably be smaller,
    # but this should be fine for now
    machine_type="n1-highmem-2",
)

(Recording this here so others can see the thought process, and as a convenient place to keep notes for myself.)

cisaacstern · 2022-12-15T01:53:47Z

Build and push a image to gcr.io which includes pangeo-forge-recipes installed from the beam-refactor branch

I just built:

FROM pangeo/forge:5e51a29 

RUN mamba run -n notebook pip install -U git+https://github.com/pangeo-forge/pangeo-forge-recipes.git@beam-refactor

where pangeo/forge:5e51a29 is the image tag currently used to deploy jobs from orchestrator.

I then ran the example code linked in the first comment above in a python interpreter inside a container started with that image, and confirmed that the NOAA OISST example runs successfully on it. From this point forward, we can be confident that any issues we encounter (perhaps there won't even be any 🤷 ) are Dataflow-specific.

Next step is pushing this image to gcr.io and then deploying the example recipe to Dataflow.

cisaacstern · 2022-12-15T17:56:22Z

Next step is pushing this image to gcr.io

Completed following the method used in orchestrator here.

cisaacstern · 2022-12-15T20:01:25Z

🎉 TL;DR: it works! Details below...

Deployment and worker envs must be the same, so I deployed the job from a container running the gcr.io/pangeo-forge-4967/beam-refactor image (described above):

$ docker run -it \
> -v '${DATAFLOW_KEYFILE}':"/opt/storage_key.json" \
> -e GOOGLE_APPLICATION_CREDENTIALS="/opt/storage_key.json" \
> --entrypoint=/bin/bash \
> gcr.io/pangeo-forge-4967/beam-refactor

Here, DATAFLOW_KEYFILE is the path to a key file for a GCP service account with permission to deploy Dataflow jobs.

Within the running container, I:

Started a python interpreter and created the pruned OISST pattern as demonstrated in the above-linked notebook.
Set target_path as:
```
target_path = "gs://beam-dataflow-test/beam-refactor-oisst-0.zarr"
```
where gs://beam-dataflow-test is a scratch bucket in our GCP project.

Defined pipeline options kws as:

job_name = "beam-refactor-oisst-0"
temp_gcs_location =  "gs://beam-dataflow-test/tmp/"
container_image = "gcr.io/pangeo-forge-4967/beam-refactor"
opts = dict(
    runner="DataflowRunner",
    project="pangeo-forge-4967",
    job_name=job_name,
    temp_location=temp_gcs_location,
    use_public_ips=False,
    region="us-central1",
    experiments=["use_runner_v2"],
    sdk_container_image=container_image,
    save_main_session=True,
    pickle_library="cloudpickle",
    # for NOAA OISST machine can probably be smaller,
    # but this should be fine for now
    machine_type="n1-highmem-2",
    service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com",
)

And deployed the job (without context manager, so it would be non-blocking):

pipeline = Pipeline(options=PipelineOptions(**opts))
pipeline = beam.Pipeline(options=PipelineOptions(**opts))
pipeline | transforms
pipeline.run()

The job succeeded, and the dataset is openable with

import xarray as xr

target_path = "gs://beam-dataflow-test/beam-refactor-oisst-0.zarr"
oisst_zarr = xr.open_dataset(target_path, engine="zarr")

I'll make some further notes below re: next steps.

cc @rabernat

cisaacstern · 2022-12-15T21:25:00Z

Now that the beam-refactor has been successfully run on Dataflow, I'll expand on the next steps from first comment above, in search of the shortest path (which also doesn't cost us too much technical debt) from where we are now, to running this same OISST pipeline on Pangeo Forge Cloud.

Getting beam-refactor to run end-to-end on Pangeo Forge Cloud is a perfect stress test for the composability of our cloud stack, and will require fast-tracking upgrades which are otherwise in our best long-term interest.

Running beam-refactor end-to-end on Pangeo Forge Cloud (without changing the current prod deployment; i.e. without making this the monolithic default) will require:

A way to inject target_path (and cache_path) at deploy time (as opposed to hardcoding these values, as we are currently doing). I'm going to take a stab at Custom Pipeline Options now, as that appears to be the most idiomatic Beam approach (@alxmrs, correct me if I'm wrong). As @rabernat pointed out in our last Coordination meeting, the docs for this focus on the argparse use case, but I think the PipelineOptions object may contain some magic internally which will make this "just work" in the Python context. We'll see!
Conditional logic in pangeo-forge-runner to optionally skip .to_beam() compilation. (Maybe a --precompiled boolean flag.)
💡 Checkpoint: At this point, a custom image (as used above) with -recipes and -runner installed from branches containing above-listed features, should be able to deploy the OISST pipeline to Dataflow from a public repo with something like:
```
$ pangeo-forge-runner bake https://github.com/user/repo.git --prune --precompiled
```
while setting target_path dynamically via the pangeo-forge-runner traitlets config.
Minimally viable Use traitlets pangeo-forge-orchestrator#197 (or follow-on to that) with support for a DockerSiblingSpawner, for starting custom recipe parsing containers based on spec provided by feedstock in meta.yaml.
Some (even provisional) syntax, supported by https://github.com/pangeo-forge/meta-yaml-schema, for specifying custom images to be used for parsing & Dataflow workers. These could be pre-built images or (possibly) images built at runtime from a base image (this latter option is appealing for the case of iterating quickly on images built from feature branches of -recipes, -runner, etc.).

💡 Checkpoint: At this point, a locally running instance of -orchestrator (which therefore has access to a Docker Daemon), should be able to deploy beam-refactor code to Dataflow from a public repo, in response to events (e.g. slash commands).
A deployment setting that allows us to run a Docker Daemon (i.e. migrating -orchestrator deployment off of Heroku, onto GCP somewhere). The final piece for arbitrary container execution on a production instance of -orchestrator.

I'm sure I've oversimplified some things, but this is the basic landscape as far as I see it right now.

cc @yuvipanda, in case you have suggestions. In particular, wondering what you think re: GCP deployment for orchestrator. Is there a particular product/platform you'd recommend?

yuvipanda · 2023-01-16T18:57:52Z

pangeo-forge/pangeo-forge-runner#48 is what implements this

cisaacstern · 2023-01-17T18:35:10Z

I've updated this issue's title to reflect what is documented here: a process for manually running beam-refactor on Dataflow. We've accomplished this goal, and are now moving on to support beam-refactor in the production stack. IMO that is beyond the scope of what this issue was opened for, so I'm going to close it now. Other issues linked above and in Get beam-refactor merged will track our remaining progress towards the production deployment of beam-refactor.

cisaacstern self-assigned this Dec 13, 2022

This was referenced Dec 20, 2022

Cloud Run recipe handler first draft pangeo-forge/cloudrun-recipe-handler#1

Merged

Syntax for declaring runtime pip installs pangeo-forge/meta-yaml-schema#4

Open

Failing Tests #451

Closed

yuvipanda moved this to Todo in Get beam-refactor merged Jan 16, 2023

yuvipanda added this to Get beam-refactor merged Jan 16, 2023

yuvipanda moved this from Todo to In Progress in Get beam-refactor merged Jan 16, 2023

cisaacstern changed the title ~~Running beam-refactor on Dataflow~~ Manually running beam-refactor on Dataflow Jan 17, 2023

cisaacstern closed this as completed Jan 17, 2023

github-project-automation bot moved this from In Progress to Done in Get beam-refactor merged Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually running `beam-refactor` on Dataflow #450

Manually running `beam-refactor` on Dataflow #450

cisaacstern commented Dec 13, 2022 •

edited

Loading

cisaacstern commented Dec 13, 2022

cisaacstern commented Dec 15, 2022 •

edited

Loading

cisaacstern commented Dec 15, 2022

cisaacstern commented Dec 15, 2022 •

edited

Loading

cisaacstern commented Dec 15, 2022 •

edited

Loading

yuvipanda commented Jan 16, 2023

cisaacstern commented Jan 17, 2023

Manually running beam-refactor on Dataflow #450

Manually running beam-refactor on Dataflow #450

Comments

cisaacstern commented Dec 13, 2022 • edited Loading

cisaacstern commented Dec 13, 2022

cisaacstern commented Dec 15, 2022 • edited Loading

cisaacstern commented Dec 15, 2022

cisaacstern commented Dec 15, 2022 • edited Loading

cisaacstern commented Dec 15, 2022 • edited Loading

yuvipanda commented Jan 16, 2023

cisaacstern commented Jan 17, 2023

Manually running `beam-refactor` on Dataflow #450

Manually running `beam-refactor` on Dataflow #450

cisaacstern commented Dec 13, 2022 •

edited

Loading

cisaacstern commented Dec 15, 2022 •

edited

Loading

cisaacstern commented Dec 15, 2022 •

edited

Loading

cisaacstern commented Dec 15, 2022 •

edited

Loading