Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually running beam-refactor on Dataflow #450

Closed
4 of 6 tasks
cisaacstern opened this issue Dec 13, 2022 · 7 comments
Closed
4 of 6 tasks

Manually running beam-refactor on Dataflow #450

cisaacstern opened this issue Dec 13, 2022 · 7 comments
Assignees

Comments

@cisaacstern
Copy link
Member

cisaacstern commented Dec 13, 2022

As documented in the notebook linked in #445 (comment), the beam-refactor runs end-to-end for a real world application (NOAA OISST) when run locally with Beam's DirectRunner. I'll now run this same recipe on Dataflow, to hopefully surface any DataflowRunner-specific issues, in preparation for merging beam-refactor. I've opened this issue to track my work plan, and record findings. Notes on how I'll be proceeding:

  • Replicate above-linked notebook with local DirectRunner
  • Reference latest pangeo-forge-runner for our current selection of Dataflow pipeline options (which we know to work, based on recent successful jobs).
  • Build and push a image to gcr.io which includes pangeo-forge-recipes installed from the beam-refactor branch; pip-installing on-top of our current base container would be lighter-weight, but I prefer to just push a dedicated container for this, because it more closely matches how we've been working on Dataflow thus far.
  • Deploy the above-linked NOAA OISST transforms to Dataflow using this selection of pipeline options + custom beam-refactor image. (Do this manually, using the Beam Python SDK.)

Next steps, assuming the above works:

  • Open an issue on pangeo-forge-runner documenting what will be required to support the beam-refactor branch on Pangeo Forge Cloud (main point: no more .to_beam() compilation).
  • Think more about the fastest path to allow Use traitlets pangeo-forge-orchestrator#197 to support multiple pangeo-forge-recipes parsing environments simultaneously on Pangeo Forge Cloud.

Updates to follow on this thread.

@cisaacstern cisaacstern self-assigned this Dec 13, 2022
@cisaacstern
Copy link
Member Author

Reference latest pangeo-forge-runner for our current selection of Dataflow pipeline options (which we know to work, based on recent successful jobs).

Adapting kwargs from here:

job_name =  # define manually
temp_gcs_location =  # (TBD) GCS bucket path
container_image =  # (TBD) URI of image I'll build for this experiment

dict(
    runner="DataflowRunner",
    project="pangeo-forge-4967",
    job_name=job_name,
    temp_location=temp_gcs_location,
    use_public_ips=False,
    region="us-central1",
    experiments=["use_runner_v2"],
    sdk_container_image=container_image,
    save_main_session=True,
    pickle_library="cloudpickle",
    # for NOAA OISST machine can probably be smaller,
    # but this should be fine for now
    machine_type="n1-highmem-2",
)

(Recording this here so others can see the thought process, and as a convenient place to keep notes for myself.)

@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 15, 2022

Build and push a image to gcr.io which includes pangeo-forge-recipes installed from the beam-refactor branch

I just built:

FROM pangeo/forge:5e51a29 

RUN mamba run -n notebook pip install -U git+https://github.com/pangeo-forge/pangeo-forge-recipes.git@beam-refactor

where pangeo/forge:5e51a29 is the image tag currently used to deploy jobs from orchestrator.

I then ran the example code linked in the first comment above in a python interpreter inside a container started with that image, and confirmed that the NOAA OISST example runs successfully on it. From this point forward, we can be confident that any issues we encounter (perhaps there won't even be any 🤷 ) are Dataflow-specific.

Next step is pushing this image to gcr.io and then deploying the example recipe to Dataflow.

@cisaacstern
Copy link
Member Author

Next step is pushing this image to gcr.io

Completed following the method used in orchestrator here.

@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 15, 2022

🎉 TL;DR: it works! Details below...

Deployment and worker envs must be the same, so I deployed the job from a container running the gcr.io/pangeo-forge-4967/beam-refactor image (described above):

$ docker run -it \
> -v '${DATAFLOW_KEYFILE}':"/opt/storage_key.json" \
> -e GOOGLE_APPLICATION_CREDENTIALS="/opt/storage_key.json" \
> --entrypoint=/bin/bash \
> gcr.io/pangeo-forge-4967/beam-refactor

Here, DATAFLOW_KEYFILE is the path to a key file for a GCP service account with permission to deploy Dataflow jobs.

Within the running container, I:

  1. Started a python interpreter and created the pruned OISST pattern as demonstrated in the above-linked notebook.

  2. Set target_path as:

    target_path = "gs://beam-dataflow-test/beam-refactor-oisst-0.zarr"

    where gs://beam-dataflow-test is a scratch bucket in our GCP project.

  3. Defined pipeline options kws as:

    job_name = "beam-refactor-oisst-0"
    temp_gcs_location =  "gs://beam-dataflow-test/tmp/"
    container_image = "gcr.io/pangeo-forge-4967/beam-refactor"
    opts = dict(
        runner="DataflowRunner",
        project="pangeo-forge-4967",
        job_name=job_name,
        temp_location=temp_gcs_location,
        use_public_ips=False,
        region="us-central1",
        experiments=["use_runner_v2"],
        sdk_container_image=container_image,
        save_main_session=True,
        pickle_library="cloudpickle",
        # for NOAA OISST machine can probably be smaller,
        # but this should be fine for now
        machine_type="n1-highmem-2",
        service_account_email="pangeo-forge-dataflow@pangeo-forge-4967.iam.gserviceaccount.com",
    )
  4. And deployed the job (without context manager, so it would be non-blocking):

    pipeline = Pipeline(options=PipelineOptions(**opts))
    pipeline = beam.Pipeline(options=PipelineOptions(**opts))
    pipeline | transforms
    pipeline.run()

The job succeeded, and the dataset is openable with

import xarray as xr

target_path = "gs://beam-dataflow-test/beam-refactor-oisst-0.zarr"
oisst_zarr = xr.open_dataset(target_path, engine="zarr")

I'll make some further notes below re: next steps.

cc @rabernat

@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 15, 2022

Now that the beam-refactor has been successfully run on Dataflow, I'll expand on the next steps from first comment above, in search of the shortest path (which also doesn't cost us too much technical debt) from where we are now, to running this same OISST pipeline on Pangeo Forge Cloud.

Getting beam-refactor to run end-to-end on Pangeo Forge Cloud is a perfect stress test for the composability of our cloud stack, and will require fast-tracking upgrades which are otherwise in our best long-term interest.

Running beam-refactor end-to-end on Pangeo Forge Cloud (without changing the current prod deployment; i.e. without making this the monolithic default) will require:

  • A way to inject target_path (and cache_path) at deploy time (as opposed to hardcoding these values, as we are currently doing). I'm going to take a stab at Custom Pipeline Options now, as that appears to be the most idiomatic Beam approach (@alxmrs, correct me if I'm wrong). As @rabernat pointed out in our last Coordination meeting, the docs for this focus on the argparse use case, but I think the PipelineOptions object may contain some magic internally which will make this "just work" in the Python context. We'll see!

  • Conditional logic in pangeo-forge-runner to optionally skip .to_beam() compilation. (Maybe a --precompiled boolean flag.)

    💡 Checkpoint: At this point, a custom image (as used above) with -recipes and -runner installed from branches containing above-listed features, should be able to deploy the OISST pipeline to Dataflow from a public repo with something like:

    $ pangeo-forge-runner bake https://github.com/user/repo.git --prune --precompiled

    while setting target_path dynamically via the pangeo-forge-runner traitlets config.

  • Minimally viable Use traitlets pangeo-forge-orchestrator#197 (or follow-on to that) with support for a DockerSiblingSpawner, for starting custom recipe parsing containers based on spec provided by feedstock in meta.yaml.

  • Some (even provisional) syntax, supported by https://github.com/pangeo-forge/meta-yaml-schema, for specifying custom images to be used for parsing & Dataflow workers. These could be pre-built images or (possibly) images built at runtime from a base image (this latter option is appealing for the case of iterating quickly on images built from feature branches of -recipes, -runner, etc.).

    💡 Checkpoint: At this point, a locally running instance of -orchestrator (which therefore has access to a Docker Daemon), should be able to deploy beam-refactor code to Dataflow from a public repo, in response to events (e.g. slash commands).

  • A deployment setting that allows us to run a Docker Daemon (i.e. migrating -orchestrator deployment off of Heroku, onto GCP somewhere). The final piece for arbitrary container execution on a production instance of -orchestrator.

I'm sure I've oversimplified some things, but this is the basic landscape as far as I see it right now.

cc @yuvipanda, in case you have suggestions. In particular, wondering what you think re: GCP deployment for orchestrator. Is there a particular product/platform you'd recommend?

@yuvipanda
Copy link
Contributor

pangeo-forge/pangeo-forge-runner#48 is what implements this

@yuvipanda yuvipanda moved this from Todo to In Progress in Get beam-refactor merged Jan 16, 2023
@cisaacstern cisaacstern changed the title Running beam-refactor on Dataflow Manually running beam-refactor on Dataflow Jan 17, 2023
@cisaacstern
Copy link
Member Author

I've updated this issue's title to reflect what is documented here: a process for manually running beam-refactor on Dataflow. We've accomplished this goal, and are now moving on to support beam-refactor in the production stack. IMO that is beyond the scope of what this issue was opened for, so I'm going to close it now. Other issues linked above and in Get beam-refactor merged will track our remaining progress towards the production deployment of beam-refactor.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Get beam-refactor merged Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants