Skip to content

Beam & Dataflow Setup

Shirley Cohen edited this page May 14, 2020 · 15 revisions

This guide takes you through the steps of setting up your GCP environment for Apache Beam and Dataflow.

1. Enable the Dataflow API

  • GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable

2. Create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)

  • Bucket name: <group name>-<some unique suffix>
  • Click Continue
  • Location type: Region
  • Location: us-central1 (Iowa)
  • Click Continue
  • Storage Class: Standard
  • Click Create
  • Create 3 folders inside your bucket. Folder names: staging, temp, output

Note:

  • bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group name.
  • the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.

3. Start up your Jupyter notebook instance and go to Jupyter Lab.

4. Bring up a terminal window by going to File -> New -> Terminal.

5. Create a virtual environment and install apache beam by entering these commands in the terminal:

$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]

6. Install an ipykernel by entering these commands in the terminal:
$ source venv/bin/activate
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_venv_kernel --display-name "Python (beam_venv)"
$ jupyter kernelspec list

Make sure you see beam_venv_kernel in the list of available kernels:

Available kernels:
beam_venv_kernel /home/jupyter/.local/share/jupyter/kernels/beam_venv_kernel
python2 /home/jupyter/beam_venv_dir/share/jupyter/kernels/python2
python3 /usr/local/share/jupyter/kernels/python3

Note: To run a beam pipeline from a python notebook, choose the beam_venv_kernel kernel from the Kernel menu.

8. Test your Apache Beam setup by running the Wordcount example using the Direct Runner:

  • python -m apache_beam.examples.wordcount --output wordcount.out
  • If you see any errors in stdout, stop and debug.
  • Open wordcount.out-00000-of-00001 and examine the output

9. Test your Dataflow setup by running the Wordcount example using the Dataflow Runner:
Replace $PROJECT_ID and $BUCKET with your project id and bucket name.
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output

  • Go to the Dataflow console, find the running job, and examine the job details.
  • Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files.
  • If the wordcount job completed without errors, your Dataflow setup is complete.
Clone this wiki locally