Beam & Dataflow Setup

This guide takes you through the steps of setting up your GCP environment for Apache Beam and Dataflow.

1. Enable the Dataflow API

GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable

2. Create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)

Bucket name: <group name>-<some unique suffix>
Click Continue
Location type: Region
Location: us-central1 (Iowa)
Click Continue
Storage Class: Standard
Click Create
Create 3 folders inside your bucket. Folder names: staging, temp, output

Note:

bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group name.
the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.

3. Start up your Jupyter notebook instance and go to Jupyter Lab.

4. Bring up a terminal window by going to File -> New -> Terminal.

5. Create a virtual environment and install apache beam by entering these commands in the terminal:

$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]

6. Install an ipykernel by entering these commands in the terminal:
$ source venv/bin/activate
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_venv_kernel --display-name "Python (beam_venv)"
$ jupyter kernelspec list

Make sure you see beam_venv_kernel in the list of available kernels:

Available kernels:
beam_venv_kernel /home/jupyter/.local/share/jupyter/kernels/beam_venv_kernel
python2 /home/jupyter/beam_venv_dir/share/jupyter/kernels/python2
python3 /usr/local/share/jupyter/kernels/python3

Note: To run a beam pipeline from a python notebook, choose the beam_venv_kernel kernel from the Kernel menu.

8. Test your Apache Beam setup by running the Wordcount example using the Direct Runner:

python -m apache_beam.examples.wordcount --output wordcount.out
If you see any errors in stdout, stop and debug.
Open wordcount.out-00000-of-00001 and examine the output

9. Test your Dataflow setup by running the Wordcount example using the Dataflow Runner:
Replace $PROJECT_ID and $BUCKET with your project id and bucket name.
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output

Go to the Dataflow console, find the running job, and examine the job details.
Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files.
If the wordcount job completed without errors, your Dataflow setup is complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beam & Dataflow Setup

Clone this wiki locally