-
Notifications
You must be signed in to change notification settings - Fork 6
Beam & Dataflow Setup
This guide takes you through the steps of setting up your GCP environment for Apache Beam and Dataflow.
1. Enable the Dataflow API
- GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable
2. Create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)
- Bucket name:
<group name>-<some unique suffix>
- Click Continue
- Location type: Region
- Location: us-central1 (Iowa)
- Click Continue
- Storage Class: Standard
- Click Create
- Create 3 folders inside your bucket. Folder names:
staging
,temp
,output
Note:
- bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group
name.
- the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.
3. Start up your Jupyter notebook instance and go to Jupyter Lab.
4. Bring up a terminal window by going to File -> New -> Terminal.
5. Create a virtual environment and install apache beam by entering these commands in the terminal:
$ pip install --user virtualenv
$ export PATH=$PATH:/home/jupyter/.local/bin
$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip install apache-beam[gcp]
$ pip install apache-beam[interactive]
6. Install an ipykernel by entering these commands in the terminal:
$ source venv/bin/activate
$ python -m pip install ipykernel
$ python -m ipykernel install --user --name beam_venv_kernel --display-name "Python (beam_venv)"
$ jupyter kernelspec list
Make sure you see beam_venv_kernel
in the list of available kernels:
Available kernels:
beam_venv_kernel /home/jupyter/.local/share/jupyter/kernels/beam_venv_kernel
python2 /home/jupyter/beam_venv_dir/share/jupyter/kernels/python2
python3 /usr/local/share/jupyter/kernels/python3
Note: To run a beam pipeline from a python notebook, choose the beam_venv_kernel
kernel from the Kernel menu.
8. Test your Apache Beam setup by running the Wordcount example using the Direct Runner:
python -m apache_beam.examples.wordcount --output wordcount.out
- If you see any errors in stdout, stop and debug.
- Open
wordcount.out-00000-of-00001
and examine the output
9. Test your Dataflow setup by running the Wordcount example using the Dataflow Runner:
Replace $PROJECT_ID
and $BUCKET
with your project id and bucket name.
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--output gs://$BUCKET/output
- Go to the Dataflow console, find the running job, and examine the job details.
- Open the GCS console, go to your bucket, open the 3 folders and view the contents of the files.
- If the wordcount job completed without errors, your Dataflow setup is complete.