Skip to content

turbaszek/snowplow-gcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

Table of Contents

Snowplow on GCP

This project aims to provide set of tools that allow you to easily deploy Snowplow setup on Google Cloud Platform.

After following all those steps you should have:

  • GKE cluster running:
    • Snowplow Scala Stream Collector
    • Beam Enrich
    • BigQuery Loader
  • Pub/Sub topics for collector and enrich stream
  • BigQuery dataset being the final destination of Snowplow events
  • Few GCS buckets

NOTE: This project is still work in progress and some part may not work yet but you are welcomed to help!

Prerequisites

To manage GCP resources you need installed gcloud CLI. For installation options check the official documentation.

This project uses Terraform to bootstrap the infrastructure and kubectl to manage the Kubernetes cluster. On MacOS you should easily install them using Homebrew:

brew install terraform
brew install kubectl

For install option on other systems please check documentation of those projects.

Infrastructure setup

  1. Create GCP project. You can also use already existing one.

  2. Run the following commands:

    export PROJECT_ID=project-name-here
    export SERVICE_ACCOUNT_NAME=snowplow
    bash scripts/setup-iam.sh ${PROJECT_ID} ${SERVICE_ACCOUNT_NAME}

    This will create service account in keys directory. This service account will have roles/editor role and will be used to create GCP resources. This script will also enable required services (GKE).

  3. To bootstrap infrastructure required for Snowplow deployment run:

    export LOCATION=europe-west3
    export GCP_KEY=keys/${SERVICE_ACCOUNT_NAME}.json
    export CLIENT=client-name
    terraform apply -var "gcp_project=${PROJECT_ID}" -var "gcp_location=${LOCATION}" -var "gcp_key_admin=${GCP_KEY}" -var "client=${CLIENT}"

    The CLIENT is a string that is added to all resources name. It's recommended to use terraform workspaces i.e. terraform workspace new my_snowplow.

At this moment all required elements should be up and running. If you wish you can check this in GCP console. In next steps you will deploy the Snowplow components.

Collector deployment

Check snowplow documentation.

To get access to the newly create kubernetes cluster run

gcloud container clusters get-credentials "snowplow-gke" --region ${LOCATION}

Collector configuration requires user to provide GCP project id. You can do this running the following substitution:

sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/collector/conf.yaml

Then deploy the following CRDs:

kubectl apply -f k8s/collector/conf.yaml
kubectl apply -f k8s/collector/deploy.yaml
kubectl apply -f k8s/collector/service.yaml

This will create snowplow-collector deployment which uses official snowplow image.

To check if the deployment works run

kubectl get pods -A | grep snowplow

and you should see few pods, all in Running state. To verify that everything works smoothly you can run health check script:

bash scripts/collector_health_check.sh

If there was no error, head to PubSub web console and after few seconds you should observe some events in the good topic.

Stream enrich job

Check snowplow documentation.

The next step is to start streaming job on Google Dataflow (Apache Beam). To do this you will use one time kubernetes job.

But before that enrich configuration requires you to provide GCP project id. You can do this running the following substitution:

sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/enrich/conf.yaml
sed -i "" "s/\*PROJECT\*/${PROJECT_ID}/" k8s/enrich/job.yaml  # does not work

Then we need a key to write to GCS:

cp keys/snowplow-admin.json keys/credentials.json
kubectl create secret generic gcs-writer-sa --from-file keys/credentials.json

TODO: there should be key with limited scope - what scope?. TODO: some more configuration changes are needed

Once you configuration is ready run:

kubectl apply -f k8s/enrich/conf.yaml
kubectl apply -f k8s/enrich/job.yaml

After few seconds run:

kubectl get jobs -A

and you should see that snowplow-enrich has completed.

BigQuery loader deployment

Check snowplow documentation.

Contributing

We welcome all contributions! Please submit an issue or PR no matter if it's bug or a typo.

This project is using pre-commits to ensure the quality of the code. To install pre-commits just do:

pip install pre-commit
# or
brew install pre-commit

And then from project directory run pre-commit install.