Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

This is codes for amazon blog post: Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

(Step 1) Data preparation

Download these datasets and upload the public datasets to the input S3 bucket.

S3 Input Bucket Structure	S3 Output Bucket Structure	Airflow Bucket Structure
`input-bucket-name \|- yellow \|- green \|- taxi-lookup`	`output-bucket-name \|- yellow \|- green \|- taxi-lookup`	`airflow-bucket-name \|- dags \|- requirements`

(Step 2) Create DataBrew projects & Import DataBrew recipes

After creating databrew projects, import DataBrew recipes. For example:

(Step 3) Create Athena tables

Create the following external tables using ./athena/nytaxi-trip-data-aggregated-summary.sql

(Step 4) Create Apache Airflow environment

Create your airflow environment according to Amazon Managed workflows for Apache Airflow (MWAA) CDK Python project
Go to the IAM Console - Roles
Search for the Airflow Instance role, which looks similar to AmazonMWAA-your-airflow-environment-name-xxxx

Let’s attach the following permissions to the Airflow Instance role

AmazonAthenaFullAccess
AwsGlueDataBrewFullAccessPolicy
AWSGlueDataBrewServiceRole

inline policy such as:

   {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetBucketLocation",
                   "s3:GetObject",
                   "s3:ListBucket",
                   "s3:ListBucketMultipartUploads",
                   "s3:ListMultipartUploadParts",
                   "s3:AbortMultipartUpload",
                   "s3:CreateBucket",
                   "s3:PutObject",
                   "s3:PutBucketPublicAccessBlock"
               ],
               "Resource": [
                   "arn:aws:s3:::your-output-bucket-name",
                   "arn:aws:s3:::your-output-bucket-name/*"
               ]
           }
       ]
   }

(Step 5) Create an Airflow DAG

Upload requirements.txt and ny_taxi_brew_trigger.py in ./mwaa directory in to s3.

(Step 5) Run DAG

The DAG will be initially set to disabled state by default. Enable the DAG using the On/Off toggle button to be picked up by the scheduler.

To trigger the DAG, click on the play button shown below.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
athena		athena
glue-databrew-recipes		glue-databrew-recipes
mwaa		mwaa
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

(Step 1) Data preparation

(Step 2) Create DataBrew projects & Import DataBrew recipes

(Step 3) Create Athena tables

(Step 4) Create Apache Airflow environment

(Step 5) Create an Airflow DAG

(Step 5) Run DAG

References

About

Releases

Packages

Languages

ksmin23/aws-mwaa-glue-databrew-nytaxi

Folders and files

Latest commit

History

Repository files navigation

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

(Step 1) Data preparation

(Step 2) Create DataBrew projects & Import DataBrew recipes

(Step 3) Create Athena tables

(Step 4) Create Apache Airflow environment

(Step 5) Create an Airflow DAG

(Step 5) Run DAG

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages