This is codes for amazon blog post: Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow
Download these datasets and upload the public datasets to the input S3 bucket.
- NYC Yellow taxi trip records (2020-Jan)
- NYC Green taxi trip records (2020-Jan)
- NYC Taxi zone lookup dataset
S3 Input Bucket Structure | S3 Output Bucket Structure | Airflow Bucket Structure |
---|---|---|
|
|
|
After creating databrew projects, import DataBrew recipes. For example:
Create the following external tables using ./athena/nytaxi-trip-data-aggregated-summary.sql
- Create your airflow environment according to Amazon Managed workflows for Apache Airflow (MWAA) CDK Python project
- Go to the IAM Console - Roles
- Search for the Airflow Instance role, which looks similar to AmazonMWAA-your-airflow-environment-name-xxxx
- Let’s attach the following permissions to the Airflow Instance role
- AmazonAthenaFullAccess
- AwsGlueDataBrewFullAccessPolicy
- AWSGlueDataBrewServiceRole
- inline policy such as:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload", "s3:CreateBucket", "s3:PutObject", "s3:PutBucketPublicAccessBlock" ], "Resource": [ "arn:aws:s3:::your-output-bucket-name", "arn:aws:s3:::your-output-bucket-name/*" ] } ] }
Upload requirements.txt
and ny_taxi_brew_trigger.py
in ./mwaa
directory in to s3.
The DAG will be initially set to disabled state by default.
Enable the DAG using the On/Off toggle button to be picked up by the scheduler.