Skip to content

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

Notifications You must be signed in to change notification settings

ksmin23/aws-mwaa-glue-databrew-nytaxi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

This is codes for amazon blog post: Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

databrew-mwaa-orchestration

(Step 1) Data preparation

Download these datasets and upload the public datasets to the input S3 bucket.

S3 Input Bucket Structure S3 Output Bucket Structure Airflow Bucket Structure

input-bucket-name
    |- yellow
    |- green
    |- taxi-lookup
                

output-bucket-name
    |- yellow
    |- green
    |- taxi-lookup
                

airflow-bucket-name
    |- dags
    |- requirements
                

(Step 2) Create DataBrew projects & Import DataBrew recipes

After creating databrew projects, import DataBrew recipes. For example:
glue-databrew-import-recipes

(Step 3) Create Athena tables

Create the following external tables using ./athena/nytaxi-trip-data-aggregated-summary.sql

(Step 4) Create Apache Airflow environment

  • Create your airflow environment according to Amazon Managed workflows for Apache Airflow (MWAA) CDK Python project
  • Go to the IAM Console - Roles
  • Search for the Airflow Instance role, which looks similar to AmazonMWAA-your-airflow-environment-name-xxxx
  • Let’s attach the following permissions to the Airflow Instance role
    • AmazonAthenaFullAccess
    • AwsGlueDataBrewFullAccessPolicy
    • AWSGlueDataBrewServiceRole
    • inline policy such as:
         {
             "Version": "2012-10-17",
             "Statement": [
                 {
                     "Effect": "Allow",
                     "Action": [
                         "s3:GetBucketLocation",
                         "s3:GetObject",
                         "s3:ListBucket",
                         "s3:ListBucketMultipartUploads",
                         "s3:ListMultipartUploadParts",
                         "s3:AbortMultipartUpload",
                         "s3:CreateBucket",
                         "s3:PutObject",
                         "s3:PutBucketPublicAccessBlock"
                     ],
                     "Resource": [
                         "arn:aws:s3:::your-output-bucket-name",
                         "arn:aws:s3:::your-output-bucket-name/*"
                     ]
                 }
             ]
         }
         

(Step 5) Create an Airflow DAG

Upload requirements.txt and ny_taxi_brew_trigger.py in ./mwaa directory in to s3.

(Step 5) Run DAG

The DAG will be initially set to disabled state by default. Enable the DAG using the On/Off toggle button to be picked up by the scheduler.
airflow-unpause-dag

To trigger the DAG, click on the play button shown below.
airflow-trigger-dag

References

About

Orchestrate AWS Glue DataBrew jobs using Amazon Managed Workflows for Apache Airflow

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages