Skip to content

Lab guide to demonstrate moving data from an on-premises database to S3 and visualising with QuickSight

Notifications You must be signed in to change notification settings

charliejllewellyn/aws-glue-quicksight-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 

Repository files navigation

Overview

In this lab you'll learn how to extract data from a local relational database, transform the content into parquet format and store on S3 using Glue. Finally you will use AWS QuickSight to visualise the data to gain insight.

Setup

Generate a KeyPair

Generate a Keypair

Note If you are using windows 7 or earlier you will need to download and install Putty and Puttygen from here.

  1. From the AWS console search for EC2 in the search box and select the service.

  2. From the left-hand menu select Key Pairs.

  3. Click the Create Key Pair button and enter a name for the glue-lab for the demo. This will download the private key to your local machine.

Note If you are running windows you need to follow these instructions to convert the key to putty.

Deploy a database to mimic on-premises

To demonstrate the data being held in a different location we'll build our fake database in the Ireland region using CloudFormation.

Click the button below to deploy the stack.

AWS Region Short name
EU West (London) eu-west-2
  1. On the next page click Next
  2. Enter the KeyPairName name created above glue-lab and click Next
  3. click Next
  4. Check the last two boxes:
    • I acknowledge that AWS CloudFormation might create IAM resources with custom names.
    • I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND

  5. Click Create Stack
  6. Wait for the stack to return CREATION_COMPLETE and then click the Outputs tab and record the database server IP address.

Note: In reality the IP would be a private address access via a VPN on Direct Connect.

Data import

Prepare your VPC for Glue

If you are time constrained you can stage the next 4 steps by running the following CloudFormation template. If you do this move to Configuring a Glue Connection

AWS Region Short name
EU West (Ireland) eu-west-1
Setup an S3 endpoint

In order to securely transfer data from the on-premesis database to S3 Glue uses an S3 endpoint which allows for data transfer over the AWS backbone once the data reaches your AWS VPC.

In order to demonstrate the data being consumed remotely to the VPC like it would be on-premesis we'll use the London region (eu-west-2).

  1. Click on endpoints on the left-hand menu
  2. Click on Create Endpoint
  3. Place a check next to com.amazonaws.eu-west-2.s3 and place a check in the routetable you created in the previous step starting rtb-

  4. Click Create Endpoint
  5. Click Close
Setup a Nat Gateway

Glue can only connect to the internet via a Nat Gateway for security. In reality you would be more likely to be routing from a private subnet to a database on-premises via a VPN. However for this lab we'll configure a VPN Gateway to allow us to connect to the database we deployed with internet access in the previous step.

  1. In the top right of the AWS console choose London and then select Ireland from the dropdown.

Note: you can ignore the errors about the stack not existing.

  1. Click on the Services dropdown in the top right and select the service VPC
  2. Click on NAT Gateways on the left-hand menu
  3. Click Subnet and selct any subnet
  4. Click Create New EIP
  5. Click Create a NAT Gateway, Close
  6. Click on Subnets on the left-hand menu
  7. Click Create Subnet
  8. Enter Glue Private Subnet for the Name Tag
  9. Enter and appropriate CIDR block in the IPv4 CIDR Block
  10. Click Create
  11. Click on Route Tables on the left-hand menu
  12. Click Create Route Table
  13. Enter Glue private route as the Name Tag
  14. Click Create, Close
  15. Check the route table you just created and select Subnet Associations tab at the bottom
  16. Click Edit subnet associations
  17. Place a check next to the Glue Private Subnet
  18. Click Save
  19. Click the Routes tab
  20. Click edit routes
  21. Click Add Route
  22. Enter 0.0.0.0/0 for the Destination
  23. For the Target select the NAT Gateway you created earlier
  24. Click Save Routes, Close
Create a security group for Glue

Glue requires access both out of the VPC to connect to the database but also to the glue service and S3.

  1. From the left-hand menu click Security Groups
  2. Clock Create security group
  3. For Security Group Name enter on-prem-glue-demo
  4. For Description enter Glue demo
  5. For VPC select the Default VPC
  6. Click Create, then Close
  7. Select the security group you just created and copy the Group Id to a text doc

  8. Select Actions --> Edit inbound rules
  9. Click Add Rule and enter All TCP for the Type
  10. Enter the Group Id recorded above in the field CIDR, IP, Security Group or Prefix List
  11. Click Save rules, Close
Setup a Glue IAM Role

In order for Glue to run we need to give the service the required permissions to manage infrastructure on our behalf.

  1. Click on the Services dropdown in the top right and select the service IAM
  2. On the left-hand menu select Roles
  3. Click Create Role
  4. Under Choose the service that will use this role select Glue
  5. Click Next: Permissions
  6. Search for Glue and place a check next to AWSGlueServiceRole
  7. Next search for s3 and place a check next to AmazonS3FullAccess
  8. Click Next: Tags
  9. Click Next: Review
  10. Enter glue-demo-role for the Role Name
  11. Click Create Role

Configuring a Glue Connection

Setup a Glue Connection

In order to transfer the data from the on-premises database we need to setup a glue connection with the database connection details.

  1. Click on the Services dropdown in the top right and select the service AWS Glue
  2. On the left-hand menu select Connections and click Add Connection
  3. Type the Name on-prem-database
  4. Select JDBC as the Connection Type and click Next
  5. For the JDBC connection enter the following string replacing the IP_ADDRESS with the IP address recorded from the cloudformation stack output,
    jdbc:mysql://IP_ADDRESS:3306/employees
    
    e.g.
    jdbc:mysql://52.212.137.195:3306/employees
    
  6. Enter Username, dbuser and Password, password12345
  7. Select your VPC created earlier, if you used the CloudFormation template it will be labeled glue-demo
  8. Select any private Subnet, e.g. glue-demo-private-a
  9. Select the Security Group with the name on-prem-glue-demo and choose Next
  10. Click Finish

Test the Glue Connection

  1. Click Test Connection
  2. Select the role glue-demo-role created previously
  3. Click Test Connection
  4. This should result in success (it may take a few minutes)

Configure Glue ETL

Next we'll configure Glue to perform ETL on the data to convert it to Parquet and store it on S3.

Create an S3 bucket

In order to store the data extracted from the on-premises database we'll create an S3 bucket.

  1. Click on the Services dropdown in the top right and select the service S3
  2. Click Create Bucket
  3. Enter a unique name for the bucket e.g. firstname-lastname-glue-demo
  4. Click Create

Create a Glue Job

  1. Click on the Services dropdown in the top right and select the service AWS Glue
  2. On the left-hand menu select Jobs
  3. Click Add Job
  4. Enter Glue-demo-job for the Name
  5. Select the Role glue-demo-role
  6. Under This job runs select A new script to be authored by you

  7. Click Next
  8. Under All Connections click Select next to on-prem-database
  9. Click Save job and edit script
  10. Paste in the script below
    import sys
    import boto3
    import json
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.job import Job
    
    s3_bucket_name = "s3://cjl-glue-mysql-database-sample"
    db_url = 'jdbc:mysql://52.30.96.60:3306/employees'
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    db_username = 'dbuser'
    db_password = 'password12345'
    
    #Table current_dept_emp
    table_name = 'current_dept_emp'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table departments
    table_name = 'departments'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table dept_emp
    table_name = 'dept_emp'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table dept_emp_latest_date
    table_name = 'dept_emp_latest_date'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table dept_manager
    table_name = 'dept_manager'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table employees
    table_name = 'employees'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("gender", "string", "gender", "string"), ("emp_no", "int", "emp_no", "int"), ("birth_date", "date", "birth_date", "date"), ("last_name", "string", "last_name", "string"), ("hire_date", "date", "hire_date", "date"), ("first_name", "string", "first_name", "string")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table salaries
    table_name = 'salaries'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("salary", "int", "salary", "int")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    #Table titles
    table_name = 'titles'
    s3_output = s3_bucket_name + "/" + table_name
    
    # Connecting to the source
    df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
    df.printSchema()
    print df.count()
    datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
    applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("title", "string", "title", "string")], transformation_ctx = "applymapping1")
    resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
    dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")
    
    job.commit()
    
  11. Edit lines 11 and 12 so the vairables s3_bucket_name and db_url reflect the correct values created above.
  12. Click Save, Run Job and then confirm by clicking Run Job

Note: you can move onto create the crawler whilst the job is runnning, but make sure the job is complete before you run the crawler

Create a Glue Crawler

We use a glue crawler to query the data from the database on S3 and create a schema so we can start to interogate the information.

  1. From the left-hand menu select Crawlers
  2. Select Add Crawler
  3. For Name enter on-prem-database, click Next
  4. In the Include path enter the bucket name from earlier, e.g. *s3://firstname-lastname-glue-demo"

  5. Click Next, Next
  6. Select Choose an existing IAM role and select the role glue-demo-role
  7. Click Next, Next
  8. Click Add database and enter the name on-prem-employee-database, click Create
  9. Click Next

  10. Click Finish
  11. Place a check next to your crawler and click Run Crawler
  12. Wait for the crawler to run and then choose Tables from the left-hand menu
  13. This should show you the tables for your newly extracted data.

Data Visualisation

Once the data is available in S3 we can start to query and visualisae the data. In this section we'll be using AWS QuickSight but other products can be integrated like Tableau, or qlik.

Note: If you haven't used quicksight before you'll need to sign-up

QuickSight sign-up

  1. Click on the Services dropdown in the top right and select the service QuickSight
  2. Select Enterprise for the QuickSight version

  3. Click Continue
  4. Enter glue-demo for the QuickSight account name and and email address for Notification email address
  5. Click Finish
  6. Click Go to Amazon QuickSight

Note: It will take a couple of minutes to get QuickSight ready

Allow QuickSight access to your data

  1. In the top right-hand corner click Admin, Manage QuickSight

  2. In the left-hand menu click Account Settings
  3. Click Manage QuickSight permissions
  4. Select Choose S3 buckets
  5. Place a check next to the bucket you created earlier
  6. Click Select Buckets and click Update
Prepare the datasource

  1. In the top right-hand corner click N. Virgina and select EU (Ireland)

  2. On the left hand page click New Analysis
  3. Click New dataset
  4. Select Athena and enter glue-demo for the Data source Name
  5. Click Create Data Source
  6. Select on-prem-employee-database
  7. Select employees as the Table
  8. Click edit preview data
  9. Click add data

  10. Select Salaires as the table and click Select

  11. Click on the two circles and under the Join Clauses select emp_no for both employees and salaries

  12. Click Apply
  13. Click add data
  14. Select dept_manager as the table, click select
  15. Click on the two circles and under the Join Clauses select emp_no for both employees and dept_manager
  16. Click Apply
  17. Click add data
  18. Select deptartments as the table, click select
  19. This time drag the departments box over the dept_manager box and release when it turns green.

  20. Click on the two circles and under the Join Clauses select dept_no for both employees and dept_manager
  21. Click Apply
  22. At the top click Save & Visualise
Build some visualisations

  1. Choose your visulisation type, in this case select the pie chart icon

  2. From the left-hand menu select gender
  3. Click the bar saying Field Wells at the top
  4. Drag Salary into the Value box and then click to select Aggregate as Average

  5. In the top left you select Add, Add Visual
  6. This time select Horizontal bar chart

  7. Select dept_name from the left-hand menu
  8. Now drag salary to the Value box in the top bar
  9. Feel free to continue building your own visulaisations to explore the data

Summary

During this lab you extracted data from an "on-premises" database, converted it to Parquet and stored the output to S3. You then used a combination of Athena and QuickSight to query and visualise the data to start exploiting it.

This is a simple lab but could easily be expanded to pull data from multiple sources to start corelating it to gain deeper insight.

About

Lab guide to demonstrate moving data from an on-premises database to S3 and visualising with QuickSight

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages