Overview

In this lab you'll learn how to extract data from a local relational database, transform the content into parquet format and store on S3 using Glue. Finally you will use AWS QuickSight to visualise the data to gain insight.

Setup

Generate a KeyPair

Generate a Keypair

Note If you are using windows 7 or earlier you will need to download and install Putty and Puttygen from here.

From the AWS console search for EC2 in the search box and select the service.
From the left-hand menu select Key Pairs.
Click the Create Key Pair button and enter a name for the glue-lab for the demo. This will download the private key to your local machine.

Note If you are running windows you need to follow these instructions to convert the key to putty.

Deploy a database to mimic on-premises

To demonstrate the data being held in a different location we'll build our fake database in the Ireland region using CloudFormation.

Click the button below to deploy the stack.

AWS Region	Short name
EU West (London)	eu-west-2

On the next page click Next
Enter the KeyPairName name created above glue-lab and click Next
click Next
Check the last two boxes:
- I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND
Click Create Stack
Wait for the stack to return CREATION_COMPLETE and then click the Outputs tab and record the database server IP address.

Note: In reality the IP would be a private address access via a VPN on Direct Connect.

Data import

Prepare your VPC for Glue

If you are time constrained you can stage the next 4 steps by running the following CloudFormation template. If you do this move to Configuring a Glue Connection

AWS Region	Short name
EU West (Ireland)	eu-west-1

Setup an S3 endpoint

In order to securely transfer data from the on-premesis database to S3 Glue uses an S3 endpoint which allows for data transfer over the AWS backbone once the data reaches your AWS VPC.

In order to demonstrate the data being consumed remotely to the VPC like it would be on-premesis we'll use the London region (eu-west-2).

Click on endpoints on the left-hand menu
Click on Create Endpoint
Place a check next to com.amazonaws.eu-west-2.s3 and place a check in the routetable you created in the previous step starting rtb-
Click Create Endpoint
Click Close

Setup a Nat Gateway

Glue can only connect to the internet via a Nat Gateway for security. In reality you would be more likely to be routing from a private subnet to a database on-premises via a VPN. However for this lab we'll configure a VPN Gateway to allow us to connect to the database we deployed with internet access in the previous step.

In the top right of the AWS console choose London and then select Ireland from the dropdown.

Note: you can ignore the errors about the stack not existing.

Click on the Services dropdown in the top right and select the service VPC
Click on NAT Gateways on the left-hand menu
Click Subnet and selct any subnet
Click Create New EIP
Click Create a NAT Gateway, Close
Click on Subnets on the left-hand menu
Click Create Subnet
Enter Glue Private Subnet for the Name Tag
Enter and appropriate CIDR block in the IPv4 CIDR Block
Click Create
Click on Route Tables on the left-hand menu
Click Create Route Table
Enter Glue private route as the Name Tag
Click Create, Close
Check the route table you just created and select Subnet Associations tab at the bottom
Click Edit subnet associations
Place a check next to the Glue Private Subnet
Click Save
Click the Routes tab
Click edit routes
Click Add Route
Enter 0.0.0.0/0 for the Destination
For the Target select the NAT Gateway you created earlier
Click Save Routes, Close

Create a security group for Glue

Glue requires access both out of the VPC to connect to the database but also to the glue service and S3.

From the left-hand menu click Security Groups
Clock Create security group
For Security Group Name enter on-prem-glue-demo
For Description enter Glue demo
For VPC select the Default VPC
Click Create, then Close
Select the security group you just created and copy the Group Id to a text doc
Select Actions --> Edit inbound rules
Click Add Rule and enter All TCP for the Type
Enter the Group Id recorded above in the field CIDR, IP, Security Group or Prefix List
Click Save rules, Close

Setup a Glue IAM Role

In order for Glue to run we need to give the service the required permissions to manage infrastructure on our behalf.

Click on the Services dropdown in the top right and select the service IAM
On the left-hand menu select Roles
Click Create Role
Under Choose the service that will use this role select Glue
Click Next: Permissions
Search for Glue and place a check next to AWSGlueServiceRole
Next search for s3 and place a check next to AmazonS3FullAccess
Click Next: Tags
Click Next: Review
Enter glue-demo-role for the Role Name
Click Create Role

Configuring a Glue Connection

Setup a Glue Connection

In order to transfer the data from the on-premises database we need to setup a glue connection with the database connection details.

Click on the Services dropdown in the top right and select the service AWS Glue
On the left-hand menu select Connections and click Add Connection
Type the Name on-prem-database
Select JDBC as the Connection Type and click Next
For the JDBC connection enter the following string replacing the IP_ADDRESS with the IP address recorded from the cloudformation stack output,
```
jdbc:mysql://IP_ADDRESS:3306/employees
```
e.g.
```
jdbc:mysql://52.212.137.195:3306/employees
```
Enter Username, dbuser and Password, password12345
Select your VPC created earlier, if you used the CloudFormation template it will be labeled glue-demo
Select any private Subnet, e.g. glue-demo-private-a
Select the Security Group with the name on-prem-glue-demo and choose Next
Click Finish

Test the Glue Connection

Click Test Connection
Select the role glue-demo-role created previously
Click Test Connection
This should result in success (it may take a few minutes)

Configure Glue ETL

Next we'll configure Glue to perform ETL on the data to convert it to Parquet and store it on S3.

Create an S3 bucket

In order to store the data extracted from the on-premises database we'll create an S3 bucket.

Click on the Services dropdown in the top right and select the service S3
Click Create Bucket
Enter a unique name for the bucket e.g. firstname-lastname-glue-demo
Click Create

Create a Glue Job

Click on the Services dropdown in the top right and select the service AWS Glue
On the left-hand menu select Jobs
Click Add Job
Enter Glue-demo-job for the Name
Select the Role glue-demo-role
Under This job runs select A new script to be authored by you
Click Next
Under All Connections click Select next to on-prem-database
Click Save job and edit script

Paste in the script below

import sys
import boto3
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

s3_bucket_name = "s3://cjl-glue-mysql-database-sample"
db_url = 'jdbc:mysql://52.30.96.60:3306/employees'

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

db_username = 'dbuser'
db_password = 'password12345'

#Table current_dept_emp
table_name = 'current_dept_emp'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table departments
table_name = 'departments'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table dept_emp
table_name = 'dept_emp'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table dept_emp_latest_date
table_name = 'dept_emp_latest_date'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table dept_manager
table_name = 'dept_manager'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table employees
table_name = 'employees'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("gender", "string", "gender", "string"), ("emp_no", "int", "emp_no", "int"), ("birth_date", "date", "birth_date", "date"), ("last_name", "string", "last_name", "string"), ("hire_date", "date", "hire_date", "date"), ("first_name", "string", "first_name", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table salaries
table_name = 'salaries'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("salary", "int", "salary", "int")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

#Table titles
table_name = 'titles'
s3_output = s3_bucket_name + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load()
df.printSchema()
print df.count()
datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("title", "string", "title", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

Edit lines 11 and 12 so the vairables s3_bucket_name and db_url reflect the correct values created above.
Click Save, Run Job and then confirm by clicking Run Job

Note: you can move onto create the crawler whilst the job is runnning, but make sure the job is complete before you run the crawler

Create a Glue Crawler

We use a glue crawler to query the data from the database on S3 and create a schema so we can start to interogate the information.

From the left-hand menu select Crawlers
Select Add Crawler
For Name enter on-prem-database, click Next
In the Include path enter the bucket name from earlier, e.g. *s3://firstname-lastname-glue-demo"
Click Next, Next
Select Choose an existing IAM role and select the role glue-demo-role
Click Next, Next
Click Add database and enter the name on-prem-employee-database, click Create
Click Next
Click Finish
Place a check next to your crawler and click Run Crawler
Wait for the crawler to run and then choose Tables from the left-hand menu
This should show you the tables for your newly extracted data.

Data Visualisation

Once the data is available in S3 we can start to query and visualisae the data. In this section we'll be using AWS QuickSight but other products can be integrated like Tableau, or qlik.

Note: If you haven't used quicksight before you'll need to sign-up

QuickSight sign-up

Click on the Services dropdown in the top right and select the service QuickSight
Select Enterprise for the QuickSight version
Click Continue
Enter glue-demo for the QuickSight account name and and email address for Notification email address
Click Finish
Click Go to Amazon QuickSight

Note: It will take a couple of minutes to get QuickSight ready

Allow QuickSight access to your data

In the top right-hand corner click Admin, Manage QuickSight
In the left-hand menu click Account Settings
Click Manage QuickSight permissions
Select Choose S3 buckets
Place a check next to the bucket you created earlier
Click Select Buckets and click Update

Prepare the datasource

In the top right-hand corner click N. Virgina and select EU (Ireland)
On the left hand page click New Analysis
Click New dataset
Select Athena and enter glue-demo for the Data source Name
Click Create Data Source
Select on-prem-employee-database
Select employees as the Table
Click edit preview data
Click add data
Select Salaires as the table and click Select
Click on the two circles and under the Join Clauses select emp_no for both employees and salaries
Click Apply
Click add data
Select dept_manager as the table, click select
Click on the two circles and under the Join Clauses select emp_no for both employees and dept_manager
Click Apply
Click add data
Select deptartments as the table, click select
This time drag the departments box over the dept_manager box and release when it turns green.
Click on the two circles and under the Join Clauses select dept_no for both employees and dept_manager
Click Apply
At the top click Save & Visualise

Build some visualisations

Choose your visulisation type, in this case select the pie chart icon
From the left-hand menu select gender
Click the bar saying Field Wells at the top
Drag Salary into the Value box and then click to select Aggregate as Average
In the top left you select Add, Add Visual
This time select Horizontal bar chart
Select dept_name from the left-hand menu
Now drag salary to the Value box in the top bar
Feel free to continue building your own visulaisations to explore the data

Summary

During this lab you extracted data from an "on-premises" database, converted it to Parquet and stored the output to S3. You then used a combination of Athena and QuickSight to query and visualise the data to start exploiting it.

This is a simple lab but could easily be expanded to pull data from multiple sources to start corelating it to gain deeper insight.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
images		images
useful_scripts		useful_scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Setup

Data import

Prepare your VPC for Glue

Configuring a Glue Connection

Configure Glue ETL

Data Visualisation

Summary

About

Releases

Packages

Languages

charliejllewellyn/aws-glue-quicksight-lab

Folders and files

Latest commit

History

Repository files navigation

Overview

Setup

Data import

Prepare your VPC for Glue

Configuring a Glue Connection

Configure Glue ETL

Data Visualisation

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages