In this lab you'll learn how to extract data from a local relational database, transform the content into parquet format and store on S3 using Glue. Finally you will use AWS QuickSight to visualise the data to gain insight.
Generate a KeyPair
Generate a Keypair
Note If you are using windows 7 or earlier you will need to download and install Putty and Puttygen from here.
-
From the AWS console search for EC2 in the search box and select the service.
-
From the left-hand menu select Key Pairs.
-
Click the Create Key Pair button and enter a name for the glue-lab for the demo. This will download the private key to your local machine.
Note If you are running windows you need to follow these instructions to convert the key to putty.
Deploy a database to mimic on-premises
To demonstrate the data being held in a different location we'll build our fake database in the Ireland region using CloudFormation.
Click the button below to deploy the stack.
AWS Region | Short name | |
---|---|---|
EU West (London) | eu-west-2 | ![]() |
- On the next page click Next
- Enter the KeyPairName name created above glue-lab and click Next
- click Next
- Check the last two boxes:
- I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND
- Click Create Stack
- Wait for the stack to return CREATION_COMPLETE and then click the Outputs tab and record the database server IP address.
Note: In reality the IP would be a private address access via a VPN on Direct Connect.
If you are time constrained you can stage the next 4 steps by running the following CloudFormation template. If you do this move to Configuring a Glue Connection
AWS Region | Short name | |
---|---|---|
EU West (Ireland) | eu-west-1 | ![]() |
Setup an S3 endpoint
In order to securely transfer data from the on-premesis database to S3 Glue uses an S3 endpoint which allows for data transfer over the AWS backbone once the data reaches your AWS VPC.
In order to demonstrate the data being consumed remotely to the VPC like it would be on-premesis we'll use the London region (eu-west-2).
Setup a Nat Gateway
Glue can only connect to the internet via a Nat Gateway for security. In reality you would be more likely to be routing from a private subnet to a database on-premises via a VPN. However for this lab we'll configure a VPN Gateway to allow us to connect to the database we deployed with internet access in the previous step.
Note: you can ignore the errors about the stack not existing.
- Click on the Services dropdown in the top right and select the service VPC
- Click on NAT Gateways on the left-hand menu
- Click Subnet and selct any subnet
- Click Create New EIP
- Click Create a NAT Gateway, Close
- Click on Subnets on the left-hand menu
- Click Create Subnet
- Enter Glue Private Subnet for the Name Tag
- Enter and appropriate CIDR block in the IPv4 CIDR Block
- Click Create
- Click on Route Tables on the left-hand menu
- Click Create Route Table
- Enter Glue private route as the Name Tag
- Click Create, Close
- Check the route table you just created and select Subnet Associations tab at the bottom
- Click Edit subnet associations
- Place a check next to the Glue Private Subnet
- Click Save
- Click the Routes tab
- Click edit routes
- Click Add Route
- Enter 0.0.0.0/0 for the Destination
- For the Target select the NAT Gateway you created earlier
- Click Save Routes, Close
Create a security group for Glue
Glue requires access both out of the VPC to connect to the database but also to the glue service and S3.
- From the left-hand menu click Security Groups
- Clock Create security group
- For Security Group Name enter on-prem-glue-demo
- For Description enter Glue demo
- For VPC select the Default VPC
- Click Create, then Close
- Select the security group you just created and copy the Group Id to a text doc
- Select Actions --> Edit inbound rules
- Click Add Rule and enter All TCP for the Type
- Enter the Group Id recorded above in the field CIDR, IP, Security Group or Prefix List
- Click Save rules, Close
Setup a Glue IAM Role
In order for Glue to run we need to give the service the required permissions to manage infrastructure on our behalf.
- Click on the Services dropdown in the top right and select the service IAM
- On the left-hand menu select Roles
- Click Create Role
- Under Choose the service that will use this role select Glue
- Click Next: Permissions
- Search for Glue and place a check next to AWSGlueServiceRole
- Next search for s3 and place a check next to AmazonS3FullAccess
- Click Next: Tags
- Click Next: Review
- Enter glue-demo-role for the Role Name
- Click Create Role
Setup a Glue Connection
In order to transfer the data from the on-premises database we need to setup a glue connection with the database connection details.
- Click on the Services dropdown in the top right and select the service AWS Glue
- On the left-hand menu select Connections and click Add Connection
- Type the Name on-prem-database
- Select JDBC as the Connection Type and click Next
- For the JDBC connection enter the following string replacing the IP_ADDRESS with the IP address recorded from the cloudformation stack output,
e.g.
jdbc:mysql://IP_ADDRESS:3306/employees
jdbc:mysql://52.212.137.195:3306/employees
- Enter Username, dbuser and Password, password12345
- Select your VPC created earlier, if you used the CloudFormation template it will be labeled glue-demo
- Select any private Subnet, e.g. glue-demo-private-a
- Select the Security Group with the name on-prem-glue-demo and choose Next
- Click Finish
Test the Glue Connection
Next we'll configure Glue to perform ETL on the data to convert it to Parquet and store it on S3.
Create an S3 bucket
In order to store the data extracted from the on-premises database we'll create an S3 bucket.
Create a Glue Job
- Click on the Services dropdown in the top right and select the service AWS Glue
- On the left-hand menu select Jobs
- Click Add Job
- Enter Glue-demo-job for the Name
- Select the Role glue-demo-role
- Under This job runs select A new script to be authored by you
- Click Next
- Under All Connections click Select next to on-prem-database
- Click Save job and edit script
- Paste in the script below
import sys import boto3 import json from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.dynamicframe import DynamicFrame from awsglue.job import Job s3_bucket_name = "s3://cjl-glue-mysql-database-sample" db_url = 'jdbc:mysql://52.30.96.60:3306/employees' ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) db_username = 'dbuser' db_password = 'password12345' #Table current_dept_emp table_name = 'current_dept_emp' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table departments table_name = 'departments' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table dept_emp table_name = 'dept_emp' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("dept_name", "string", "dept_name", "string")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table dept_emp_latest_date table_name = 'dept_emp_latest_date' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table dept_manager table_name = 'dept_manager' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("dept_no", "string", "dept_no", "string"), ("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table employees table_name = 'employees' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("gender", "string", "gender", "string"), ("emp_no", "int", "emp_no", "int"), ("birth_date", "date", "birth_date", "date"), ("last_name", "string", "last_name", "string"), ("hire_date", "date", "hire_date", "date"), ("first_name", "string", "first_name", "string")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table salaries table_name = 'salaries' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("salary", "int", "salary", "int")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") #Table titles table_name = 'titles' s3_output = s3_bucket_name + "/" + table_name # Connecting to the source df = glueContext.read.format("jdbc").option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("driver","com.mysql.jdbc.Driver").load() df.printSchema() print df.count() datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0") applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("from_date", "date", "from_date", "date"), ("to_date", "date", "to_date", "date"), ("emp_no", "int", "emp_no", "int"), ("title", "string", "title", "string")], transformation_ctx = "applymapping1") resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2") dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3") datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_bucket_name + "/" + table_name}, format = "parquet", transformation_ctx = "datasink4") job.commit()
- Edit lines 11 and 12 so the vairables s3_bucket_name and db_url reflect the correct values created above.
- Click Save, Run Job and then confirm by clicking Run Job
Note: you can move onto create the crawler whilst the job is runnning, but make sure the job is complete before you run the crawler
Create a Glue Crawler
We use a glue crawler to query the data from the database on S3 and create a schema so we can start to interogate the information.
- From the left-hand menu select Crawlers
- Select Add Crawler
- For Name enter on-prem-database, click Next
- In the Include path enter the bucket name from earlier, e.g. *s3://firstname-lastname-glue-demo"
- Click Next, Next
- Select Choose an existing IAM role and select the role glue-demo-role
- Click Next, Next
- Click Add database and enter the name on-prem-employee-database, click Create
- Click Next
- Click Finish
- Place a check next to your crawler and click Run Crawler
- Wait for the crawler to run and then choose Tables from the left-hand menu
- This should show you the tables for your newly extracted data.
Once the data is available in S3 we can start to query and visualisae the data. In this section we'll be using AWS QuickSight but other products can be integrated like Tableau, or qlik.
Note: If you haven't used quicksight before you'll need to sign-up
QuickSight sign-up
- Click on the Services dropdown in the top right and select the service QuickSight
- Select Enterprise for the QuickSight version
- Click Continue
- Enter glue-demo for the QuickSight account name and and email address for Notification email address
- Click Finish
- Click Go to Amazon QuickSight
Note: It will take a couple of minutes to get QuickSight ready
Allow QuickSight access to your data
Prepare the datasource
- In the top right-hand corner click N. Virgina and select EU (Ireland)
- On the left hand page click New Analysis
- Click New dataset
- Select Athena and enter glue-demo for the Data source Name
- Click Create Data Source
- Select on-prem-employee-database
- Select employees as the Table
- Click edit preview data
- Click add data
- Select Salaires as the table and click Select
- Click on the two circles and under the Join Clauses select emp_no for both employees and salaries
- Click Apply
- Click add data
- Select dept_manager as the table, click select
- Click on the two circles and under the Join Clauses select emp_no for both employees and dept_manager
- Click Apply
- Click add data
- Select deptartments as the table, click select
- This time drag the departments box over the dept_manager box and release when it turns green.
- Click on the two circles and under the Join Clauses select dept_no for both employees and dept_manager
- Click Apply
- At the top click Save & Visualise
Build some visualisations
- Choose your visulisation type, in this case select the pie chart icon
- From the left-hand menu select gender
- Click the bar saying Field Wells at the top
- Drag Salary into the Value box and then click to select Aggregate as Average
- In the top left you select Add, Add Visual
- This time select Horizontal bar chart
- Select dept_name from the left-hand menu
- Now drag salary to the Value box in the top bar
- Feel free to continue building your own visulaisations to explore the data
During this lab you extracted data from an "on-premises" database, converted it to Parquet and stored the output to S3. You then used a combination of Athena and QuickSight to query and visualise the data to start exploiting it.
This is a simple lab but could easily be expanded to pull data from multiple sources to start corelating it to gain deeper insight.