This documentation provides a comprehensive guide to deploying an AWS Data Lake solution using the AWS Cloud Development Kit (CDK) in Python. The solution includes:
- An Amazon S3 bucket acting as the data lake storage.
- An AWS Glue crawler and database for data cataloging.
- An Amazon Athena workgroup for querying the data.
- An Amazon QuickSight setup for data visualization.
- An AWS Budget alarm to monitor costs exceeding a user-defined amount of USD per month.
This guide includes detailed explanations of each component, deployment instructions, an architecture diagram, deployment and cleanup instructions.
Below is a diagram representing the architecture of the AWS resources:
Responsible for setting up the foundational data lake components:
-
Amazon S3 Bucket (
data_lake_bucket
):- Stores raw and processed data.
- Versioning and server-side encryption enabled.
- Configured to auto-delete objects and bucket upon stack deletion.
-
AWS Glue Crawler (
glue_crawler
):- Scans the S3 bucket to detect schema changes.
- Updates the AWS Glue Data Catalog.
-
AWS Glue Database (
glue_database
):- Stores metadata about the data in the S3 bucket.
-
Amazon Athena Workgroup (
athena_workgroup
):- Executes queries against data cataloged by AWS Glue.
- Stores query results in a specified S3 location within the data lake bucket.
-
IAM Roles and Policies:
glue_crawler_role
: Grants AWS Glue permissions to read/write to the S3 bucket.
Sets up Amazon QuickSight resources for data visualization:
-
IAM Role (
quicksight_role
):- Allows QuickSight to access Athena and S3.
- Must be manually assigned in QuickSight settings.
-
QuickSight Data Source (
data_source
):- Connects QuickSight to Athena using the specified workgroup.
-
QuickSight Dataset (
dataset
):- Defines the data to be used for analyses and dashboards.
-
Custom Resource for Cleanup:
- AWS Lambda function (
cleanup_function
) to delete QuickSight resources upon stack deletion.
- AWS Lambda function (
Creates a budget alarm to monitor AWS costs:
- AWS Budget (
budget
):- Sets a monthly budget limit of user-defined monthly_budget_usd amount of USD.
- Sends notifications when actual spend exceeds 100% of the budget.
- Notifications are sent via email to the specified address defined in the quicksight_and_alarm_email variable.
-
AWS Account: An AWS account with permissions to create the necessary resources.
-
AWS CLI Installed: Ensure the AWS CLI is installed and configured with your credentials.
-
AWS CDK Installed: Install the AWS CDK if not already installed.
npm install -g aws-cdk
-
Clone the Repository or Create Project Structure:
mkdir data-lake-cdk-demo cd data-lake-cdk-demo cdk init app --language python
-
Install Python Dependencies:
Install the dependencies:
pip install -r requirements.txt
-
Update Placeholder Values:
Adapt all the variables under the data_lake_constants key
-
Bootstrap Your AWS Environment:
cdk bootstrap
-
Synthesize the CDK App:
cdk synth
-
Deploy the CDK App:
cdk deploy --all --require-approval never
-
Confirm Budget Subscription:
Check your email for a confirmation message from AWS Budgets and confirm your subscription.
To delete all resources created by the CDK stacks, run the following command:
cdk destroy --all