On-prem DB to Azure Cloud Pipeline with Data Factory, Lake Storage, Spark, Databricks, Synapse, PowerBI
- Project Overview
- Key Insights
- Project Architecture
3.1. Data Ingestion
3.2. Data Transformation
3.3. Data Loading
3.4. Data Reporting - Credits
- Contact
This an end-to-end data engineering project on the Azure cloud. Where I did data ingestion from a on-premise SQL Server to Azure Data Lake using Data Factory to transformation using Databricks and Spark, loading to Synapse, and reporting using PowerBI.
Dataset link : https://drive.google.com/file/d/1i4aRieq_WDVJDGpqtZq8UW9CH8sCbaBd/view?pli=1
In this project we are going to create an end to end data platform right from Data Ingestion, Data Transformation, Data Loading and Reporting.
The tools that are covered in this project are,
- SQL server migration
- Azure Data Factory
- Azure Data Lake Storage Gen2
- Azure Databricks
- PYSPARK
- SPARK SQL
- Microsoft Power BI
The use case for this project is building an end to end solution by ingesting the tables from on-premise SQL Server database using Azure Data Factory and then store the data in Azure Data Lake. Then Azure databricks is used to transform the RAW data to the most cleanest form of data and finally using Microsoft Power BI to integrate with Azure synapse analytics to build an interactive dashboard.
- Establish a connection between on-premise SQL server and Azure cloud.
- Ingest tables into the Azure Data Lake.
- Apply data cleaning and transformation using Azure Databricks.
- Utilize Azure Synapse Analytics for loading clean data.
- Create interactive data visualizations and reports with Microsoft Power BI.
-
💸 Total Revenue by Product Category
-
🌍 Sales by Pizza Name and size
- N°1: The L size pizza generated the total revenue of 45%.
- N°2: The M size pizza generated the total revenue of 30.49%.
-
🚻 Sales by Pizza category
- 26.91% of the revenue is generated by Classic pizza category
- 23.96% of the revenue is generated by Chicken pizza category
This can be explained by males have more interest in doing outdoor activites with the different categories of Bikes than females.
You can find the detailed information on the diagram below:
-
Connected the on-premise SQL Server with Azure using Microsoft Integration Runtime.
-
Setup the Resource group with needed services (Storage Account, Data Factory, Databricks, Synapse Analytics)
-
Migrated the tables from on-premise SQL Server to Azure Data Lake Storage Gen2.
- Mounted Azure Blob Storage to Databricks to retrieve raw data from the Data Lake.
- Used Spark Cluster in Azure Databricks to clean and refine the raw data And do some aggregations.
- Saved the cleaned data in a PARQUET format; optimized for further analysis.
- Used Azure Synapse Analytics to load the refined data efficiently.
- Created SQL database and connected it to the data lake.
- Connected Microsoft Power BI to Azure Synapse, and used the Views of the DB to create interactive and insightful data visualizations.
- Data Source: SQL Server
- Orchestration: Azure Data Factory
- Ingestion: Azure Data Lake Gen2
- Storage: Azure Synapse Analytics(if required)
- Data Visualization: PowerBI
- This Project is inspired by the video of the YouTube Channel "Learn by doing it"