- This project involves the acquisition of several Covid-19 Datasets from the ECDC website. These datasets are subsequently processed through diverse ADF components to effect transformations. These transformations are executed using ADF and Databricks. The resultant data is then loaded into Data Lake with the intention of enabling the Analytics team to draw meaningful and practical insights from these datasets. The primary objective is to comprehensively understand the workings of ADF.
- This project's mission is to ingest data from multiple data sources, clean it up, and alter it using Data Flows and Databricks. Once the data has been cleaned, it should be imported into a processed Datalake Gen2 container.
- ECDC (https://www.ecdc.europa.eu/en/covid-19)
- Population Data From Azure Files (eurostat_data)
- Download these Datasets
- Azure Data Lake Gen2 Storage
- ADF Pipelines
- Data Flows within the Data Factory
- Databricks
- Azure Subscription
- Data Factory
- Data Lake Storage Gen2
- Azure Databricks Cluster
Four different datasets were ingested from both the ECDC website and Azure files into Datalake Gen2. They are -
- Cases and Deaths Data (ECDC)
- Hospital Admissions Data (ECDC)
- Test Conducted Data (ECDC)
- Population Data
The below datasets was transformed using ADF Data flows -
-
Cases and Deaths Data
-
Hospital Admissions Data
- Cases And Deaths Source (Azure Data Lake Storage Gen2 )
- Filter Europe-Only Data
- Select only the required columns
- PivotCounts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
- Lookup Country to get country_code_2_digit,country_code_3_digit columns
- Select Only the required columns for the Sink
- Create a Sink dataset (Azure Data Lake Storage Gen2)
- Hospital Admissions Source (Azure Data Lake Storage Gen2 )
- Select only the required columns
- Lookup Country to get country_code_2_digit,country_code_3_digit columns
- Select only the required columns
- Condition to Split Weekly and Daily
- indicator=='Weekly new hospital admissions per 100k' || indicator=='Weekly new ICU admissions per 100k'
- indicator== "Daily hospital occupancy" || indicator=="Daily ICU occupancy"
- For Weekly Path
- Join with Date to get ecdc_Year_week, week_start_date, week_End_date
- Piovt Counts using indicator Columns(confirmed cases, deaths) and get the sum of daily cases count
- Sort data using reported_year_week ASC and Country DESC
- Select only required columns for sink
- Create a sink dataset (Azure Data Lake Storage Gen2)
- For Daily Path
The below datasets was transformed using Azure Databricks -
- Population Data
- Test Conducted Data
- Azure DataFactory
- Azure Databricks (Pyspark)
- Azure Storage Account
- Azure Data Lake Gen2