Zipco Foods ETL and Data Analytics Pipeline

Project Overview

Zipco Foods is a vibrant and growing fast-casual dining business specializing in pizzas and cakes. With numerous outlets across the country, the business generates a significant amount of daily sales data. However, due to inefficient data handling and fragmented data storage (spread across multiple CSV files), valuable insights remain untapped.

This project addresses these challenges by designing and implementing a robust ETL (Extract, Transform, Load) pipeline, developing a scalable and normalized database schema, and enabling real-time data analytics. The solution not only streamlines data processing but also enhances decision-making capabilities, operational efficiency, and data integrity.

Business Problem Statement

Zipco Foods currently faces several data management issues:

Fragmented Data Storage: Critical sales and inventory data is scattered across multiple CSV files.
Inefficient Data Access: The lack of a unified system causes delays in data retrieval and hinders real-time insights.
Data Quality Challenges: Maintaining data integrity and accuracy is difficult due to disparate data sources.

By automating the ETL process and centralizing data management, the project aims to overcome these challenges and support the company's growth and operational efficiency.

Objectives & Benefits

Objectives

Automated ETL Pipeline: Implement a pipeline to extract, clean, transform, and load data consistently.
Efficient Database Design: Create a database schema adhering to 2NF/3NF normalization standards for optimal data retrieval and scalability.
Real-time Analytics: Develop tools for real-time data insights to support informed decision-making.
Robust Data Governance: Ensure data integrity, version control, and compliance via efficient orchestration.

Benefits

Enhanced Decision-Making: Gain timely, accurate insights from real-time data analytics.
Improved Operational Efficiency: Reduce manual processes and streamline data handling.
Scalability & Flexibility: Prepare for future business growth with a scalable data management solution.
Reliable Data Integrity: Maintain high-quality data for strategic planning and operations.

Tech Stack

Python: Used for scripting the ETL processes, data cleaning, transformation, and analysis with libraries such as pandas and NumPy.
SQL: Employed for querying and managing the database, which is hosted on Azure.
Azure Blob Storage: Serves as the centralized, scalable data repository.
GitHub: Utilized for version control, ensuring collaborative development and maintenance.
Apache Airflow: Orchestrates the ETL workflows, schedules jobs, and monitors data pipelines.

Project Structure

zipco-foods-etl/
├── data/                  # Raw and processed data files
├── scripts/               # Python scripts for ETL tasks
│   ├── extract.py         # Data extraction logic
│   ├── transform.py       # Data transformation functions
│   ├── load.py            # Data loading procedures into Azure/SQL
│   └── utils.py           # Helper functions and utilities
├── airflow_dags/          # Apache Airflow DAG definitions
├── docs/                  # Additional documentation and reports
├── requirements.txt       # Python dependencies
└── README.md              # Project overview and setup instructions

Installation & Setup

Prerequisites

Python 3.7+
SQL Database on Azure
Azure Blob Storage Account
Apache Airflow: Installed and configured.
Git: For cloning the repository.

Steps

Clone the Repository:

git clone https://github.com/your-username/zipco-foods-etl.git
cd zipco-foods-etl

Set Up the Python Environment:

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # For Windows: venv\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```

Configure Environment Variables: Create a .env file or set the following environment variables:

AZURE_STORAGE_CONNECTION_STRING=your_azure_blob_storage_connection_string
SQL_DATABASE_URL=your_azure_sql_database_connection_string

Configure Apache Airflow:
- Place the DAG files from the airflow_dags/ directory into your Airflow DAGs folder.
- Start the Airflow scheduler and webserver:
```
airflow scheduler
airflow webserver --port 8080
```

Usage

Running the ETL Pipeline: The ETL pipeline is orchestrated via Apache Airflow. Once your Airflow environment is running, the scheduled DAG will automatically execute the extraction, transformation, and loading processes.
Monitoring the Workflow: Access the Airflow web interface (typically at http://localhost:8080) to monitor job progress, review logs, and manage workflow status.
Querying Processed Data: Use any SQL client (or Azure Data Studio) to query the processed data stored in your Azure SQL database, enabling real-time analytics and reporting.

Contributing

Contributions are welcome! If you have suggestions or improvements:

Fork the repository.
Create a new branch (git checkout -b feature/YourFeature).
Commit your changes (git commit -m 'Add some feature').
Push to the branch (git push origin feature/YourFeature).
Open a pull request.

For significant changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Raw_data		Raw_data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
Zipco_Foods_Extraction.py		Zipco_Foods_Extraction.py
Zipco_Foods_Transformation.py		Zipco_Foods_Transformation.py
Zipco_Foods_loading.py		Zipco_Foods_loading.py
dag_script.py		dag_script.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zipco Foods ETL and Data Analytics Pipeline

Project Overview

Business Problem Statement

Objectives & Benefits

Objectives

Benefits

Tech Stack

Project Structure

Installation & Setup

Prerequisites

Steps

Usage

Contributing

About

Releases

Packages

Languages

Nel-zi/Zipco_Foods

Folders and files

Latest commit

History

Repository files navigation

Zipco Foods ETL and Data Analytics Pipeline

Project Overview

Business Problem Statement

Objectives & Benefits

Objectives

Benefits

Tech Stack

Project Structure

Installation & Setup

Prerequisites

Steps

Usage

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages