Zipco Foods is a vibrant and growing fast-casual dining business specializing in pizzas and cakes. With numerous outlets across the country, the business generates a significant amount of daily sales data. However, due to inefficient data handling and fragmented data storage (spread across multiple CSV files), valuable insights remain untapped.
This project addresses these challenges by designing and implementing a robust ETL (Extract, Transform, Load) pipeline, developing a scalable and normalized database schema, and enabling real-time data analytics. The solution not only streamlines data processing but also enhances decision-making capabilities, operational efficiency, and data integrity.
Zipco Foods currently faces several data management issues:
- Fragmented Data Storage: Critical sales and inventory data is scattered across multiple CSV files.
- Inefficient Data Access: The lack of a unified system causes delays in data retrieval and hinders real-time insights.
- Data Quality Challenges: Maintaining data integrity and accuracy is difficult due to disparate data sources.
By automating the ETL process and centralizing data management, the project aims to overcome these challenges and support the company's growth and operational efficiency.
- Automated ETL Pipeline: Implement a pipeline to extract, clean, transform, and load data consistently.
- Efficient Database Design: Create a database schema adhering to 2NF/3NF normalization standards for optimal data retrieval and scalability.
- Real-time Analytics: Develop tools for real-time data insights to support informed decision-making.
- Robust Data Governance: Ensure data integrity, version control, and compliance via efficient orchestration.
- Enhanced Decision-Making: Gain timely, accurate insights from real-time data analytics.
- Improved Operational Efficiency: Reduce manual processes and streamline data handling.
- Scalability & Flexibility: Prepare for future business growth with a scalable data management solution.
- Reliable Data Integrity: Maintain high-quality data for strategic planning and operations.
- Python: Used for scripting the ETL processes, data cleaning, transformation, and analysis with libraries such as
pandas
andNumPy
. - SQL: Employed for querying and managing the database, which is hosted on Azure.
- Azure Blob Storage: Serves as the centralized, scalable data repository.
- GitHub: Utilized for version control, ensuring collaborative development and maintenance.
- Apache Airflow: Orchestrates the ETL workflows, schedules jobs, and monitors data pipelines.
zipco-foods-etl/
├── data/ # Raw and processed data files
├── scripts/ # Python scripts for ETL tasks
│ ├── extract.py # Data extraction logic
│ ├── transform.py # Data transformation functions
│ ├── load.py # Data loading procedures into Azure/SQL
│ └── utils.py # Helper functions and utilities
├── airflow_dags/ # Apache Airflow DAG definitions
├── docs/ # Additional documentation and reports
├── requirements.txt # Python dependencies
└── README.md # Project overview and setup instructions
- Python 3.7+
- SQL Database on Azure
- Azure Blob Storage Account
- Apache Airflow: Installed and configured.
- Git: For cloning the repository.
-
Clone the Repository:
git clone https://github.com/your-username/zipco-foods-etl.git cd zipco-foods-etl
-
Set Up the Python Environment:
- Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate # For Windows: venv\Scripts\activate
- Install the required dependencies:
pip install -r requirements.txt
- Create and activate a virtual environment:
-
Configure Environment Variables: Create a
.env
file or set the following environment variables:AZURE_STORAGE_CONNECTION_STRING=your_azure_blob_storage_connection_string SQL_DATABASE_URL=your_azure_sql_database_connection_string
-
Configure Apache Airflow:
- Place the DAG files from the
airflow_dags/
directory into your Airflow DAGs folder. - Start the Airflow scheduler and webserver:
airflow scheduler airflow webserver --port 8080
- Place the DAG files from the
-
Running the ETL Pipeline: The ETL pipeline is orchestrated via Apache Airflow. Once your Airflow environment is running, the scheduled DAG will automatically execute the extraction, transformation, and loading processes.
-
Monitoring the Workflow: Access the Airflow web interface (typically at
http://localhost:8080
) to monitor job progress, review logs, and manage workflow status. -
Querying Processed Data: Use any SQL client (or Azure Data Studio) to query the processed data stored in your Azure SQL database, enabling real-time analytics and reporting.
Contributions are welcome! If you have suggestions or improvements:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/YourFeature
). - Open a pull request.
For significant changes, please open an issue first to discuss what you would like to change.