This documentation provides guidance on setting up, running, and understanding the codebase, including instructions for installing dependencies, configuring the environment, and running the analysis.
Before proceeding, ensure you have the following installed:
- Python (version 3.8 or higher)
- Git (if cloning the repository)
- A code editor or IDE (e.g., VS Code, PyCharm)
Here is an how the project directory is organized:
project-directory/
│
├── Raw_datasets/
│ ├── items.csv
│ ├── promotion.csv
│ ├── sales.csv
│ ├── supermarkets.csv
│
├── src/
│ ├── Data_Engineering_Pretest.ipynb # Main script for running analysis
│ ├── Clean_data.py # Functions for cleaning datasets
│
├── requirements.txt # List of dependencies
├── .env # Environment variables (e.g., database credentials)
├── README.md # Project overview and usage
└── Report
└── Report on Data Engineering Pretest.pdf # Final report detailing tasks and insights
If the project is hosted on a Git repository, clone it:
git clone <repository-url>
cd project-directory
Set up a virtual environment to manage dependencies:
python -m venv venv
source venv/bin/activate # On Linux/Mac
venv\Scripts\activate # On Windows
Install the required libraries using requirements.txt
:
pip install -r requirements.txt
Create a .env
file (if it doesn’t already exist) in the project root. Add the following variables:
DB_NAME=your_database_name
DB_USER=your_username
DB_PASSWORD=your_password
DB_HOST=your_host
DB_PORT=your_port
Replace placeholders with your actual PostgreSQL credentials.
Use the database.py
script to upload cleaned datasets to the PostgreSQL database:
python src/database.py
This script:
- Connects to the PostgreSQL database using credentials in
.env
. - Creates tables for items, promotions, sales, and supermarkets.
- Loads the data from the
data/
folder into the respective tables.
Run the data cleaning script to preprocess and clean the datasets:
python src/data_cleaning.py
Use the analysis.py
script to analyze branch-level sales patterns and promotion effectiveness:
python src/analysis.py
Run the visualization script to create charts and heatmaps for promotion effectiveness and sales trends:
python src/visualization.py
Alternatively, you can run the main.py
script, which combines all the steps:
python src/main.py
This file lists all Python dependencies. Here’s an example:
pandas
numpy
matplotlib
seaborn
sqlalchemy
psycopg2-binary
python-dotenv
Install these by running:
pip install -r requirements.txt
Holds sensitive credentials (e.g., database information). Example:
DB_NAME=supermarket_data
DB_USER=admin
DB_PASSWORD=securepassword
DB_HOST=localhost
DB_PORT=5432
-
Data Cleaning:
- Handles missing values, duplicates, and data type corrections.
-
Database Integration:
- Uploads cleaned data into a PostgreSQL database for centralized storage and analysis.
-
Business Analysis:
- Generates actionable insights, such as branch-level sales patterns and promotion effectiveness.
-
Visualization:
- Provides visual insights through heatmaps, bar plots, and other charts.
-
Dependency Issues:
- Ensure you’re using the correct Python version.
- If errors occur during installation, update
pip
:pip install --upgrade pip
-
Database Connection Errors:
- Verify that the PostgreSQL server is running and the
.env
file contains the correct credentials.
- Verify that the PostgreSQL server is running and the
-
Data File Issues:
- Ensure all required CSV files are in the
data/
folder. Missing files will cause errors.
- Ensure all required CSV files are in the
- Add year information to the sales dataset for temporal analysis.
- Include additional validation steps for ensuring data quality during extraction.