Data_Engineering_Pretest

Documentation and Setup Instructions

This documentation provides guidance on setting up, running, and understanding the codebase, including instructions for installing dependencies, configuring the environment, and running the analysis.

1. Prerequisites

Before proceeding, ensure you have the following installed:

Python (version 3.8 or higher)
Git (if cloning the repository)
A code editor or IDE (e.g., VS Code, PyCharm)

2. Directory Structure

Here is an how the project directory is organized:

project-directory/
│
├── Raw_datasets/
│   ├── items.csv
│   ├── promotion.csv
│   ├── sales.csv
│   ├── supermarkets.csv
│
├── src/
│   ├── Data_Engineering_Pretest.ipynb  # Main script for running analysis
│   ├── Clean_data.py # Functions for cleaning datasets
│
├── requirements.txt    # List of dependencies
├── .env                # Environment variables (e.g., database credentials)
├── README.md           # Project overview and usage
└── Report
    └── Report on Data Engineering Pretest.pdf   # Final report detailing tasks and insights

3. Setting Up the Project

Step 1: Clone the Repository

If the project is hosted on a Git repository, clone it:

git clone <repository-url>
cd project-directory

Step 2: Create a Virtual Environment

Set up a virtual environment to manage dependencies:

python -m venv venv
source venv/bin/activate  # On Linux/Mac
venv\Scripts\activate     # On Windows

Step 3: Install Dependencies

Install the required libraries using requirements.txt:

pip install -r requirements.txt

Step 4: Configure Environment Variables

Create a .env file (if it doesn’t already exist) in the project root. Add the following variables:

DB_NAME=your_database_name
DB_USER=your_username
DB_PASSWORD=your_password
DB_HOST=your_host
DB_PORT=your_port

Replace placeholders with your actual PostgreSQL credentials.

Step 5: Load Data into PostgreSQL

Use the database.py script to upload cleaned datasets to the PostgreSQL database:

python src/database.py

This script:

Connects to the PostgreSQL database using credentials in .env.
Creates tables for items, promotions, sales, and supermarkets.
Loads the data from the data/ folder into the respective tables.

4. Running the Code

Step 1: Perform Data Cleaning

Run the data cleaning script to preprocess and clean the datasets:

python src/data_cleaning.py

Step 2: Generate Business Insights

Use the analysis.py script to analyze branch-level sales patterns and promotion effectiveness:

python src/analysis.py

Step 3: Visualize Results

Run the visualization script to create charts and heatmaps for promotion effectiveness and sales trends:

python src/visualization.py

Step 4: Run the Main Script

Alternatively, you can run the main.py script, which combines all the steps:

python src/main.py

5. Key Configuration Files

requirements.txt

This file lists all Python dependencies. Here’s an example:

pandas
numpy
matplotlib
seaborn
sqlalchemy
psycopg2-binary
python-dotenv

Install these by running:

pip install -r requirements.txt

.env

Holds sensitive credentials (e.g., database information). Example:

DB_NAME=supermarket_data
DB_USER=admin
DB_PASSWORD=securepassword
DB_HOST=localhost
DB_PORT=5432

6. Key Features of the Code

Data Cleaning:
- Handles missing values, duplicates, and data type corrections.
Database Integration:
- Uploads cleaned data into a PostgreSQL database for centralized storage and analysis.
Business Analysis:
- Generates actionable insights, such as branch-level sales patterns and promotion effectiveness.
Visualization:
- Provides visual insights through heatmaps, bar plots, and other charts.

7. Troubleshooting

Dependency Issues:
- Ensure you’re using the correct Python version.
- If errors occur during installation, update pip:
```
pip install --upgrade pip
```
Database Connection Errors:
- Verify that the PostgreSQL server is running and the .env file contains the correct credentials.
Data File Issues:
- Ensure all required CSV files are in the data/ folder. Missing files will cause errors.

8. Notes for Future Development

Add year information to the sales dataset for temporal analysis.
Include additional validation steps for ensuring data quality during extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data_Engineering_Pretest

Documentation and Setup Instructions

1. Prerequisites

2. Directory Structure

3. Setting Up the Project

Step 1: Clone the Repository

Step 2: Create a Virtual Environment

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Step 5: Load Data into PostgreSQL

4. Running the Code

Step 1: Perform Data Cleaning

Step 2: Generate Business Insights

Step 3: Visualize Results

Step 4: Run the Main Script

5. Key Configuration Files

requirements.txt

.env

6. Key Features of the Code

7. Troubleshooting

8. Notes for Future Development

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Clean_data		Clean_data
Raw_datasets		Raw_datasets
Report		Report
.gitattributes		.gitattributes
.gitignore		.gitignore
Data_Engineering_Pretest.ipynb		Data_Engineering_Pretest.ipynb
Maze.py		Maze.py
README.md		README.md
requirements.txt		requirements.txt

Nel-zi/Data_Engineering_Pretest

Folders and files

Latest commit

History

Repository files navigation

Data_Engineering_Pretest

Documentation and Setup Instructions

1. Prerequisites

2. Directory Structure

3. Setting Up the Project

Step 1: Clone the Repository

Step 2: Create a Virtual Environment

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Step 5: Load Data into PostgreSQL

4. Running the Code

Step 1: Perform Data Cleaning

Step 2: Generate Business Insights

Step 3: Visualize Results

Step 4: Run the Main Script

5. Key Configuration Files

requirements.txt

.env

6. Key Features of the Code

7. Troubleshooting

8. Notes for Future Development

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages