Google Sheets to PostgreSQL ETL Pipeline

This project is a python-based ETL (Extract, Transform, Load) pipeline that automates the extraction of data from Google Sheets, transforms the data as needed, and loads it into a PostgreSQL database hosted on ElephantSQL. It's designed to streamline the process of data handling, making it easier to perform insightful data analysis in the cloud.


Data Flow Diagram

Features

Data Extraction: Automatically fetches data from Google Sheets using the Google Sheets API.
Data Transformation: Processes and prepares data to fit the desired schema for database insertion.
Data Loading: Efficiently loads transformed data into a PostgreSQL database, ensuring data integrity and consistency.
Dynamic Date Filtering: Allows data extraction from a specified start date, enabling incremental data updates.
Logging: Comprehensive logging throughout the ETL process for easy monitoring and debugging.

Technologies Used

Python: Primary programming language for scripting the ETL process.
Pandas: Used for data manipulation and transformation.
SQLAlchemy: Facilitates database interaction.
PostgreSQL: Robust and scalable open-source database system.
ElephantSQL: Cloud-hosted PostgreSQL service providing the database backend.
Google Sheets API: Enables programmatic access to Google Sheets for data extraction.

Setup

To run this project, you'll need to have Python installed on your system, along with the necessary Python packages:

Clone the repository to your local machine:

git clone https://github.com/Mvanhuffel/Sheets_to_Postgres_ETL_Pipeline.git

Navigate to the project directory:

cd Sheets_to_Postgres_ETL_Pipeline

Install the required Python packages:

pip install -r requirements.txt

Usage

To start the ETL process, run the main.py script from the command line. Ensure you are in the project root directory:

python src/main.py

Ensure your config.yaml file is set up with the correct parameters before running the script. If you haven't already, copy config.example.yaml to config.yaml and edit it with your settings.

Project Structure:

Sheets to SQL/
├── .git/                # Git version control system directory
├── .gitignore           # Specifies intentionally untracked files to ignore
├── _pycache_/           # Compiled Python files
├── data/                # Data files, like your Google service account JSON
├── env_sts/             # Virtual environment directory
├── logs/                # Log files directory
│   └── data_ingestion.log
├── src/                 # Source code directory
│   ├── extract.py       # Your extraction script
│   ├── load.py          # Your loading script
│   └── main.py          # Main script that orchestrates the ETL process
├── README.md            # Project documentation
├── requirements.txt     # Python package dependencies
└── config.yaml          # Configuration file

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Sheets to PostgreSQL ETL Pipeline

Features

Technologies Used

Setup

Usage

Project Structure:

License

About

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
env_sts		env_sts
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml.sample		config.yaml.sample
data_ingestion.log		data_ingestion.log
etl_flow.png		etl_flow.png
requirements.txt		requirements.txt

Mvanhuffel/Sheets_to_Postgres_ETL_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Google Sheets to PostgreSQL ETL Pipeline

Features

Technologies Used

Setup

Usage

Project Structure:

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages