This project is a python-based ETL (Extract, Transform, Load) pipeline that automates the extraction of data from Google Sheets, transforms the data as needed, and loads it into a PostgreSQL database hosted on ElephantSQL. It's designed to streamline the process of data handling, making it easier to perform insightful data analysis in the cloud.
Data Flow Diagram |
- Data Extraction: Automatically fetches data from Google Sheets using the Google Sheets API.
- Data Transformation: Processes and prepares data to fit the desired schema for database insertion.
- Data Loading: Efficiently loads transformed data into a PostgreSQL database, ensuring data integrity and consistency.
- Dynamic Date Filtering: Allows data extraction from a specified start date, enabling incremental data updates.
- Logging: Comprehensive logging throughout the ETL process for easy monitoring and debugging.
- Python: Primary programming language for scripting the ETL process.
- Pandas: Used for data manipulation and transformation.
- SQLAlchemy: Facilitates database interaction.
- PostgreSQL: Robust and scalable open-source database system.
- ElephantSQL: Cloud-hosted PostgreSQL service providing the database backend.
- Google Sheets API: Enables programmatic access to Google Sheets for data extraction.
To run this project, you'll need to have Python installed on your system, along with the necessary Python packages:
- Clone the repository to your local machine:
git clone https://github.com/Mvanhuffel/Sheets_to_Postgres_ETL_Pipeline.git
- Navigate to the project directory:
cd Sheets_to_Postgres_ETL_Pipeline
- Install the required Python packages:
pip install -r requirements.txt
To start the ETL process, run the main.py script from the command line. Ensure you are in the project root directory:
python src/main.py
Ensure your config.yaml file is set up with the correct parameters before running the script. If you haven't already, copy config.example.yaml
to config.yaml
and edit it with your settings.
Sheets to SQL/
├── .git/ # Git version control system directory
├── .gitignore # Specifies intentionally untracked files to ignore
├── _pycache_/ # Compiled Python files
├── data/ # Data files, like your Google service account JSON
├── env_sts/ # Virtual environment directory
├── logs/ # Log files directory
│ └── data_ingestion.log
├── src/ # Source code directory
│ ├── extract.py # Your extraction script
│ ├── load.py # Your loading script
│ └── main.py # Main script that orchestrates the ETL process
├── README.md # Project documentation
├── requirements.txt # Python package dependencies
└── config.yaml # Configuration file
This project is licensed under the MIT License - see the LICENSE file for details.