Table of Contents
This project is about building a data pipeline to extract, transform, and load (ETL) data from a source to a target. The data source is a CSV file containing information about car sales. The target is a PostgreSQL database table.
PostgreSQL was preferred for the richer data handling with multiple data types,
transaction management and its highly scalability to provide great performance
at CRUD operations.
The project followed SQLAlchemy models scheme based on OOP concepts that
provide an excellent abstraction when working with multiple datasets for a
future process. This high level abstraction provides a greater control over the
data being inserted as the table structure can be defined with multiple
constraints and relationships.
For more advanced requirements, transactions, migrations and more complex
operations can be performed through the ORM so managing large amounts of data
won't be an issue.
The project also works with PEP8 style that is tested with Pylint and this
includes type hinting for variables, functions arguments and more.
If performance is critical, consider using Python 3.11 in terms of handling
exceptions that can be thrown and re-raised in shorter execution times.
Assets are also included with future consideration for HTML and CSS files.
Testing could be done using unittests (to be implemented in a future release).
- Remove any rows with missing values.
- Convert the date columns to a standard format.
- Create a new column to store the year of the sale.
- Replace the categorical values in the "Car Model" column with numerical values.
- The target database should be either PostgreSQL or MySQL.
- The pipeline should be runnable using a command-line interface.
- The pipeline should have error handling and logging capabilities.
- The pipeline should be modular and easily extendable to handle additional data sources and transformations.
- Clone the repository
git clone https://github.com/jpcadena/car-sales-etl.git
- Change the directory to root project
cd car-sales-etl
- Create a virtual environment venv
python3 -m venv venv
- Activate environment in Windows
.\venv\Scripts\activate
- Or with Unix/Mac OS X
source venv/bin/activate
- Install requirements with PIP
pip install -r requirements.txt
- Rename file sample.env to .env.
- Replace your credentials into the .env file.
- Execute with console.
python main.py
If you have a suggestion that would make this better, please fork the repo and create a pull request.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Use docstrings with reStructuredText format by adding triple double quotes
""" after function definition.
Add a brief function description, also for the parameters including the return
value and its corresponding data type.
Please use linting to check your code quality
following PEP 8.
Check documentation
for Visual Studio Code
or Jetbrains Pycharm.\
Recommended plugin for autocompletion: Tabnine
Distributed under the MIT License.
LinkedIn: Juan Pablo Cadena Aguilar
E-mail: Juan Pablo Cadena Aguilar