Team Members: Fabian Roulin, Léa Goffinet, Samuel Mouny
This project aims to predict the likelihood of cardiovascular disease based on individual health factors. The analysis involves data exploration, feature engineering, and implementing machine learning algorithms for binary classification. This work is part of the EPFL Machine Learning Course, Fall 2024.
The repository is organized as follows:
- data/: Placeholder for dataset files (
data/raw/
for raw data). - doc/: Contains project description and data codebook.
- notebooks/: Jupyter notebooks for data exploration, processing, model training, etc.
- report/: Final report (LaTeX format) and accompanying figures.
- src/: Custom package with modules:
data_exploration
,data_loading
,data_processing
models
,train_pipeline
,utils
- tests/: Test files provided to validate the functions.
Root Files:
helpers.py
: Helper functions for submission creation.implementations.py
: Project-required implementations of functions.requirements.txt
: Environment dependencies.run.py
: Script to produce the best predictions, might take a while to train the model.setup.py
: Script to install thesrc
package.
Follow these steps to clone the repository, set up the environment, and install the package.
To clone this repository, use the following command:
git clone <repository_link>
cd <repository_name>
Set up the environment by installing required packages from requirements.txt
:
pip install -r requirements.txt
To install the custom src
package, run:
pip install .
For editable mode, allowing modifications without reinstalling:
pip install -e .
To execute the project pipeline and generate predictions:
python run.py
- Data Preparation: Place raw data files in
data/raw/
for smooth execution. - Exploration and Modeling: Use the notebooks in
notebooks/
for step-by-step analysis and model development. - Final Report: The final analysis is documented in
report/
.
For further resources, visit: