Skip to content

Comparative Study of Stratified K-Fold and K-Fold Cross-Validation Techniques in Cerebral Stroke Prediction Imbalanced Dataset

License

Notifications You must be signed in to change notification settings

sanskrutikhedkar9/Comparative-Study-of-Cross-Validation-Techniques-on-Imbalanced-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Stroke Identification Project

Overview

This project focuses on identifying the potential risk of stroke in individuals by analyzing various health indicators and demographics. Using machine learning techniques, we aim to predict stroke occurrences, providing a valuable tool for preventive healthcare measures.

Dataset

Link - https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset The dataset contains health-related metrics of individuals, including age, hypertension status, heart disease presence, marital status, work type, residence type, average glucose level, BMI (Body Mass Index), and smoking status. It's crucial to note that the dataset has been cleaned and preprocessed to handle missing values and outliers, ensuring a reliable foundation for model training.

Objective

The primary goal is to develop a predictive model that can accurately identify individuals at a higher risk of stroke. By leveraging this model, healthcare professionals can target preventive measures more effectively, potentially reducing the occurrence of strokes.

Methodology

Data Preprocessing: This includes handling missing values, encoding categorical variables, dealing with outliers, and normalizing the data to prepare it for modeling.

Feature Selection: Using techniques like SelectKBest and Chi-Square tests, important features contributing to the risk of stroke are identified, ensuring the model focuses on relevant predictors.

Model Training: Various machine learning models, including Logistic Regression and K-Nearest Neighbors (KNN), are trained using the preprocessed data. These models are chosen for their simplicity and effectiveness in classification tasks.

Cross-Validation: To evaluate the models' performance robustly, cross-validation techniques like K-Fold and Stratified K-Fold are employed, ensuring the model's generalizability across different data segments.

Resampling: Given the imbalanced nature of the dataset (stroke occurrences being relatively rare), resampling techniques like SMOTETomek are applied to balance the dataset, improving model performance on minority classes.

Model Evaluation: Models are evaluated based on metrics such as Accuracy, Recall, Precision, and F1 Score. These metrics provide a comprehensive understanding of the models' performance, especially in the context of imbalanced classification.

Comparison and Selection: The performance of different models and cross-validation techniques are compared, selecting the best-performing model based on evaluation metrics.

Results

The project reveals insights into the key factors contributing to stroke risk and identifies the most effective machine learning model and cross-validation technique based on accuracy, recall, and precision metrics. These results guide the development of targeted interventions for stroke prevention.

Usage

This project is intended for healthcare researchers and professionals looking to leverage machine learning for predictive health analytics. It offers a framework for identifying stroke risk that can be integrated into healthcare planning and patient management systems.

Installation

The project is implemented in Python, utilizing libraries like Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, and imbalanced-learn. Ensure these libraries are installed to run the project seamlessly.

Contribution

Contributions are welcome, especially in areas of model optimization, feature engineering, and exploring novel machine learning algorithms for improved prediction accuracy.

License: This project is open-sourced under the MIT License. See the LICENSE file for more details.

About

Comparative Study of Stratified K-Fold and K-Fold Cross-Validation Techniques in Cerebral Stroke Prediction Imbalanced Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages