Semi-Automated Pipeline using Concept Drift Analysis for improving Metabolomics Predictions (SAPCDAMP)

In this repository, we offer a semi-automated pipeline for enhanced metabolomics predictions (SemiAutomated_pipeline_py_notebook.ipynb). This pipeline leverages various machine learning techniques, concept drift detection methods, and data preprocessing strategies to improve predictive accuracy.

The whole pipeline is divided into three sections ([A], [B], [C]), at the end of which the evaluation parameters of individual prediction models are shown and summarized.

Introduction

The SAPCDAMP pipeline is designed for metabolomics prediction using multiple classifiers. We leverage concept drift detection techniques like DDM (Drift Detection Method) and EDDM (Enhanced Drift Detection Method) to monitor changes in data distributions over time. This allows us to fine-tune predictive models continuously.

The pipeline aims to improve the accuracy and robustness of metabolomics predictions by:

Detecting concept drift.
Identifying potential confounding factors.
Testing hypotheses related to segmentation and data standardization.
Using multiple classifiers and comparing their performance.

How to Use

Running the Pipeline on Google Colab

Clone the Repository
- Open Google Colab (https://colab.research.google.com/).
- Clone the repository to your Colab environment using the following command:
```
!git clone <repository_url>
```
  Replace <repository_url> with the actual URL of the repository you're working with.
Install Required Dependencies
- Install necessary libraries and dependencies. For example, you can use:
```
!pip install -r requirements.txt
```
  Make sure the requirements.txt file contains all the necessary libraries (like pandas, numpy, sklearn, etc.).
Upload the Dataset
- You will need to upload the datasets for the pipeline. You can do so by using the Colab interface to upload files or by using URLs to directly load the data from external sources. Use the following to upload a dataset:
```
from google.colab import files
uploaded = files.upload()
```
Alternatively, if your dataset is hosted online, you can load it directly:
```
url_Chu_et_al_scaled = "<dataset_url>"
url_Li_et_al_scaled = "<dataset_url>"
url_Kar_et_al_scaled = "<dataset_url>"
```
Set Up the Pipeline
- In your Colab notebook, load and clean the dataset using the provided functions.
```
data = load_and_clean_data(url_Chu_et_al_scaled)
```
Choose Scaling Method
- You will be prompted to choose a scaling method. This can be done through a simple input prompt in the notebook.
```
scaling_method = input("Enter the scaling method (Centering, Autoscaling, Range Scaling, Pareto Scaling, Vast Scaling, Level Scaling, Log Transformation, Power Transformation): ")
```
  Based on the input, the dataset will be scaled using the corresponding method. If necessary, ensure that the function apply_scaling_by_class accepts and processes this input correctly.

Run the Concept Drift Detection

After preprocessing, run the Concept Drift Detection steps for each classifier. For example:

detect_drift_ddm(predicted_Ridge, "RR")
detect_drift_ddm(predicted_SVR, "SVR")
detect_drift_ddm(predicted_RF, "RF")
detect_drift_ddm(predicted_DNN, "DNN")

Train and Evaluate the Classifiers

Follow the steps to train and test classifiers based on the cleaned and preprocessed data.

model_RF = RandomForestClassifier(n_estimators=100)
model_RF.fit(met_train_X, met_train_y)

Evaluate the performance of the classifier:

print(f"Random Forest model accuracy on the test set: {model_RF.score(met_test_X, met_test_y)}")

View Results and Adjust Parameters
- After running the pipeline, the evaluation metrics will be printed. Review the results, compare accuracy, precision, and recall across different models.
- You can adjust parameters such as the scaling method, concept drift detection settings, or classifier configurations and re-run the pipeline for improved results.

Sections Overview

Section A: Conventional Methods in Metabolomics

This section involves the selection of conventional methods that are frequently used in predictive metabolomic modeling. These include basic machine learning classifiers like Ridge Regression (RR), Support Vector Regression (SVR), Random Forest (RF), and Deep Neural Networks (DNN). These models are trained and evaluated using preprocessed metabolomics data.

Section B: Concept Drift Detection and Hypothesis Testing

Section B is divided into three distinct parts:

b1] Concept Drift Detection (DDM and EDDM)

Concept drift detection is implemented using two methods:

DDM (Drift Detection Method)
EDDM (Enhanced Drift Detection Method)

These methods help identify whether changes have occurred in the data distribution over time. Concept drift can affect model accuracy, and detecting it early can help adjust or retrain models accordingly.

b2] Identification of Confounding Factors

This part focuses on identifying confounding factors using clinical data and biological knowledge. It highlights the need for human expertise to interpret data effectively and adjust the model accordingly, turning the pipeline into a semi-automatic process.

b3] Hypothesis Testing for Enhanced Classifiers

This part tests two hypotheses aimed at improving metabolomics classifiers:

Segmentation: This involves eliminating individuals below a certain age (e.g., 26 years old) to focus on a region of interest and remove outliers that could skew predictions.
Feature Scaling: Standardizing the dataset to ensure consistency and remove bias due to varying feature ranges. Feature scaling is applied based on a time-threshold (e.g., 26 years).

Section C: Enhanced Classifiers and Improved Accuracy

Section C implements new classifiers and models that incorporate the concepts tested in Section B. These models use new data inputs to enhance the accuracy of predictions and refine the machine learning models based on findings from the previous sections.

Evaluation

After running the pipeline, the accuracy, precision, recall, and other evaluation metrics are displayed for each classifier used in the pipeline. These metrics help you understand the effectiveness of each model and provide insight into areas that may need improvement. The models can be further adjusted and fine-tuned based on this feedback to improve prediction performance.

Note: The repository is designed to run on Google Colab, making it easier to run experiments in a cloud-based environment without worrying about local setups. Make sure to upload the necessary datasets and follow the steps as outlined above.

Datasets & Literature:

Metabolite unification nomenclature was performed for GM (training and testing) and CSF (validation) and CML (validation) data:

1] GM data

Chu,X. et al. (2021) Integration of metabolomics, genomics, and immune phenotypes reveals the causal roles of metabolites in disease. Genome Biol, 22.

2] CSF data

Li,K. et al. (2022) Cerebrospinal fluid and plasma metabolomics of acute endurance exercise. FASEB Journal, 36.

3] CML data

Karlíková, R., et al. (2016) Metabolite profiling of the plasma and leukocytes of chronic myeloid leukemia patients. Journal of proteome research, 15.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
example_data		example_data
LICENSE		LICENSE
README.md		README.md
SemiAutomated_pipeline_py_notebook.ipynb		SemiAutomated_pipeline_py_notebook.ipynb
Tornado_framework.ipynb		Tornado_framework.ipynb
user_manual.pdf		user_manual.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semi-Automated Pipeline using Concept Drift Analysis for improving Metabolomics Predictions (SAPCDAMP)

Table of Contents

Introduction

How to Use

Running the Pipeline on Google Colab

Sections Overview

Section A: Conventional Methods in Metabolomics

Section B: Concept Drift Detection and Hypothesis Testing

b1] Concept Drift Detection (DDM and EDDM)

b2] Identification of Confounding Factors

b3] Hypothesis Testing for Enhanced Classifiers

Section C: Enhanced Classifiers and Improved Accuracy

Evaluation

About

Releases

Packages

Languages

License

JanaSchwarzerova/SAPCDAMP

Folders and files

Latest commit

History

Repository files navigation

Semi-Automated Pipeline using Concept Drift Analysis for improving Metabolomics Predictions (SAPCDAMP)

Table of Contents

Introduction

How to Use

Running the Pipeline on Google Colab

Sections Overview

Section A: Conventional Methods in Metabolomics

Section B: Concept Drift Detection and Hypothesis Testing

b1] Concept Drift Detection (DDM and EDDM)

b2] Identification of Confounding Factors

b3] Hypothesis Testing for Enhanced Classifiers

Section C: Enhanced Classifiers and Improved Accuracy

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages