Skip to content

🧩 Group Aggregation via UMAP (multi-omics) Data Integration

License

Notifications You must be signed in to change notification settings

hirscheylab/gaudi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The package is named gaudi, inspired by the renowned Spanish architect Antoni Gaudí, who was famous for his intricate and colorful mosaics. Just as Gaudí pieced together countless fragments of tiles to create stunning, cohesive artworks, this package brings together various pieces of multi-omics data to form a comprehensive picture of complex biological systems. The concept of a mosaic beautifully parallels the integration of diverse data types, illustrating how individual fragments (genes, proteins, metabolites, etc.) come together to reveal the larger, more intricate patterns within biological research.

GAUDI

The gaudi package provides a streamlined solution for the integration and analysis of complex multi-omics data. Leveraging the power of UMAP (Uniform Manifold Approximation and Projection), this package offers researchers an efficient and intuitive approach to uncover hidden patterns and relationships within diverse biological datasets. Designed for ease of use and compatibility with existing R-based bioinformatics workflows, gaudi is ideal for both novice and experienced users looking to delve deeper into the world of multi-omics research.

Installation

To install the latest GitHub version:

# install.packages("devtools")
devtools::install_github("hirscheylab/gaudi")

Parameter Settings and Dataset Adaptation

GAUDI provides comprehensive parameter customization while offering empirically-derived defaults that enable robust analysis across diverse multi-omics contexts:

Default Parameters

  • UMAP Parameters

    • n_neighbors = 15 (UMAP default)
    • n_components = 4 for initial embeddings, 2 for final visualization
    • metric = "euclidean"
    • min_dist = 0.01
  • HDBSCAN Parameters

    • Minimum cluster size automatically computed as 3% of sample count (minimum of 2 samples)
    • Adapts to varying dataset sizes while preventing singleton clusters
  • Feature Selection Parameters (metagenes)

    • XGBoost (default): lambda=0, eta=0.5, gamma=50, max_depth=10, subsample=0.95
    • Random Forest option available for broader feature inclusion

Data Preprocessing

Data Preparation

Prepare your multi-omics data matrices. Each matrix should have samples as rows and features as columns:

omics_list <- list(
    expression = expression_matrix,  # Gene expression data
    methylation = methylation_matrix,  # DNA methylation data
    protein = protein_matrix  # Protein abundance data
)

Missing Data

  • Features with zero standard deviation are automatically removed
  • We recommend imputing missing values before using GAUDI
  • For large-scale missingness (>20%), consider removing affected samples/features

Batch Effects

  • Use combine_omics = TRUE to enable automatic batch correction
  • Employs ComBat from sva package to correct omic-type effects
result <- gaudi(omics_list, combine_omics = TRUE)

Adapting to New Datasets

Dataset Size Considerations

  • For small datasets (<100 samples): Consider reducing UMAP n_neighbors
  • For large datasets (>1000 samples): May benefit from increased n_neighbors
  • HDBSCAN minimum cluster size automatically scales with dataset size

Feature Selection

  • Use XGBoost (default) for selective feature identification
  • Switch to Random Forest for broader feature inclusion:
result <- gaudi(omics_list, method = "rf")

Validation

  • Use silhouette scores to assess clustering quality
  • Compare survival differences between clusters
  • Perform enrichment analysis on identified features
# Check clustering quality
print(result@silhouette_score)

Like most machine learning approaches, while these parameters provide robust performance across typical multi-omics datasets, users are encouraged to optimize them for their specific biological context through standard cross-validation and benchmarking procedures, particularly when analyzing datasets with unique characteristics or investigating novel biological phenomena.

Method Performance and Validation

For an in-depth understanding of the GAUDI (Group Aggregation via UMAP Data Integration) method's performance and its comparative analysis with other leading multi-omics integration techniques, we encourage users to explore our dedicated benchmarking repository. This repository contains detailed benchmarks across various datasets, including simulated data, TCGA cancer datasets, single-cell datasets, and DepMap multi-omics data, providing valuable insights into the effectiveness and versatility of the GAUDI approach. Access the comprehensive benchmarks and results here.

License

GAUDI is licensed under the GNU General Public License v3.0. See the LICENSE file in the repository for details.

Citation

@article {Castellano-Escuder2024.10.07.617035,
	author = {Castellano-Escuder, Pol and Zachman, Derek K and Han, Kevin and Hirschey, Matthew D},
	title = {Interpretable multi-omics integration with UMAP embeddings and density-based clustering},
	elocation-id = {2024.10.07.617035},
	year = {2024},
	doi = {10.1101/2024.10.07.617035},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/10/11/2024.10.07.617035},
	eprint = {https://www.biorxiv.org/content/early/2024/10/11/2024.10.07.617035.full.pdf},
	journal = {bioRxiv}
}

About

🧩 Group Aggregation via UMAP (multi-omics) Data Integration

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages