The package is named
gaudi
, inspired by the renowned Spanish architect Antoni GaudÃ, who was famous for his intricate and colorful mosaics. Just as Gaudà pieced together countless fragments of tiles to create stunning, cohesive artworks, this package brings together various pieces of multi-omics data to form a comprehensive picture of complex biological systems. The concept of a mosaic beautifully parallels the integration of diverse data types, illustrating how individual fragments (genes, proteins, metabolites, etc.) come together to reveal the larger, more intricate patterns within biological research.
The gaudi
package provides a streamlined solution for the integration and analysis of complex multi-omics data. Leveraging the power of UMAP (Uniform Manifold Approximation and Projection), this package offers researchers an efficient and intuitive approach to uncover hidden patterns and relationships within diverse biological datasets. Designed for ease of use and compatibility with existing R-based bioinformatics workflows, gaudi
is ideal for both novice and experienced users looking to delve deeper into the world of multi-omics research.
To install the latest GitHub version:
# install.packages("devtools")
devtools::install_github("hirscheylab/gaudi")
GAUDI provides comprehensive parameter customization while offering empirically-derived defaults that enable robust analysis across diverse multi-omics contexts:
-
UMAP Parameters
n_neighbors
= 15 (UMAP default)n_components
= 4 for initial embeddings, 2 for final visualizationmetric
= "euclidean"min_dist
= 0.01
-
HDBSCAN Parameters
- Minimum cluster size automatically computed as 3% of sample count (minimum of 2 samples)
- Adapts to varying dataset sizes while preventing singleton clusters
-
Feature Selection Parameters (metagenes)
- XGBoost (default): lambda=0, eta=0.5, gamma=50, max_depth=10, subsample=0.95
- Random Forest option available for broader feature inclusion
Prepare your multi-omics data matrices. Each matrix should have samples as rows and features as columns:
omics_list <- list(
expression = expression_matrix, # Gene expression data
methylation = methylation_matrix, # DNA methylation data
protein = protein_matrix # Protein abundance data
)
- Features with zero standard deviation are automatically removed
- We recommend imputing missing values before using GAUDI
- For large-scale missingness (>20%), consider removing affected samples/features
- Use
combine_omics = TRUE
to enable automatic batch correction - Employs
ComBat
fromsva
package to correct omic-type effects
result <- gaudi(omics_list, combine_omics = TRUE)
- For small datasets (<100 samples): Consider reducing UMAP
n_neighbors
- For large datasets (>1000 samples): May benefit from increased
n_neighbors
- HDBSCAN minimum cluster size automatically scales with dataset size
- Use XGBoost (default) for selective feature identification
- Switch to Random Forest for broader feature inclusion:
result <- gaudi(omics_list, method = "rf")
- Use silhouette scores to assess clustering quality
- Compare survival differences between clusters
- Perform enrichment analysis on identified features
# Check clustering quality
print(result@silhouette_score)
Like most machine learning approaches, while these parameters provide robust performance across typical multi-omics datasets, users are encouraged to optimize them for their specific biological context through standard cross-validation and benchmarking procedures, particularly when analyzing datasets with unique characteristics or investigating novel biological phenomena.
For an in-depth understanding of the GAUDI (Group Aggregation via UMAP Data Integration) method's performance and its comparative analysis with other leading multi-omics integration techniques, we encourage users to explore our dedicated benchmarking repository. This repository contains detailed benchmarks across various datasets, including simulated data, TCGA cancer datasets, single-cell datasets, and DepMap multi-omics data, providing valuable insights into the effectiveness and versatility of the GAUDI approach. Access the comprehensive benchmarks and results here.
GAUDI is licensed under the GNU General Public License v3.0. See the LICENSE file in the repository for details.
@article {Castellano-Escuder2024.10.07.617035,
author = {Castellano-Escuder, Pol and Zachman, Derek K and Han, Kevin and Hirschey, Matthew D},
title = {Interpretable multi-omics integration with UMAP embeddings and density-based clustering},
elocation-id = {2024.10.07.617035},
year = {2024},
doi = {10.1101/2024.10.07.617035},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/10/11/2024.10.07.617035},
eprint = {https://www.biorxiv.org/content/early/2024/10/11/2024.10.07.617035.full.pdf},
journal = {bioRxiv}
}