ClustREval is a package for evaluating the performance of different clustering pipelines on scRNA-seq data using unsupervised metrics and gene set enrichment analysis. Clustering is a powerful tool for helping researchers to detect cellular heterogeniety. However, clustering performance is highly dependent on parameters used in the clustering pipeline, for which there are no systematic recommendations. This package allows users to compute clustering results from various clustering pipelines defined by user-specified parameters. Clustering results can then be evaluated using comparing unsupervised clustering metrics and differential gene expression between results.
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
To install the latest version of the package:
devtools::install_github("cindyfang70/clustREval", build_vignettes = TRUE)
data(package = "clustREval")
To run the shiny app, run: runClustREval()
clustREval contains four functions that aid in the evaluation of clustering performance.
The runPipelineCombs
function runs all combinations of user-specified
clustering pipelines. The user simply has to define the various
parameters to use at each step in the pipeline and provide the data to
perform clustering on.
The computeUnsupervisedMetrics
function computes the Dunn index and
mean silhouette width of a clustering output.
The geneSetEval
function performs Gene Set Enrichment Analysis (GSEA)
on each of the clusters from a clustering output and returns enrichment
scores based on the Hallmark Pathways from MSigDB.
The plotGeneSetEval
function plots the enrichment scores from GSEA.
An overview of the package is illustrated below:
The author for this package is Xin Zhi Fang.
The runPipelineCombs
uses the pipeComp
library to run all pipeline
combinations, but does not use the same end results. The evaluation step
from the pipeComp
library is bypassed as it depends on cell type
labels (which are not always available).
The fgsea
package is used to perform GSEA on the clusters. As well,
are used to help map the gene symbols
to names.
, and scuttle
are used in various functions for
preprocessing of the scRNA-seq data.
is used in almost all functions to store
scRNA-seq data.
and clValid
were used to calculate the silhouette score and
Dunn index, respectively.
and gridExtra
were used for plotting functionality.
, tibble
, tidyverse
, dplyr
, and magrittr
were used for
data manipulation.
Germain, P. L., Sonrel, A., & Robinson, M. D. (2020). pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biology, 21(1).
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12).
Amezquita, R. A., Lun, A. T. L., Becht, E., Carey, V. J., Carpp, L. N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., Waldron, L., Pagès, H., Smith, M. L., Huber, W., Morgan, M., Gottardo, R., & Hicks, S. C. (2019). Orchestrating single-cell analysis with Bioconductor. Nature Methods, 17(2), 137–145.
Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., Artyomov, M. N., & Sergushichev, A. (2016). Fast gene set enrichment analysis. BioRxiv. Published.
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). “Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R.” Bioinformatics, 33, 1179-1186. doi: 10.1093/bioinformatics/btw777 (URL:
Joseph Larmarange (2021). labelled: Manipulating Labelled Data. R package version 2.9.0.
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi: 10.12688/f1000research.9501.2 (URL:
Hervé Pagès, Marc Carlson, Seth Falcon and Nianhua Li (2020). AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. R package version 1.52.0.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686,
Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software, 25(4), 1-22. URL
Hao and Hao et al. Integrated analysis of multimodal single-cell data. Cell (2021)
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2021). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.2.
Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.7.
Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.6.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Stefan Milton Bache and Hadley Wickham (2020). magrittr: A Forward-Pipe Operator for R. R package version 2.0.1.
Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3.
Carlson M (2019). Genome wide annotation for Human. R package version 3.8.2.
Yan L, Yang M, Guo H, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nature Structural & Molecular Biology. 2013 Sep;20(9):1131-1139. DOI: 10.1038/nsmb.2660. PMID: 23934149.
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems, 1(6), 417–425.
This package was developed for BCB410H: Applied Bioinformatics, University of Toronto, Toronto, CANADA, 2019-2021.