In single-cell RNA sequencing (scRNA-seq), clusters are groups of cells that exhibit similar gene expression patterns. The primary goal of clustering in scRNA-seq analysis is to identify and group together cells that share similar transcriptional profiles. Each cluster represents a distinct population of cells with potentially similar cell types, biological states, or functions. An R pacakge called Seurat is a popular tool used to carry out the pre-processing, clustering and visualization steps in scRNA-seq analysis.
The package processes Seurat’s differential expression markers (after running FindAllMarkers() function in Seurat). This package reformats the gene markers to go through gene ontology (GO) analysis using DAVID (Database for Annotation, Visualization and Integrated Discovery). It also provides functions for analysis of DAVID output files and visualization.
Currently, users have to manually separate the clusters in Seurat’s markers dataframe using Excel and export it as a tab-delimited text file to upload to DAVID. They then have to manually combine all the DAVID output files (one for each clusters) to do further analysis.
The R package includes the main components: DESCRIPTION, NAMESPACE, man
subdirectory and R subdirectory. Additionally, LICENSE, README and
subdirectories vignettes, tests, data and inst are also explored. The
SeuratToGO
package was developed using
R version 4.3.2 (2023-10-31 ucrt)
,
Platform: x86_64-w64-mingw32/x64 (64-bit)
and
Running under: Windows 11 x64 (build 22621)
.
You can install the development version of SeuratToGO from GitHub with:
install.packages("devtools")
library("devtools")
devtools::install_github("dien-n-nguyen/SeuratToGO", build_vignettes = TRUE)
library("SeuratToGO")
To run the Shiny app:
SeuratToGO::run_SeuratToGO()
ls("package:SeuratToGO")
data(package = "SeuratToGO")
browseVignettes("SeuratToGO")
SeuratToGO
contains 5 functions.
-
separate_clusters for separating the differentially expressed markers data frame generated by Seurat and exporting it as a tab-delimited text file.
-
combine_david_files for combining all the DAVID output files into a list of data frames.
-
get_top_processes to get the top processes for a one specified cluster. The output is a dataframe in which each row is a biological process and each column is a property relating to that process, for example genes, p-value, population, etc… This is to get a closer look at the each cluster.
-
get_all_top_processes to get the p-values of the top processes for every cluster and consolidate them into one data frame.
-
top_processes_heatmap to generate a heatmap for all the top processes in each cluster
The package also contains a dataset called pbmc_markers
, which
contains differentially expressed markers generated using Seurat’s
tutorial. It also contains a zip folder called david.zip
in
inst/extdata/
that contains sample DAVID output files if users want to
view them.
An overview of the package is illustrated below. The steps highlighted yellow are not supported by this package, since DAVID’s API does not support the type of gene IDs we are working with. See the vignette for more details.
The author of the package is Dien Nguyen. The author wrote all 5
functions mentioned above. separate_clusters uses the package
magrittr
for piping and the package dplyr
for filtering and
selecting. get_top_processes uses dplyr
to sort data frames.
top_processes_heatmap uses the package pheatmap
to generate the
heatmap. The pbmc_markers
dataset was generated by following Seurat’s
clustering tutorial. The DAVID output files were generated using the
DAVID web server.
-
Bache S, Wickham H. 2022. magrittr: A Forward-Pipe Operator for R. https://magrittr.tidyverse.org, https://github.com/tidyverse/magrittr.
-
Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological. 57(1):289–300. doi:10.1111/j.2517-6161.1995.tb02031.x.
-
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. 2018. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 36(5):411–420. doi:10.1038/nbt.4096.
-
Kolde R. 2019. Pheatmap: pretty heatmaps. https://github.com/raivokolde/pheatmap
-
R Core Team. 2023. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
-
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W. 2022. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50(W1):W216–W221. doi:10.1093/nar/gkac194.
-
Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive Integration of Single-Cell Data. Cell. 177(7):1888-1902.e21. doi:10.1016/j.cell.2019.05.031.
-
Wickham H, Bryan, J. 2019. R Packages (2nd edition). Newton, Massachusetts: O’Reilly Media. https://r-pkgs.org/
-
Wickham H, François R, Henry L, Müller K, Vaughan D. 2023. dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org, https://github.com/tidyverse/dplyr.
This package was developed as part of an assessment for 2022-2023
BCB410H: Applied Bioinformatics course at the University of Toronto,
Toronto, CANADA. SeuratToGO
welcomes issues, enhancement requests, and
other contributions. To submit an issue, use the GitHub
issues. Many thanks
to those who provided feedback to improve this package.