Clustering algorithms have several parameters which can be varied, leading to different clustering results. A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis.
This notebook provides examples of how to use the
rOpenScPCA
package to:
Please refer to the 01_perform-evaluate-clustering.Rmd
notebook for a tutorial on using rOpenScPCA
functions
to:
This notebook will use the sample SCPCS000001
from
project SCPCP000001
, which is assumed present in the
OpenScPCA-analysis/data/current/SCPCP000001
directory, for
all examples. Please see
this documentation for more information about obtaining ScPCA
data.
library(rOpenScPCA)
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(ggplot2)
library(patchwork)
})
Warning: package 'S4Vectors' was built under R version 4.4.1
Warning: package 'IRanges' was built under R version 4.4.1
# Set ggplot theme for plots
theme_set(theme_bw())
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)
# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.
set.seed(2024)
To begin, we’ll read in the SingleCellExperiment
(SCE)
object.
# Read the SCE file
sce <- readRDS(input_sce_file)
For the initial cluster calculations and evaluations, we will use the
PCA matrix extracted from the SCE object. As shown in 01_perform-evaluate-clustering.Rmd
,
it is also possible to use an SCE object or a Seurat object
directly.
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
This section will show how to perform clustering across a set of
parameters (aka, “sweep” a set of parameters) with
rOpenScPCA::sweep_clusters()
.
This function takes a PCA matrix with row names representing unique
cell ids (e.g., barcodes) as its primary argument, with additional
arguments for cluster parameters. This function wraps the
rOpenScPCA::calculate_clusters()
function but allows you to
provide a vector of parameter values to perform clustering across, as
listed below. Clusters will be calculated for all combinations of
parameters values (where applicable); default values that the function
will use for any unspecified parameter values are shown in
parentheses
algorithm
: Which clustering algorithm to use
(Louvain)weighting
: Which weighting scheme to use (Jaccard)nn
: The number of nearest neighbors (10)resolution
: The resolution parameter (1; used only with
Louvain and Leiden clustering)objective_function
: The objective function to optimize
clusters (CPM; used only with Leiden clustering)rOpenScPCA::sweep_clusters()
does allow you to specify
values for any other parameters.
This function will return a list of data frames of clustering results. Each data frame will have the following columns:
cell_id
: Unique cell identifiers, obtained from the PCA
matrix’s row namescluster
: A factor column with the cluster
identitiesTo demonstrate this function, we’ll calculate clusters using the
Louvain algorithm while varying the nn
parameter:
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)
# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
pca_matrix,
nn = nn_values
)
The resulting list has a length of three, one data frame for each
nn
parameter tested:
length(cluster_results_list)
[1] 3
It can be helpful (although it is not strictly necessary to keep
track) to name this list by the varied nn
parameter:
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
We can look at the first few rows of each data frame using purrr::map()
to iterate over the list:
cluster_results_list |>
purrr::map(head)
$nn_10
cell_id cluster algorithm weighting nn resolution
1 GGTTAACTCCTCACTG 1 louvain jaccard 10 1
2 TGATGGTGTTGGATCT 2 louvain jaccard 10 1
3 AGATGCTAGAGCACTG 2 louvain jaccard 10 1
4 TGGATGTTCAACTTTC 2 louvain jaccard 10 1
5 TTATTGCGTGAGTGAC 2 louvain jaccard 10 1
6 AATGACCCAAGGTTGG 1 louvain jaccard 10 1
$nn_20
cell_id cluster algorithm weighting nn resolution
1 GGTTAACTCCTCACTG 1 louvain jaccard 20 1
2 TGATGGTGTTGGATCT 1 louvain jaccard 20 1
3 AGATGCTAGAGCACTG 1 louvain jaccard 20 1
4 TGGATGTTCAACTTTC 1 louvain jaccard 20 1
5 TTATTGCGTGAGTGAC 1 louvain jaccard 20 1
6 AATGACCCAAGGTTGG 1 louvain jaccard 20 1
$nn_30
cell_id cluster algorithm weighting nn resolution
1 GGTTAACTCCTCACTG 1 louvain jaccard 30 1
2 TGATGGTGTTGGATCT 1 louvain jaccard 30 1
3 AGATGCTAGAGCACTG 1 louvain jaccard 30 1
4 TGGATGTTCAACTTTC 1 louvain jaccard 30 1
5 TTATTGCGTGAGTGAC 1 louvain jaccard 30 1
6 AATGACCCAAGGTTGG 1 louvain jaccard 30 1
Generally speaking, purrr::map()
can be used to iterate
over this list to visualize or analyze each clustering result on its
own; we’ll use this approach in the following sections.
When comparing clustering results, it’s important to first visualize the different clusterings to build context for interpreting QC metrics.
As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity. However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren’t many clusters in the first place. This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.
We’ll therefore visualize these results as UMAPs by iterating over
cluster_results_list
and combining plots with patchwork::wrap_plots()
.
We’ll specifically use purrr::imap()
to iterate so that we can assign this list’s names as plot titles.
For this, we’ll begin by extracting a table of UMAP coordinates from our SCE object.
umap_df <- reducedDim(sce, "UMAP") |>
as.data.frame()
Next, we’ll iterate over cluster_results_list
to plot
the UMAPs.
umap_plots <- cluster_results_list |>
purrr::imap(
\(cluster_df, clustering_name) {
# Add a column with cluster assignments to umap_df
umap_df_plot <- umap_df |>
dplyr::mutate(cluster = cluster_df$cluster)
# Plot the UMAP, colored by the new cluster variable
ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
# We'll add a couple UMAP plot settings here, including equal axes and
# turning off the axis ticks and text since UMAP coordinates are not meaningful
coord_equal() +
theme(
axis.ticks = element_blank(),
axis.text = element_blank()
)
}
)
# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.
This section will use purrr::map()
to iterate over each
clustering result data frame to calculate silhouette width, neighborhood
purity, and stability, and then visualize results. The goal of this code
is to identify whether one clustering parameterization produces more
reliable clusters.
Both silhouette width and neighborhood purity are cell-level
quantities, so we can calculate them together in the same call to
purrr::map()
. Below, we’ll iterate over each data frame in
cluster_results_list
to calculate these quantities.
cell_metric_list <- cluster_results_list |>
purrr::map(
\(cluster_df) {
# calculate silhouette width
silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
# calculate neighbhorhood purity
purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
# Combine into a single data frame
dplyr::left_join(silhouette_df, purity_df)
}
)
Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
resolution)`
Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
resolution)`
Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
resolution)`
# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
$nn_10
cell_id cluster algorithm weighting nn resolution silhouette_other
1 GGTTAACTCCTCACTG 1 louvain jaccard 10 1 4
2 TGATGGTGTTGGATCT 2 louvain jaccard 10 1 1
3 AGATGCTAGAGCACTG 2 louvain jaccard 10 1 1
4 TGGATGTTCAACTTTC 2 louvain jaccard 10 1 1
5 TTATTGCGTGAGTGAC 2 louvain jaccard 10 1 4
6 AATGACCCAAGGTTGG 1 louvain jaccard 10 1 2
silhouette_width purity maximum_neighbor
1 0.11974670 0.5954806 1
2 0.16684933 0.8703055 2
3 -0.01994652 0.4088789 1
4 -0.00389170 0.3559577 1
5 0.10148714 0.6340306 2
6 0.29900703 0.8408082 1
$nn_20
cell_id cluster algorithm weighting nn resolution silhouette_other
1 GGTTAACTCCTCACTG 1 louvain jaccard 20 1 3
2 TGATGGTGTTGGATCT 1 louvain jaccard 20 1 3
3 AGATGCTAGAGCACTG 1 louvain jaccard 20 1 3
4 TGGATGTTCAACTTTC 1 louvain jaccard 20 1 2
5 TTATTGCGTGAGTGAC 1 louvain jaccard 20 1 3
6 AATGACCCAAGGTTGG 1 louvain jaccard 20 1 3
silhouette_width purity maximum_neighbor
1 -0.01856650 0.6014493 1
2 0.16054653 0.9672087 1
3 0.17129677 1.0000000 1
4 0.09659459 0.8061952 1
5 0.03631923 0.8797615 1
6 0.08324109 1.0000000 1
$nn_30
cell_id cluster algorithm weighting nn resolution silhouette_other
1 GGTTAACTCCTCACTG 1 louvain jaccard 30 1 4
2 TGATGGTGTTGGATCT 1 louvain jaccard 30 1 4
3 AGATGCTAGAGCACTG 1 louvain jaccard 30 1 4
4 TGGATGTTCAACTTTC 1 louvain jaccard 30 1 4
5 TTATTGCGTGAGTGAC 1 louvain jaccard 30 1 4
6 AATGACCCAAGGTTGG 1 louvain jaccard 30 1 4
silhouette_width purity maximum_neighbor
1 -0.01294996 0.6588463 1
2 0.15391710 1.0000000 1
3 0.16327444 1.0000000 1
4 0.13532673 0.9382301 1
5 0.03019309 0.9093376 1
6 0.07406343 1.0000000 1
To visualize these results, we can combine all data frames in this
list into a single overall data frame, where the existing
nn
column will distinguish among conditions.
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
We can visualize silhouette width and neighborhood purity each with
boxplots, for example, and use the patchwork
package to print them together:
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel2") +
labs(
x = "Number of nearest neighbors",
y = "Silhouette width"
)
# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel2") +
labs(
x = "Number of nearest neighbors",
y = "Neighborhood purity"
)
# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.
Next, we’ll calculate stability on the clusters using
rOpenScPCA::calculate_stability()
, specifying the same
parameter used for the original cluster calculation at each
iteration.
stability_list <- cluster_results_list |>
purrr::map(
\(cluster_df) {
nn <- unique(cluster_df$nn)
# calculate stability, passing in the parameter value used for this iteration
rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
}
)
We’ll again combine the output of stability_list
into a
single data frame and plot ari
values across
nn
parameterizations.
stability_df <- dplyr::bind_rows(stability_list)
ggplot(stability_df) +
aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel2") +
labs(
x = "Number of nearest neighbors",
y = "Adjusted Rand Index"
) +
theme(legend.position = "none")
Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10.
The previous section demonstrated how to calculate clusters and QC
metrics when varying one parameter, but it is possible to vary multiple
parameters at once with rOpenScPCA::sweep_clusters()
. In
this section, we’ll show an overview of how you might write code to vary
two parameters at once (here, nearest neighbors and resolution as
examples) and visualize results.
# Define vectors of parameters to vary
nn_values <- seq(10, 30, 10)
res_values <- seq(5, 15, 5)/10
cluster_results_list <- rOpenScPCA::sweep_clusters(
pca_matrix,
nn = nn_values,
resolution = res_values
)
This resulting list now has 9 different clustering results, for each
combination of nn_values
and res_values
:
length(cluster_results_list)
[1] 9
Next, we’ll iterate over cluster_results_list
to plot
the UMAPs. This time, we’ll use purrr::map()
and pull out
parameters from each iteration’s cluster_df
to form the
UMAP panel title.
umap_plots <- cluster_results_list |>
purrr::map(
\(cluster_df) {
# Add a column with cluster assignments to umap_df
umap_df_plot <- umap_df |>
dplyr::mutate(cluster = cluster_df$cluster)
# Create a title for the UMAP with both parameters
umap_title <- glue::glue(
"nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
)
# Plot the UMAP, colored by the new cluster variable
ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
geom_point(alpha = 0.6) +
labs(title = umap_title) +
# We'll add a couple UMAP-specific plot settings
coord_equal() +
theme(
axis.ticks = element_blank(),
axis.text = element_blank(),
legend.position = "bottom"
)
}
)
# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
This section presents one coding strategy to calculate and visualize results when varying two clustering parameters. In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values. Since silhouette width and neighhorbood purity calculations using generally similar code, we’ll just show neighborhood purity here.
First, we’ll calculate neighborhood purity and combine results into a single data frame.
purity_df <- cluster_results_list |>
purrr::map(
\(cluster_df) {
rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
}
) |>
dplyr::bind_rows()
We’ll add a column resolution_label
which we’ll use to
have informative panel titles in the faceted ggplot we make next.
purity_df <- purity_df |>
dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
ggplot(purity_df) +
aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel2") +
# facet by resolution
facet_wrap(vars(resolution_label)) +
labs(
x = "Number of nearest neighbors",
y = "Neighborhood purity"
) +
theme(legend.position = "none")
Similar to above, we’ll calculate stability, combine results into a
single data frame, add a resolution_label
column to support
plot interpretation, and finally make our plot.
stability_df <- cluster_results_list |>
purrr::map(
\(cluster_df) {
# Extract parameters for this clustering result
nn <- unique(cluster_df$nn)
resolution <- unique(cluster_df$resolution)
rOpenScPCA::calculate_stability(
pca_matrix,
cluster_df,
nn = nn,
resolution = resolution
)
}
) |>
dplyr::bind_rows()
stability_df <- stability_df |>
dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
ggplot(stability_df) +
aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel2") +
# facet by resolution
facet_wrap(vars(resolution_label)) +
labs(
x = "Number of nearest neighbors",
y = "Adjusted Rand Index"
) +
theme(legend.position = "none")
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] patchwork_1.3.0 ggplot2_3.5.1
[3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[5] Biobase_2.64.0 GenomicRanges_1.56.1
[7] GenomeInfoDb_1.40.1 IRanges_2.38.1
[9] S4Vectors_0.42.1 BiocGenerics_0.50.0
[11] MatrixGenerics_1.16.0 matrixStats_1.4.1
[13] rOpenScPCA_0.1.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.48 bslib_0.8.0
[4] lattice_0.22-6 pdfCluster_1.0-4 vctrs_0.6.5
[7] tools_4.4.0 generics_0.1.3 parallel_4.4.0
[10] tibble_3.2.1 fansi_1.0.6 highr_0.11
[13] cluster_2.1.6 BiocNeighbors_1.22.0 pkgconfig_2.0.3
[16] Matrix_1.7-0 RColorBrewer_1.1-3 lifecycle_1.0.4
[19] GenomeInfoDbData_1.2.12 compiler_4.4.0 farver_2.1.2
[22] munsell_0.5.1 bluster_1.14.0 codetools_0.2-20
[25] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
[28] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
[31] tidyr_1.3.1 BiocParallel_1.38.0 DelayedArray_0.30.1
[34] cachem_1.1.0 abind_1.4-8 tidyselect_1.2.1
[37] digest_0.6.37 dplyr_1.1.4 purrr_1.0.2
[40] magic_1.6-1 labeling_0.4.3 rprojroot_2.0.4
[43] fastmap_1.2.0 grid_4.4.0 colorspace_2.1-1
[46] cli_3.6.3 SparseArray_1.4.8 magrittr_2.0.3
[49] S4Arrays_1.4.1 utf8_1.2.4 withr_3.0.1
[52] scales_1.3.0 UCSC.utils_1.0.0 rmarkdown_2.28
[55] XVector_0.44.0 httr_1.4.7 igraph_2.0.3
[58] evaluate_1.0.0 knitr_1.48 geometry_0.5.0
[61] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0
[64] BiocManager_1.30.25 renv_1.0.10 jsonlite_1.8.9
[67] R6_2.5.1 zlibbioc_1.50.0