From 6bb42c5e8e71215cc498a8c946934c3a6c2aee18 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Fri, 20 Dec 2024 15:09:04 -0500 Subject: [PATCH 01/16] UUID --- analyses/hello-clusters/hello-clusters.Rproj | 1 + 1 file changed, 1 insertion(+) diff --git a/analyses/hello-clusters/hello-clusters.Rproj b/analyses/hello-clusters/hello-clusters.Rproj index 98df0873f..3bb706a10 100644 --- a/analyses/hello-clusters/hello-clusters.Rproj +++ b/analyses/hello-clusters/hello-clusters.Rproj @@ -1,4 +1,5 @@ Version: 1.0 +ProjectId: 579b4580-65dc-45bf-ab43-c5fcd3170e05 RestoreWorkspace: No SaveWorkspace: No From 694aadfb91dec0c721508dad7bf095cfb2b73963 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Fri, 20 Dec 2024 15:09:36 -0500 Subject: [PATCH 02/16] WIP: started sweep notebook --- .../02_compare-clustering-parameters.Rmd | 166 ++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 analyses/hello-clusters/02_compare-clustering-parameters.Rmd diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd new file mode 100644 index 000000000..34db2b80e --- /dev/null +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -0,0 +1,166 @@ +--- +title: "Comparing clustering parameters with rOpenScPCA" +date: "`r Sys.Date()`" +author: "Data Lab" +output: + html_notebook: + toc: yes + toc_float: yes + df_print: paged +--- + +## Introduction + +Clustering algorithms have several parameters which can be varied, leading to different clustering results. +A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. + +This notebook provides examples of how to use the `rOpenScPCA` package to: + +* Calculate clusters across several different parameterizations +* Calculate QC metrics on across parameterizations + +Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to: + +* Calculate clusters from a single parameterization +* Calculate QC metrics on a single set of clusters + +This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples. +Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data. + +## Setup + +### Packages + + +```{r packages} +library(rOpenScPCA) + +suppressPackageStartupMessages({ + library(SingleCellExperiment) + library(Seurat) + library(dplyr) + library(ggplot2) +}) + +# Set ggplot theme for plots +theme_set(theme_bw()) +``` + + +### Paths + +```{r base paths} +# The base path for the OpenScPCA repository +repository_base <- rprojroot::find_root(rprojroot::is_git_root) + +# The current data directory, found within the repository base directory +data_dir <- file.path(repository_base, "data", "current") + +# The path to this module +module_base <- file.path(repository_base, "analyses", "hello-clusters") +``` + +```{r input file path} +# Path to processed SCE file for sample SCPCS000001 +input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds") +``` + + +### Set the random seed + +Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility. + +```{r set seed} +set.seed(2024) +``` + +## Read in and prepare data + +To begin, we'll read in the `SingleCellExperiment` (SCE) object. + +```{r read data} +# Read the SCE file +sce <- readRDS(input_sce_file) +``` + +For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object. +As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly. + + +```{r extract pca data} +# Extract the PCA matrix from an SCE object +pca_matrix <- reducedDim(sce, "PCA") +``` + +## Perform clustering across a set of parameters + +This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. + +This function takes a PCA matrix with rownames representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. +This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below. +Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses + +* `algorithm`: Which clustering algorithm to use (Louvain) +* `weighting`: Which weighting scheme to use (Jaccard) +* `nn`: The number of nearest neighbors (10) +* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering) +* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering) + +`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. + + +This function will return a list of data frames of clustering results. +Each data frame will have the following columns: + +* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names +* `cluster`: A factor column with the cluster identities +* There will be one column for each clustering parameter used + + +To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter: + +```{r sweep clusters} +# Define nn parameter values of interest +nn <- seq(10, 30, 10) + +# Calculate clusters varying nn, but leaving other parameters at their default values +cluster_results_list <- rOpenScPCA::sweep_clusters( + pca_matrix, + nn = nn +) +``` + +The resulting list has a length of three, one data frame for each `nn` parameter tested: + +```{r length cluster_results_list} +length(cluster_results_list) +``` + +We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list: + + +```{r map cluster_results_list} +cluster_results_list |> + purrr::map(head) +``` + +Generally speaking, `purrr::map()` can be used to iterate over this list to analyze or visualize each clustering result on its own. +Complementary to this, we can also combine this list into a single data frame to jointly analyze or visualize all clustering results together: + + +```{r bind cluster_results_list rows} +cluster_results_df <- dplyr::bind_rows(cluster_results_list) +head(cluster_results_df) +``` + +## Evaluating clustering results + +This section will use `purrr::map()` to iterate over each df to calculate silhouette, purity, and stability. +We'll also plot results. + +## Session Info + +```{r session info} +# record the versions of the packages used in this analysis and other environment information +sessionInfo() +``` From 56b609bb9b2d3594e8e44c10226607ba05c10c56 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 10:33:35 -0500 Subject: [PATCH 03/16] flesh out section to vary one parameter --- .../02_compare-clustering-parameters.Rmd | 181 +- .../02_compare-clustering-parameters.nb.html | 3563 +++++++++++++++++ 2 files changed, 3728 insertions(+), 16 deletions(-) create mode 100644 analyses/hello-clusters/02_compare-clustering-parameters.nb.html diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 34db2b80e..9ceeca0e7 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -16,13 +16,13 @@ A key question when clustering, therefore, is how to identify a set of parameter This notebook provides examples of how to use the `rOpenScPCA` package to: -* Calculate clusters across several different parameterizations -* Calculate QC metrics on across parameterizations +* Calculate several versions of clustering results across several different parameterizations +* Calculate QC metrics on across clustering results Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to: * Calculate clusters from a single parameterization -* Calculate QC metrics on a single set of clusters +* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples. Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data. @@ -37,9 +37,8 @@ library(rOpenScPCA) suppressPackageStartupMessages({ library(SingleCellExperiment) - library(Seurat) - library(dplyr) library(ggplot2) + library(patchwork) }) # Set ggplot theme for plots @@ -96,7 +95,7 @@ pca_matrix <- reducedDim(sce, "PCA") This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. -This function takes a PCA matrix with rownames representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. +This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below. Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses @@ -116,17 +115,16 @@ Each data frame will have the following columns: * `cluster`: A factor column with the cluster identities * There will be one column for each clustering parameter used - To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter: ```{r sweep clusters} # Define nn parameter values of interest -nn <- seq(10, 30, 10) +nn_values <- seq(10, 30, 10) # Calculate clusters varying nn, but leaving other parameters at their default values cluster_results_list <- rOpenScPCA::sweep_clusters( pca_matrix, - nn = nn + nn = nn_values ) ``` @@ -136,6 +134,13 @@ The resulting list has a length of three, one data frame for each `nn` parameter length(cluster_results_list) ``` +It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter: + +```{r set list names} +names(cluster_results_list) <- glue::glue("nn_{nn_values}") +``` + + We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list: @@ -144,19 +149,163 @@ cluster_results_list |> purrr::map(head) ``` -Generally speaking, `purrr::map()` can be used to iterate over this list to analyze or visualize each clustering result on its own. -Complementary to this, we can also combine this list into a single data frame to jointly analyze or visualize all clustering results together: +Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. + +## Visualizing clustering results +When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics. -```{r bind cluster_results_list rows} -cluster_results_df <- dplyr::bind_rows(cluster_results_list) -head(cluster_results_df) +As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity. +However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place. +This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values. + +We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html). +We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles. + +For this, we'll begin by extracting a table of UMAP coordinates from our SCE object. + +```{r create umap_df} +umap_df <- reducedDim(sce, "UMAP") |> + as.data.frame() ``` +Next, we'll iterate over `cluster_results_list` to plot the UMAPs. + +```{r plot all umaps, fig.width = 12} +umap_plots <- cluster_results_list |> + purrr::imap( + \(cluster_df, clustering_name) { + # Add a column with cluster assignments to umap_df + umap_df_plot <- umap_df |> + dplyr::mutate(cluster = cluster_df$cluster) + + # Plot the UMAP, colored by the new cluster variable + ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) + + geom_point(alpha = 0.6) + + labs(title = glue::glue("nearest neighbors: {clustering_name}")) + + # We'll add a couple UMAP plot settings here, including equal axes and + # turning off the axis ticks and text since UMAP coordinates are not meaningful + coord_equal() + + theme( + axis.ticks = element_blank(), + axis.text = element_blank() + ) + } + ) + +# Print the plots with patchwork::wrap_plots() +patchwork::wrap_plots(umap_plots) +``` + +These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters. + + + + ## Evaluating clustering results -This section will use `purrr::map()` to iterate over each df to calculate silhouette, purity, and stability. -We'll also plot results. +This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results. +The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. + + +### Silhouette width and neighborhood purity + +Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`. +Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities. + +```{r calculate cell level metrics} +cell_metric_list <- cluster_results_list |> + purrr::map( + \(cluster_df) { + # calculate silhouette width + silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df) + + # calculate neighbhorhood purity + purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df) + + # Combine into a single data frame + dplyr::left_join(silhouette_df, purity_df) + } + ) + +# View the first six rows of each clustering result's cell-level QC metrics +purrr::map(cell_metric_list, head) +``` + + +To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions. + +```{r combine cell metrics list} +cell_metrics_df <- dplyr::bind_rows(cell_metric_list) +``` + +We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together: + + +```{r} +# Plot silhouette width +silhouette_plot <- ggplot(cell_metrics_df) + + aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) + + geom_boxplot() + + scale_fill_brewer(palette = "Pastel2") + + labs( + x = "Number of nearest neighbors", + y = "Silhouette width" + ) + + +# Plot neighborhood purity width +purity_plot <- ggplot(cell_metrics_df) + + aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + + geom_boxplot() + + scale_fill_brewer(palette = "Pastel2") + + labs( + x = "Number of nearest neighbors", + y = "Neighborhood purity" + ) + + +# Add together using the patchwork library, without a legend +silhouette_plot + purity_plot & theme(legend.position = "none") +``` +While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter. + + +### Stability + +Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. + +```{r calculate stability} +stability_list <- cluster_results_list |> + purrr::map( + \(cluster_df) { + nn <- unique(cluster_df$nn) + + # calculate stability, passing in the parameter value used for this iteration + rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn) + } + ) +``` + +We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations. + + +```{r combine plot stability} +stability_df <- dplyr::bind_rows(stability_list) + +ggplot(stability_df) + + aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + + geom_boxplot() + + scale_fill_brewer(palette = "Pastel2") + + labs( + x = "Number of nearest neighbors", + y = "Adjusted Rand Index" + ) + + theme(legend.position = "none") +``` + +Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. + ## Session Info diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html new file mode 100644 index 000000000..d6c524fd2 --- /dev/null +++ b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html @@ -0,0 +1,3563 @@ + + + + + + + + + + + + + + + +Comparing clustering parameters with rOpenScPCA + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + +
+

Introduction

+

Clustering algorithms have several parameters which can be varied, +leading to different clustering results. A key question when clustering, +therefore, is how to identify a set of parameters that lead to robust +and reliable clusters that can be used in downstream analysis.

+

This notebook provides examples of how to use the +rOpenScPCA package to:

+
    +
  • Calculate several versions of clustering results across several +different parameterizations
  • +
  • Calculate QC metrics on across clustering results
  • +
+

Please refer to the 01_perform-evaluate-clustering.Rmd +notebook for a tutorial on using rOpenScPCA functions +to:

+
    +
  • Calculate clusters from a single parameterization
  • +
  • Calculate QC metrics on a single set of clusters, as well as +explanations of the metrics themselves
  • +
+

This notebook will use the sample SCPCS000001 from +project SCPCP000001, which is assumed present in the +OpenScPCA-analysis/data/current/SCPCP000001 directory, for +all examples. Please see +this documentation for more information about obtaining ScPCA +data.

+
+
+

Setup

+
+

Packages

+ + + +
library(rOpenScPCA)
+
+suppressPackageStartupMessages({
+  library(SingleCellExperiment)
+  library(ggplot2)
+  library(patchwork)
+})
+ + +
Warning: package 'S4Vectors' was built under R version 4.4.1
+ + +
Warning: package 'IRanges' was built under R version 4.4.1
+ + +
# Set ggplot theme for plots
+theme_set(theme_bw())
+ + + +
+
+

Paths

+ + + +
# The base path for the OpenScPCA repository
+repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+
+# The current data directory, found within the repository base directory
+data_dir <- file.path(repository_base, "data", "current")
+
+# The path to this module
+module_base <- file.path(repository_base, "analyses", "hello-clusters")
+ + + + + + +
# Path to processed SCE file for sample SCPCS000001
+input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
+ + + +
+
+

Set the random seed

+

Because clustering involves random sampling, it is important to set +the random seed at the top of your analysis script or notebook to ensure +reproducibility.

+ + + +
set.seed(2024)
+ + + +
+
+
+

Read in and prepare data

+

To begin, we’ll read in the SingleCellExperiment (SCE) +object.

+ + + +
# Read the SCE file
+sce <- readRDS(input_sce_file)
+ + + +

For the initial cluster calculations and evaluations, we will use the +PCA matrix extracted from the SCE object. As shown in 01_perform-evaluate-clustering.Rmd, +it is also possible to use an SCE object or a Seurat object +directly.

+ + + +
# Extract the PCA matrix from an SCE object
+pca_matrix <- reducedDim(sce, "PCA")
+ + + +
+
+

Perform clustering across a set of parameters

+

This section will show how to perform clustering across a set of +parameters (aka, “sweep” a set of parameters) with +rOpenScPCA::sweep_clusters().

+

This function takes a PCA matrix with row names representing unique +cell ids (e.g., barcodes) as its primary argument, with additional +arguments for cluster parameters. This function wraps the +rOpenScPCA::calculate_clusters() function but allows you to +provide a vector of parameter values to perform clustering across, as +listed below. Clusters will be calculated for all combinations of +parameters values (where applicable); default values that the function +will use for any unspecified parameter values are shown in +parentheses

+
    +
  • algorithm: Which clustering algorithm to use +(Louvain)
  • +
  • weighting: Which weighting scheme to use (Jaccard)
  • +
  • nn: The number of nearest neighbors (10)
  • +
  • resolution: The resolution parameter (1; used only with +Louvain and Leiden clustering)
  • +
  • objective_function: The objective function to optimize +clusters (CPM; used only with Leiden clustering)
  • +
+

rOpenScPCA::sweep_clusters() does allow you to specify +values for any other parameters.

+

This function will return a list of data frames of clustering +results. Each data frame will have the following columns:

+
    +
  • cell_id: Unique cell identifiers, obtained from the PCA +matrix’s row names
  • +
  • cluster: A factor column with the cluster +identities
  • +
  • There will be one column for each clustering parameter used
  • +
+

To demonstrate this function, we’ll calculate clusters using the +Louvain algorithm while varying the nn parameter:

+ + + +
# Define nn parameter values of interest
+nn_values <- seq(10, 30, 10)
+
+# Calculate clusters varying nn, but leaving other parameters at their default values
+cluster_results_list <- rOpenScPCA::sweep_clusters(
+  pca_matrix,
+  nn = nn_values
+)
+ + + +

The resulting list has a length of three, one data frame for each +nn parameter tested:

+ + + +
length(cluster_results_list)
+ + +
[1] 3
+ + + +

It can be helpful (although it is not strictly necessary to keep +track) to name this list by the varied nn parameter:

+ + + +
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
+ + + +

We can look at the first few rows of each data frame using purrr::map() +to iterate over the list:

+ + + +
cluster_results_list |>
+  purrr::map(head)
+ + +
$nn_10
+           cell_id cluster algorithm weighting nn resolution
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 10          1
+2 TGATGGTGTTGGATCT       2   louvain   jaccard 10          1
+3 AGATGCTAGAGCACTG       2   louvain   jaccard 10          1
+4 TGGATGTTCAACTTTC       2   louvain   jaccard 10          1
+5 TTATTGCGTGAGTGAC       2   louvain   jaccard 10          1
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 10          1
+
+$nn_20
+           cell_id cluster algorithm weighting nn resolution
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 20          1
+2 TGATGGTGTTGGATCT       1   louvain   jaccard 20          1
+3 AGATGCTAGAGCACTG       1   louvain   jaccard 20          1
+4 TGGATGTTCAACTTTC       1   louvain   jaccard 20          1
+5 TTATTGCGTGAGTGAC       1   louvain   jaccard 20          1
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 20          1
+
+$nn_30
+           cell_id cluster algorithm weighting nn resolution
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 30          1
+2 TGATGGTGTTGGATCT       1   louvain   jaccard 30          1
+3 AGATGCTAGAGCACTG       1   louvain   jaccard 30          1
+4 TGGATGTTCAACTTTC       1   louvain   jaccard 30          1
+5 TTATTGCGTGAGTGAC       1   louvain   jaccard 30          1
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 30          1
+ + + +

Generally speaking, purrr::map() can be used to iterate +over this list to visualize or analyze each clustering result on its +own; we’ll use this approach in the following sections.

+
+
+

Visualizing clustering results

+

When comparing clustering results, it’s important to first visualize +the different clusterings to build context for interpreting QC +metrics.

+

As one example of why this is important, we generally expect that +more robust clusters will have higher values for metrics like silhouette +width and neighborhood purity. However, we also expect that having fewer +clusters in the first place will also lead to higher metrics, regardless +of cluster quality: When there are fewer clusters, it is more likely +that clusters overlap less with one another just because there aren’t +many clusters in the first place. This means that, when interpreting +cluster quality metrics, you should be careful to take more context +about the data into consideration and not only rely on the metric +values.

+

We’ll therefore visualize these results as UMAPs by iterating over +cluster_results_list and combining plots with patchwork::wrap_plots(). +We’ll specifically use purrr::imap() +to iterate so that we can assign this list’s names as plot titles.

+

For this, we’ll begin by extracting a table of UMAP coordinates from +our SCE object.

+ + + +
umap_df <- reducedDim(sce, "UMAP") |>
+  as.data.frame()
+ + + +

Next, we’ll iterate over cluster_results_list to plot +the UMAPs.

+ + + +
umap_plots <- cluster_results_list |>
+  purrr::imap(
+    \(cluster_df, clustering_name) {
+      
+      # Add a column with cluster assignments to umap_df
+      umap_df_plot <- umap_df |>
+        dplyr::mutate(cluster = cluster_df$cluster)
+
+      # Plot the UMAP, colored by the new cluster variable
+      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
+        geom_point(alpha = 0.6) +
+        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
+        # We'll add a couple UMAP plot settings here, including equal axes and 
+        # turning off the axis ticks and text since UMAP coordinates are not meaningful
+        coord_equal() + 
+        theme(
+          axis.ticks = element_blank(),
+          axis.text = element_blank()
+        )
+    }
+  )
+
+# Print the plots with patchwork::wrap_plots()
+patchwork::wrap_plots(umap_plots)
+ + +

+ + + +

These plots show that the number of clusters decreases as the nearest +neighbors parameter increases, with between 9-13 clusters.

+
+
+

Evaluating clustering results

+

This section will use purrr::map() to iterate over each +clustering result data frame to calculate silhouette width, neighborhood +purity, and stability, and then visualize results. The goal of this code +is to identify whether one clustering parameterization produces more +reliable clusters.

+
+

Silhouette width and neighborhood purity

+

Both silhouette width and neighborhood purity are cell-level +quantities, so we can calculate them together in the same call to +purrr::map(). Below, we’ll iterate over each data frame in +cluster_results_list to calculate these quantities.

+ + + +
cell_metric_list <- cluster_results_list |>
+  purrr::map(
+    \(cluster_df) {
+      
+      # calculate silhouette width
+      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
+      
+      # calculate neighbhorhood purity 
+      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
+      
+      # Combine into a single data frame
+      dplyr::left_join(silhouette_df, purity_df)
+    }
+  )
+ + +
Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
+resolution)`
+Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
+resolution)`
+Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
+resolution)`
+ + +
# View the first six rows of each clustering result's cell-level QC metrics
+purrr::map(cell_metric_list, head)
+ + +
$nn_10
+           cell_id cluster algorithm weighting nn resolution silhouette_other
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 10          1                4
+2 TGATGGTGTTGGATCT       2   louvain   jaccard 10          1                1
+3 AGATGCTAGAGCACTG       2   louvain   jaccard 10          1                1
+4 TGGATGTTCAACTTTC       2   louvain   jaccard 10          1                1
+5 TTATTGCGTGAGTGAC       2   louvain   jaccard 10          1                4
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 10          1                2
+  silhouette_width    purity maximum_neighbor
+1       0.11974670 0.5954806                1
+2       0.16684933 0.8703055                2
+3      -0.01994652 0.4088789                1
+4      -0.00389170 0.3559577                1
+5       0.10148714 0.6340306                2
+6       0.29900703 0.8408082                1
+
+$nn_20
+           cell_id cluster algorithm weighting nn resolution silhouette_other
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 20          1                3
+2 TGATGGTGTTGGATCT       1   louvain   jaccard 20          1                3
+3 AGATGCTAGAGCACTG       1   louvain   jaccard 20          1                3
+4 TGGATGTTCAACTTTC       1   louvain   jaccard 20          1                2
+5 TTATTGCGTGAGTGAC       1   louvain   jaccard 20          1                3
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 20          1                3
+  silhouette_width    purity maximum_neighbor
+1      -0.01856650 0.6014493                1
+2       0.16054653 0.9672087                1
+3       0.17129677 1.0000000                1
+4       0.09659459 0.8061952                1
+5       0.03631923 0.8797615                1
+6       0.08324109 1.0000000                1
+
+$nn_30
+           cell_id cluster algorithm weighting nn resolution silhouette_other
+1 GGTTAACTCCTCACTG       1   louvain   jaccard 30          1                4
+2 TGATGGTGTTGGATCT       1   louvain   jaccard 30          1                4
+3 AGATGCTAGAGCACTG       1   louvain   jaccard 30          1                4
+4 TGGATGTTCAACTTTC       1   louvain   jaccard 30          1                4
+5 TTATTGCGTGAGTGAC       1   louvain   jaccard 30          1                4
+6 AATGACCCAAGGTTGG       1   louvain   jaccard 30          1                4
+  silhouette_width    purity maximum_neighbor
+1      -0.01294996 0.6588463                1
+2       0.15391710 1.0000000                1
+3       0.16327444 1.0000000                1
+4       0.13532673 0.9382301                1
+5       0.03019309 0.9093376                1
+6       0.07406343 1.0000000                1
+ + + +

To visualize these results, we can combine all data frames in this +list into a single overall data frame, where the existing +nn column will distinguish among conditions.

+ + + +
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
+ + + +

We can visualize silhouette width and neighborhood purity each with +boxplots, for example, and use the patchwork +package to print them together:

+ + + +
# Plot silhouette width
+silhouette_plot <- ggplot(cell_metrics_df) + 
+  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) + 
+  geom_boxplot() + 
+  scale_fill_brewer(palette = "Pastel2") +
+  labs(
+    x = "Number of nearest neighbors", 
+    y = "Silhouette width"
+  )
+
+
+# Plot neighborhood purity width
+purity_plot <- ggplot(cell_metrics_df) + 
+  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + 
+  geom_boxplot() + 
+  scale_fill_brewer(palette = "Pastel2") +
+  labs(
+    x = "Number of nearest neighbors", 
+    y = "Neighborhood purity"
+  )
+
+
+# Add together using the patchwork library, without a legend
+silhouette_plot + purity_plot & theme(legend.position = "none")
+ + +

+ + + +

While there does not appear to be a salient difference among +silhouette width distributions, it does appear that purity is higher +with a higher nearest neighbors parameter.

+
+
+

Stability

+

Next, we’ll calculate stability on the clusters using +rOpenScPCA::calculate_stability(), specifying the same +parameter used for the original cluster calculation at each +iteration.

+ + + +
stability_list <- cluster_results_list |>
+  purrr::map(
+    \(cluster_df) {
+      
+      nn <- unique(cluster_df$nn)
+
+      # calculate stability, passing in the parameter value used for this iteration
+      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
+    }
+  )
+ + + +

We’ll again combine the output of stability_list into a +single data frame and plot ari values across +nn parameterizations.

+ + + +
stability_df <- dplyr::bind_rows(stability_list)
+
+ggplot(stability_df) + 
+  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + 
+  geom_boxplot() + 
+  scale_fill_brewer(palette = "Pastel2") +
+  labs(
+    x = "Number of nearest neighbors", 
+    y = "Adjusted Rand Index"
+  ) + 
+  theme(legend.position = "none")
+ + +

+ + + +

Here, we see that a nearest neighbors value of 20 or 30 leads to more +stable clustering results compared to 10.

+
+
+
+

Session Info

+ + + +
# record the versions of the packages used in this analysis and other environment information
+sessionInfo()
+ + +
R version 4.4.0 (2024-04-24)
+Platform: aarch64-apple-darwin20
+Running under: macOS 15.2
+
+Matrix products: default
+BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
+LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
+
+locale:
+[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
+
+time zone: America/New_York
+tzcode source: internal
+
+attached base packages:
+[1] stats4    stats     graphics  grDevices datasets  utils     methods  
+[8] base     
+
+other attached packages:
+ [1] patchwork_1.3.0             ggplot2_3.5.1              
+ [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
+ [5] Biobase_2.64.0              GenomicRanges_1.56.1       
+ [7] GenomeInfoDb_1.40.1         IRanges_2.38.1             
+ [9] S4Vectors_0.42.1            BiocGenerics_0.50.0        
+[11] MatrixGenerics_1.16.0       matrixStats_1.4.1          
+[13] rOpenScPCA_0.1.0           
+
+loaded via a namespace (and not attached):
+ [1] gtable_0.3.5            xfun_0.48               bslib_0.8.0            
+ [4] lattice_0.22-6          pdfCluster_1.0-4        vctrs_0.6.5            
+ [7] tools_4.4.0             generics_0.1.3          parallel_4.4.0         
+[10] tibble_3.2.1            fansi_1.0.6             highr_0.11             
+[13] cluster_2.1.6           BiocNeighbors_1.22.0    pkgconfig_2.0.3        
+[16] Matrix_1.7-0            RColorBrewer_1.1-3      lifecycle_1.0.4        
+[19] GenomeInfoDbData_1.2.12 compiler_4.4.0          farver_2.1.2           
+[22] munsell_0.5.1           bluster_1.14.0          codetools_0.2-20       
+[25] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.10            
+[28] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
+[31] tidyr_1.3.1             BiocParallel_1.38.0     DelayedArray_0.30.1    
+[34] cachem_1.1.0            abind_1.4-8             tidyselect_1.2.1       
+[37] digest_0.6.37           dplyr_1.1.4             purrr_1.0.2            
+[40] magic_1.6-1             labeling_0.4.3          rprojroot_2.0.4        
+[43] fastmap_1.2.0           grid_4.4.0              colorspace_2.1-1       
+[46] cli_3.6.3               SparseArray_1.4.8       magrittr_2.0.3         
+[49] S4Arrays_1.4.1          utf8_1.2.4              withr_3.0.1            
+[52] scales_1.3.0            UCSC.utils_1.0.0        rmarkdown_2.28         
+[55] XVector_0.44.0          httr_1.4.7              igraph_2.0.3           
+[58] evaluate_1.0.0          knitr_1.48              geometry_0.5.0         
+[61] rlang_1.1.4             Rcpp_1.0.13             glue_1.8.0             
+[64] BiocManager_1.30.25     renv_1.0.10             jsonlite_1.8.9         
+[67] R6_2.5.1                zlibbioc_1.50.0        
+ + +
+ +
---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Perform clustering across a set of parameters

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

## Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot all umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and 
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() + 
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




## Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
      
      # calculate neighbhorhood purity 
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
      
      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) + 
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) + 
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```
While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) + 
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Adjusted Rand Index"
  ) + 
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

+ + +
+
+ +
+ + + + + + + + + + + + + + + + + From 947f41fb254adf2adb3db115126d1398e5acd2c9 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 10:49:49 -0500 Subject: [PATCH 04/16] add section for varying multiple parameters, and other tweaks --- .../02_compare-clustering-parameters.Rmd | 163 ++++++++++- .../02_compare-clustering-parameters.nb.html | 264 +++++++++++++++--- 2 files changed, 383 insertions(+), 44 deletions(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 9ceeca0e7..1133ce947 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -91,7 +91,7 @@ As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clusterin pca_matrix <- reducedDim(sce, "PCA") ``` -## Perform clustering across a set of parameters +## Varying a single clustering parameter This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. @@ -151,7 +151,7 @@ cluster_results_list |> Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. -## Visualizing clustering results +### Visualizing clustering results When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics. @@ -171,7 +171,7 @@ umap_df <- reducedDim(sce, "UMAP") |> Next, we'll iterate over `cluster_results_list` to plot the UMAPs. -```{r plot all umaps, fig.width = 12} +```{r plot nn umaps, fig.width = 12} umap_plots <- cluster_results_list |> purrr::imap( \(cluster_df, clustering_name) { @@ -202,13 +202,13 @@ These plots show that the number of clusters decreases as the nearest neighbors -## Evaluating clustering results +### Evaluating clustering results This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results. The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. -### Silhouette width and neighborhood purity +#### Silhouette width and neighborhood purity Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`. Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities. @@ -268,10 +268,11 @@ purity_plot <- ggplot(cell_metrics_df) + # Add together using the patchwork library, without a legend silhouette_plot + purity_plot & theme(legend.position = "none") ``` + While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter. -### Stability +#### Stability Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. @@ -307,6 +308,156 @@ ggplot(stability_df) + Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. +## Varying multiple clustering parameters + +The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`. +In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results. + + +```{r sweep two parameters} +# Define vectors of parameters to vary +nn_values <- seq(10, 30, 10) +res_values <- seq(5, 15, 5) / 10 + + +cluster_results_list <- rOpenScPCA::sweep_clusters( + pca_matrix, + nn = nn_values, + resolution = res_values +) +``` + +This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`: + + +```{r length cluster_results_list two parameters} +length(cluster_results_list) +``` + + +### Visualize clusters + +Next, we'll iterate over `cluster_results_list` to plot the UMAPs. +This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title. + +```{r plot nn res umaps, fig.height = 14} +umap_plots <- cluster_results_list |> + purrr::map( + \(cluster_df) { + # Add a column with cluster assignments to umap_df + umap_df_plot <- umap_df |> + dplyr::mutate(cluster = cluster_df$cluster) + + # Create a title for the UMAP with both parameters + umap_title <- glue::glue( + "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}" + ) + + # Plot the UMAP, colored by the new cluster variable + ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) + + geom_point(alpha = 0.6) + + labs(title = umap_title) + + # We'll add a couple UMAP-specific plot settings + coord_equal() + + theme( + axis.ticks = element_blank(), + axis.text = element_blank(), + legend.position = "bottom" + ) + } + ) + +# Print the plots with patchwork::wrap_plots() +patchwork::wrap_plots(umap_plots) +``` + + + +### Calculate and visualize QC metrics + +This section presents one coding strategy to calculate and visualize results when varying two clustering parameters. +In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values. +Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here. + +#### Neighborhood purity + +First, we'll calculate neighborhood purity and combine results into a single data frame. + +```{r calculate purity two parameters} +purity_df <- cluster_results_list |> + purrr::map( + \(cluster_df) { + rOpenScPCA::calculate_purity(pca_matrix, cluster_df) + } + ) |> + dplyr::bind_rows() +``` + + +We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next. + +```{r add resolution_label column} +purity_df <- purity_df |> + dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}")) +``` + +```{r visualize purity two parameters} +ggplot(purity_df) + + aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + + geom_boxplot() + + scale_fill_brewer(palette = "Pastel2") + + # facet by resolution + facet_wrap(vars(resolution_label)) + + labs( + x = "Number of nearest neighbors", + y = "Neighborhood purity" + ) + + theme(legend.position = "none") +``` + +### Stability + +Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot. + +```{r calculate stability two parameters} +stability_df <- cluster_results_list |> + purrr::map( + \(cluster_df) { + # Extract parameters for this clustering result + nn <- unique(cluster_df$nn) + resolution <- unique(cluster_df$resolution) + + rOpenScPCA::calculate_stability( + pca_matrix, + cluster_df, + nn = nn, + resolution = resolution + ) + } + ) |> + dplyr::bind_rows() + +stability_df <- stability_df |> + dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}")) +``` + + +```{r visualize stability two parameters} +ggplot(stability_df) + + aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + + geom_boxplot() + + scale_fill_brewer(palette = "Pastel2") + + # facet by resolution + facet_wrap(vars(resolution_label)) + + labs( + x = "Number of nearest neighbors", + y = "Adjusted Rand Index" + ) + + theme(legend.position = "none") +``` + + + ## Session Info ```{r session info} diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html index d6c524fd2..b29b98fde 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html +++ b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html @@ -3027,8 +3027,8 @@

Read in and prepare data

-
-

Perform clustering across a set of parameters

+
+

Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, “sweep” a set of parameters) with rOpenScPCA::sweep_clusters().

@@ -3140,9 +3140,8 @@

Perform clustering across a set of parameters

Generally speaking, purrr::map() can be used to iterate over this list to visualize or analyze each clustering result on its own; we’ll use this approach in the following sections.

-
-
-

Visualizing clustering results

+
+

Visualizing clustering results

When comparing clustering results, it’s important to first visualize the different clusterings to build context for interpreting QC metrics.

@@ -3174,11 +3173,10 @@

Visualizing clustering results

the UMAPs.

- +
umap_plots <- cluster_results_list |>
   purrr::imap(
     \(cluster_df, clustering_name) {
-      
       # Add a column with cluster assignments to umap_df
       umap_df_plot <- umap_df |>
         dplyr::mutate(cluster = cluster_df$cluster)
@@ -3187,9 +3185,9 @@ 

Visualizing clustering results

ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) + geom_point(alpha = 0.6) + labs(title = glue::glue("nearest neighbors: {clustering_name}")) + - # We'll add a couple UMAP plot settings here, including equal axes and + # We'll add a couple UMAP plot settings here, including equal axes and # turning off the axis ticks and text since UMAP coordinates are not meaningful - coord_equal() + + coord_equal() + theme( axis.ticks = element_blank(), axis.text = element_blank() @@ -3208,32 +3206,31 @@

Visualizing clustering results

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.

-
-

Evaluating clustering results

+
+

Evaluating clustering results

This section will use purrr::map() to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results. The goal of this code is to identify whether one clustering parameterization produces more reliable clusters.

-
-

Silhouette width and neighborhood purity

+
+

Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to purrr::map(). Below, we’ll iterate over each data frame in cluster_results_list to calculate these quantities.

- +
cell_metric_list <- cluster_results_list |>
   purrr::map(
     \(cluster_df) {
-      
       # calculate silhouette width
       silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
-      
-      # calculate neighbhorhood purity 
+
+      # calculate neighbhorhood purity
       purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
-      
+
       # Combine into a single data frame
       dplyr::left_join(silhouette_df, purity_df)
     }
@@ -3317,25 +3314,25 @@ 

Silhouette width and neighborhood purity

package to print them together:

- +
# Plot silhouette width
-silhouette_plot <- ggplot(cell_metrics_df) + 
-  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) + 
-  geom_boxplot() + 
+silhouette_plot <- ggplot(cell_metrics_df) +
+  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
+  geom_boxplot() +
   scale_fill_brewer(palette = "Pastel2") +
   labs(
-    x = "Number of nearest neighbors", 
+    x = "Number of nearest neighbors",
     y = "Silhouette width"
   )
 
 
 # Plot neighborhood purity width
-purity_plot <- ggplot(cell_metrics_df) + 
-  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + 
-  geom_boxplot() + 
+purity_plot <- ggplot(cell_metrics_df) +
+  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
+  geom_boxplot() +
   scale_fill_brewer(palette = "Pastel2") +
   labs(
-    x = "Number of nearest neighbors", 
+    x = "Number of nearest neighbors",
     y = "Neighborhood purity"
   )
 
@@ -3352,19 +3349,18 @@ 

Silhouette width and neighborhood purity

silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.

-
-

Stability

+
+

Stability

Next, we’ll calculate stability on the clusters using rOpenScPCA::calculate_stability(), specifying the same parameter used for the original cluster calculation at each iteration.

- +
stability_list <- cluster_results_list |>
   purrr::map(
     \(cluster_df) {
-      
       nn <- unique(cluster_df$nn)
 
       # calculate stability, passing in the parameter value used for this iteration
@@ -3379,17 +3375,17 @@ 

Stability

nn parameterizations.

- +
stability_df <- dplyr::bind_rows(stability_list)
 
-ggplot(stability_df) + 
-  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + 
-  geom_boxplot() + 
+ggplot(stability_df) +
+  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
+  geom_boxplot() +
   scale_fill_brewer(palette = "Pastel2") +
   labs(
-    x = "Number of nearest neighbors", 
+    x = "Number of nearest neighbors",
     y = "Adjusted Rand Index"
-  ) + 
+  ) +
   theme(legend.position = "none")
@@ -3401,6 +3397,198 @@

Stability

stable clustering results compared to 10.

+
+
+

Varying multiple clustering parameters

+

The previous section demonstrated how to calculate clusters and QC +metrics when varying one parameter, but it is possible to vary multiple +parameters at once with rOpenScPCA::sweep_clusters(). In +this section, we’ll show an overview of how you might write code to vary +two parameters at once (here, nearest neighbors and resolution as +examples) and visualize results.

+ + + +
# Define vectors of parameters to vary
+nn_values <- seq(10, 30, 10)
+res_values <- seq(5, 15, 5)/10
+
+
+cluster_results_list <- rOpenScPCA::sweep_clusters(
+  pca_matrix,
+  nn = nn_values,
+  resolution = res_values
+)
+ + + +

This resulting list now has 9 different clustering results, for each +combination of nn_values and res_values:

+ + + +
length(cluster_results_list)
+ + +
[1] 9
+ + + +
+

Visualize clusters

+

Next, we’ll iterate over cluster_results_list to plot +the UMAPs. This time, we’ll use purrr::map() and pull out +parameters from each iteration’s cluster_df to form the +UMAP panel title.

+ + + +
umap_plots <- cluster_results_list |>
+  purrr::map(
+    \(cluster_df) {
+
+      # Add a column with cluster assignments to umap_df
+      umap_df_plot <- umap_df |>
+        dplyr::mutate(cluster = cluster_df$cluster)
+
+      # Create a title for the UMAP with both parameters
+      umap_title <- glue::glue(
+        "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
+      )
+
+      # Plot the UMAP, colored by the new cluster variable
+      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
+        geom_point(alpha = 0.6) +
+        labs(title = umap_title) +
+        # We'll add a couple UMAP-specific plot settings 
+        coord_equal() +
+        theme(
+          axis.ticks = element_blank(),
+          axis.text = element_blank(),
+          legend.position = "bottom"
+        )
+  }
+)
+
+# Print the plots with patchwork::wrap_plots()
+patchwork::wrap_plots(umap_plots)
+ + +

+ + + +
+
+

Calculate and visualize QC metrics

+

This section presents one coding strategy to calculate and visualize +results when varying two clustering parameters. In particular, we use +faceting to help display all information in one plot, by placing nearest +neighbor values on the X-axis and faceting by resolution values. Since +silhouette width and neighhorbood purity calculations using generally +similar code, we’ll just show neighborhood purity here.

+
+

Neighborhood purity

+

First, we’ll calculate neighborhood purity and combine results into a +single data frame.

+ + + +
purity_df <- cluster_results_list |>
+  purrr::map(
+    \(cluster_df) {
+      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
+    }
+  ) |>
+  dplyr::bind_rows()
+ + + +

We’ll add a column resolution_label which we’ll use to +have informative panel titles in the faceted ggplot we make next.

+ + + +
purity_df <- purity_df |>
+  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
+ + + + + + +
ggplot(purity_df) +
+  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
+  geom_boxplot() +
+  scale_fill_brewer(palette = "Pastel2") +
+  # facet by resolution
+  facet_wrap(vars(resolution_label)) +
+  labs(
+    x = "Number of nearest neighbors",
+    y = "Neighborhood purity"
+  ) +
+  theme(legend.position = "none")
+ + +

+ + + +
+
+
+

Stability

+

Similar to above, we’ll calculate stability, combine results into a +single data frame, add a resolution_label column to support +plot interpretation, and finally make our plot.

+ + + +
stability_df <- cluster_results_list |>
+  purrr::map(
+    \(cluster_df) {
+
+      # Extract parameters for this clustering result
+      nn <- unique(cluster_df$nn)
+      resolution <- unique(cluster_df$resolution)
+
+      rOpenScPCA::calculate_stability(
+        pca_matrix,
+        cluster_df,
+        nn = nn,
+        resolution = resolution
+      )
+    }
+  ) |>
+  dplyr::bind_rows()
+
+stability_df <- stability_df |>
+  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
+ + + + + + +
ggplot(stability_df) +
+  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
+  geom_boxplot() +
+  scale_fill_brewer(palette = "Pastel2") +
+  # facet by resolution
+  facet_wrap(vars(resolution_label)) +
+  labs(
+    x = "Number of nearest neighbors",
+    y = "Adjusted Rand Index"
+  ) +
+  theme(legend.position = "none")
+ + +

+ + + +
+

Session Info

@@ -3465,7 +3653,7 @@

Session Info

-
---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Perform clustering across a set of parameters

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

## Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot all umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and 
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() + 
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




## Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
      
      # calculate neighbhorhood purity 
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
      
      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) + 
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) + 
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```
While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) + 
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors", 
    y = "Adjusted Rand Index"
  ) + 
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

+
---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

### Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot nn umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




### Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


#### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)

      # calculate neighbhorhood purity
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)

      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```

While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


#### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Varying multiple clustering parameters

The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`.
In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results.


```{r sweep two parameters}
# Define vectors of parameters to vary
nn_values <- seq(10, 30, 10)
res_values <- seq(5, 15, 5)/10


cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values,
  resolution = res_values
)
```

This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`:


```{r length cluster_results_list two parameters}
length(cluster_results_list)
```


### Visualize clusters

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.
This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title.

```{r plot nn res umaps, fig.height = 14}
umap_plots <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {

      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Create a title for the UMAP with both parameters
      umap_title <- glue::glue(
        "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
      )

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = umap_title) +
        # We'll add a couple UMAP-specific plot settings 
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank(),
          legend.position = "bottom"
        )
  }
)

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```



### Calculate and visualize QC metrics

This section presents one coding strategy to calculate and visualize results when varying two clustering parameters.
In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values.
Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here.

#### Neighborhood purity

First, we'll calculate neighborhood purity and combine results into a single data frame.

```{r calculate purity two parameters}
purity_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    }
  ) |>
  dplyr::bind_rows()
```


We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next.

```{r add resolution_label column}
purity_df <- purity_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```

```{r visualize purity two parameters}
ggplot(purity_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  ) +
  theme(legend.position = "none")
```

### Stability

Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot.

```{r calculate stability two parameters}
stability_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {

      # Extract parameters for this clustering result
      nn <- unique(cluster_df$nn)
      resolution <- unique(cluster_df$resolution)

      rOpenScPCA::calculate_stability(
        pca_matrix,
        cluster_df,
        nn = nn,
        resolution = resolution
      )
    }
  ) |>
  dplyr::bind_rows()

stability_df <- stability_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))

```


```{r visualize stability two parameters}
ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```



## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

From 04cf200ebe29252c1c467a4dddea52f4f3e18667 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 10:52:45 -0500 Subject: [PATCH 05/16] add notebook to readme, and add missing :: and () --- analyses/hello-clusters/README.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/analyses/hello-clusters/README.md b/analyses/hello-clusters/README.md index cc992ab79..b095a73c8 100644 --- a/analyses/hello-clusters/README.md +++ b/analyses/hello-clusters/README.md @@ -9,7 +9,7 @@ The `rOpenScPCA` package provides several functions that leverage the [`bluster` ### Perform clustering -The function `calculate_clusters()` can be used to perform graph-based clustering. +The function `rOpenScPCA::calculate_clusters()` can be used to perform graph-based clustering. By default, this function uses the Louvain algorithm with Jaccard weighting. @@ -17,21 +17,20 @@ By default, this function uses the Louvain algorithm with Jaccard weighting. `rOpenScPCA` contains several functions to calculate quality metrics for a particular clustering result: -- `calculate_silhouette()` +- `rOpenScPCA::calculate_silhouette()` - This function calculates the [silhouette width](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#silhouette-width) - Clusters with higher average silhouette widths indicate that clusters are highly-separated from other clusters -- `calculate_purity()` +- `rOpenScPCA::calculate_purity()` - This function calculates [neighborhood purity](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#cluster-purity) - Higher purity values indicate that most neighboring cells in a cluster are assigned to the same cluster, therefore indicating good cluster separation -- `calculate_stability()` +- `rOpenScPCA::calculate_stability()` - This function calculates [cluster stability](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#cluster-bootstrapping) with Adjusted Rand Index (ARI) and a bootstrapping procedure - Higher stability values indicate that clusters are more robust ### Identify optimal clustering parameters It is often helpful to explore and evaluate results from different clustering algorithms and/or parameters to choose an optimal clustering scheme. -The function `sweep_clusters()` allows you to generate clustering results from a provided set of algorithms and/or parameters, whose quality can then be assessed to select a set of clusters to proceed with. - +The function `rOpenScPCA::sweep_clusters()` allows you to generate clustering results from a provided set of algorithms and/or parameters, whose quality can then be assessed to select a set of clusters to proceed with. ## Installing rOpenScPCA @@ -57,5 +56,10 @@ renv::update("rOpenScPCA") ## Example notebooks 1. The `01_perform-evaluate-clustering.Rmd` notebook shows examples of: - - Performing clustering with `calculate_clusters` - - Evaluating clustering with `calculate_silhouette`, `calculate_purity`, and `calculate_stability` + - Performing clustering with `rOpenScPCA::calculate_clusters()` + - Evaluating clustering with `rOpenScPCA::calculate_silhouette()`, `rOpenScPCA::calculate_purity()`, and `rOpenScPCA::calculate_stability()` +It also contains explanations for how to interpret cluster quality metrics. + +2. The `02_compare-clustering-parameters.Rmd` notebook shows examples of: + - Performing clustering across a set of parameterizations with `rOpenScPCA::sweep_clusters()` + - Comparing and visualizing multiple sets of clustering results From 9885599ff9128c7c3acb84f6be50bec2df7831df Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 10:53:43 -0500 Subject: [PATCH 06/16] add 2nd notebook to run script --- analyses/hello-clusters/run_hello-clusters.sh | 3 +++ 1 file changed, 3 insertions(+) diff --git a/analyses/hello-clusters/run_hello-clusters.sh b/analyses/hello-clusters/run_hello-clusters.sh index dae74bacb..2a10fc22e 100755 --- a/analyses/hello-clusters/run_hello-clusters.sh +++ b/analyses/hello-clusters/run_hello-clusters.sh @@ -12,3 +12,6 @@ cd ${MODULE_DIR} # Render first notebook Rscript -e "rmarkdown::render('01_perform-evaluate-clustering.Rmd', clean = TRUE)" + +# Render second notebook +Rscript -e "rmarkdown::render('02_compare-clustering-parameters.Rmd', clean = TRUE)" From c50d8df3db4885fd7eaf30d836c12039d220fc86 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 11:04:58 -0500 Subject: [PATCH 07/16] we can use docker in the GHA now --- .github/workflows/run_hello-clusters.yml | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/.github/workflows/run_hello-clusters.yml b/.github/workflows/run_hello-clusters.yml index 100c4c07b..e6b5b19f1 100644 --- a/.github/workflows/run_hello-clusters.yml +++ b/.github/workflows/run_hello-clusters.yml @@ -33,29 +33,15 @@ jobs: run-module: if: github.repository_owner == 'AlexsLemonade' runs-on: ubuntu-latest + container: public.ecr.aws/openscpca/hello-clusters:latest + defaults: + run: + shell: bash -el {0} steps: - name: Checkout repo uses: actions/checkout@v4 - - name: Set up R - uses: r-lib/actions/setup-r@v2 - with: - r-version: 4.4.0 - use-public-rspm: true - - - name: Set up pandoc - uses: r-lib/actions/setup-pandoc@v2 - - - name: Install additional system dependencies - run: | - sudo apt-get install -y libcurl4-openssl-dev libglpk40 - - - name: Set up renv - uses: r-lib/actions/setup-renv@v2 - with: - working-directory: ${{ env.MODULE_PATH }} - - name: Download test data run: ./download-data.py --test-data --format SCE --samples SCPCS000001 From c641a3d70b839c78008b9e79262445872efd1dc7 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 11:22:05 -0500 Subject: [PATCH 08/16] no conda so need to install aws standalone --- .github/workflows/run_hello-clusters.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.github/workflows/run_hello-clusters.yml b/.github/workflows/run_hello-clusters.yml index e6b5b19f1..cce1fa22b 100644 --- a/.github/workflows/run_hello-clusters.yml +++ b/.github/workflows/run_hello-clusters.yml @@ -42,6 +42,12 @@ jobs: - name: Checkout repo uses: actions/checkout@v4 + - name: Install aws-cli + run: | + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + ./aws/install + - name: Download test data run: ./download-data.py --test-data --format SCE --samples SCPCS000001 From 70a9413c4340d082e1812172e46976056727b243 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 11:28:35 -0500 Subject: [PATCH 09/16] keep pandoc --- .github/workflows/run_hello-clusters.yml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.github/workflows/run_hello-clusters.yml b/.github/workflows/run_hello-clusters.yml index cce1fa22b..33289df4f 100644 --- a/.github/workflows/run_hello-clusters.yml +++ b/.github/workflows/run_hello-clusters.yml @@ -42,6 +42,9 @@ jobs: - name: Checkout repo uses: actions/checkout@v4 + - name: Set up pandoc + uses: r-lib/actions/setup-pandoc@v2 + - name: Install aws-cli run: | curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" From da002e4ebdccf55ca9277e5179516706e9f0c4f6 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Mon, 13 Jan 2025 11:38:48 -0500 Subject: [PATCH 10/16] swap in new rprojroot line so this runs in CI, and re-render notebooks --- .../01_perform-evaluate-clustering.Rmd | 2 +- .../01_perform-evaluate-clustering.nb.html | 53 ++++++++----------- .../02_compare-clustering-parameters.Rmd | 2 +- .../02_compare-clustering-parameters.nb.html | 22 ++++---- 4 files changed, 35 insertions(+), 44 deletions(-) diff --git a/analyses/hello-clusters/01_perform-evaluate-clustering.Rmd b/analyses/hello-clusters/01_perform-evaluate-clustering.Rmd index 80268a302..adaa3d0e9 100644 --- a/analyses/hello-clusters/01_perform-evaluate-clustering.Rmd +++ b/analyses/hello-clusters/01_perform-evaluate-clustering.Rmd @@ -50,7 +50,7 @@ theme_set(theme_bw()) ```{r base paths} # The base path for the OpenScPCA repository -repository_base <- rprojroot::find_root(rprojroot::is_git_root) +repository_base <- rprojroot::find_root(rprojroot::has_dir(".github")) # The current data directory, found within the repository base directory data_dir <- file.path(repository_base, "data", "current") diff --git a/analyses/hello-clusters/01_perform-evaluate-clustering.nb.html b/analyses/hello-clusters/01_perform-evaluate-clustering.nb.html index bc5d1fd95..a58a5d64b 100644 --- a/analyses/hello-clusters/01_perform-evaluate-clustering.nb.html +++ b/analyses/hello-clusters/01_perform-evaluate-clustering.nb.html @@ -11,7 +11,7 @@ - + Performing graph-based clustering with rOpenScPCA @@ -2901,7 +2901,7 @@

Performing graph-based clustering with rOpenScPCA

Data Lab

-

2024-12-20

+

2025-01-13

@@ -2967,9 +2967,9 @@

Packages

Paths

- +
# The base path for the OpenScPCA repository
-repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))
 
 # The current data directory, found within the repository base directory
 data_dir <- file.path(repository_base, "data", "current")
@@ -3061,7 +3061,7 @@ 

Clustering with default parameters

@@ -3174,7 +3174,7 @@

Silhouette width

@@ -3190,7 +3190,7 @@

Silhouette width

labs(x = "Cluster", y = "Silhouette width")
-

+

@@ -3227,7 +3227,7 @@

Neighborhood purity

@@ -3243,7 +3243,7 @@

Neighborhood purity

labs(x = "Cluster", y = "Neighborhood purity")
-

+

@@ -3285,7 +3285,7 @@

Cluster stability

@@ -3301,7 +3301,7 @@

Cluster stability

labs(x = "Adjusted rand index across bootstrap replicates")
-

+

@@ -3359,7 +3359,7 @@

Working with objects directly

@@ -3404,31 +3404,24 @@

Evaluating Seurat clusters

FindNeighbors() |> FindClusters()
- -
Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
-a percentage of total singular values, use a standard svd instead.
- - -
Warning: Number of dimensions changing from 10 to 50
- - +
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
 
-Number of nodes: 100
-Number of edges: 4142
+Number of nodes: 2623
+Number of edges: 78853
 
 Running Louvain algorithm...
-Maximum modularity in 10 random starts: 0.2147
-Number of communities: 2
+Maximum modularity in 10 random starts: 0.8478
+Number of communities: 13
 Elapsed time: 0 seconds
seurat_obj
- +
An object of class Seurat 
-126242 features across 100 samples within 3 assays 
-Active assay: SCT (5604 features, 3000 variable features)
+145743 features across 2623 samples within 3 assays 
+Active assay: SCT (25105 features, 3000 variable features)
  3 layers present: counts, data, scale.data
  2 other assays present: RNA, spliced
  2 dimensional reductions calculated: pca, umap
@@ -3451,7 +3444,7 @@

Evaluating Seurat clusters

@@ -3543,7 +3536,7 @@

Evaluating ScPCA clusters

@@ -3776,7 +3769,7 @@

Session Info

-
---
title: "Performing graph-based clustering with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

This notebook provides examples of how to use functions in `rOpenScPCA` that:

* Perform clustering
* Calculate QC metrics on clusters, including:
  * Silhouette width
  * Neighborhood purity
  * Cluster stability, as measured with the Adjusted Rand Index
* Calculate QC metrics on clusters obtained with other tools, such as `Seurat`
* Save clustering results to an SCE or `Seurat`

While this notebook demonstrates how to use individual functions that calculate helpful metrics for evaluating clustering results, a full evaluation would compare these metrics across different clusterings from different parameterizations.

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(Seurat)
  library(dplyr)
  library(ggplot2)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
It's also possible to use an SCE object or a Seurat object directly, which we will demonstrate later.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Perform clustering

This section will show how to perform clustering with the function `rOpenScPCA::calculate_clusters()`.

This function takes a PCA matrix with rownames representing unique cell ids (e.g., barcodes) as its primary argument.
By default it will calculate clusters using the following parameters:

* Louvain algorithm
* Jaccard weighting
* 10 nearest neighbors
* A resolution parameter of 1

This function will return a table with the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used


### Clustering with default parameters

```{r cluster sce}
# Calculate clusters with default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(pca_matrix)

# Print the first rows of the resulting table
head(cluster_results_df)
```

### Clustering with non-default parameters

Parameters used for clustering can be customized with these arguments:

* The `algorithm` can be one of:
  * `louvain`, `walktrap`, or `leiden`
* The `weighting` can be one of:
  * `jaccard`, `rank`, or `number`
* The nearest neighbors parameter can be customized with the `nn` argument
* The resolution parameter can be customized with the `resolution` argument
  * This parameter is only used by Louvain and Leiden algorithms
* If the Leiden algorithm is used, its default objective function parameter will be `CPM`, but you can also set  `objective_function = "modularity"` instead.
* You can provide additional parameters as a list to the `cluster_args` argument.
  * Please refer to the [`igraph` documentation](https://igraph.org/r/html/latest) to learn more about what additional parameters can be provided to each clustering algorithm.
  * Note that `cluster_args` only accepts single-length arguments (no vectors or lists).

For example:

```{r cluster sce nondefault}
# Calculate clusters with non-default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(
  pca_matrix,
  algorithm = "leiden",
  nn = 15,
  objective_function = "modularity"
)
```


## Calculate QC metrics on clusters

This section demonstrates how to use several functions for evaluating cluster quality and reliability.
It's important to note that a full evaluation of clustering results would compare these metrics across a set of clustering results, with the aim of identifying an optimal parameterization.

All functions presented in this section take the following required arguments:

* A PCA matrix with row names representing unique cell ids (e.g., barcodes)
* A data frame with, at least, columns representing unique cell ids and cluster assignments
  * By default, these columns should be named `cell_id` and `cluster`, respectively, matching the output of `rOpenScPCA::calculate_clusters()`
  * You can override these defaults using the arguments `cell_id_col` and `cluster_col`

### Silhouette width

Silhouette width is a common metric that measures how well separated clusters are by, for each cell, comparing the average distance to all cells in the same cluster, and all cells in other clusters.
This value ranges from -1 to 1.
Cells in well-separated clusters should have high silhouette values closer to 1.
You can read more about silhouette width purity from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#silhouette-width).

We'll use the function `rOpenScPCA::calculate_silhouette()` to calculate the silhouette width.

This function will return the inputted data frame with two additional columns:

* `silhouette_width`: The calculated silhouette width for the cell
* `silhouette_other`: The closet cluster to the cell besides the cluster to which it belongs, as used in the silhouette width calculation


```{r silhouette}
# calculate the silhouette width for each cell
silhouette_results <- rOpenScPCA::calculate_silhouette(
  pca_matrix,
  cluster_results_df
)

# Print the first rows of the resulting table
head(silhouette_results)
```


We can visualize these results by plotting silhouette width across clusters as violin plots, for example:

```{r violin silhouette}
ggplot(silhouette_results) +
  aes(x = cluster, y = silhouette_width) +
  geom_violin(fill = "darkmagenta") +
  labs(x = "Cluster", y = "Silhouette width")
```

### Neighborhood purity

Neighborhood purity is defined, for each cell, as the proportion of neighboring cells that are assigned to the same cluster.
This value ranges from 0 to 1.
Cells in well-separated clusters should have high purity values closer to 1, since there should be minimal overlap between member and neighboring cells.
You can read more about neighborhood purity from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#cluster-purity).

We'll use the function `rOpenScPCA::calculate_purity()` to calculate the neighborhood purity.

This function will return the inputted data frame with two additional columns:

* `purity`: The neighborhood purity for the cell
* `maximum_neighbor`: The cluster with the highest proportion of observations neighboring the cell


```{r purity}
# calculate the neighborhood purity for each cell
purity_results <- rOpenScPCA::calculate_purity(
  pca_matrix,
  cluster_results_df
)

# Print the first rows of the resulting table
head(purity_results)
```


We can visualize these results by plotting purity clusters as violin plots, for example:

```{r violin purity}
ggplot(purity_results) +
  aes(x = cluster, y = purity) +
  geom_violin(fill = "darkolivegreen3") +
  labs(x = "Cluster", y = "Neighborhood purity")
```

### Cluster stability

Another approach to exploring cluster quality is how stable the clusters themselves are using bootstrapping.
Given a set of original clusters, we can compare the bootstrapped cluster identities to original ones using the Adjusted Rand Index (ARI), which measures the similarity of two data clusterings.
ARI ranges from -1 to 1, where:

* A value of 1 indicates they are completely overlapping
* A value of -1 indicates they are completely distinct
* A value of 0 indicates a random relationship

We expect that highly stable clusterings have ARI values closer to 1 across a set of bootstrap replicates.

You can read more about the Adjusted Rand Index from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/release/OSCA.advanced/clustering-redux.html#adjusted-rand-index).

We'll use the function `rOpenScPCA::calculate_stability()` to calculate the cluster stability.
By default, this function performs 20 bootstrap replicates, but this can be customized using the argument `replicates`.

This function will return a data frame with columns `replicate`, `ari`, and additional columns for the clustering parameters used when calculating bootstrap clusters.

```{r stability, warning=FALSE}
# calculate the stability of clusters
stability_results <- rOpenScPCA::calculate_stability(
  pca_matrix,
  cluster_results_df
)

# print the result
stability_results
```

We can visualize these results by plotting stability as a density plot, for example:

```{r ari density}
ggplot(stability_results) +
  aes(x = ari) +
  geom_density(color = "grey30", fill = "lightslateblue") +
  labs(x = "Adjusted rand index across bootstrap replicates")
```


#### Using non-default clustering parameters

When calculating bootstrap clusters, `rOpenScPCA::calculate_stability()` uses `rOpenScPCA::calculate_clusters()` with default parameters.
If your original clusters were not calculated with these defaults, you should pass those customized values into this function as well to ensure a fair comparison between your original clusters and the bootstrap clusters.


```{r stability custom parameters}
# Calculate clusters with non-default parameters
cluster_df_leiden <- rOpenScPCA::calculate_clusters(
  pca_matrix,
  algorithm = "leiden",
  resolution = 0.5,
  nn = 15
)

# Now, pass in the same arguments customizing parameters here
stability_results_leiden <- rOpenScPCA::calculate_stability(
  pca_matrix,
  cluster_df_leiden,
  algorithm = "leiden",
  resolution = 0.5,
  nn = 15
)
```


## Working with objects directly

As presented above, `rOpenScPCA` clustering functions take a PCA matrix with row names representing unique cell ids as their first argument.

Instead of a matrix, you can alternatively pass in an SCE or Seurat object that contains a matrix.

We show an example of this below with and SCE object and `rOpenScPCA::calculate_clusters()`, but this will also work for any of the evaluation functions as well and has the same syntax for Seurat objects.

```{r run on sce}
# Calculate clusters from an SCE object using default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(sce)
cluster_results_df
```


`rOpenScPCA` assumes that the PCA matrix is named `PCA` in SCE objects, and `pca` in Seurat objects.
If the PCA matrix you want to use in the object has a different name, you can provide the argument `pc_name`.


## Calculating QC metrics on existing clusters

If you already have clustering results calculated with other tools, you can still use the `rOpenScPCA` functions to evaluate your clusters.

In this section, we'll present examples of how you can calculate the silhouette width, neighborhood purity, and cluster stability from existing cluster assignments within objects.

### Evaluating Seurat clusters

If you are analyzing your data with a Seurat pipeline that includes calculating clusters, you can use `rOpenScPCA` to evaluate them.

To demonstrate this, we'll convert our SCE object to a Seurat using the function `rOpenScPCA::sce_to_seurat()`.
Then, we'll use a simple Seurat pipeline to obtain clusters.

```{r sce to seurat, message = FALSE}
# Convert the SCE to a Seurat object using rOpenScPCA
seurat_obj <- rOpenScPCA::sce_to_seurat(sce)

# Calculate clusters with Seurat using a standard Seurat pipeline, for example
seurat_obj <- seurat_obj |>
  SCTransform() |>
  RunPCA() |>
  FindNeighbors() |>
  FindClusters()

seurat_obj
```


To calculate QC metrics on these clusters, we'll need to create a data frame with columns `cell_id` and `cluster`:

```{r prepare seurat input}
# Create a data frame for input
seurat_cluster_df <- data.frame(
  cell_id = colnames(seurat_obj),
  cluster = seurat_obj$seurat_clusters
)

head(seurat_cluster_df)
```

Now, we can run `rOpenScPCA::calculate_silhouette()` and `rOpenScPCA::calculate_purity()` using this data frame and the Seurat object:

```{r seurat silhouette}
seurat_silhouette_df <- rOpenScPCA::calculate_silhouette(
  seurat_obj,
  seurat_cluster_df
)
```

```{r seurat purity}
seurat_purity_df <- rOpenScPCA::calculate_purity(
  seurat_obj,
  seurat_cluster_df
)
```

We do not recommend using `rOpenScPCA::calculate_stability()` on Seurat clusters due to differences in the underlying clustering approach between Seurat and the `bluster` package which `rOpenScPCA` uses.

### Evaluating ScPCA clusters

ScPCA cell metadata already contains a column called `cluster` with results from an automated clustering.
These clusters were calculated using `bluster`, the same tool that `rOpenScPCA` uses.
The specifications used for this clustering are stored in the SCE object's metadata, as follows; note that all other clustering parameters were left at their default values.

* `metadata(sce)$cluster_algorithm`: The clustering algorithm used
* `metadata(sce)$cluster_weighting`: The weighting scheme used
* `metadata(sce)$cluster_nn`: The number of nearest neighbors used

You can see all their values here:


```{r extract cluster params}
# Print clustering specifications
metadata(sce)[c("cluster_algorithm", "cluster_weighting", "cluster_nn")]
```


In this example, we'll show how to use the cluster evaluation functions on these clusters.

To begin, we'll prepare a data frame with two columns: `cell_id` containing cell barcodes, and `cluster` containing the cluster identities.

```{r prepare scpca data frame}
scpca_cluster_df <- data.frame(
  cell_id = colnames(sce),
  cluster = sce$cluster
)

head(scpca_cluster_df)
```

We can run evaluation functions using this data frame and the SCE object.

```{r scpca silhouette}
# Calculate silhouette width
scpca_silhouette_df <- rOpenScPCA::calculate_silhouette(
  sce,
  scpca_cluster_df
)
```

```{r scpca purity}
# Calculate neighborhood purity
scpca_purity_df <- rOpenScPCA::calculate_purity(
  sce,
  scpca_cluster_df
)
```

When running `rOpenScPCA::calculate_stability()`, we'll specify the same parameters originally used to build the clusters by extracting them from the metadata.
We'll need to ensure the provided arguments are lowercase, as well.

Generally speaking, we only recommend evaluating clusters with `rOpenScPCA::calculate_stability()` if you know the original parameters used.


```{r scpca stability}
scpca_stability_df <- rOpenScPCA::calculate_stability(
  sce,
  scpca_cluster_df,
  # provide ScPCA clustering parameters by extracting from the SCE metadata
  algorithm = tolower(metadata(sce)$cluster_algorithm),
  weighting = tolower(metadata(sce)$cluster_weighting),
  nn = metadata(sce)$cluster_nn
)
```


## Saving clustering results

Results can either be directly exported as a TSV file (e.g., with `readr::write_tsv()`), or you can add the results into your SCE or Seurat object.
The subsequent examples will demonstrate saving the cluster assignments stored in `cluster_results_df$cluster` to an SCE and a Seurat object.

_A word of caution!_
Objects from the ScPCA Portal already contain a column called `cluster` with results from an automated clustering.
These automatic clusters were not evaluated, and their parameters were not optimized for any given library.
To avoid ambiguity between the existing and new clustering results, we'll name the new column `ropenscpca_cluster`.

### Saving results to an SCE object

We can add columns to an SCE object's `colData` table by directly creating a column in the object with `$`.
Before we do so, we'll confirm that the clusters are in the same order as the SCE object by comparing cell ids:

```{r check sce order}
all.equal(
  colnames(sce),
  cluster_results_df$cell_id
)
```

```{r add to sce}
# Add cluster results to the colData
sce$ropenscpca_cluster <- cluster_results_df$cluster
```

### Saving results to a Seurat object


We can add columns to an Seurat object's cell metadata table by directly creating a column in the object with `$` (note that you can also use the Seurat function `AddMetaData()`).
Before we do so, we'll confirm that the clusters are in the same order as the Seurat object by comparing cell ids:


```{r check seurat order}
all.equal(
  colnames(seurat_obj),
  cluster_results_df$cell_id
)
```

```{r add to seurat}
# Add cluster results to the cell metadata
seurat_obj$ropenscpca_cluster <- cluster_results_df$cluster
```


## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

+
---
title: "Performing graph-based clustering with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

This notebook provides examples of how to use functions in `rOpenScPCA` that:

* Perform clustering
* Calculate QC metrics on clusters, including:
  * Silhouette width
  * Neighborhood purity
  * Cluster stability, as measured with the Adjusted Rand Index
* Calculate QC metrics on clusters obtained with other tools, such as `Seurat`
* Save clustering results to an SCE or `Seurat`

While this notebook demonstrates how to use individual functions that calculate helpful metrics for evaluating clustering results, a full evaluation would compare these metrics across different clusterings from different parameterizations.

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(Seurat)
  library(dplyr)
  library(ggplot2)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
It's also possible to use an SCE object or a Seurat object directly, which we will demonstrate later.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Perform clustering

This section will show how to perform clustering with the function `rOpenScPCA::calculate_clusters()`.

This function takes a PCA matrix with rownames representing unique cell ids (e.g., barcodes) as its primary argument.
By default it will calculate clusters using the following parameters:

* Louvain algorithm
* Jaccard weighting
* 10 nearest neighbors
* A resolution parameter of 1

This function will return a table with the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used


### Clustering with default parameters

```{r cluster sce}
# Calculate clusters with default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(pca_matrix)

# Print the first rows of the resulting table
head(cluster_results_df)
```

### Clustering with non-default parameters

Parameters used for clustering can be customized with these arguments:

* The `algorithm` can be one of:
  * `louvain`, `walktrap`, or `leiden`
* The `weighting` can be one of:
  * `jaccard`, `rank`, or `number`
* The nearest neighbors parameter can be customized with the `nn` argument
* The resolution parameter can be customized with the `resolution` argument
  * This parameter is only used by Louvain and Leiden algorithms
* If the Leiden algorithm is used, its default objective function parameter will be `CPM`, but you can also set  `objective_function = "modularity"` instead.
* You can provide additional parameters as a list to the `cluster_args` argument.
  * Please refer to the [`igraph` documentation](https://igraph.org/r/html/latest) to learn more about what additional parameters can be provided to each clustering algorithm.
  * Note that `cluster_args` only accepts single-length arguments (no vectors or lists).

For example:

```{r cluster sce nondefault}
# Calculate clusters with non-default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(
  pca_matrix,
  algorithm = "leiden",
  nn = 15,
  objective_function = "modularity"
)
```


## Calculate QC metrics on clusters

This section demonstrates how to use several functions for evaluating cluster quality and reliability.
It's important to note that a full evaluation of clustering results would compare these metrics across a set of clustering results, with the aim of identifying an optimal parameterization.

All functions presented in this section take the following required arguments:

* A PCA matrix with row names representing unique cell ids (e.g., barcodes)
* A data frame with, at least, columns representing unique cell ids and cluster assignments
  * By default, these columns should be named `cell_id` and `cluster`, respectively, matching the output of `rOpenScPCA::calculate_clusters()`
  * You can override these defaults using the arguments `cell_id_col` and `cluster_col`

### Silhouette width

Silhouette width is a common metric that measures how well separated clusters are by, for each cell, comparing the average distance to all cells in the same cluster, and all cells in other clusters.
This value ranges from -1 to 1.
Cells in well-separated clusters should have high silhouette values closer to 1.
You can read more about silhouette width purity from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#silhouette-width).

We'll use the function `rOpenScPCA::calculate_silhouette()` to calculate the silhouette width.

This function will return the inputted data frame with two additional columns:

* `silhouette_width`: The calculated silhouette width for the cell
* `silhouette_other`: The closet cluster to the cell besides the cluster to which it belongs, as used in the silhouette width calculation


```{r silhouette}
# calculate the silhouette width for each cell
silhouette_results <- rOpenScPCA::calculate_silhouette(
  pca_matrix,
  cluster_results_df
)

# Print the first rows of the resulting table
head(silhouette_results)
```


We can visualize these results by plotting silhouette width across clusters as violin plots, for example:

```{r violin silhouette}
ggplot(silhouette_results) +
  aes(x = cluster, y = silhouette_width) +
  geom_violin(fill = "darkmagenta") +
  labs(x = "Cluster", y = "Silhouette width")
```

### Neighborhood purity

Neighborhood purity is defined, for each cell, as the proportion of neighboring cells that are assigned to the same cluster.
This value ranges from 0 to 1.
Cells in well-separated clusters should have high purity values closer to 1, since there should be minimal overlap between member and neighboring cells.
You can read more about neighborhood purity from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/3.19/OSCA.advanced/clustering-redux.html#cluster-purity).

We'll use the function `rOpenScPCA::calculate_purity()` to calculate the neighborhood purity.

This function will return the inputted data frame with two additional columns:

* `purity`: The neighborhood purity for the cell
* `maximum_neighbor`: The cluster with the highest proportion of observations neighboring the cell


```{r purity}
# calculate the neighborhood purity for each cell
purity_results <- rOpenScPCA::calculate_purity(
  pca_matrix,
  cluster_results_df
)

# Print the first rows of the resulting table
head(purity_results)
```


We can visualize these results by plotting purity clusters as violin plots, for example:

```{r violin purity}
ggplot(purity_results) +
  aes(x = cluster, y = purity) +
  geom_violin(fill = "darkolivegreen3") +
  labs(x = "Cluster", y = "Neighborhood purity")
```

### Cluster stability

Another approach to exploring cluster quality is how stable the clusters themselves are using bootstrapping.
Given a set of original clusters, we can compare the bootstrapped cluster identities to original ones using the Adjusted Rand Index (ARI), which measures the similarity of two data clusterings.
ARI ranges from -1 to 1, where:

* A value of 1 indicates they are completely overlapping
* A value of -1 indicates they are completely distinct
* A value of 0 indicates a random relationship

We expect that highly stable clusterings have ARI values closer to 1 across a set of bootstrap replicates.

You can read more about the Adjusted Rand Index from the [_Orchestrating Single Cell Analysis with Bioconductor_ book](https://bioconductor.org/books/release/OSCA.advanced/clustering-redux.html#adjusted-rand-index).

We'll use the function `rOpenScPCA::calculate_stability()` to calculate the cluster stability.
By default, this function performs 20 bootstrap replicates, but this can be customized using the argument `replicates`.

This function will return a data frame with columns `replicate`, `ari`, and additional columns for the clustering parameters used when calculating bootstrap clusters.

```{r stability, warning=FALSE}
# calculate the stability of clusters
stability_results <- rOpenScPCA::calculate_stability(
  pca_matrix,
  cluster_results_df
)

# print the result
stability_results
```

We can visualize these results by plotting stability as a density plot, for example:

```{r ari density}
ggplot(stability_results) +
  aes(x = ari) +
  geom_density(color = "grey30", fill = "lightslateblue") +
  labs(x = "Adjusted rand index across bootstrap replicates")
```


#### Using non-default clustering parameters

When calculating bootstrap clusters, `rOpenScPCA::calculate_stability()` uses `rOpenScPCA::calculate_clusters()` with default parameters.
If your original clusters were not calculated with these defaults, you should pass those customized values into this function as well to ensure a fair comparison between your original clusters and the bootstrap clusters.


```{r stability custom parameters}
# Calculate clusters with non-default parameters
cluster_df_leiden <- rOpenScPCA::calculate_clusters(
  pca_matrix,
  algorithm = "leiden",
  resolution = 0.5,
  nn = 15
)

# Now, pass in the same arguments customizing parameters here
stability_results_leiden <- rOpenScPCA::calculate_stability(
  pca_matrix,
  cluster_df_leiden,
  algorithm = "leiden",
  resolution = 0.5,
  nn = 15
)
```


## Working with objects directly

As presented above, `rOpenScPCA` clustering functions take a PCA matrix with row names representing unique cell ids as their first argument.

Instead of a matrix, you can alternatively pass in an SCE or Seurat object that contains a matrix.

We show an example of this below with and SCE object and `rOpenScPCA::calculate_clusters()`, but this will also work for any of the evaluation functions as well and has the same syntax for Seurat objects.

```{r run on sce}
# Calculate clusters from an SCE object using default parameters
cluster_results_df <- rOpenScPCA::calculate_clusters(sce)
cluster_results_df
```


`rOpenScPCA` assumes that the PCA matrix is named `PCA` in SCE objects, and `pca` in Seurat objects.
If the PCA matrix you want to use in the object has a different name, you can provide the argument `pc_name`.


## Calculating QC metrics on existing clusters

If you already have clustering results calculated with other tools, you can still use the `rOpenScPCA` functions to evaluate your clusters.

In this section, we'll present examples of how you can calculate the silhouette width, neighborhood purity, and cluster stability from existing cluster assignments within objects.

### Evaluating Seurat clusters

If you are analyzing your data with a Seurat pipeline that includes calculating clusters, you can use `rOpenScPCA` to evaluate them.

To demonstrate this, we'll convert our SCE object to a Seurat using the function `rOpenScPCA::sce_to_seurat()`.
Then, we'll use a simple Seurat pipeline to obtain clusters.

```{r sce to seurat, message = FALSE}
# Convert the SCE to a Seurat object using rOpenScPCA
seurat_obj <- rOpenScPCA::sce_to_seurat(sce)

# Calculate clusters with Seurat using a standard Seurat pipeline, for example
seurat_obj <- seurat_obj |>
  SCTransform() |>
  RunPCA() |>
  FindNeighbors() |>
  FindClusters()

seurat_obj
```


To calculate QC metrics on these clusters, we'll need to create a data frame with columns `cell_id` and `cluster`:

```{r prepare seurat input}
# Create a data frame for input
seurat_cluster_df <- data.frame(
  cell_id = colnames(seurat_obj),
  cluster = seurat_obj$seurat_clusters
)

head(seurat_cluster_df)
```

Now, we can run `rOpenScPCA::calculate_silhouette()` and `rOpenScPCA::calculate_purity()` using this data frame and the Seurat object:

```{r seurat silhouette}
seurat_silhouette_df <- rOpenScPCA::calculate_silhouette(
  seurat_obj,
  seurat_cluster_df
)
```

```{r seurat purity}
seurat_purity_df <- rOpenScPCA::calculate_purity(
  seurat_obj,
  seurat_cluster_df
)
```

We do not recommend using `rOpenScPCA::calculate_stability()` on Seurat clusters due to differences in the underlying clustering approach between Seurat and the `bluster` package which `rOpenScPCA` uses.

### Evaluating ScPCA clusters

ScPCA cell metadata already contains a column called `cluster` with results from an automated clustering.
These clusters were calculated using `bluster`, the same tool that `rOpenScPCA` uses.
The specifications used for this clustering are stored in the SCE object's metadata, as follows; note that all other clustering parameters were left at their default values.

* `metadata(sce)$cluster_algorithm`: The clustering algorithm used
* `metadata(sce)$cluster_weighting`: The weighting scheme used
* `metadata(sce)$cluster_nn`: The number of nearest neighbors used

You can see all their values here:


```{r extract cluster params}
# Print clustering specifications
metadata(sce)[c("cluster_algorithm", "cluster_weighting", "cluster_nn")]
```


In this example, we'll show how to use the cluster evaluation functions on these clusters.

To begin, we'll prepare a data frame with two columns: `cell_id` containing cell barcodes, and `cluster` containing the cluster identities.

```{r prepare scpca data frame}
scpca_cluster_df <- data.frame(
  cell_id = colnames(sce),
  cluster = sce$cluster
)

head(scpca_cluster_df)
```

We can run evaluation functions using this data frame and the SCE object.

```{r scpca silhouette}
# Calculate silhouette width
scpca_silhouette_df <- rOpenScPCA::calculate_silhouette(
  sce,
  scpca_cluster_df
)
```

```{r scpca purity}
# Calculate neighborhood purity
scpca_purity_df <- rOpenScPCA::calculate_purity(
  sce,
  scpca_cluster_df
)
```

When running `rOpenScPCA::calculate_stability()`, we'll specify the same parameters originally used to build the clusters by extracting them from the metadata.
We'll need to ensure the provided arguments are lowercase, as well.

Generally speaking, we only recommend evaluating clusters with `rOpenScPCA::calculate_stability()` if you know the original parameters used.


```{r scpca stability}
scpca_stability_df <- rOpenScPCA::calculate_stability(
  sce,
  scpca_cluster_df,
  # provide ScPCA clustering parameters by extracting from the SCE metadata
  algorithm = tolower(metadata(sce)$cluster_algorithm),
  weighting = tolower(metadata(sce)$cluster_weighting),
  nn = metadata(sce)$cluster_nn
)
```


## Saving clustering results

Results can either be directly exported as a TSV file (e.g., with `readr::write_tsv()`), or you can add the results into your SCE or Seurat object.
The subsequent examples will demonstrate saving the cluster assignments stored in `cluster_results_df$cluster` to an SCE and a Seurat object.

_A word of caution!_
Objects from the ScPCA Portal already contain a column called `cluster` with results from an automated clustering.
These automatic clusters were not evaluated, and their parameters were not optimized for any given library.
To avoid ambiguity between the existing and new clustering results, we'll name the new column `ropenscpca_cluster`.

### Saving results to an SCE object

We can add columns to an SCE object's `colData` table by directly creating a column in the object with `$`.
Before we do so, we'll confirm that the clusters are in the same order as the SCE object by comparing cell ids:

```{r check sce order}
all.equal(
  colnames(sce),
  cluster_results_df$cell_id
)
```

```{r add to sce}
# Add cluster results to the colData
sce$ropenscpca_cluster <- cluster_results_df$cluster
```

### Saving results to a Seurat object


We can add columns to an Seurat object's cell metadata table by directly creating a column in the object with `$` (note that you can also use the Seurat function `AddMetaData()`).
Before we do so, we'll confirm that the clusters are in the same order as the Seurat object by comparing cell ids:


```{r check seurat order}
all.equal(
  colnames(seurat_obj),
  cluster_results_df$cell_id
)
```

```{r add to seurat}
# Add cluster results to the cell metadata
seurat_obj$ropenscpca_cluster <- cluster_results_df$cluster
```


## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 1133ce947..dab142325 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -50,7 +50,7 @@ theme_set(theme_bw()) ```{r base paths} # The base path for the OpenScPCA repository -repository_base <- rprojroot::find_root(rprojroot::is_git_root) +repository_base <- rprojroot::find_root(rprojroot::has_dir(".github")) # The current data directory, found within the repository base directory data_dir <- file.path(repository_base, "data", "current") diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html index b29b98fde..212e116c5 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html +++ b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html @@ -2967,9 +2967,9 @@

Packages

Paths

- +
# The base path for the OpenScPCA repository
-repository_base <- rprojroot::find_root(rprojroot::is_git_root)
+repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))
 
 # The current data directory, found within the repository base directory
 data_dir <- file.path(repository_base, "data", "current")
@@ -3408,10 +3408,10 @@ 

Varying multiple clustering parameters

examples) and visualize results.

- +
# Define vectors of parameters to vary
 nn_values <- seq(10, 30, 10)
-res_values <- seq(5, 15, 5)/10
+res_values <- seq(5, 15, 5) / 10
 
 
 cluster_results_list <- rOpenScPCA::sweep_clusters(
@@ -3442,11 +3442,10 @@ 

Visualize clusters

UMAP panel title.

- +
umap_plots <- cluster_results_list |>
   purrr::map(
     \(cluster_df) {
-
       # Add a column with cluster assignments to umap_df
       umap_df_plot <- umap_df |>
         dplyr::mutate(cluster = cluster_df$cluster)
@@ -3460,15 +3459,15 @@ 

Visualize clusters

ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) + geom_point(alpha = 0.6) + labs(title = umap_title) + - # We'll add a couple UMAP-specific plot settings + # We'll add a couple UMAP-specific plot settings coord_equal() + theme( axis.ticks = element_blank(), axis.text = element_blank(), legend.position = "bottom" ) - } -) + } + ) # Print the plots with patchwork::wrap_plots() patchwork::wrap_plots(umap_plots)
@@ -3543,11 +3542,10 @@

Stability

plot interpretation, and finally make our plot.

- +
stability_df <- cluster_results_list |>
   purrr::map(
     \(cluster_df) {
-
       # Extract parameters for this clustering result
       nn <- unique(cluster_df$nn)
       resolution <- unique(cluster_df$resolution)
@@ -3653,7 +3651,7 @@ 

Session Info

-
---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

### Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot nn umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




### Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


#### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)

      # calculate neighbhorhood purity
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)

      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```

While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


#### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Varying multiple clustering parameters

The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`.
In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results.


```{r sweep two parameters}
# Define vectors of parameters to vary
nn_values <- seq(10, 30, 10)
res_values <- seq(5, 15, 5)/10


cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values,
  resolution = res_values
)
```

This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`:


```{r length cluster_results_list two parameters}
length(cluster_results_list)
```


### Visualize clusters

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.
This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title.

```{r plot nn res umaps, fig.height = 14}
umap_plots <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {

      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Create a title for the UMAP with both parameters
      umap_title <- glue::glue(
        "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
      )

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = umap_title) +
        # We'll add a couple UMAP-specific plot settings 
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank(),
          legend.position = "bottom"
        )
  }
)

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```



### Calculate and visualize QC metrics

This section presents one coding strategy to calculate and visualize results when varying two clustering parameters.
In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values.
Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here.

#### Neighborhood purity

First, we'll calculate neighborhood purity and combine results into a single data frame.

```{r calculate purity two parameters}
purity_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    }
  ) |>
  dplyr::bind_rows()
```


We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next.

```{r add resolution_label column}
purity_df <- purity_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```

```{r visualize purity two parameters}
ggplot(purity_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  ) +
  theme(legend.position = "none")
```

### Stability

Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot.

```{r calculate stability two parameters}
stability_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {

      # Extract parameters for this clustering result
      nn <- unique(cluster_df$nn)
      resolution <- unique(cluster_df$resolution)

      rOpenScPCA::calculate_stability(
        pca_matrix,
        cluster_df,
        nn = nn,
        resolution = resolution
      )
    }
  ) |>
  dplyr::bind_rows()

stability_df <- stability_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))

```


```{r visualize stability two parameters}
ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```



## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

+
---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

### Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot nn umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




### Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


#### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)

      # calculate neighbhorhood purity
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)

      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```

While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


#### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Varying multiple clustering parameters

The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`.
In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results.


```{r sweep two parameters}
# Define vectors of parameters to vary
nn_values <- seq(10, 30, 10)
res_values <- seq(5, 15, 5) / 10


cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values,
  resolution = res_values
)
```

This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`:


```{r length cluster_results_list two parameters}
length(cluster_results_list)
```


### Visualize clusters

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.
This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title.

```{r plot nn res umaps, fig.height = 14}
umap_plots <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Create a title for the UMAP with both parameters
      umap_title <- glue::glue(
        "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
      )

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = umap_title) +
        # We'll add a couple UMAP-specific plot settings
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank(),
          legend.position = "bottom"
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```



### Calculate and visualize QC metrics

This section presents one coding strategy to calculate and visualize results when varying two clustering parameters.
In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values.
Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here.

#### Neighborhood purity

First, we'll calculate neighborhood purity and combine results into a single data frame.

```{r calculate purity two parameters}
purity_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    }
  ) |>
  dplyr::bind_rows()
```


We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next.

```{r add resolution_label column}
purity_df <- purity_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```

```{r visualize purity two parameters}
ggplot(purity_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  ) +
  theme(legend.position = "none")
```

### Stability

Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot.

```{r calculate stability two parameters}
stability_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Extract parameters for this clustering result
      nn <- unique(cluster_df$nn)
      resolution <- unique(cluster_df$resolution)

      rOpenScPCA::calculate_stability(
        pca_matrix,
        cluster_df,
        nn = nn,
        resolution = resolution
      )
    }
  ) |>
  dplyr::bind_rows()

stability_df <- stability_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```


```{r visualize stability two parameters}
ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```



## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

From 953b3d10b688104a8e5902d4320b8b125192f945 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 13:45:37 -0500 Subject: [PATCH 11/16] Apply suggestions from code review Co-authored-by: Joshua Shapiro --- .../02_compare-clustering-parameters.Rmd | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index dab142325..7b9ba3c4c 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -97,7 +97,7 @@ This section will show how to perform clustering across a set of parameters (aka This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below. -Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses +Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses. * `algorithm`: Which clustering algorithm to use (Louvain) * `weighting`: Which weighting scheme to use (Jaccard) @@ -119,7 +119,7 @@ To demonstrate this function, we'll calculate clusters using the Louvain algorit ```{r sweep clusters} # Define nn parameter values of interest -nn_values <- seq(10, 30, 10) +nn_values <- c(10, 20, 30) # Calculate clusters varying nn, but leaving other parameters at their default values cluster_results_list <- rOpenScPCA::sweep_clusters( @@ -221,10 +221,7 @@ cell_metric_list <- cluster_results_list |> silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df) # calculate neighbhorhood purity - purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df) - - # Combine into a single data frame - dplyr::left_join(silhouette_df, purity_df) + purity_df <- rOpenScPCA::calculate_purity(pca_matrix, silhouette_df) } ) @@ -280,7 +277,7 @@ Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_sta stability_list <- cluster_results_list |> purrr::map( \(cluster_df) { - nn <- unique(cluster_df$nn) + nn <- cluster_df$nn[1] # all rows have the same `nn` parameter, so we'll take the first # calculate stability, passing in the parameter value used for this iteration rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn) @@ -316,8 +313,8 @@ In this section, we'll show an overview of how you might write code to vary two ```{r sweep two parameters} # Define vectors of parameters to vary -nn_values <- seq(10, 30, 10) -res_values <- seq(5, 15, 5) / 10 +nn_values <- c(10, 20, 30) +res_values <-c(0.5, 1.0, 1.5) cluster_results_list <- rOpenScPCA::sweep_clusters( @@ -368,7 +365,7 @@ umap_plots <- cluster_results_list |> ) # Print the plots with patchwork::wrap_plots() -patchwork::wrap_plots(umap_plots) +patchwork::wrap_plots(umap_plots, ncol = 3) ``` @@ -407,7 +404,7 @@ ggplot(purity_df) + geom_boxplot() + scale_fill_brewer(palette = "Pastel2") + # facet by resolution - facet_wrap(vars(resolution_label)) + + facet_wrap(vars(resolution_label), labeller = label_both) + labs( x = "Number of nearest neighbors", y = "Neighborhood purity" From 9de516aa5fa2b97a8ba0a8272eca9c62303e1975 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 13:50:57 -0500 Subject: [PATCH 12/16] Update analyses/hello-clusters/02_compare-clustering-parameters.Rmd Co-authored-by: Joshua Shapiro --- analyses/hello-clusters/02_compare-clustering-parameters.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 7b9ba3c4c..ec6bddc65 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -233,7 +233,7 @@ purrr::map(cell_metric_list, head) To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions. ```{r combine cell metrics list} -cell_metrics_df <- dplyr::bind_rows(cell_metric_list) +cell_metrics_df <- purrr::list_rbind(cell_metric_list) ``` We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together: From 626b1747af94b867627967dfe86fcecb86a9fff1 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 14:04:24 -0500 Subject: [PATCH 13/16] updates in response to review --- .../02_compare-clustering-parameters.Rmd | 35 +++++++------------ 1 file changed, 13 insertions(+), 22 deletions(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index ec6bddc65..457ca45b7 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -105,7 +105,7 @@ Clusters will be calculated for all combinations of parameters values (where app * `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering) * `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering) -`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. +`rOpenScPCA::sweep_clusters()` does not allow you to specify values for any other parameters. This function will return a list of data frames of clustering results. @@ -134,7 +134,8 @@ The resulting list has a length of three, one data frame for each `nn` parameter length(cluster_results_list) ``` -It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter: +It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter. +In this case, we'll use these names to label plots. ```{r set list names} names(cluster_results_list) <- glue::glue("nn_{nn_values}") @@ -194,7 +195,7 @@ umap_plots <- cluster_results_list |> ) # Print the plots with patchwork::wrap_plots() -patchwork::wrap_plots(umap_plots) +patchwork::wrap_plots(umap_plots, ncol = 3) ``` These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters. @@ -267,6 +268,7 @@ silhouette_plot + purity_plot & theme(legend.position = "none") ``` While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter. +It's worth noting that this trend in purity values is expected: Higher nearest neighbor parameter values lead to fewer clusters, and neighborhood purity tends to be higher when there are fewer clusters. #### Stability @@ -289,7 +291,7 @@ We'll again combine the output of `stability_list` into a single data frame and ```{r combine plot stability} -stability_df <- dplyr::bind_rows(stability_list) +stability_df <- purrr::list_rbind(stability_list) ggplot(stability_df) + aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + @@ -314,7 +316,7 @@ In this section, we'll show an overview of how you might write code to vary two ```{r sweep two parameters} # Define vectors of parameters to vary nn_values <- c(10, 20, 30) -res_values <-c(0.5, 1.0, 1.5) +res_values <- c(0.5, 1.0, 1.5) cluster_results_list <- rOpenScPCA::sweep_clusters( @@ -347,7 +349,7 @@ umap_plots <- cluster_results_list |> # Create a title for the UMAP with both parameters umap_title <- glue::glue( - "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}" + "nn: {cluster_df$nn[1]}; res: {cluster_df$resolution[1]}" ) # Plot the UMAP, colored by the new cluster variable @@ -387,24 +389,17 @@ purity_df <- cluster_results_list |> rOpenScPCA::calculate_purity(pca_matrix, cluster_df) } ) |> - dplyr::bind_rows() + purrr::list_rbind() ``` -We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next. - -```{r add resolution_label column} -purity_df <- purity_df |> - dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}")) -``` - ```{r visualize purity two parameters} ggplot(purity_df) + aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) + geom_boxplot() + scale_fill_brewer(palette = "Pastel2") + - # facet by resolution - facet_wrap(vars(resolution_label), labeller = label_both) + + # facet by resolution, labeling panels with both the resolution column name and value + facet_wrap(vars(resolution), labeller = label_both) + labs( x = "Number of nearest neighbors", y = "Neighborhood purity" @@ -432,10 +427,7 @@ stability_df <- cluster_results_list |> ) } ) |> - dplyr::bind_rows() - -stability_df <- stability_df |> - dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}")) + purrr::list_rbind() ``` @@ -444,8 +436,7 @@ ggplot(stability_df) + aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) + geom_boxplot() + scale_fill_brewer(palette = "Pastel2") + - # facet by resolution - facet_wrap(vars(resolution_label)) + + facet_wrap(vars(resolution), labeller = label_both) + labs( x = "Number of nearest neighbors", y = "Adjusted Rand Index" From 013fd7ccbe1f5769fc779042a947ce4245032c09 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 14:06:32 -0500 Subject: [PATCH 14/16] update a comment --- analyses/hello-clusters/02_compare-clustering-parameters.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 457ca45b7..3bfc1c0a4 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -221,8 +221,8 @@ cell_metric_list <- cluster_results_list |> # calculate silhouette width silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df) - # calculate neighbhorhood purity - purity_df <- rOpenScPCA::calculate_purity(pca_matrix, silhouette_df) + # calculate neighbhorhood purity, starting from silhouette_df + OpenScPCA::calculate_purity(pca_matrix, silhouette_df) } ) From 9de82b4c7872b13af0660d981440f92941567cf1 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 14:07:09 -0500 Subject: [PATCH 15/16] addidentally deleted an r --- analyses/hello-clusters/02_compare-clustering-parameters.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index 3bfc1c0a4..e5ffb6c5c 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -222,7 +222,7 @@ cell_metric_list <- cluster_results_list |> silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df) # calculate neighbhorhood purity, starting from silhouette_df - OpenScPCA::calculate_purity(pca_matrix, silhouette_df) + rOpenScPCA::calculate_purity(pca_matrix, silhouette_df) } ) From 1e59f9c2dde1d32025f53e92a91370dd161c5945 Mon Sep 17 00:00:00 2001 From: Stephanie Spielman Date: Tue, 14 Jan 2025 15:16:06 -0500 Subject: [PATCH 16/16] ensure legend fits, and render the html --- .../02_compare-clustering-parameters.Rmd | 4 +- .../02_compare-clustering-parameters.nb.html | 118 ++++++++---------- 2 files changed, 52 insertions(+), 70 deletions(-) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd index e5ffb6c5c..6ba7df341 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.Rmd +++ b/analyses/hello-clusters/02_compare-clustering-parameters.Rmd @@ -361,7 +361,9 @@ umap_plots <- cluster_results_list |> theme( axis.ticks = element_blank(), axis.text = element_blank(), - legend.position = "bottom" + # Ensure legends fit in the figure + legend.position = "bottom", + legend.key.size = unit(0.2, "cm") ) } ) diff --git a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html index 212e116c5..76257489e 100644 --- a/analyses/hello-clusters/02_compare-clustering-parameters.nb.html +++ b/analyses/hello-clusters/02_compare-clustering-parameters.nb.html @@ -11,7 +11,7 @@ - + Comparing clustering parameters with rOpenScPCA @@ -2901,7 +2901,7 @@

Comparing clustering parameters with rOpenScPCA

Data Lab

-

2025-01-13

+

2025-01-14

@@ -3040,7 +3040,7 @@

Varying a single clustering parameter

listed below. Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in -parentheses

+parentheses.

  • algorithm: Which clustering algorithm to use (Louvain)
  • @@ -3051,8 +3051,8 @@

    Varying a single clustering parameter

  • objective_function: The objective function to optimize clusters (CPM; used only with Leiden clustering)
-

rOpenScPCA::sweep_clusters() does allow you to specify -values for any other parameters.

+

rOpenScPCA::sweep_clusters() does not allow you to +specify values for any other parameters.

This function will return a list of data frames of clustering results. Each data frame will have the following columns:

    @@ -3066,9 +3066,9 @@

    Varying a single clustering parameter

    Louvain algorithm while varying the nn parameter:

    - +
    # Define nn parameter values of interest
    -nn_values <- seq(10, 30, 10)
    +nn_values <- c(10, 20, 30)
     
     # Calculate clusters varying nn, but leaving other parameters at their default values
     cluster_results_list <- rOpenScPCA::sweep_clusters(
    @@ -3091,7 +3091,8 @@ 

    Varying a single clustering parameter

    It can be helpful (although it is not strictly necessary to keep -track) to name this list by the varied nn parameter:

    +track) to name this list by the varied nn parameter. In +this case, we’ll use these names to label plots.

    @@ -3173,7 +3174,7 @@

    Visualizing clustering results

    the UMAPs.

    - +
    umap_plots <- cluster_results_list |>
       purrr::imap(
         \(cluster_df, clustering_name) {
    @@ -3196,7 +3197,7 @@ 

    Visualizing clustering results

    ) # Print the plots with patchwork::wrap_plots() -patchwork::wrap_plots(umap_plots)
    +patchwork::wrap_plots(umap_plots, ncol = 3)

    @@ -3221,31 +3222,19 @@

    Silhouette width and neighborhood purity

    cluster_results_list to calculate these quantities.

    - +
    cell_metric_list <- cluster_results_list |>
       purrr::map(
         \(cluster_df) {
           # calculate silhouette width
           silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)
     
    -      # calculate neighbhorhood purity
    -      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    -
    -      # Combine into a single data frame
    -      dplyr::left_join(silhouette_df, purity_df)
    +      # calculate neighbhorhood purity, starting from silhouette_df
    +      rOpenScPCA::calculate_purity(pca_matrix, silhouette_df)
         }
    -  )
    - - -
    Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
    -resolution)`
    -Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
    -resolution)`
    -Joining with `by = join_by(cell_id, cluster, algorithm, weighting, nn,
    -resolution)`
    - - -
    # View the first six rows of each clustering result's cell-level QC metrics
    +  )
    +
    +# View the first six rows of each clustering result's cell-level QC metrics
     purrr::map(cell_metric_list, head)
    @@ -3304,8 +3293,8 @@

    Silhouette width and neighborhood purity

    nn column will distinguish among conditions.

    - -
    cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
    + +
    cell_metrics_df <- purrr::list_rbind(cell_metric_list)
    @@ -3347,7 +3336,10 @@

    Silhouette width and neighborhood purity

    While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher -with a higher nearest neighbors parameter.

    +with a higher nearest neighbors parameter. It’s worth noting that this +trend in purity values is expected: Higher nearest neighbor parameter +values lead to fewer clusters, and neighborhood purity tends to be +higher when there are fewer clusters.

    Stability

    @@ -3357,11 +3349,11 @@

    Stability

    iteration.

    - +
    stability_list <- cluster_results_list |>
       purrr::map(
         \(cluster_df) {
    -      nn <- unique(cluster_df$nn)
    +      nn <- cluster_df$nn[1] # all rows have the same `nn` parameter, so we'll take the first
     
           # calculate stability, passing in the parameter value used for this iteration
           rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    @@ -3375,8 +3367,8 @@ 

    Stability

    nn parameterizations.

    - -
    stability_df <- dplyr::bind_rows(stability_list)
    +
    +
    stability_df <- purrr::list_rbind(stability_list)
     
     ggplot(stability_df) +
       aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
    @@ -3408,10 +3400,10 @@ 

    Varying multiple clustering parameters

    examples) and visualize results.

    - +
    # Define vectors of parameters to vary
    -nn_values <- seq(10, 30, 10)
    -res_values <- seq(5, 15, 5) / 10
    +nn_values <- c(10, 20, 30)
    +res_values <- c(0.5, 1.0, 1.5)
     
     
     cluster_results_list <- rOpenScPCA::sweep_clusters(
    @@ -3442,7 +3434,7 @@ 

    Visualize clusters

    UMAP panel title.

    - +
    umap_plots <- cluster_results_list |>
       purrr::map(
         \(cluster_df) {
    @@ -3452,7 +3444,7 @@ 

    Visualize clusters

    # Create a title for the UMAP with both parameters umap_title <- glue::glue( - "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}" + "nn: {cluster_df$nn[1]}; res: {cluster_df$resolution[1]}" ) # Plot the UMAP, colored by the new cluster variable @@ -3464,16 +3456,18 @@

    Visualize clusters

    theme( axis.ticks = element_blank(), axis.text = element_blank(), - legend.position = "bottom" + # Ensure legends fit in the figure + legend.position = "bottom", + legend.key.size = unit(0.2, "cm") ) } ) # Print the plots with patchwork::wrap_plots() -patchwork::wrap_plots(umap_plots)
    +patchwork::wrap_plots(umap_plots, ncol = 3)
    -

    +

    @@ -3492,36 +3486,26 @@

    Neighborhood purity

    single data frame.

    - +
    purity_df <- cluster_results_list |>
       purrr::map(
         \(cluster_df) {
           rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
         }
       ) |>
    -  dplyr::bind_rows()
    + purrr::list_rbind()
    -

    We’ll add a column resolution_label which we’ll use to -have informative panel titles in the faceted ggplot we make next.

    - -
    purity_df <- purity_df |>
    -  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
    - - - - - - +
    ggplot(purity_df) +
       aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
       geom_boxplot() +
       scale_fill_brewer(palette = "Pastel2") +
    -  # facet by resolution
    -  facet_wrap(vars(resolution_label)) +
    +  # facet by resolution, labeling panels with both the resolution column name and value
    +  facet_wrap(vars(resolution), labeller = label_both) +
       labs(
         x = "Number of nearest neighbors",
         y = "Neighborhood purity"
    @@ -3529,7 +3513,7 @@ 

    Neighborhood purity

    theme(legend.position = "none")
    -

    +

    @@ -3542,7 +3526,7 @@

    Stability

    plot interpretation, and finally make our plot.

    - +
    stability_df <- cluster_results_list |>
       purrr::map(
         \(cluster_df) {
    @@ -3558,22 +3542,18 @@ 

    Stability

    ) } ) |> - dplyr::bind_rows() - -stability_df <- stability_df |> - dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
    + purrr::list_rbind()
    - +
    ggplot(stability_df) +
       aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
       geom_boxplot() +
       scale_fill_brewer(palette = "Pastel2") +
    -  # facet by resolution
    -  facet_wrap(vars(resolution_label)) +
    +  facet_wrap(vars(resolution), labeller = label_both) +
       labs(
         x = "Number of nearest neighbors",
         y = "Adjusted Rand Index"
    @@ -3581,7 +3561,7 @@ 

    Stability

    theme(legend.position = "none")
    -

    +

    @@ -3651,7 +3631,7 @@

    Session Info

    -
    ---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- seq(10, 30, 10)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter:

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

### Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot nn umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




### Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


#### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)

      # calculate neighbhorhood purity
      purity_df <- rOpenScPCA::calculate_purity(pca_matrix, cluster_df)

      # Combine into a single data frame
      dplyr::left_join(silhouette_df, purity_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- dplyr::bind_rows(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```

While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.


#### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      nn <- unique(cluster_df$nn)

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- dplyr::bind_rows(stability_list)

ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Varying multiple clustering parameters

The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`.
In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results.


```{r sweep two parameters}
# Define vectors of parameters to vary
nn_values <- seq(10, 30, 10)
res_values <- seq(5, 15, 5) / 10


cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values,
  resolution = res_values
)
```

This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`:


```{r length cluster_results_list two parameters}
length(cluster_results_list)
```


### Visualize clusters

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.
This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title.

```{r plot nn res umaps, fig.height = 14}
umap_plots <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Create a title for the UMAP with both parameters
      umap_title <- glue::glue(
        "nn: {unique(cluster_df$nn)}; res: {unique(cluster_df$resolution)}"
      )

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = umap_title) +
        # We'll add a couple UMAP-specific plot settings
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank(),
          legend.position = "bottom"
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots)
```



### Calculate and visualize QC metrics

This section presents one coding strategy to calculate and visualize results when varying two clustering parameters.
In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values.
Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here.

#### Neighborhood purity

First, we'll calculate neighborhood purity and combine results into a single data frame.

```{r calculate purity two parameters}
purity_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    }
  ) |>
  dplyr::bind_rows()
```


We'll add a column `resolution_label` which we'll use to have informative panel titles in the faceted ggplot we make next.

```{r add resolution_label column}
purity_df <- purity_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```

```{r visualize purity two parameters}
ggplot(purity_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  ) +
  theme(legend.position = "none")
```

### Stability

Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot.

```{r calculate stability two parameters}
stability_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Extract parameters for this clustering result
      nn <- unique(cluster_df$nn)
      resolution <- unique(cluster_df$resolution)

      rOpenScPCA::calculate_stability(
        pca_matrix,
        cluster_df,
        nn = nn,
        resolution = resolution
      )
    }
  ) |>
  dplyr::bind_rows()

stability_df <- stability_df |>
  dplyr::mutate(resolution_label = glue::glue("Resolution: {resolution}"))
```


```{r visualize stability two parameters}
ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution
  facet_wrap(vars(resolution_label)) +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```



## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```

    +
    ---
title: "Comparing clustering parameters with rOpenScPCA"
date: "`r Sys.Date()`"
author: "Data Lab"
output:
  html_notebook:
    toc: yes
    toc_float: yes
    df_print: paged
---

## Introduction

Clustering algorithms have several parameters which can be varied, leading to different clustering results.
A key question when clustering, therefore, is how to identify a set of parameters that lead to robust and reliable clusters that can be used in downstream analysis. 

This notebook provides examples of how to use the `rOpenScPCA` package to:

* Calculate several versions of clustering results across several different parameterizations
* Calculate QC metrics on across clustering results

Please refer to the [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd) notebook for a tutorial on using `rOpenScPCA` functions to:

* Calculate clusters from a single parameterization
* Calculate QC metrics on a single set of clusters, as well as explanations of the metrics themselves

This notebook will use the sample `SCPCS000001` from project `SCPCP000001`, which is assumed present in the `OpenScPCA-analysis/data/current/SCPCP000001` directory, for all examples.
Please [see this documentation](https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/) for more information about obtaining ScPCA data.

## Setup

### Packages


```{r packages}
library(rOpenScPCA)

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(ggplot2)
  library(patchwork)
})

# Set ggplot theme for plots
theme_set(theme_bw())
```


### Paths

```{r base paths}
# The base path for the OpenScPCA repository
repository_base <- rprojroot::find_root(rprojroot::has_dir(".github"))

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")

# The path to this module
module_base <- file.path(repository_base, "analyses", "hello-clusters")
```

```{r input file path}
# Path to processed SCE file for sample SCPCS000001
input_sce_file <- file.path(data_dir, "SCPCP000001", "SCPCS000001", "SCPCL000001_processed.rds")
```


### Set the random seed

Because clustering involves random sampling, it is important to set the random seed at the top of your analysis script or notebook to ensure reproducibility.

```{r set seed}
set.seed(2024)
```

## Read in and prepare data

To begin, we'll read in the `SingleCellExperiment` (SCE) object.

```{r read data}
# Read the SCE file
sce <- readRDS(input_sce_file)
```

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
As shown in [`01_perform-evaluate-clustering.Rmd`](01_perform-evaluate-clustering.Rmd), it is also possible to use an SCE object or a Seurat object directly.


```{r extract pca data}
# Extract the PCA matrix from an SCE object
pca_matrix <- reducedDim(sce, "PCA")
```

## Varying a single clustering parameter

This section will show how to perform clustering across a set of parameters (aka, "sweep" a set of parameters) with `rOpenScPCA::sweep_clusters()`. 

This function takes a PCA matrix with row names representing unique cell ids (e.g., barcodes) as its primary argument, with additional arguments for cluster parameters. 
This function wraps the `rOpenScPCA::calculate_clusters()` function but allows you to provide a vector of parameter values to perform clustering across, as listed below.
Clusters will be calculated for all combinations of parameters values (where applicable); default values that the function will use for any unspecified parameter values are shown in parentheses.

* `algorithm`: Which clustering algorithm to use (Louvain)
* `weighting`: Which weighting scheme to use (Jaccard)
* `nn`: The number of nearest neighbors (10)
* `resolution`: The resolution parameter (1; used only with Louvain and Leiden clustering)
* `objective_function`: The objective function to optimize clusters (CPM; used only with Leiden clustering)

`rOpenScPCA::sweep_clusters()` does not allow you to specify values for any other parameters. 


This function will return a list of data frames of clustering results. 
Each data frame will have the following columns:

* `cell_id`: Unique cell identifiers, obtained from the PCA matrix's row names
* `cluster`: A factor column with the cluster identities
* There will be one column for each clustering parameter used

To demonstrate this function, we'll calculate clusters using the Louvain algorithm while varying the `nn` parameter:

```{r sweep clusters}
# Define nn parameter values of interest
nn_values <- c(10, 20, 30)

# Calculate clusters varying nn, but leaving other parameters at their default values
cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values
)
```

The resulting list has a length of three, one data frame for each `nn` parameter tested:

```{r length cluster_results_list}
length(cluster_results_list)
```

It can be helpful (although it is not strictly necessary to keep track) to name this list by the varied `nn` parameter.
In this case, we'll use these names to label plots.

```{r set list names}
names(cluster_results_list) <- glue::glue("nn_{nn_values}")
```


We can look at the first few rows of each data frame using [`purrr::map()`](https://purrr.tidyverse.org/reference/map.html) to iterate over the list:


```{r map cluster_results_list}
cluster_results_list |>
  purrr::map(head)
```

Generally speaking, `purrr::map()` can be used to iterate over this list to visualize or analyze each clustering result on its own; we'll use this approach in the following sections. 

### Visualizing clustering results

When comparing clustering results, it's important to first visualize the different clusterings to build context for interpreting QC metrics.

As one example of why this is important, we generally expect that more robust clusters will have higher values for metrics like silhouette width and neighborhood purity.
However, we also expect that having fewer clusters in the first place will also lead to higher metrics, regardless of cluster quality: When there are fewer clusters, it is more likely that clusters overlap less with one another just because there aren't many clusters in the first place.
This means that, when interpreting cluster quality metrics, you should be careful to take more context about the data into consideration and not only rely on the metric values.

We'll therefore visualize these results as UMAPs by iterating over `cluster_results_list` and combining plots with [`patchwork::wrap_plots()`](https://patchwork.data-imaginist.com/reference/wrap_plots.html).
We'll specifically use [`purrr::imap()`](https://purrr.tidyverse.org/reference/imap.html) to iterate so that we can assign this list's names as plot titles.

For this, we'll begin by extracting a table of UMAP coordinates from our SCE object.

```{r create umap_df}
umap_df <- reducedDim(sce, "UMAP") |>
  as.data.frame()
```

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.

```{r plot nn umaps, fig.width = 12}
umap_plots <- cluster_results_list |>
  purrr::imap(
    \(cluster_df, clustering_name) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = glue::glue("nearest neighbors: {clustering_name}")) +
        # We'll add a couple UMAP plot settings here, including equal axes and
        # turning off the axis ticks and text since UMAP coordinates are not meaningful
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank()
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots, ncol = 3)
```

These plots show that the number of clusters decreases as the nearest neighbors parameter increases, with between 9-13 clusters.




### Evaluating clustering results

This section will use `purrr::map()` to iterate over each clustering result data frame to calculate silhouette width, neighborhood purity, and stability, and then visualize results.
The goal of this code is to identify whether one clustering parameterization produces more reliable clusters. 


#### Silhouette width and neighborhood purity

Both silhouette width and neighborhood purity are cell-level quantities, so we can calculate them together in the same call to `purrr::map()`.
Below, we'll iterate over each data frame in `cluster_results_list` to calculate these quantities.

```{r calculate cell level metrics}
cell_metric_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # calculate silhouette width
      silhouette_df <- rOpenScPCA::calculate_silhouette(pca_matrix, cluster_df)

      # calculate neighbhorhood purity, starting from silhouette_df
      rOpenScPCA::calculate_purity(pca_matrix, silhouette_df)
    }
  )

# View the first six rows of each clustering result's cell-level QC metrics
purrr::map(cell_metric_list, head)
```


To visualize these results, we can combine all data frames in this list into a single overall data frame, where the existing `nn` column will distinguish among conditions.

```{r combine cell metrics list}
cell_metrics_df <- purrr::list_rbind(cell_metric_list)
```

We can visualize silhouette width and neighborhood purity each with boxplots, for example, and use the [`patchwork`](https://patchwork.data-imaginist.com/) package to print them together:


```{r}
# Plot silhouette width
silhouette_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = silhouette_width, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Silhouette width"
  )


# Plot neighborhood purity width
purity_plot <- ggplot(cell_metrics_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  )


# Add together using the patchwork library, without a legend
silhouette_plot + purity_plot & theme(legend.position = "none")
```

While there does not appear to be a salient difference among silhouette width distributions, it does appear that purity is higher with a higher nearest neighbors parameter.
It's worth noting that this trend in purity values is expected: Higher nearest neighbor parameter values lead to fewer clusters, and neighborhood purity tends to be higher when there are fewer clusters. 


#### Stability

Next, we'll calculate stability on the clusters using `rOpenScPCA::calculate_stability()`, specifying the same parameter used for the original cluster calculation at each iteration. 

```{r calculate stability}
stability_list <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      nn <- cluster_df$nn[1] # all rows have the same `nn` parameter, so we'll take the first

      # calculate stability, passing in the parameter value used for this iteration
      rOpenScPCA::calculate_stability(pca_matrix, cluster_df, nn = nn)
    }
  )
```

We'll again combine the output of `stability_list` into a single data frame and plot `ari` values across `nn` parameterizations.


```{r combine plot stability}
stability_df <- purrr::list_rbind(stability_list)

ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```

Here, we see that a nearest neighbors value of 20 or 30 leads to more stable clustering results compared to 10. 


## Varying multiple clustering parameters

The previous section demonstrated how to calculate clusters and QC metrics when varying one parameter, but it is possible to vary multiple parameters at once with `rOpenScPCA::sweep_clusters()`.
In this section, we'll show an overview of how you might write code to vary two parameters at once (here, nearest neighbors and resolution as examples) and visualize results.


```{r sweep two parameters}
# Define vectors of parameters to vary
nn_values <- c(10, 20, 30)
res_values <- c(0.5, 1.0, 1.5)


cluster_results_list <- rOpenScPCA::sweep_clusters(
  pca_matrix,
  nn = nn_values,
  resolution = res_values
)
```

This resulting list now has 9 different clustering results, for each combination of `nn_values` and `res_values`:


```{r length cluster_results_list two parameters}
length(cluster_results_list)
```


### Visualize clusters

Next, we'll iterate over `cluster_results_list` to plot the UMAPs.
This time, we'll use `purrr::map()` and pull out parameters from each iteration's `cluster_df` to form the UMAP panel title.

```{r plot nn res umaps, fig.height = 14}
umap_plots <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Add a column with cluster assignments to umap_df
      umap_df_plot <- umap_df |>
        dplyr::mutate(cluster = cluster_df$cluster)

      # Create a title for the UMAP with both parameters
      umap_title <- glue::glue(
        "nn: {cluster_df$nn[1]}; res: {cluster_df$resolution[1]}"
      )

      # Plot the UMAP, colored by the new cluster variable
      ggplot(umap_df_plot, aes(x = UMAP1, y = UMAP2, color = cluster)) +
        geom_point(alpha = 0.6) +
        labs(title = umap_title) +
        # We'll add a couple UMAP-specific plot settings
        coord_equal() +
        theme(
          axis.ticks = element_blank(),
          axis.text = element_blank(),
          # Ensure legends fit in the figure
          legend.position = "bottom",
          legend.key.size = unit(0.2, "cm")
        )
    }
  )

# Print the plots with patchwork::wrap_plots()
patchwork::wrap_plots(umap_plots, ncol = 3)
```



### Calculate and visualize QC metrics

This section presents one coding strategy to calculate and visualize results when varying two clustering parameters.
In particular, we use faceting to help display all information in one plot, by placing nearest neighbor values on the X-axis and faceting by resolution values.
Since silhouette width and neighhorbood purity calculations using generally similar code, we'll just show neighborhood purity here.

#### Neighborhood purity

First, we'll calculate neighborhood purity and combine results into a single data frame.

```{r calculate purity two parameters}
purity_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      rOpenScPCA::calculate_purity(pca_matrix, cluster_df)
    }
  ) |>
  purrr::list_rbind()
```


```{r visualize purity two parameters}
ggplot(purity_df) +
  aes(x = as.factor(nn), y = purity, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  # facet by resolution, labeling panels with both the resolution column name and value
  facet_wrap(vars(resolution), labeller = label_both) +
  labs(
    x = "Number of nearest neighbors",
    y = "Neighborhood purity"
  ) +
  theme(legend.position = "none")
```

### Stability

Similar to above, we'll calculate stability, combine results into a single data frame, add a `resolution_label` column to support plot interpretation, and finally make our plot.

```{r calculate stability two parameters}
stability_df <- cluster_results_list |>
  purrr::map(
    \(cluster_df) {
      # Extract parameters for this clustering result
      nn <- unique(cluster_df$nn)
      resolution <- unique(cluster_df$resolution)

      rOpenScPCA::calculate_stability(
        pca_matrix,
        cluster_df,
        nn = nn,
        resolution = resolution
      )
    }
  ) |>
  purrr::list_rbind()
```


```{r visualize stability two parameters}
ggplot(stability_df) +
  aes(x = as.factor(nn), y = ari, fill = as.factor(nn)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel2") +
  facet_wrap(vars(resolution), labeller = label_both) +
  labs(
    x = "Number of nearest neighbors",
    y = "Adjusted Rand Index"
  ) +
  theme(legend.position = "none")
```



## Session Info

```{r session info}
# record the versions of the packages used in this analysis and other environment information
sessionInfo()
```
