@@ -12,8 +12,9 @@ strategy that works best, and so we recommend exploring different
12
12
annotation strategies. Also, careful interpretation of the matrix
13
13
factorization is discussed.
14
14
15
- The plotting functions used in this analysis are from
16
- [ fastTopics] [ fastTopics ] .
15
+ A side benefit of this investigation is to illustate some useful
16
+ plotting strategies, including the ` annotation_heatmap() ` function
17
+ from the [ fastTopics] [ fastTopics package ] .
17
18
18
19
``` {r knitr-opts, include=FALSE}
19
20
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
@@ -27,6 +28,7 @@ library(Matrix)
27
28
library(flashier)
28
29
library(fastTopics)
29
30
library(ggplot2)
31
+ library(ggrepel)
30
32
library(cowplot)
31
33
```
32
34
@@ -114,11 +116,16 @@ plot_grid(p1,p2,nrow = 1,ncol = 2)
114
116
Strategy (i) picks out some canonical marker genes for islet cells
115
117
such as * INS* for beta cells and * GCG* for alpha cells. But it also
116
118
picks out other genes that are highly expressed in multiple islet cell
117
- types, such as * TTR * and * CHGB * . Strategy (ii) focusses more strongly
119
+ types, such as * SCGN * and * TTR * . Strategy (ii) focusses more strongly
118
120
on genes that distinguish one cell type from another, and as a result
119
121
marker genes such as * MAFA* (beta cells) and * GC* (alpha cells) are
120
122
ranked more highly with this strategy.
121
123
124
+ Below, we take a closer look at the ranking of the genes based on
125
+ these two strategies, and suggest another simple visualization which
126
+ could be useful. (See: "A closer look at ranking genes by largest
127
+ versus distinctive".)
128
+
122
129
The better strategy will depend on the setting and on the goals of the
123
130
analysis, which is why the ` annotation_heatmap ` function provides both
124
131
options. These selection strategies can also reveal complementary
@@ -212,4 +219,79 @@ p2 <- annotation_heatmap(F,n = 8,dims = kset,
212
219
plot_grid(p1,p2,nrow = 1,ncol = 2)
213
220
```
214
221
222
+ Note that the F matrix in the semi-NMF allows for both positive and
223
+ negative log-fold changes.
224
+
225
+ ## A closer look at ranking genes by largest vs. distinctive
226
+
227
+ Above, we compared gene selection strategies for some annotation
228
+ heatmaps of NMF results. Here we visualize how these two different
229
+ strategies result in two different gene rankings. And this
230
+ visualization may be useful on its own to annotate the factors.
231
+
232
+ First we define a couple of functions used to create some plots.
233
+
234
+ This function computes the "least extreme" (l.e.) effect differences
235
+ for a non-negative effects matrix:
236
+
237
+ ``` {r compute-le-diff}
238
+ compute_le_diff <- function (effects_matrix,
239
+ compare_dims = seq(1,ncol(effects_matrix))) {
240
+ m <- ncol(effects_matrix)
241
+ out <- effects_matrix
242
+ for (i in 1:m) {
243
+ dims <- setdiff(compare_dims,i)
244
+ out[,i] <- effects_matrix[,i] - apply(effects_matrix[,dims],1,max)
245
+ }
246
+ return(out)
247
+ }
248
+ ```
249
+
250
+ This function will be used to create the scatterplots:
251
+
252
+ ``` {r distinctive-gene-scatterplot}
253
+ distinctive_genes_scatterplot <- function (effects_matrix, k,
254
+ effect_quantile_prob = 0.999,
255
+ lediff_quantile_prob = 0.999) {
256
+ lediff <- compute_le_diff(effects_matrix)
257
+ genes <- rownames(effects_matrix)
258
+ pdat <- data.frame(gene = genes,
259
+ effect = effects_matrix[,k],
260
+ lediff = lediff[,k])
261
+ effect_quantile <- quantile(pdat$effect,effect_quantile_prob)
262
+ lediff_quantile <- quantile(pdat$lediff,lediff_quantile_prob)
263
+ i <- which(pdat$effect < effect_quantile & pdat$lediff < lediff_quantile)
264
+ pdat[i,"gene"] <- NA
265
+ return(ggplot(pdat,aes(x = effect,y = lediff,label = gene)) +
266
+ geom_point(color = "dodgerblue") +
267
+ geom_hline(yintercept = 0,color = "magenta",linetype = "dotted",
268
+ linewidth = 0.5) +
269
+ geom_text_repel(color = "black",size = 2,
270
+ fontface = "italic",segment.color = "black",
271
+ segment.size = 0.25,min.segment.length = 0,
272
+ max.overlaps = Inf,na.rm = TRUE) +
273
+ labs(x = "log-fold change",y = "l.e. difference") +
274
+ theme_cowplot(font_size = 9))
275
+ }
276
+ ```
277
+
278
+ Now we compare the two different gene rankings in the scatterplots for
279
+ factors 4, 5 and 6 of the flashier NMF result:
280
+
281
+ ``` {r distinctive-gene-scatterplots-flashier-nmf, fig.height=2.5, fig.width=7.5}
282
+ F <- fl_nmf_ldf$F
283
+ colnames(F) <- paste0("k",1:9)
284
+ kset <- paste0("k",4:6)
285
+ p1 <- distinctive_genes_scatterplot(F[,kset],"k4") + ggtitle("factor k4")
286
+ p2 <- distinctive_genes_scatterplot(F[,kset],"k5") + ggtitle("factor k5")
287
+ p3 <- distinctive_genes_scatterplot(F[,kset],"k6") + ggtitle("factor k6")
288
+ print(plot_grid(p1,p2,p3,nrow = 1,ncol = 3))
289
+ ```
290
+
291
+ It is clear from these scatterplots that the rankings are very
292
+ different, and strikingly so for factor 5 representing alpha cells.
293
+ This means that many of the top-ranked genes for factor 5 (largest
294
+ increases in expression) also show very large increases in other islet
295
+ cells, e.g., * SCG5* .
296
+
215
297
[ fastTopics ] : https://github.com/stephenslab/fastTopics/
0 commit comments