Skip to content

Commit 6cf9aa7

Browse files
committed
wflow_publish("pancreas_annotate.Rmd", verbose = TRUE, view = FALSE)
1 parent 105293f commit 6cf9aa7

File tree

1 file changed

+85
-3
lines changed

1 file changed

+85
-3
lines changed

analysis/pancreas_annotate.Rmd

+85-3
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,9 @@ strategy that works best, and so we recommend exploring different
1212
annotation strategies. Also, careful interpretation of the matrix
1313
factorization is discussed.
1414

15-
The plotting functions used in this analysis are from
16-
[fastTopics][fastTopics].
15+
A side benefit of this investigation is to illustate some useful
16+
plotting strategies, including the `annotation_heatmap()` function
17+
from the [fastTopics][fastTopics package].
1718

1819
```{r knitr-opts, include=FALSE}
1920
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold",
@@ -27,6 +28,7 @@ library(Matrix)
2728
library(flashier)
2829
library(fastTopics)
2930
library(ggplot2)
31+
library(ggrepel)
3032
library(cowplot)
3133
```
3234

@@ -114,11 +116,16 @@ plot_grid(p1,p2,nrow = 1,ncol = 2)
114116
Strategy (i) picks out some canonical marker genes for islet cells
115117
such as *INS* for beta cells and *GCG* for alpha cells. But it also
116118
picks out other genes that are highly expressed in multiple islet cell
117-
types, such as *TTR* and *CHGB*. Strategy (ii) focusses more strongly
119+
types, such as *SCGN* and *TTR*. Strategy (ii) focusses more strongly
118120
on genes that distinguish one cell type from another, and as a result
119121
marker genes such as *MAFA* (beta cells) and *GC* (alpha cells) are
120122
ranked more highly with this strategy.
121123

124+
Below, we take a closer look at the ranking of the genes based on
125+
these two strategies, and suggest another simple visualization which
126+
could be useful. (See: "A closer look at ranking genes by largest
127+
versus distinctive".)
128+
122129
The better strategy will depend on the setting and on the goals of the
123130
analysis, which is why the `annotation_heatmap` function provides both
124131
options. These selection strategies can also reveal complementary
@@ -212,4 +219,79 @@ p2 <- annotation_heatmap(F,n = 8,dims = kset,
212219
plot_grid(p1,p2,nrow = 1,ncol = 2)
213220
```
214221

222+
Note that the F matrix in the semi-NMF allows for both positive and
223+
negative log-fold changes.
224+
225+
## A closer look at ranking genes by largest vs. distinctive
226+
227+
Above, we compared gene selection strategies for some annotation
228+
heatmaps of NMF results. Here we visualize how these two different
229+
strategies result in two different gene rankings. And this
230+
visualization may be useful on its own to annotate the factors.
231+
232+
First we define a couple of functions used to create some plots.
233+
234+
This function computes the "least extreme" (l.e.) effect differences
235+
for a non-negative effects matrix:
236+
237+
```{r compute-le-diff}
238+
compute_le_diff <- function (effects_matrix,
239+
compare_dims = seq(1,ncol(effects_matrix))) {
240+
m <- ncol(effects_matrix)
241+
out <- effects_matrix
242+
for (i in 1:m) {
243+
dims <- setdiff(compare_dims,i)
244+
out[,i] <- effects_matrix[,i] - apply(effects_matrix[,dims],1,max)
245+
}
246+
return(out)
247+
}
248+
```
249+
250+
This function will be used to create the scatterplots:
251+
252+
```{r distinctive-gene-scatterplot}
253+
distinctive_genes_scatterplot <- function (effects_matrix, k,
254+
effect_quantile_prob = 0.999,
255+
lediff_quantile_prob = 0.999) {
256+
lediff <- compute_le_diff(effects_matrix)
257+
genes <- rownames(effects_matrix)
258+
pdat <- data.frame(gene = genes,
259+
effect = effects_matrix[,k],
260+
lediff = lediff[,k])
261+
effect_quantile <- quantile(pdat$effect,effect_quantile_prob)
262+
lediff_quantile <- quantile(pdat$lediff,lediff_quantile_prob)
263+
i <- which(pdat$effect < effect_quantile & pdat$lediff < lediff_quantile)
264+
pdat[i,"gene"] <- NA
265+
return(ggplot(pdat,aes(x = effect,y = lediff,label = gene)) +
266+
geom_point(color = "dodgerblue") +
267+
geom_hline(yintercept = 0,color = "magenta",linetype = "dotted",
268+
linewidth = 0.5) +
269+
geom_text_repel(color = "black",size = 2,
270+
fontface = "italic",segment.color = "black",
271+
segment.size = 0.25,min.segment.length = 0,
272+
max.overlaps = Inf,na.rm = TRUE) +
273+
labs(x = "log-fold change",y = "l.e. difference") +
274+
theme_cowplot(font_size = 9))
275+
}
276+
```
277+
278+
Now we compare the two different gene rankings in the scatterplots for
279+
factors 4, 5 and 6 of the flashier NMF result:
280+
281+
```{r distinctive-gene-scatterplots-flashier-nmf, fig.height=2.5, fig.width=7.5}
282+
F <- fl_nmf_ldf$F
283+
colnames(F) <- paste0("k",1:9)
284+
kset <- paste0("k",4:6)
285+
p1 <- distinctive_genes_scatterplot(F[,kset],"k4") + ggtitle("factor k4")
286+
p2 <- distinctive_genes_scatterplot(F[,kset],"k5") + ggtitle("factor k5")
287+
p3 <- distinctive_genes_scatterplot(F[,kset],"k6") + ggtitle("factor k6")
288+
print(plot_grid(p1,p2,p3,nrow = 1,ncol = 3))
289+
```
290+
291+
It is clear from these scatterplots that the rankings are very
292+
different, and strikingly so for factor 5 representing alpha cells.
293+
This means that many of the top-ranked genes for factor 5 (largest
294+
increases in expression) also show very large increases in other islet
295+
cells, e.g., *SCG5*.
296+
215297
[fastTopics]: https://github.com/stephenslab/fastTopics/

0 commit comments

Comments
 (0)