-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathQC_Normalization.Rmd
417 lines (284 loc) · 15 KB
/
QC_Normalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
---
title: "Quality Control and Normalization"
output: github_document
---
Created by: Ahmed Mahfouz
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Overview
In this practical, we will walk through a pipeline to analyze single cell RNA-sequencing (scRNA-seq) data. Starting from a count matrix, we will cover the following steps of the analysis:
1. Quality control
2. Normalization
3. Feature selection
## Datasets
For this tutorial we will use 3 different PBMC datasets from the 10x Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets).
* 1k PBMCs using 10x v2 chemistry
* 1k PBMCs using 10x v3 chemistry
* 1k PBMCs using 10x v3 chemistry in combination with cell surface proteins, but disregarding the protein data and only looking at gene expression.
The datasets are available in this repository.
Load required packages:
```{r packages}
suppressMessages(require(Seurat))
suppressMessages(require(scater))
suppressMessages(require(scran))
suppressMessages(require(Matrix))
```
## Read the data and create a Seurat object
Here, we use the function Read10X_h5 to read in the expression matrices in R.
```{r load}
v3.1k <- Read10X_h5("pbmc_1k_v3_filtered_feature_bc_matrix.h5", use.names = T)
v2.1k <- Read10X_h5("pbmc_1k_v2_filtered_feature_bc_matrix.h5", use.names = T)
p3.1k <- Read10X_h5("pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5", use.names = T)
# select only gene expression data from the CITE-seq data.
p3.1k <- p3.1k$`Gene Expression`
```
First, create Seurat objects for each of the datasets, and then merge into one large seurat object.
```{r create_Seurat}
sdata.v2.1k <- CreateSeuratObject(v2.1k, project = "v2.1k")
sdata.v3.1k <- CreateSeuratObject(v3.1k, project = "v3.1k")
sdata.p3.1k <- CreateSeuratObject(p3.1k, project = "p3.1k")
# merge into one single seurat object. Add cell ids just in case you have overlapping barcodes between the datasets.
alldata <- merge(sdata.v2.1k, c(sdata.v3.1k,sdata.p3.1k), add.cell.ids=c("v2.1k","v3.1k","p3.1k"))
# also add in a metadata column that indicates v2 vs v3 chemistry
chemistry <- rep("v3",ncol(alldata))
chemistry[Idents(alldata) == "v2.1k"] <- "v2"
alldata <- AddMetaData(alldata, chemistry, col.name = "Chemistry")
alldata
```
Check number of cells from each sample, is stored in the orig.ident slot of metadata and is autmatically set as active ident.
```{r check_samples}
table(Idents(alldata))
```
## 1. Quality control
Seurat automatically calculates some QC-stats, like number of UMIs and features per cell. Stored in columns nCount_RNA & nFeature_RNA of the metadata.
```{r check_metadata1}
head(alldata@meta.data)
```
### Calculate mitochondrial proportion
We will manually calculate the proportion of mitochondrial reads and add to the metadata table.
```{r QC1}
mt.genes <- rownames(alldata)[grep("^MT-",rownames(alldata))]
C <- GetAssayData(object = alldata, slot = "counts")
percent.mito <- colSums(C[mt.genes,])/Matrix::colSums(C)*100
alldata <- AddMetaData(alldata, percent.mito, col.name = "percent.mito")
```
#### Calculate ribosomal proportion
In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins.
```{r QC2}
rb.genes <- rownames(alldata)[grep("^RP[SL]",rownames(alldata))]
percent.ribo <- colSums(C[rb.genes,])/Matrix::colSums(C)*100
alldata <- AddMetaData(alldata, percent.ribo, col.name = "percent.ribo")
```
Now have another look at the metadata table
```{r check_metadata2}
head(alldata@meta.data)
```
#### Plot QC
Now we can plot some of the QC-features as violin plots
```{r plot_nGenes}
VlnPlot(alldata, features = "nFeature_RNA", pt.size = 0.1) + NoLegend()
```
```{r plot_nUMIs}
VlnPlot(alldata, features = "nCount_RNA", pt.size = 0.1) + NoLegend()
```
```{r plot_mito}
VlnPlot(alldata, features = "percent.mito", pt.size = 0.1) + NoLegend()
```
```{r plot_ribo}
VlnPlot(alldata, features = "percent.ribo", pt.size = 0.1) + NoLegend()
```
As you can see, the v2 chemistry gives lower gene detection, but higher detection of ribosomal proteins. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected.
We can also plot the different QC-measures as scatter plots.
```{r plot_scatter1}
FeatureScatter(alldata, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
```
```{r plot_scatter2}
FeatureScatter(alldata, feature1 = "nFeature_RNA", feature2 = "percent.mito")
```
```{r plot_scatter3}
FeatureScatter(alldata, feature1="percent.ribo", feature2="nFeature_RNA")
```
We can also subset the data to only plot one sample.
```{r plot_scatter4}
FeatureScatter(alldata, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",
cells = WhichCells(alldata, expression = orig.ident == "v3.1k") )
```
### Filtering
#### Mitochondrial filtering
We have quite a lot of cells with high proportion of mitochondrial reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent.mito variable during scaling.
In this case we have as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much celltype signature left in those.
Looking at the plots, make resonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 25% mitochondrial reads and that will be used as a cutoff.
```{r mito.filt}
#select cells with percent.mito < 25
selected <- WhichCells(alldata, expression = percent.mito < 25)
length(selected)
# and subset the object to only keep those cells
data.filt <- subset(alldata, cells = selected)
# plot violins for new data
VlnPlot(data.filt, features = "percent.mito")
```
As you can see, there is still quite a lot of variation in percent mito, so it will have to be dealt with in the data analysis step.
#### Gene detection filtering
Extremely high number of detected genes could indicate doublets. However, depending on the celltype composition in your sample, you may have cells with higher number of genes (and also higher counts) from one celltype.
In these datasets, there is also a clear difference between the v2 vs v3 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them.
Also, in the protein assay data there is a lot of cells with few detected genes giving a bimodal distribution. This type of distribution is not seen in the other 2 datasets. Considering that they are all pbmc datasets it makes sense to regard this distribution as low quality libraries.
Filter the cells with high gene detection (putative doublets) with cutoffs 4100 for v3 chemistry and 2000 for v2.
```{r gene.filt}
#start with cells with many genes detected.
high.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA > 4100)
high.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA > 2000 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(high.det.v2,high.det.v3)))
# check number of cells
ncol(data.filt)
```
Filter the cells with low gene detection (low quality libraries) with less than 1000 genes for v2 and < 500 for v2.
```{r gene.filt2}
#start with cells with many genes detected.
low.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA < 1000 & orig.ident != "v2.1k")
low.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA < 500 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(low.det.v2,low.det.v3)))
# check number of cells
ncol(data.filt)
```
#### Plot QC-stats again
Lets plot the same qc-stats another time.
```{r vln.plot2}
VlnPlot(data.filt, features = "nFeature_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "nCount_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "percent.mito", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "percent.ribo", pt.size = 0.1) + NoLegend()
# and check the number of cells per sample before and after filtering
table(Idents(alldata))
table(Idents(data.filt))
```
### Calculate cell-cycle scores
Seurat has a function for calculating cell cycle scores based on a list of know S-phase and G2/M-phase genes.
```{r cc}
data.filt <- CellCycleScoring(
object = data.filt,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes
)
VlnPlot(data.filt, features = c("S.Score","G2M.Score"))
```
In this case it looks like we only have a few cycling cells in the datasets.
## 2. Normalization
```{r setup2}
options(stringsAsFactors = FALSE)
set.seed(32546)
```
To speed things up, we will continue working with the v3.1k dataset only. We will convert the Seurat object to a SCE object to work with the scater package. You can read more about SCE objects [here](https://osca.bioconductor.org/data-infrastructure.html#the-singlecellexperiment-class).
Note: to create an SCE object directly from the count matrices, have a look at their tutorial at: https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette-intro.html.
```{r select_dataset}
pbmc.sce <- SingleCellExperiment(assays = list(counts = as.matrix(v3.1k)))
pbmc.sce <- pbmc.sce[rowSums(counts(pbmc.sce) > 0) > 2,]
isSpike(pbmc.sce, "MT") <- grepl("^MT-", rownames(pbmc.sce))
pbmc.sce <- calculateQCMetrics(pbmc.sce)
colnames(colData(pbmc.sce))
```
Filter out poor quality cells to avoid negative size factors. These steps are very similar to what we have already done on the combined Seurat object but now we perform them on one dataset only using the Scater package.
```{r filter_poor_quality}
pbmc.sce <- filter(pbmc.sce, pct_counts_MT < 20)
pbmc.sce <- filter(pbmc.sce,
total_features_by_counts > 1000 &
total_features_by_counts < 4100)
```
Create a new assay with unnormalized counts for comparison to post-normalization.
```{r}
assay(pbmc.sce, "logcounts_raw") <- log2(counts(pbmc.sce) + 1)
plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts_raw", style = "full")
```
Run PCA and save the result in a new object, as we will overwrite the PCA slot later.
```{r}
raw.sce <- runPCA(pbmc.sce, exprs_values = "logcounts_raw")
scater::plotPCA(raw.sce, colour_by = "total_counts")
```
Plot the expression of the B cell marker MS4A1.
```{r}
plotReducedDim(raw.sce, use_dimred = "PCA", by_exprs_values = "logcounts_raw",
colour_by = "MS4A1")
```
### Normalization: Log
In the default normalization method in Seurat, counts for each cell are divided by the total counts for that cell and multiplied by the scale factor 10,000. This is then log transformed.
Here we use the filtered data from the counts slot of the SCE object to create a Seurat object. After normalization, we convert the result back into a SingleCellExperiment object for comparing plots.
```{r}
pbmc.seu <- CreateSeuratObject(counts(pbmc.sce), project = "PBMC")
pbmc.seu <- NormalizeData(pbmc.seu)
pbmc.seu.sce <- as.SingleCellExperiment(pbmc.seu)
pbmc.seu.sce <- calculateQCMetrics(pbmc.seu.sce)
```
Perform PCA and examine the normalization results with plotRLE and plotReducedDim. This time, use "logcounts" as the expression values to plot (or omit the parameter, as "logcounts" is the default value). Check some marker genes, for example GNLY (NK cells) or LYZ (monocytes).
```{r}
plotRLE(pbmc.seu.sce[,1:50], style = "full")
```
```{r}
pbmc.seu.sce <- runPCA(pbmc.seu.sce)
scater::plotPCA(pbmc.seu.sce, colour_by = "total_counts")
```
```{r}
plotReducedDim(pbmc.seu.sce, use_dimred = "PCA", colour_by = "MS4A1")
```
### Normalization: scran
The normalization procedure in scran is based on the deconvolution method by Lun et al (2016). Counts from many cells are pooled to avoid the drop-out problem. Pool-based size factors are then “deconvolved” into cell-based factors for cell-specific normalization. Clustering cells prior to normalization is not always necessary but it improves normalization accuracy by reducing the number of DE genes between cells in the same cluster.
```{r}
qclust <- scran::quickCluster(pbmc.sce)
pbmc.sce <- scran::computeSumFactors(pbmc.sce, clusters = qclust)
summary(sizeFactors(pbmc.sce))
pbmc.sce <- normalize(pbmc.sce)
```
Examine the results and compare to the log-normalized result. Are they different?
```{r}
plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts", exprs_logged = FALSE,
style = "full")
```
```{r}
pbmc.sce <- runPCA(pbmc.sce)
scater::plotPCA(pbmc.sce, colour_by = "total_counts")
```
```{r}
plotReducedDim(pbmc.sce, use_dimred = "PCA", colour_by = "MS4A1")
```
## 3. Feature selection
### Feature selection: scran
In the scran method for finding HVGs, a trend is first fitted to the technical variances. In the absence of spike-ins, this is done using the whole data, assuming that the majority of genes are not variably expressed. Then, the biological component of the variance for each endogenous gene is computed by subtracting the fitted value of the trend from the total variance. HVGs are then identified as those genes with the largest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation. see the [scran vignette](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html#5_variance_modelling) for details.
```{r}
fit <- scran::trendVar(pbmc.sce, use.spikes = NA)
dec <- scran::decomposeVar(pbmc.sce, fit)
dec <- dec[!is.na(dec$FDR),]
top.hvgs <- order(dec$bio, decreasing = TRUE)
head(dec[top.hvgs,])
dec$HVG <- (dec$FDR<0.00001)
hvg_genes <- rownames(dec[dec$FDR < 0.00001, ])
# plot highly variable genes
plot(dec$mean, dec$total, pch=16, cex=0.6, xlab="Mean log-expression",
ylab="Variance of log-expression")
o <- order(dec$mean)
lines(dec$mean[o], dec$tech[o], col="dodgerblue", lwd=2)
points(dec$mean[dec$HVG], dec$total[dec$HVG], col="red", pch=16)
## save the decomposed variance table and hvg_genes into metadata for safekeeping
metadata(pbmc.sce)$hvg_genes <- hvg_genes
metadata(pbmc.sce)$dec_var <- dec
```
We choose genes that have a biological component that is significantly greater than zero, using a false discovery rate (FDR) of 5%.
```{r}
plotExpression(pbmc.sce, features = top.hvgs[1:10])
```
### Feature selection: Seurat
The default method in Seurat 3 is variance-stabilizing transformation. A trend is fitted to to predict the variance of each gene as a function of its mean. For each gene, the variance of standardized values is computed across all cells and used to rank the features. By default, 2000 top genes are returned.
```{r}
pbmc.seu <- FindVariableFeatures(pbmc.seu, selection.method = "vst")
top10 <- head(VariableFeatures(pbmc.seu), 10)
vplot <- VariableFeaturePlot(pbmc.seu)
LabelPoints(plot = vplot, points = top10, repel = TRUE)
```
How many of the variable genes detected with scran are included in VariableFeatures in Seurat?
```{r}
table(hvg_genes %in% VariableFeatures(pbmc.seu))
```
### Session info
```{r}
sessionInfo()
```