This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy pathlancet-paired-WXS-WGS.Rmd
127 lines (102 loc) · 2.98 KB
/
lancet-paired-WXS-WGS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "Lancet paired WXS and WGS data"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: C. Savonen for ALSF CCDL
date: 2020
---
#### Purpose:
This analysis resulted from an [idea in a comment here](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/548#issuecomment-589749355).
In short we were seeing particularly low VAF for Lancet TCGA data and were wondering
whether this had to do with the data being WXS instead of WGS.
### Conclusion:
This notebook shows within a group of participants who have WXS and WGS data, Lancet seems to call more mutations of low VAF in WXS data than in WGS. This leads us to question the validity of Lancet's WXS calls with the current data.
basis.
Declare names of input and output directories.
```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`
data_dir <- file.path("..", "..", "..", "data")
scratch_dir <- file.path("..", "..", "..", "scratch")
```
Read in the metadata.
```{r}
metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv"), guess_max = 10000)
```
Retrieve all the participant IDs for participants that have both WGS and WXS data.
```{r}
matched_participants <- metadata %>%
dplyr::filter(experimental_strategy != "RNA-Seq") %>%
dplyr::group_by(Kids_First_Participant_ID) %>%
dplyr::summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
dplyr::filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
dplyr::pull(Kids_First_Participant_ID)
```
Get the biospecimen IDS for these participants.
```{r}
biospecimens <- metadata %>%
dplyr::filter(Kids_First_Participant_ID %in% matched_participants) %>%
dplyr::pull(Kids_First_Biospecimen_ID)
```
Connect to SQLite database.
```{r}
# Start up connection
con <- DBI::dbConnect(
RSQLite::SQLite(),
file.path(scratch_dir, "snv_db.sqlite")
)
```
Make a list of columns we will keep.
```{r}
cols_to_keep <- c(
"Chromosome",
"Start_Position",
"End_Position",
"Reference_Allele",
"Allele",
"Tumor_Sample_Barcode",
"Variant_Classification",
"VAF"
)
```
Set up the Lancet data from the SQL database and only keep the biospecimens we
identified.
```{r}
lancet <- dplyr::tbl(con, "lancet") %>%
dplyr::select(cols_to_keep) %>%
dplyr::inner_join(
dplyr::tbl(con, "samples") %>%
dplyr::select(
Tumor_Sample_Barcode = Kids_First_Biospecimen_ID,
experimental_strategy, short_histology,
Kids_First_Participant_ID
)
) %>%
dplyr::filter(Tumor_Sample_Barcode %in% biospecimens) %>%
as.data.frame()
```
Plot this as a violin plot
```{r}
vaf_plot <- ggplot2::ggplot(lancet, ggplot2::aes(
x = experimental_strategy,
y = VAF,
fill = experimental_strategy
)) +
ggplot2::geom_violin() +
ggplot2::theme_classic() +
ggplot2::ggtitle("Lancet participants with WXS and WGS")
```
Print out plot with all participants together.
```{r}
vaf_plot
```
Split up data by participant.
```{r}
vaf_plot + ggplot2::facet_wrap(~ Kids_First_Participant_ID)
```
### Session Info
```{r}
sessionInfo()
```