analyses/snv-callers/lancet-wxs-tests/lancet-paired-WXS-WGS.Rmd

---
title: "Lancet paired WXS and WGS data"
output: 
  html_notebook:
    toc: TRUE
    toc_float: TRUE
author: C. Savonen for ALSF CCDL
date: 2020
---

#### Purpose: 
This analysis resulted from an [idea in a comment here](https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/548#issuecomment-589749355).
In short we were seeing particularly low VAF for Lancet TCGA data and were wondering 
whether this had to do with the data being WXS instead of WGS. 

### Conclusion: 
This notebook shows within a group of participants who have WXS and WGS data, Lancet seems to call more mutations of low VAF in WXS data than in WGS. This leads us to question the validity of Lancet's WXS calls with the current data. 
basis. 

Declare names of input and output directories.

```{r}
# magrittr pipe
`%>%` <- dplyr::`%>%`

data_dir <- file.path("..", "..", "..", "data")
scratch_dir <- file.path("..", "..", "..", "scratch")
```

Read in the metadata. 

```{r}
metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv"), guess_max = 10000)
```

Retrieve all the participant IDs for participants that have both WGS and WXS data. 

```{r}
matched_participants <- metadata %>%
  dplyr::filter(experimental_strategy != "RNA-Seq") %>%
  dplyr::group_by(Kids_First_Participant_ID) %>%
  dplyr::summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
  dplyr::filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
  dplyr::pull(Kids_First_Participant_ID)
```

Get the biospecimen IDS for these participants. 

```{r}
biospecimens <- metadata %>%
  dplyr::filter(Kids_First_Participant_ID %in% matched_participants) %>%
  dplyr::pull(Kids_First_Biospecimen_ID)
```

Connect to SQLite database.

```{r}
# Start up connection
con <- DBI::dbConnect(
  RSQLite::SQLite(),
  file.path(scratch_dir, "snv_db.sqlite")
)
```

Make a list of columns we will keep.

```{r}
cols_to_keep <- c(
  "Chromosome",
  "Start_Position",
  "End_Position",
  "Reference_Allele",
  "Allele",
  "Tumor_Sample_Barcode",
  "Variant_Classification",
  "VAF"
)
```

Set up the Lancet data from the SQL database and only keep the biospecimens we
identified. 

```{r}
lancet <- dplyr::tbl(con, "lancet") %>%
  dplyr::select(cols_to_keep) %>%
  dplyr::inner_join(
    dplyr::tbl(con, "samples") %>%
      dplyr::select(
        Tumor_Sample_Barcode = Kids_First_Biospecimen_ID,
        experimental_strategy, short_histology,
        Kids_First_Participant_ID
      )
  ) %>%
  dplyr::filter(Tumor_Sample_Barcode %in% biospecimens) %>%
  as.data.frame()
```

Plot this as a violin plot

```{r}
vaf_plot <- ggplot2::ggplot(lancet, ggplot2::aes(
  x = experimental_strategy,
  y = VAF,
  fill = experimental_strategy
)) +
  ggplot2::geom_violin() +
  ggplot2::theme_classic() +
  ggplot2::ggtitle("Lancet participants with WXS and WGS")
```

Print out plot with all participants together. 

```{r}
vaf_plot
```

Split up data by participant. 

```{r}
vaf_plot + ggplot2::facet_wrap(~ Kids_First_Participant_ID)
```

### Session Info

```{r}
sessionInfo()
```