Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

TCGA Consensus and Comparison Revised (1 of 2) #562

Merged
merged 40 commits into from
Feb 28, 2020
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
d39bd92
make separate bash script for TCGA data
cansavvy Jan 21, 2020
4211158
Starting to set up for TCGA flexibility
cansavvy Jan 28, 2020
4338b4d
Keep the test I did for now.
cansavvy Feb 4, 2020
3a59ea4
Make it so TCGA metadata works with the TMB stuff
cansavvy Feb 5, 2020
f257fff
Merge remote-tracking branch 'upstream/master' into tcga-consensus
cansavvy Feb 5, 2020
230a1b3
Add one more caveat for tcga data
cansavvy Feb 5, 2020
45cee50
Use data in tmb-compare module
cansavvy Feb 5, 2020
e2da876
Adjusted some plot aesthetics
cansavvy Feb 5, 2020
dd4da06
Add comparison plots notebook
cansavvy Feb 5, 2020
b318f88
Refresh TCGA notebook
cansavvy Feb 6, 2020
2386378
Update the READMEs
cansavvy Feb 6, 2020
2761d54
Add to CircleCI
cansavvy Feb 6, 2020
4dd0750
Adjust for WXS TCGA
cansavvy Feb 6, 2020
f34dcd2
Adjust metadata
cansavvy Feb 17, 2020
23b97ef
Use revised data
cansavvy Feb 20, 2020
8c10a3f
Update TMBs to only be calculated based on strelka and mutect
cansavvy Feb 25, 2020
3f78f1e
Make name parallel to tcga
cansavvy Feb 25, 2020
31d45e0
Update TMB calculation to only strelka mutect
cansavvy Feb 26, 2020
b8f2ab6
Believe I have pinpointed the join problem I was having
cansavvy Feb 26, 2020
0d1d9dd
Update TMB compare results with new TCGA consensus data
cansavvy Feb 26, 2020
781f88b
get rid of a typo
cansavvy Feb 26, 2020
57a738c
Merge branch 'master' into tcga-consensus
cansavvy Feb 26, 2020
b0c3092
Update plots
cansavvy Feb 26, 2020
a98c5e8
Update the tmb notebook
cansavvy Feb 26, 2020
865ad7d
Update file name in CircleCI
cansavvy Feb 26, 2020
c64eb39
Fix the file path
cansavvy Feb 26, 2020
81746b4
Update the READMEs and fix what bed files are used for TMB calculations
cansavvy Feb 26, 2020
779a001
FIx some README wording
cansavvy Feb 26, 2020
4b19346
Get rid of old version file
cansavvy Feb 26, 2020
a221101
Undo tcga tmb changes
cansavvy Feb 27, 2020
84d1ed5
Incorporate @jashapiro 's suggestions, move to union
cansavvy Feb 27, 2020
7948c10
Get rid of development remnant
cansavvy Feb 27, 2020
cbf496f
Update split_mnv: ungroup before return
jashapiro Feb 27, 2020
9b56d36
Drop the temp_id thing
cansavvy Feb 27, 2020
6754084
Merge remote-tracking branch 'origin/tcga-consensus' into tcga-consensus
cansavvy Feb 27, 2020
bd22c5d
Fix a couple minor errors
cansavvy Feb 27, 2020
cbc166c
A more regex-like syntax
jashapiro Feb 28, 2020
2deb927
Remove broken TCGA notebook for now.
cansavvy Feb 28, 2020
2681395
Get rid of PNGs that are incorrect
cansavvy Feb 28, 2020
5ec349f
Merge remote-tracking branch 'upstream/master' into tcga-consensus
jaclyn-taroni Feb 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,13 @@ jobs:
################################


- run:
name: TCGA SNV Caller Analysis
command: ./scripts/run_in_ci.sh bash analyses/snv-callers/run_caller_consensus_analysis-tcga.sh

- run:
name: SNV Caller Analysis
command: OPENPBTA_VAF_CUTOFF=0.5 ./scripts/run_in_ci.sh bash analyses/snv-callers/run_caller_consensus_analysis.sh
command: OPENPBTA_VAF_CUTOFF=0.5 ./scripts/run_in_ci.sh bash analyses/snv-callers/run_caller_consensus_analysis-pbta.sh

- run:
name: Lancet WXS vs WGS test
Expand Down
2 changes: 1 addition & 1 deletion analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`sample-distribution-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/sample-distribution-analysis) | `pbta-histologies.tsv` | Produces plots and tables that illustrate the distribution of different histologies in the PBTA data | N/A
| [`selection-strategy-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/selection-strategy-comparison) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` | Comparison of RNA-seq data from different selection strategies | N/A
| [`sex-prediction-from-RNASeq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/sex-prediction-from-RNASeq) | `pbta-gene-expression-kallisto.stranded.rds` <br> `pbta-histologies.tsv` | *In progress*; predicts genetic sex using RNA-seq data ([#84](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/6)) | N/A
| [`snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers) | `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `pbta-snv-vardict.vep.maf.gz` | Generates consensus SNV and indel calls; calculates tumor mutation burden using the consensus calls | `results/consensus/pbta-snv-consensus-mutation.maf.tsv.gz` <br> `results/consensus/pbta-snv-consensus-mutation-tmb.tsv` (included in data download; too large for tracking via GitHub)
| [`snv-callers`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers) | `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `pbta-snv-vardict.vep.maf.gz` <br> `tcga-snv-lancet.vep.maf.gz` <br> `tcga-snv-mutect2.vep.maf.gz` <br> `tcga-snv-strelka2.vep.maf.gz` | Generates consensus SNV and indel calls for PBTA and TCGA data; calculates tumor mutation burden using the consensus calls | `results/consensus/pbta-snv-consensus-mutation.maf.tsv.gz` <br> `results/consensus/pbta-snv-consensus-mutation-tmb.tsv` <br> `results/consensus/pbta-snv-consensus-mutation-tmb-coding.tsv`(included in data download; too large for tracking via GitHub) <br> `results/consensus/tcga-snv-consensus-mutation.maf.tsv.gz` <br> `results/consensus/tcga-snv-consensus-mutation-tmb.tsv` <br> `results/consensus/tcga-snv-consensus-mutation-tmb-coding.tsv`
| [`ssgsea-hallmark`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/ssgsea-hallmark) | `pbta-gene-counts-rsem-expected_count.stranded.rds` | *Deprecated*; performs GSVA using Hallmark gene sets | N/A
| [`survival-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/survival-analysis) | TBD | *In progress*; will eventually contain functions for various types of survival analysis ([#18](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/18)) | N/A
| [`sv-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/sv-analysis) | `pbta-sv-manta.tsv.gz` <br> `independent-specimens.wgs.primary-plus.tsv` | *In progress*; chromothripsis analysis per [#27](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/27)| N/A
Expand Down
3 changes: 2 additions & 1 deletion analyses/snv-callers/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# ignore folders with big results files
results
ref_files
ref_files/*
!ref_files/gencode.v19.basic.exome.hg38liftover.bed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this needs to be added in somewhere, but I think it will be in a future data release.

17 changes: 12 additions & 5 deletions analyses/snv-callers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,22 @@ To run the evaluations and comparisons of all the SNV callers, call the bash scr
```
bash run_caller_analysis.sh
```

For the TCGA data, it has its own script to run the same methods:

```
bash run_caller_analysis-tcga.sh
```

This bash script will return:

- Comparison plots in a notebook: [`compare_snv_callers_plots.nb.html`](https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots.nb.html).
- A zip file containing:
- `pbta-snv-consensus-mutation.maf.tsv` - is [MAF-like file](#consensus-mutation-call) that contains the snvs that were called by all three of these callers for a given sample are saved to this file.
- `pbta/tcga-snv-consensus-mutation.maf.tsv` - is [MAF-like file](#consensus-mutation-call) that contains the snvs that were called by all three of these callers for a given sample are saved to this file.
These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).
- `pbta-snv-consensus-mutation-tmb-coding.tsv` - Tumor Mutation burden calculations using *coding only* mutations use the consensus of Lancet, Mutect2, and Strelka2.
- `pbta-snv-consensus-mutation-tmb-all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a [coding region bias in the way it was run](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)).
- `pbta/tcga-snv-consensus-mutation-tmb-coding.tsv` - Tumor Mutation burden calculations using *coding only* mutations use the consensus of Lancet, Mutect2, and Strelka2.
- `pbta/tcga-snv-consensus-mutation-tmb-all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a [coding region bias in the way it was run](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)).

## Summary of Methods

Expand All @@ -62,7 +69,7 @@ As Strelka2 does not call multinucleotide variants (MNV), but instead calls each
### Tumor Mutation Burden Calculation

For each experimental strategy and TMB calculation, the intersection of the genomic regions effectively being surveyed are used.
These genomic regions are used for first filtering mutations to these regions and then for using the size in bp of the genomic regions surveyed as the TMB denominator.
These genomic regions are used for first filtering mutations to these regions and then for using the size in bp of the genomic regions surveyed as the TMB denominator.

#### All mutations TMB

Expand All @@ -84,7 +91,7 @@ SNVs outside of these coding sequences are filtered out before being summed and
```
WGS_coding_only_TMB = (total # coding sequence snvs called by all three of Strelka, Lancet, and Mutect2 ) / intersection_strelka_lancet_mutect_CDS_genome_size
```
Because the same WXS BED file applies to all callers, that file is intersected with the coding sequences for filtering and for determining the denominator.
Because the same WXS BED file applies to all callers, that file is intersected with the coding sequences for filtering and for determining the denominator.
```
WXS_coding_only_TMB = (total # coding sequence snvs called by all three of Strelka, Lancet, and Mutect2 ) /
intersection_wxs_CDS_genome_size
Expand Down
48 changes: 1 addition & 47 deletions analyses/snv-callers/compare_snv_callers_plots-tcga.Rmd
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ Connect to SQLite database.
```{r}
# Start up connection
con <- DBI::dbConnect(RSQLite::SQLite(),
file.path(scratch_dir, "tcga_snv_db.sqlite"))
file.path(scratch_dir, "tcga_v2_snv_db.sqlite"))
```

Note what columns we will join by.
Expand Down Expand Up @@ -389,52 +389,6 @@ perc_var_df %>%
ggplot2::ggsave(file.path(plots_dir, "tcga-variant_classification_plot.png"))
```

## Where are the unique Lancet calls?

```{r}
source(file.path("..", "chromosomal-instability", "util", "chr-break-plot.R"))
source(file.path("..", "chromosomal-instability", "util", "chr-break-calculate.R"))
```

```{r}
# Set up Chr sizes
chr_sizes <- readr::read_tsv(file.path(data_dir, "WGS.hg38.strelka2.unpadded.bed"),
col_names = FALSE
) %>%
# Reformat the chromosome variable to drop the "chr"
dplyr::mutate(X1 = factor(gsub("chr", "", X1),
levels = c(1:22, "X", "Y", "M")
)) %>%
# Remove sex chromosomes
dplyr::filter(!(X1 %in% c("X", "Y", "M")))
# Make chromosome size named vector
chr_sizes_vector <- chr_sizes$X3
names(chr_sizes_vector) <- chr_sizes$X1
```

```{r}
only_lancet <- all_caller_df %>%
dplyr::filter(!is.na(VAF_lancet) & is.na(VAF_strelka) & is.na(VAF_mutect)) %>%
dplyr::mutate(Chromosome = gsub("chr", "", Chromosome))

lancet_densities <- break_density(breaks_df = only_lancet,
sample_id = "all",
samples_col = "Tumor_Sample_Barcode",
chrom_col = "Chromosome",
start_col = "Start_Position",
end_col = "Start_Position",
window_size = 1e7,
chr_sizes_vector = chr_sizes_vector)

map_breaks_plot(lancet_densities,
y_val = "total_counts",
color = "blue",
y_lab = "Total Num of Calls",
main_title = "Lancet Only Calls"
)

```

## Session Info

```{r}
Expand Down
Loading