Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Addition of script to produce oncoprint #176

Merged
merged 13 commits into from
Oct 29, 2019

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Oct 25, 2019

Purpose/implementation

This draft PR adds the notebook which produces oncoprints given the output of strelka2, mutect2, lancet, and vardict.

The goal is to produce an oncoplot including cnv and fusion information, in addition to the snv information already displayed. For reference, here is a plot produced using a workflow authored by @jharenza:
2019-10-24-brain-oncoprint18-1
In an attempt to adapt said workflow to our data, I discovered that the format of our data files are different from the input files used to produce the above plot. Therefore, I was unable to successfully plug our input files into the code. That being said, it seems that this may require some amount of data wrangling which is still in progress. However, please feel free to offer your relevant suggestions at any point.

Issue

This addresses issue #6 on producing an oncoprint showing the landscape of genetic lesions across PBTA.

Directions for reviewers

  1. Is there a filtering step you feel should be added to this notebook?
  2. Does the first plot where the data from all four callers are combined seem beneficial?
  3. Do you feel the aesthetics of the plots are appropriate?
  4. The features of the metadata I chose to annotate the plots further include broad_histology, short_histology, reported_gender, and tumor_descriptor. Do these features seem to be the most appropriate annotations for these plots?

The plots can be found in the plots directory of this module, and in the html output of this notebook here.

Results

Oncoprints displaying each snv callers’ output across this dataset has been produced thus far.
The output files are:

  • analyses/oncoprint-landscape/plots/combined_maf_oncoprint.png
  • analyses/oncoprint-landscape/plots/strelka2_oncoprint.png
  • analyses/oncoprint-landscape/plots/mutect2_oncoprint.png
  • analyses/oncoprint-landscape/plots/lancet_oncoprint.png
  • analyses/oncoprint-landscape/plots/vardict_oncoprint.png

Docker and continuous integration

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • Run a linter
  • Set the seed - Not needed.
  • Comments and/or documentation up to date
  • Double check your paths
  • Spell check any Rmd file or md file
  • Restart R and run all notebooks fresh and save

@jharenza jharenza self-requested a review October 25, 2019 15:07
@jharenza
Copy link
Collaborator

jharenza commented Oct 25, 2019

@cbethell Thanks for working on this! This will probably be the first or second figure and most critiqued from a biological standpoint to represent the tumors landscape accurately, so I have a few thoughts:

  • Yes, there was some data wrangling for adding copy number and fusions to these plots. For fusions, I appended them to the MAF file, so once 04-project specific filtering and run script #166 is completed, we can use that final filtered file to add high-confidence fusions to the MAF to be integrated into the plot. This will also require you to specify Fusion as a variant classification upon reading the MAF.
  • For copy number, once Proposed Analysis: Copy number consensus calls #128 is completed, these can also be added. This may be tricky, as we are dealing with NGS tools instead of array, and in general, it makes more sense to add to an oncoprint focal amplifications and homozygous deletions, rather than any genes with CN changes due to broad gain/loss. For these, I had to threshold a bit, but once we see the results from the consensus calls, I think we can better tackle what this means.
  • Regarding the multiple MAFs, I would suggest just working with one for now (as we will use consensus SNV calls from a final MAF from Getting consensus mutations list by comparing callers #174 as input to this figure). This can be used as proof of principle.
  • One thing that is tricky about MAF files is that using the files as they are results in many private variants coming through into the MAF. The MUC and TTN genes, for example, are some of the largest genes in the genome and by default are most mutated, but not at hotspots. These are generally not considered cancer driver genes, so oftentimes, we will come up with a gene list for subsetting the MAF. Top N (25-50?) brain tumor driver genes, for example. For this, perhaps we create an issue to create a driver gene list, and we can potentially add this to the data release - @jaclyn-taroni - thoughts? I had created two gene driver lists for pediatric brain tumors - long and short based on the literature. We could also try using these to start and see what the plots look like?
  • Finally, I think that we will also want to add molecular_subtype to these plots. Currently, this is pretty sparse (and I realize MB subtypes are only denoted for RNA because that's what the classifier used to predict them, but should be added at the patient level), but will be addressed with Planned Analysis: Molecularly subtype all tumors #19.

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Oct 25, 2019

Regarding the multiple MAFs, I would suggest just working with one for now (as we will use consensus SNV calls from a final MAF from #174 as input to this figure). This can be used as proof of principle.

Let's move to using a script where you accept a MAF file as a command line argument @cbethell.

For these three points:

  • Yes, there was some data wrangling for adding copy number and fusions to these plots. For fusions, I appended them to the MAF file, so once 04-project specific filtering and run script #166 is completed, we can use that final filtered file to add high-confidence fusions to the MAF to be integrated into the plot. This will also require you to specify Fusion as a variant classification upon reading the MAF.
  • For copy number, once Proposed Analysis: Copy number consensus calls #128 is completed, these can also be added. This may be tricky, as we are dealing with NGS tools instead of array, and in general, it makes more sense to add to an oncoprint focal amplifications and homozygous deletions, rather than any genes with CN changes due to broad gain/loss. For these, I had to threshold a bit, but once we see the results from the consensus calls, I think we can better tackle what this means.
  • Finally, I think that we will also want to add molecular_subtype to these plots. Currently, this is pretty sparse (and I realize MB subtypes are only denoted for RNA because that's what the classifier used to predict them, but should be added at the patient level), but will be addressed with Planned Analysis: Molecularly subtype all tumors #19.

We can get this code working and through review and make these updates as they come in. As far as the driver lists are concerned, let's try adding the download of the driver lists @jharenza linked to a shell script that includes the following steps:

  1. Download driver lists.
  2. Run the oncoplot R script.

So to accomplish step one, you could do something like (untested):

mkdir driver-lists
wget --directory-prefix=driver-lists -O brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt 
wget --directory-prefix=driver-lists -O brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt 

cbethell and others added 3 commits October 25, 2019 15:47
- added a shell script to `wget` gene list and run `01-plot-oncoprint.R`
- added optparse options to include a snv maf file, seg cnv file, and fusion file (although cnv and fusion data are not yet displayed on the oncoprint)
- used snv data solely from mutect2 to produce the oncoprint (this data will be replaced by a consensus maf file once it is produced and merged) 
- updated `.circleci` appropriately
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things I noticed as this was coming in

library(maftools)

# Define a color vector for plots
colores = c("Missense_Mutation" = "#35978f",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
colores = c("Missense_Mutation" = "#35978f",
colores <- c("Missense_Mutation" = "#35978f",

#### Specify genes -------------------------------------------------------------

# Read in gene list
goi_list <-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also make this a command line option.


mkdir driver-lists
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the blob -> raw was necessary and remembered to change the first one but perhaps not

Suggested change
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt

# Get `magrittr` pipe
`%>%` <- dplyr::`%>%`

plot_oncoplot <- function(maf_file, filename){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also think this is now a MAF object and not a MAF file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this change will be in the next commit.

…ile` command line arguments optional

- renamed `maf_file` argument in `plot_oncoplot` function to `maf_object`
- removed `.gitignore` as it is no longer needed in this module
- edited `wget` line to change `blob` to `raw` in genes list file path
- made a slight change to the plot dimensions
- made the inclusion of a gene list optional 
- changed an instance of `=` to `<-`
- displayed the plot in the `README` instead of a separate markdown 
- added the appropriate argument to the `read.maf` function to include the cnv information if it is supplied via the command line
@cbethell cbethell changed the title In Progress: Addition of notebook to produce oncoprints Addition of script to produce oncoprint Oct 28, 2019
@cbethell cbethell marked this pull request as ready for review October 28, 2019 11:39
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell, I've reviewed and left some comments.

Below I'm splitting things up into what needs to happen as part of this pull request and what is outside of the scope of the pull request in service of keeping track of what needs to be done to satisfy #6.

Before merging

  • Fix the indentation issues
  • Have the fusion code use the standardized fusion caller format, rather than the Arriba format
  • Accept the oncoprint PNG filename as a command line argument
  • Simplify the logic for the option files (e.g., fusion, CNV)
  • Add a TODO statement about the copy number functionality
  • Make plot_oncoplot less reliant on objects in the global environment or get rid of plot_oncoplot function

In the future

cd "$script_directory" || exit

mkdir driver-lists
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've committed the files here, you can use the --no-clobber option to wget: https://stackoverflow.com/questions/4944295/skip-download-if-files-exist-in-wget


#### Plot and Save Oncoprint ---------------------------------------------------

plot_oncoplot(maf_object, "maf_oncoprint.png")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output filename should also be a command line argument. Say you wanted to run this multiple times (e.g., one for each MAF file/call). Without the ability to specify the plot filename, you would overwrite this every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me how the gene list gets used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you expect genes to be in the global environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plot_oncoplot(maf_object, "maf_oncoprint.png")
plot_oncoplot(maf_object, "maf_oncoprint.png", genes = genes)

I think setting things up this way makes it more clear.

#### Specify genes -------------------------------------------------------------

if (!is.null(opt$goi_list)){
# Read in gene list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here is off.


maf <- opt$maf_file

if (!is.null(opt$cnv_file)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time you have if(!is.null(opt$cnv_file)) or if(!is.null(opt$fusion_file)) multiple times, it suggests that that can all be combined into a single if statement. There may be some exceptions of course, but in this case I see no reason you need to set this variable and read in the file separately before using that data.

#### Incorporate Fusion Data ---------------------------------------------------

if (!is.null(opt$fusion_file)) {
fus <- fusion_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here is off in this block.

# Get `magrittr` pipe
`%>%` <- dplyr::`%>%`

plot_oncoplot <- function(maf_object, filename){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
plot_oncoplot <- function(maf_object, filename){
plot_oncoplot <- function(maf_object, filename, genes = NULL){

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if you are not planning to use this outside the context of this script, it gets used once, so it may not need to be a function. If do you plan on moving it outside this script so you can source it in multiple places, then you'd want to have the color palette as an argument.

dir.create(plots_dir)
}

genes <- NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and use argument to plot_oncoplot instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Read in cnv file
if (!is.null(opt$cnv_file)) {
cnv_file <- data.table::fread(cnv, stringsAsFactors = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you collapse this with the logic above (line 147), add a TODO:

# TODO: <what you think the next steps will be based on what you know so far>

library(maftools)

# Define a color vector for plots
colores <- c("Missense_Mutation" = "#35978f",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move this to something like util/oncoplot-palette.R -- that allows you to source it from wherever and can help keep things consistent if you need to use this palette across multiple scripts. The entire contents of util/oncoplot-palette.R would be:

# <Your names>
# <Link to code where it came from originally if applicable, e.g., PPTC paper repo >
# <Comment about intended usage>

# Define a color vector for plots 
color_palette <- c("Missense_Mutation" = "#35978f", 
                   "Nonsense_Mutation" = "#000000",
                   "Frame_Shift_Del" = "#56B4E9", 
                   "Frame_Shift_Ins" = "#FFBBFF", 
                   "Splice_Site" = "#F0E442",
                   "Translation_Start_Site" = "#191970",
                   "Nonstop_Mutation" = "#545454",
                   "In_Frame_Del" = "#CAE1FF",
                   "In_Frame_Ins" = "#FFE4E1",
                   "Stop_Codon_Ins" = "#CC79A7",
                   "Start_Codon_Del" = "#56B4E9",
                   "Fusion" = "#7B68EE",
                   "Multi_Hit" = "#f46d43",
                   "Hom_Deletion" = "#313695",
                   "Hem_Deletion" = "#abd9e9",
                   "Amplification" = "#c51b7d",
                   "Loss" = "#0072B2",
                   "Gain" = "#D55E00",
                   "High_Level_Gain" = "#FF0000",
                   "Multi_Hit_Fusion" = "#CD96CD")

if (!is.null(opt$fusion_file)) {
fus <- fusion_file

fus <- fus %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the plan is to take the final filtered set from @kgaonkar6's work over on #166, you should plan use the standardized tabular format she's using. The relevant documentation of the column I think you'll want to use is here:

# "FusionName" GeneA--GeneB ,or if fusion is intergenic then Gene1A/Gene2A--GeneB

If you want one of those files for development, you can take the command used in CI and run it from the root directory:

Rscript analyses/fusion_filtering/01-fusion-standardization.R -f "data/pbta-fusion-arriba.tsv.gz" -c "arriba" -o "scratch/arriba.tsv" 

and then use scratch/arriba.tsv. There will potentially be other columns in whatever ends up being the final product out of #174, but a way to deal with that would be to only select the columns you need for this.

cbethell and others added 3 commits October 28, 2019 12:40
- added and sourced `util/oncoplot-palette.R` in `01-plot-oncoprint.R` to define color vector 
- accepted png output filename via command line option
- removed `plot_oncoplot` custom function as it is only needed once in the script 
- simplified logic statements
- Added `TODO` for copy number functionality 
- Fixed indentation issues
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few more comments @cbethell

),
optparse::make_option(
c("-c", "--cnv_file"),
action = "store_true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that you only want to use action = "store_true" if this is a logical (make_option docs)

# Read in gene list
goi_list <-
read.delim(
file.path("driver-lists", "brain-goi-list-long.txt"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the command line option, not hardcoded.

read.delim(
file.path("driver-lists", "brain-goi-list-long.txt"),
sep = "\t",
header = F,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
header = F,
header = FALSE,

file.path("driver-lists", "brain-goi-list-long.txt"),
sep = "\t",
header = F,
as.is = T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
as.is = T
as.is = TRUE

The structure of this folder is as follows:

```
oncoprint-landscape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

genes <- NULL
cnv_file <- NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from development?

library(maftools)

# Source the color palette for plots
source(file.path("util", "oncoplot-palette.R"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should utilize root_dir so it's looking in the right directory no matter where this script is called from.

dir.create(plots_dir)
}

genes <- NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Loss" = "#0072B2",
"Gain" = "#D55E00",
"High_Level_Gain" = "#FF0000",
"Multi_Hit_Fusion" = "#CD96CD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line here

- used command line option to read in `goi_list`
- Change `F` to `FALSE` and `T` to `TRUE` where applicable
- removed `genes <- NULL`
- changed `cnv_file <- NULL` to `cnv_file <- opt$cnv_file` 
- utilized `root_dir` to source `oncoplot-palette.R`
- added new line to the end of `oncoplot-palette.R`
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about how to improve the multi-hit fusion section. Once those comments are addressed, we can get this merged. Thanks!

if (!is.null(opt$fusion_file)) {
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE)

#### Incorporate Fusion Data -------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a #TODO here as well? Specifically I think this might need to change once we get the consensus calls (I'm not sure you'll need to do the separation in quite the same way).


# Read in fusion file
if (!is.null(opt$fusion_file)) {
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE)
fusion_file <- data.table::fread(opt$fusion_file, data.table = FALSE, stringsAsFactors = FALSE)

probably will make things a bit easier

Comment on lines 132 to 149
# Separate fusion gene partners and add variant classification and center
fus_sep <- fusion_file %>%
tidyr::separate(FusionName, c("5'-gene", "3'-gene"), sep = "--")

# Determine which samples have multi-fused fusions
multi <- fus_sep[, c("Sample", "5'-gene", "3'-gene")]
multi$ID <- paste0(multi$Sample, ";", multi$`5'-gene`)
reformat_multi <- multi %>%
dplyr::group_by(ID) %>%
dplyr::summarise(Sample = dplyr::n()) %>%
as.data.frame()

reformat_multi$Variant_Classification <-
ifelse(reformat_multi$Sample == 1, "Fusion", "Multi_Hit_Fusion")
reformat_multi <- reformat_multi %>%
tidyr::separate(ID, c("Tumor_Sample_Barcode", "Hugo_Symbol"), sep = ";")
reformat_multi$Sample <- NULL
reformat_multi$Variant_Type <- "OTHER"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would benefit from using more R-friendly column names and it could be simplified by using additional dplyr functionality:

fus_sep <- fusion_file %>%
  # can add a note here about 5' and 3'
  tidyr::separate(FusionName, c("Gene1", "Gene2"), sep = "--") %>%  
  dplyr::select(Sample, Gene1, Gene2)

reformat_fusion <- fus_sep %>% 
  # here we want to tally how many times the 5' gene shows up as a fusion hit 
  # in a sample
  dplyr::group_by(Sample, Gene1) %>%
  dplyr::tally() %>%
  # if the sample-5' gene pair shows up more than once, call it a multi hit 
  # fusion
  dplyr::mutate(Variant_Classification = 
                  dplyr::if_else(n == 1, "Fusion", "Multi_Hit_Fusion"),
                # required column for joining with MAF
                Variant_Type = "OTHER") %>%
  # correct format for joining with MAF
  dplyr::rename(Tumor_Sample_Barcode = Sample, Hugo_Symbol = Gene1) %>%
  dplyr::select(-n)

Should be equivalent. Then you're ready to bind reformat_fusion.

cbethell and others added 2 commits October 29, 2019 09:41
- Made fusion formatting section more R-friendly
- Added `data.table = FALSE` when reading in fusion file
@jaclyn-taroni jaclyn-taroni merged commit 4254f93 into AlexsLemonade:master Oct 29, 2019
@cbethell cbethell mentioned this pull request Oct 29, 2019
8 tasks
@cbethell cbethell mentioned this pull request Nov 7, 2019
8 tasks
@cbethell cbethell deleted the oncoprint-plotting branch December 18, 2019 19:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants