-
Notifications
You must be signed in to change notification settings - Fork 83
Addition of script to produce oncoprint #176
Addition of script to produce oncoprint #176
Conversation
@cbethell Thanks for working on this! This will probably be the first or second figure and most critiqued from a biological standpoint to represent the tumors landscape accurately, so I have a few thoughts:
|
Let's move to using a script where you accept a MAF file as a command line argument @cbethell. For these three points:
We can get this code working and through review and make these updates as they come in. As far as the driver lists are concerned, let's try adding the download of the driver lists @jharenza linked to a shell script that includes the following steps:
So to accomplish step one, you could do something like (untested): mkdir driver-lists
wget --directory-prefix=driver-lists -O brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt
wget --directory-prefix=driver-lists -O brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt |
- added a shell script to `wget` gene list and run `01-plot-oncoprint.R` - added optparse options to include a snv maf file, seg cnv file, and fusion file (although cnv and fusion data are not yet displayed on the oncoprint) - used snv data solely from mutect2 to produce the oncoprint (this data will be replaced by a consensus maf file once it is produced and merged) - updated `.circleci` appropriately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things I noticed as this was coming in
library(maftools) | ||
|
||
# Define a color vector for plots | ||
colores = c("Missense_Mutation" = "#35978f", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colores = c("Missense_Mutation" = "#35978f", | |
colores <- c("Missense_Mutation" = "#35978f", |
#### Specify genes ------------------------------------------------------------- | ||
|
||
# Read in gene list | ||
goi_list <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also make this a command line option.
|
||
mkdir driver-lists | ||
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt | ||
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the blob
-> raw
was necessary and remembered to change the first one but perhaps not
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/blob/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt | |
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-short.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list.txt |
# Get `magrittr` pipe | ||
`%>%` <- dplyr::`%>%` | ||
|
||
plot_oncoplot <- function(maf_file, filename){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also think this is now a MAF object and not a MAF file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this change will be in the next commit.
…ile` command line arguments optional - renamed `maf_file` argument in `plot_oncoplot` function to `maf_object` - removed `.gitignore` as it is no longer needed in this module - edited `wget` line to change `blob` to `raw` in genes list file path
- made a slight change to the plot dimensions - made the inclusion of a gene list optional - changed an instance of `=` to `<-`
- displayed the plot in the `README` instead of a separate markdown - added the appropriate argument to the `read.maf` function to include the cnv information if it is supplied via the command line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell, I've reviewed and left some comments.
Below I'm splitting things up into what needs to happen as part of this pull request and what is outside of the scope of the pull request in service of keeping track of what needs to be done to satisfy #6.
Before merging
- Fix the indentation issues
- Have the fusion code use the standardized fusion caller format, rather than the Arriba format
- Accept the oncoprint PNG filename as a command line argument
- Simplify the logic for the option files (e.g., fusion, CNV)
- Add a TODO statement about the copy number functionality
- Make
plot_oncoplot
less reliant on objects in the global environment or get rid ofplot_oncoplot
function
In the future
- Add copy number functionality
- Optionally, run this for all four callers and put the PNGs in the README
- Rerun this plot with the following updated data:
- Final fusion list Getting consensus mutations list by comparing callers #174
- Copy number consensus Proposed Analysis: Copy number consensus calls #128
- Add molecular subtypes Planned Analysis: Molecularly subtype all tumors #19
- I don't know if such a file already exists, but I like the idea adding functionality to give the script a blacklist of genes like MUC and TTN which are likely to show up but are not of interest.
cd "$script_directory" || exit | ||
|
||
mkdir driver-lists | ||
wget --directory-prefix=driver-lists -O driver-lists/brain-goi-list-long.txt https://github.com/marislab/create-pptc-pdx-oncoprints/raw/2c9ed2a2331bcef3003d6aa25a130485d76a3535/data/brain-goi-list-old.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've committed the files here, you can use the --no-clobber
option to wget
: https://stackoverflow.com/questions/4944295/skip-download-if-files-exist-in-wget
|
||
#### Plot and Save Oncoprint --------------------------------------------------- | ||
|
||
plot_oncoplot(maf_object, "maf_oncoprint.png") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output filename should also be a command line argument. Say you wanted to run this multiple times (e.g., one for each MAF file/call). Without the ability to specify the plot filename, you would overwrite this every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how the gene list gets used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you expect genes
to be in the global environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plot_oncoplot(maf_object, "maf_oncoprint.png") | |
plot_oncoplot(maf_object, "maf_oncoprint.png", genes = genes) |
I think setting things up this way makes it more clear.
#### Specify genes ------------------------------------------------------------- | ||
|
||
if (!is.null(opt$goi_list)){ | ||
# Read in gene list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation here is off.
|
||
maf <- opt$maf_file | ||
|
||
if (!is.null(opt$cnv_file)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time you have if(!is.null(opt$cnv_file))
or if(!is.null(opt$fusion_file))
multiple times, it suggests that that can all be combined into a single if
statement. There may be some exceptions of course, but in this case I see no reason you need to set this variable and read in the file separately before using that data.
#### Incorporate Fusion Data --------------------------------------------------- | ||
|
||
if (!is.null(opt$fusion_file)) { | ||
fus <- fusion_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation here is off in this block.
# Get `magrittr` pipe | ||
`%>%` <- dplyr::`%>%` | ||
|
||
plot_oncoplot <- function(maf_object, filename){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plot_oncoplot <- function(maf_object, filename){ | |
plot_oncoplot <- function(maf_object, filename, genes = NULL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if you are not planning to use this outside the context of this script, it gets used once, so it may not need to be a function. If do you plan on moving it outside this script so you can source it in multiple places, then you'd want to have the color palette as an argument.
dir.create(plots_dir) | ||
} | ||
|
||
genes <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and use argument to plot_oncoplot
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the logic down below now: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/176/files#diff-f1245b275299730475f8b3d538c277f6R204 so I think this can come out.
|
||
# Read in cnv file | ||
if (!is.null(opt$cnv_file)) { | ||
cnv_file <- data.table::fread(cnv, stringsAsFactors = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you collapse this with the logic above (line 147), add a TODO:
# TODO: <what you think the next steps will be based on what you know so far>
library(maftools) | ||
|
||
# Define a color vector for plots | ||
colores <- c("Missense_Mutation" = "#35978f", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move this to something like util/oncoplot-palette.R
-- that allows you to source it from wherever and can help keep things consistent if you need to use this palette across multiple scripts. The entire contents of util/oncoplot-palette.R
would be:
# <Your names>
# <Link to code where it came from originally if applicable, e.g., PPTC paper repo >
# <Comment about intended usage>
# Define a color vector for plots
color_palette <- c("Missense_Mutation" = "#35978f",
"Nonsense_Mutation" = "#000000",
"Frame_Shift_Del" = "#56B4E9",
"Frame_Shift_Ins" = "#FFBBFF",
"Splice_Site" = "#F0E442",
"Translation_Start_Site" = "#191970",
"Nonstop_Mutation" = "#545454",
"In_Frame_Del" = "#CAE1FF",
"In_Frame_Ins" = "#FFE4E1",
"Stop_Codon_Ins" = "#CC79A7",
"Start_Codon_Del" = "#56B4E9",
"Fusion" = "#7B68EE",
"Multi_Hit" = "#f46d43",
"Hom_Deletion" = "#313695",
"Hem_Deletion" = "#abd9e9",
"Amplification" = "#c51b7d",
"Loss" = "#0072B2",
"Gain" = "#D55E00",
"High_Level_Gain" = "#FF0000",
"Multi_Hit_Fusion" = "#CD96CD")
if (!is.null(opt$fusion_file)) { | ||
fus <- fusion_file | ||
|
||
fus <- fus %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the plan is to take the final filtered set from @kgaonkar6's work over on #166, you should plan use the standardized tabular format she's using. The relevant documentation of the column I think you'll want to use is here:
# "FusionName" GeneA--GeneB ,or if fusion is intergenic then Gene1A/Gene2A--GeneB |
If you want one of those files for development, you can take the command used in CI and run it from the root directory:
Rscript analyses/fusion_filtering/01-fusion-standardization.R -f "data/pbta-fusion-arriba.tsv.gz" -c "arriba" -o "scratch/arriba.tsv"
and then use scratch/arriba.tsv
. There will potentially be other columns in whatever ends up being the final product out of #174, but a way to deal with that would be to only select the columns you need for this.
- added and sourced `util/oncoplot-palette.R` in `01-plot-oncoprint.R` to define color vector - accepted png output filename via command line option - removed `plot_oncoplot` custom function as it is only needed once in the script - simplified logic statements - Added `TODO` for copy number functionality - Fixed indentation issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few more comments @cbethell
), | ||
optparse::make_option( | ||
c("-c", "--cnv_file"), | ||
action = "store_true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that you only want to use action = "store_true"
if this is a logical (make_option
docs)
# Read in gene list | ||
goi_list <- | ||
read.delim( | ||
file.path("driver-lists", "brain-goi-list-long.txt"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the command line option, not hardcoded.
read.delim( | ||
file.path("driver-lists", "brain-goi-list-long.txt"), | ||
sep = "\t", | ||
header = F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header = F, | |
header = FALSE, |
file.path("driver-lists", "brain-goi-list-long.txt"), | ||
sep = "\t", | ||
header = F, | ||
as.is = T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as.is = T | |
as.is = TRUE |
The structure of this folder is as follows: | ||
|
||
``` | ||
oncoprint-landscape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be the output of tree
: https://rschu.me/list-a-directory-with-tree-command-on-mac-os-x-3b2d4c4a4827
} | ||
|
||
genes <- NULL | ||
cnv_file <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from development?
library(maftools) | ||
|
||
# Source the color palette for plots | ||
source(file.path("util", "oncoplot-palette.R")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should utilize root_dir
so it's looking in the right directory no matter where this script is called from.
dir.create(plots_dir) | ||
} | ||
|
||
genes <- NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the logic down below now: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/176/files#diff-f1245b275299730475f8b3d538c277f6R204 so I think this can come out.
"Loss" = "#0072B2", | ||
"Gain" = "#D55E00", | ||
"High_Level_Gain" = "#FF0000", | ||
"Multi_Hit_Fusion" = "#CD96CD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line here
- used command line option to read in `goi_list` - Change `F` to `FALSE` and `T` to `TRUE` where applicable - removed `genes <- NULL` - changed `cnv_file <- NULL` to `cnv_file <- opt$cnv_file` - utilized `root_dir` to source `oncoplot-palette.R` - added new line to the end of `oncoplot-palette.R`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about how to improve the multi-hit fusion section. Once those comments are addressed, we can get this merged. Thanks!
if (!is.null(opt$fusion_file)) { | ||
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE) | ||
|
||
#### Incorporate Fusion Data ------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a #TODO
here as well? Specifically I think this might need to change once we get the consensus calls (I'm not sure you'll need to do the separation in quite the same way).
|
||
# Read in fusion file | ||
if (!is.null(opt$fusion_file)) { | ||
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fusion_file <- data.table::fread(opt$fusion_file, stringsAsFactors = FALSE) | |
fusion_file <- data.table::fread(opt$fusion_file, data.table = FALSE, stringsAsFactors = FALSE) |
probably will make things a bit easier
# Separate fusion gene partners and add variant classification and center | ||
fus_sep <- fusion_file %>% | ||
tidyr::separate(FusionName, c("5'-gene", "3'-gene"), sep = "--") | ||
|
||
# Determine which samples have multi-fused fusions | ||
multi <- fus_sep[, c("Sample", "5'-gene", "3'-gene")] | ||
multi$ID <- paste0(multi$Sample, ";", multi$`5'-gene`) | ||
reformat_multi <- multi %>% | ||
dplyr::group_by(ID) %>% | ||
dplyr::summarise(Sample = dplyr::n()) %>% | ||
as.data.frame() | ||
|
||
reformat_multi$Variant_Classification <- | ||
ifelse(reformat_multi$Sample == 1, "Fusion", "Multi_Hit_Fusion") | ||
reformat_multi <- reformat_multi %>% | ||
tidyr::separate(ID, c("Tumor_Sample_Barcode", "Hugo_Symbol"), sep = ";") | ||
reformat_multi$Sample <- NULL | ||
reformat_multi$Variant_Type <- "OTHER" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section would benefit from using more R-friendly column names and it could be simplified by using additional dplyr functionality:
fus_sep <- fusion_file %>%
# can add a note here about 5' and 3'
tidyr::separate(FusionName, c("Gene1", "Gene2"), sep = "--") %>%
dplyr::select(Sample, Gene1, Gene2)
reformat_fusion <- fus_sep %>%
# here we want to tally how many times the 5' gene shows up as a fusion hit
# in a sample
dplyr::group_by(Sample, Gene1) %>%
dplyr::tally() %>%
# if the sample-5' gene pair shows up more than once, call it a multi hit
# fusion
dplyr::mutate(Variant_Classification =
dplyr::if_else(n == 1, "Fusion", "Multi_Hit_Fusion"),
# required column for joining with MAF
Variant_Type = "OTHER") %>%
# correct format for joining with MAF
dplyr::rename(Tumor_Sample_Barcode = Sample, Hugo_Symbol = Gene1) %>%
dplyr::select(-n)
Should be equivalent. Then you're ready to bind reformat_fusion
.
- Made fusion formatting section more R-friendly - Added `data.table = FALSE` when reading in fusion file
Purpose/implementation
This draft PR adds the notebook which produces oncoprints given the output of strelka2, mutect2, lancet, and vardict.
The goal is to produce an oncoplot including cnv and fusion information, in addition to the snv information already displayed. For reference, here is a plot produced using a workflow authored by @jharenza:
data:image/s3,"s3://crabby-images/88646/88646dff31cb1467ac526745e3bb5a598bfd0396" alt="2019-10-24-brain-oncoprint18-1"
In an attempt to adapt said workflow to our data, I discovered that the format of our data files are different from the input files used to produce the above plot. Therefore, I was unable to successfully plug our input files into the code. That being said, it seems that this may require some amount of data wrangling which is still in progress. However, please feel free to offer your relevant suggestions at any point.
Issue
This addresses issue #6 on producing an oncoprint showing the landscape of genetic lesions across PBTA.
Directions for reviewers
broad_histology
,short_histology
,reported_gender
, andtumor_descriptor
. Do these features seem to be the most appropriate annotations for these plots?The plots can be found in the
plots
directory of this module, and in the html output of this notebook here.Results
Oncoprints displaying each snv callers’ output across this dataset has been produced thus far.
The output files are:
analyses/oncoprint-landscape/plots/combined_maf_oncoprint.png
analyses/oncoprint-landscape/plots/strelka2_oncoprint.png
analyses/oncoprint-landscape/plots/mutect2_oncoprint.png
analyses/oncoprint-landscape/plots/lancet_oncoprint.png
analyses/oncoprint-landscape/plots/vardict_oncoprint.png
Docker and continuous integration