Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Update oncoprints to use histology-specific goi lists #1046

Merged
merged 22 commits into from
May 13, 2021

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Apr 29, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

This PR incorporates histology-specific genes of interest lists for the appropriate histology oncoprints.

What was your approach?

Using the oncoprint-goi-lists-OpenPBTA.csv file, linked in #969 (comment), I prepared a script (saved in util) to make each a column into its own TSV file, named appropriately for the associated histology and stored in oncoprint-landscape/data for use with 01-plot-oncoprint.R

I then adjusted the file paths to the goi lists in the run-oncoprint.sh script.

What GitHub issue does your pull request address?

This PR closes #1004

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The goi plots have been updated so they should receive a close look to ensure that they were updated as expected.

Is there anything that you want to discuss further?

The file path to the main goi file (in the run-oncoprint.sh script) will likely need to be updated.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, although it will need to be re-ran once v19 is merged and the file path to the main goi file will likely need to be updated if it is planned to be included in the data release.

Results

What types of results are included (e.g., table, figure)?

Updated goi oncoprints

What is your summary of the results?

The plots seems to be updated as expected

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cbethell cbethell changed the title WIP: Update oncoprints to use histology-specific goi lists Update oncoprints to use histology-specific goi lists Apr 29, 2021
@cbethell cbethell marked this pull request as ready for review April 29, 2021 13:17
@jharenza jharenza self-requested a review April 30, 2021 22:23
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell! I think this looks good and oncoprints and lists match as expected, but the top = 25 parameter in the oncoplot function is not working as expected (I am seeing >25 genes in some of the oncoprints).

We should get #1009 merged before this, but after the v19 release.

@cbethell
Copy link
Contributor Author

cbethell commented May 3, 2021

Hi @cbethell! I think this looks good and oncoprints and lists match as expected, but the top = 25 parameter in the oncoplot function is not working as expected (I am seeing >25 genes in some of the oncoprints).

We should get #1009 merged before this, but after the v19 release.

Hi @jharenza, is it the goi oncoprints that appear to not adhering to the expected behavior? If so, as noted in #975 (comment),

it appears that a default behavior of the oncoplot() function is to display the top 20 genes with mutations, a functionality that can be over-ridden if supplied a genes of interest list with more than 20 genes.

In other words, the top = 25 parameter is over-ridden in cases where a goi list has been supplied. That said, if a goi list contains >25 genes, >25 genes will be displayed on the oncoprint.

@jharenza
Copy link
Collaborator

jharenza commented May 3, 2021

In other words, the top = 25 parameter is over-ridden in cases where a goi list has been supplied. That said, if a goi list contains >25 genes, >25 genes will be displayed on the oncoprint.

Ah, yes. I remember us discussing this, but could not find it at the time. Looking back at the PPTC paper code, it looks like what I had done to create the oncoprints as a max of N genes while using genelists was to get the summary of mutations per gene first, then read just that gene list into the oncoprint function:
https://github.com/marislab/create-pptc-pdx-oncoprints/blob/13e2fd942b33ae7f4ba3f0cec844c0631283b8f5/R/create-complexheat-oncoprint-revision.R#L20-L59

What do you think about implementing this?

@cbethell
Copy link
Contributor Author

cbethell commented May 3, 2021

In other words, the top = 25 parameter is over-ridden in cases where a goi list has been supplied. That said, if a goi list contains >25 genes, >25 genes will be displayed on the oncoprint.

Ah, yes. I remember us discussing this, but could not find it at the time. Looking back at the PPTC paper code, it looks like what I had done to create the oncoprints as a max of N genes while using genelists was to get the summary of mutations per gene first, then read just that gene list into the oncoprint function:
https://github.com/marislab/create-pptc-pdx-oncoprints/blob/13e2fd942b33ae7f4ba3f0cec844c0631283b8f5/R/create-complexheat-oncoprint-revision.R#L20-L59

What do you think about implementing this?

Ah yes @jharenza, I can refactor the module to take in GOI lists and top N as arguments and where both specified, count up the number of mutations in the genes in the GOI list from the MAF file and then subset the GOI list to the top N genes on the basis of the number of mutations, which is what seems to be implemented in the code you linked above!

Comment on lines 247 to 272
# Subset `maf_object` for histology-specific goi list
if (!is.null(opt$goi_list)) {
maf_object = subsetMaf(
maf = maf_object,
tsb = metadata$Tumor_Sample_Barcode,
genes = goi_list,
mafObj = TRUE
)

# Get top mutated genes per this subset object
gene_sum <- mafSummary(maf_object)$gene.summary

# Sort to get top altered genes rather than mutated only genes
goi_ordered <-
gene_sum[order(gene_sum$AlteredSamples, decreasing = T),]

if (!is.null(opt$top_n)) {

# Select top `n` genes if the argument is provided
top_n <- ifelse(nrow(gene_sum) < opt$top_n, nrow(gene_sum), opt$top_n)

goi_list <- goi_ordered[1:top_n,]

}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual code changes I'm suggesting are untested, but I believe you will only have to take these steps if someone specifies a GOI list and a top n, so you can simplify this a bit:

Suggested change
# Subset `maf_object` for histology-specific goi list
if (!is.null(opt$goi_list)) {
maf_object = subsetMaf(
maf = maf_object,
tsb = metadata$Tumor_Sample_Barcode,
genes = goi_list,
mafObj = TRUE
)
# Get top mutated genes per this subset object
gene_sum <- mafSummary(maf_object)$gene.summary
# Sort to get top altered genes rather than mutated only genes
goi_ordered <-
gene_sum[order(gene_sum$AlteredSamples, decreasing = T),]
if (!is.null(opt$top_n)) {
# Select top `n` genes if the argument is provided
top_n <- ifelse(nrow(gene_sum) < opt$top_n, nrow(gene_sum), opt$top_n)
goi_list <- goi_ordered[1:top_n,]
}
}
# Subset `maf_object` for histology-specific goi list
if (!is.null(opt$goi_list) & !is.null(opt$top_n)) {
maf_object <- subsetMaf(
maf = maf_object,
tsb = metadata$Tumor_Sample_Barcode,
genes = goi_list,
mafObj = TRUE
)
# Get top mutated genes per this subset object
gene_sum <- mafSummary(maf_object)$gene.summary
# Sort to get top altered genes rather than mutated only genes
goi_ordered <-
gene_sum[order(gene_sum$AlteredSamples, decreasing = T),]
# Select top `n` genes if the argument is provided
top_n <- ifelse(nrow(gene_sum) < opt$top_n, nrow(gene_sum), opt$top_n)
goi_list <- goi_ordered[1:top_n,]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon testing this method does not appear to obey the top_n argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on that a bit?

Copy link
Contributor Author

@cbethell cbethell May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not limit the amount of genes being displayed to top_n, it instead shows all of the genes listed in the goi list (the behavior it previously exhibited but we did not want).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the two if way worked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That suggests that the logic isn't quite right because it's using the original list (goi_list) - the conditions for the if() are not being met. Since you don't want to subset the MAF and do the mafSummary() steps unless you have to and you only have to if you have a top_n argument I would see if you can get to the bottom of it.

cbethell and others added 3 commits May 6, 2021 16:18
* Move goi prep out of util and renumber, etc.

* Add a distinct() step

* New naming scheme

* Use more arrays in the bash script, renumbering

* Update oncoprint PNGs
- re-ran to ensure everything is running as expected
@cbethell
Copy link
Contributor Author

cbethell commented May 6, 2021

As mentioned in #1053 (comment)

It appears that in sample 7316-2285, when looking at the lgat plots, the display_group value is Neuronal and mixed neuronal-glial tumor while the broad_histology value is Low-grade astrocytic tumor.

See the display_group representation in the example of primary-plus_lgat_goi_oncoprint.png below:

primary-plus_lgat_goi_oncoprint

I guess the question now is should we leave this info as is or should we recode the display_group? (A similar situation is occurring in the Other CNS plots where Chordoma is the display_group value associated with the broad_histology Choroid plexus tumor -- the value we do want). Any thoughts here @jharenza?

Otherwise, this PR now appears to do what we expect re incorporating goi_list and top_n.

@jharenza
Copy link
Collaborator

jharenza commented May 6, 2021

hi @cbethell! For 7316-2285, in v19, this has broad_histology == Neuronal and mixed neuronal-glial tumor, so this should be fixed now. For Chordoma samples, they now have broad_histology == Chordoma. The choroid plexus tumor was a mismatch because the WHO 2016 didn't have a spot for chordomas and we made it its own broad histology, as it is different. So, I think this should help.

@cbethell
Copy link
Contributor Author

cbethell commented May 6, 2021

hi @cbethell! For 7316-2285, in v19, this has broad_histology == Neuronal and mixed neuronal-glial tumor, so this should be fixed now. For Chordoma samples, they now have broad_histology == Chordoma. The choroid plexus tumor was a mismatch because the WHO 2016 didn't have a spot for chordomas and we made it its own broad histology, as it is different. So, I think this should help.

Great, thanks @jharenza! I'll be sure to re-run with v19 files.

@jaclyn-taroni jaclyn-taroni mentioned this pull request May 6, 2021
cbethell and others added 2 commits May 6, 2021 18:23
* Update top_n logic

* top_n can not be NULL + rerun
@jaclyn-taroni
Copy link
Member

@cbethell can you run this with v19 again now that #1054 is in please? I'm going to start merging the subtyping PRs that should probably get rerun with v19!

@jaclyn-taroni jaclyn-taroni requested a review from jharenza May 7, 2021 13:02
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @cbethell and @jaclyn-taroni. I think this generally looks good as far as goi output. There is one thing I noticed - not sure if we want to fix here or elsewhere, and that is the display of genes when there are no mutations - these we should remove from the plots. I wasn't expecting this, but also did not add that to the ticket specifically. I will approve and let you both decide whether we need a new ticket for that.

@cbethell
Copy link
Contributor Author

cbethell commented May 10, 2021

hi @cbethell and @jaclyn-taroni. I think this generally looks good as far as goi output. There is one thing I noticed - not sure if we want to fix here or elsewhere, and that is the display of genes when there are no mutations - these we should remove from the plots. I wasn't expecting this, but also did not add that to the ticket specifically. I will approve and let you both decide whether we need a new ticket for that.

After some further investigation, it seems that the underlying issue here was the lack of coding for Intron, 5'Flank, and 3'Flank. These accounted for some number of mutations in the AlteredSamples, and therefore showed up "blank" with lack of coding.

On my local machine, I have updated the coding in oncoprint_color_palette.tsv and oncoplot_functions.R to account for the values named above. This seems to have fixed the display of genes as mentioned above, but I am not sure if this is the ideal fix as the Intron values seem to crowd the non-goi plots. See the lgat goi vs non-goi plots below and feel free to leave any thoughts here @jaclyn-taroni and @jharenza! (Note that I have committed these changes in 04874a4 for your reference but am ready to revert these changes if this is not what we want).

primary-plus_lgat_goi_oncoprint
primary-plus_lgat_oncoprint

@cbethell
Copy link
Contributor Author

cbethell commented May 10, 2021

Given the findings in #1046 (comment), do we want to instead filter the Intron, 5'Flank, and 3'Flank cases out of the MAF file before subsetting and plotting @jharenza ?

@jharenza
Copy link
Collaborator

jharenza commented May 10, 2021

Given the findings in #1046 (comment), do we want to instead filter the Intron, 5'Flank, and 3'Flank cases out of the MAF file before subsetting and plotting @jharenza ?

Oh, nice find! We definitely do not want to include intronic variants here, so I would suggest we can either remove them from the MAF to lighten that load or remove them from the altered samples table. We should keep the 3' and 5' flank - TERT promoter mutations from #819 will be included as 5' Flank. Based on the hotspot maf, we have Variant_Classification ==

Missense_Mutation 
Splice_Site    
Nonsense_Mutation
Frame_Shift_Del   
In_Frame_Ins   
In_Frame_Del      
Frame_Shift_Ins
5'Flank

so, everything else is currently captured

- filter to genes with `AlteredSamples >1` and re-run
@cbethell
Copy link
Contributor Author

Given the findings in #1046 (comment), do we want to instead filter the Intron, 5'Flank, and 3'Flank cases out of the MAF file before subsetting and plotting @jharenza ?

Oh, nice find! We definitely do not want to include intronic variants here, so I would suggest we can either remove them from the MAF to lighten that load or remove them from the altered samples table. We should keep the 3' and 5' flank - TERT promoter mutations from #819 will be included as 5' Flank. Based on the hotspot maf, we have Variant_Classification ==

Missense_Mutation 
Splice_Site    
Nonsense_Mutation
Frame_Shift_Del   
In_Frame_Ins   
In_Frame_Del      
Frame_Shift_Ins
5'Flank

so, everything else is currently captured.a

Thanks for the feedback @jharenza!

In the most recent commit 52ac162, you'll find that @jaclyn-taroni and I made the decision to filter to genes with AlteredSamples > 1, as to handle any cases with 0 (or near to) mutations being displayed on the oncoprints.

We also added a flag called include_introns, which by default is equal to FALSE -- meaning that we have removed intronic variants from the oncoprints by default while the option to include them is also there if we need to at some point in the future.

@cbethell cbethell requested a review from jharenza May 11, 2021 14:18
@jharenza
Copy link
Collaborator

In the most recent commit 52ac162, you'll find that @jaclyn-taroni and I made the decision to filter to genes with AlteredSamples > 1, as to handle any cases with 0 (or near to) mutations being displayed on the oncoprints.

Thanks! This looks like it does what it should - filters out introns and removes non-mutated genes, except now I am seeing that the TMB portion of the plot seems to be missing bars.
This can be seen vividly in LGAT goi and
Screen Shot 2021-05-12 at 6 54 26 PM
CNS other goi
Screen Shot 2021-05-12 at 6 54 47 PM
Scrolling through above, I see this happened a few other times, and perhaps warrants its own issue.

The other thing I noticed is that these oncoprints are missing the reciprocal fusion logic in that commit, but I see it in your latest commit ea6e207. However, in this latest commit, I am seeing the odd behavior of goi and non-goi oncoplot names being reversed, but not even reversed because the goi lists aren't being populated properly. For instance:

primary-plus-low-grade-astrocytic-tumor_goi_oncoprint.png:
Screen Shot 2021-05-12 at 7 03 34 PM
This is not the goi list, but the file refers to it as such

primary-plus-low-grade-astrocytic-tumor_oncoprint.png:
Screen Shot 2021-05-12 at 7 02 45 PM
There are genes in this that are not in the goi list, so I am not sure where exactly this list comes from?

The TMB issue seems to be [mostly] resolved in commit ea6e207. I say mostly because there are just a few samples with no bars, but they seem to be very few compared to before and maybe those TMBs are below the visible log10 threshold which would be fine.

@jaclyn-taroni
Copy link
Member

The fact that there are no PNG files in files changed when we are changing how we set file names in the shell script here is certainly suspicious.

@jharenza
Copy link
Collaborator

The fact that there are no PNG files in files changed when we are changing how we set file names in the shell script here is certainly suspicious.

yes, I checked multiple times because I thought I was going crazy...

@jaclyn-taroni
Copy link
Member

Rerunning locally now 🎲

@jaclyn-taroni
Copy link
Member

New plots added in 3ad02de @jharenza. I haven't looked at them in detail yet (that is a tomorrow problem!) but seems like the TMB issue might still exist in the GOI plots?

@jharenza
Copy link
Collaborator

New plots added in 3ad02de @jharenza. I haven't looked at them in detail yet (that is a tomorrow problem!) but seems like the TMB issue might still exist in the GOI plots?

Ok great - the oncoprint content looks correct now, but yes, the TMB issue still exists. It looks like the goi subset is resulting in a counting of only those gene alterations for the TMB calculation and in the non-goi plots, you can see that the TMB plots contain all alterations. I may have seen this before when subsetting the maf. I think the only way to perhaps get around this would be to refactor the code in this order:

  1. subset maf to goi list
  2. gather top N goi from summary
  3. plot full maf (not subsetted maf) but only plot genes from 2 (instead of top N from 1).
    If we want to test this in a new PR, I can submit an issue.

@cbethell
Copy link
Contributor Author

New plots added in 3ad02de @jharenza. I haven't looked at them in detail yet (that is a tomorrow problem!) but seems like the TMB issue might still exist in the GOI plots?

Ok great - the oncoprint content looks correct now, but yes, the TMB issue still exists. It looks like the goi subset is resulting in a counting of only those gene alterations for the TMB calculation and in the non-goi plots, you can see that the TMB plots contain all alterations. I may have seen this before when subsetting the maf. I think the only way to perhaps get around this would be to refactor the code in this order:

  1. subset maf to goi list
  2. gather top N goi from summary
  3. plot full maf (not subsetted maf) but only plot genes from 2 (instead of top N from 1).
    If we want to test this in a new PR, I can submit an issue.

With the changes made and committed in d71f76b, it looks like the missing bars associated with the TMB portion of the goi plots have been retained, would you agree @jharenza?

@jharenza
Copy link
Collaborator

With the changes made and committed in d71f76b, it looks like the missing bars associated with the TMB portion of the goi plots have been retained, would you agree @jharenza?

Yes, this looks great and I think good to go!

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updated analysis: Update oncoprint-landscape module to use histology-specific goi lists
3 participants