Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Fusion Filtering ("other" reading frame ) edits #553

Merged
merged 4 commits into from
Feb 21, 2020

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Feb 21, 2020

Purpose/implementation Section

Fusion filtering steps that affect the putative_oncogene fusions need to be updated through this PR because of the following issues:

  1. During project specific filtering I mistakenly removed "other" fusions while filtering on column Fusion_Type for putative-oncogene fusion list.
  2. There is also a spelling mistake in the run_script
  3. IGH-@,IGH@ , IGL-@ and IGL@ need to be added to reference list as oncogenic genes.

What scientific question is your analysis addressing?

Identify all inframe/frameshift/other putative fusions that pass QC and expression based filtering.

What was your approach?

  1. I re-ordered the filtering for putative-oncogene fusions to this chunk https://github.com/kgaonkar6/OpenPBTA-analysis/blob/155045684646a57920d684577db6148aaeec3b6d/analyses/fusion_filtering/04-project-specific-filtering.Rmd#L110

And then remove the "other" fusion while scavenging the fusions for recurrent non-oncogenic fusions here only:
https://github.com/kgaonkar6/OpenPBTA-analysis/blob/155045684646a57920d684577db6148aaeec3b6d/analyses/fusion_filtering/04-project-specific-filtering.Rmd#L146

  1. corrected the spelling in run_script here https://github.com/kgaonkar6/OpenPBTA-analysis/blob/155045684646a57920d684577db6148aaeec3b6d/analyses/fusion_filtering/run_fusion_merged.sh#L53

  2. IGH-@,IGH@ , IGL-@ and IGL@ need to be added to reference list as oncogenic genes.
    https://github.com/kgaonkar6/OpenPBTA-analysis/blob/add_other_fusion/analyses/fusion_filtering/references/genelistreference.txt

What GitHub issue does your pull request address?

Updated analysis: Fusion Filtering
#552

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

@jaclyn-taroni @jharenza

Which areas should receive a particularly close look?

Code review of project specific filtering because of the "other" inclusion for putative-oncogene fusions.

Is there anything that you want to discuss further?

Please review filtering process and it's implementation:

Putative Driver:

Filtering for general cancer specific genes ( after QC+expression_filteirng and removing LOCAL_REARRANGEMENT|LOCAL_INVERSION as potential read-throughs)
Fusions with genes in either onco from 02 script in columns Gene1A_anno,Gene1B_anno,Gene2A_anno,Gene2B_anno

Scavenge back filtered fusions to add to putative oncogenic fusions ( after QC+expression_filteirng removing LOCAL_REARRANGEMENT|LOCAL_INVERSION as potential read-throughs) :

In-frame/frameshift fusions is called in atleast 2 samples per histology OR
In-frame/frameshift fusions is called in atleast 2 callers
AND
Remove filtered-fusions found in more than 1 histology OR
Remove filtered-fusion with genes that have multi-fused gene (more than 5 times in sample)

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

results/
FilteredFusion.tsv
pbta-fusion-putative-oncogenic.tsv
pbta-fusion-recurrent-fusion-byhistology.tsv
pbta-fusion-recurrent-fusion-bysample.tsv
pbta-fusion-recurrently-fused-genes-byhistology.tsv
pbta-fusion-recurrently-fused-genes-bysample.tsv

What is your summary of the results?

4354 pbta-fusion-putative-oncogenic.tsv fusions
Also IGH-@--MYC is now being captured in pbta-fusion-putative-oncogenic.tsv which is a known fusion.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jaclyn-taroni jaclyn-taroni self-requested a review February 21, 2020 14:49
@kgaonkar6 kgaonkar6 changed the title edits as per issue #552 Fusion Filtering ("other" reading frame ) edits as per issue #552 Feb 21, 2020
@jaclyn-taroni
Copy link
Member

@kgaonkar6 – checking my understanding – put a slightly different way, it's alright for putative oncogenic fusions to not be inframe or frameshift gene fusions, is that correct?

@kgaonkar6
Copy link
Collaborator Author

Yes, because the reading-frame as in-frame/frameshift are only predictions from the algorithm and when predictions cannot be made they put "." which we reannotated as "other". So from my understanding the fusion in the oncogene can be true and like in the case IGH-MYC which is a known fusion ( @jharenza mentioned that it is known to be inframe/frameshift in literature) but StarFusion couldn't predict the frame for for sample.

@kgaonkar6 kgaonkar6 changed the title Fusion Filtering ("other" reading frame ) edits as per issue #552 Fusion Filtering ("other" reading frame ) edits Feb 21, 2020
@kgaonkar6 kgaonkar6 mentioned this pull request Feb 21, 2020
5 tasks
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reordering implemented here appears to match what is described. I would like to see the TODO added to the shell script before merging. I also had a comment about the handling of the inconsistent gene symbols, but I think that can be addressed in a future pull request.

@@ -6752,3 +6752,7 @@
"6751" "GAK" "PfamKinase" "Kinase"
"6752" "PIK3C2B" "PfamKinase" "Kinase"
"6753" "HUNK" "PfamKinase" "Kinase"
"6754" "IGH@" "addedToallOnco_Feb2017.tsv" "Oncogene"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling this way is fine for now. However, the better solution/design decision is to deal with these atypical or inconsistent gene symbols in the standardization steps. I can imagine a situation where you have some reference file that essentially contains genes that need to be recoded and what they should be standardized to for each caller you support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! Yes, I agree that would be a better idea going forward, I will add that as a future PR.

kgaonkar6 and others added 2 commits February 21, 2020 11:57
@jaclyn-taroni jaclyn-taroni merged commit 31fe795 into AlexsLemonade:master Feb 21, 2020
@jharenza
Copy link
Collaborator

@jaclyn-taroni I asked @kgaonkar6 to dig as to why the algorithms could not predict the frames. However, some of the fusions in this category are canonical fusions, for eg: IGH-MYC, and in some cases are either inframe or frameshift but still oncogenic. This came up in our lymphoma sample, and this fusion is canonical in certain leukemias and lymphomas. I am imagining that if we ran additional fusion algorithms, we might have a frame assigned, so not sure why these are the way they are.

@kgaonkar6
Copy link
Collaborator Author

@jaclyn-taroni and @jharenza from going through the code for arriba (STAR fusion seemed a little to complicated with Perl and multiple utils scripts etc. ) to me it looks like the reading frame is detected by looking for specific features of the pileup of aligned chimeric dna sequence and then predicted peptide. There seem to be many conditions in which the tools will not be able to detect the frame:

First would be to identify if the any coding exons can be predicted between the transcript and breakpoint. If no protein coding region is detected the sequence is "." which means no frame information can be predicted as well.
https://github.com/suhrig/arriba/blob/ca1d40b0575e958243fe2e7fd28acd54de038349/source/output_fusions.cpp#L201
If it a 5' gene cannot be predicted then the frame also cannot be detected because the tool looks for a start codon in the 5' to predict frame in peptide sequence
https://github.com/suhrig/arriba/blob/ca1d40b0575e958243fe2e7fd28acd54de038349/source/output_fusions.cpp#L803
If the breakpoints cannot be determined on contigs of the reference assembly:
https://github.com/suhrig/arriba/blob/ca1d40b0575e958243fe2e7fd28acd54de038349/source/output_fusions.cpp#L810

@kgaonkar6 kgaonkar6 deleted the add_other_fusion branch December 8, 2020 22:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants