Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

cbethell · 2020-02-26T17:48:24Z

What are the scientific goals of the analysis?

The scientific goal of the analysis is to gather evidence needed to validate whether or not the way we handle the copy number data in this project is reasonable and the best possible way in which we could handle it.

To do this, we compare our calls to GISTIC's calls. More specifically, we want to compare the GISTIC gene level status calls (and cytoband status calls later) with the focal CN calls we prepare in the focal-cn-file-preparation module. We know that GISTIC is a tool that is widely used, and although it relies on recurrence and may not be appropriate for every histology we have in our cohort, we want to see if our focal-cn-file-preparation analysis gives us the same answer that GISTIC would. Once, we've collected this evidence, we will get an expert's opinion on the interpretation of the evidence.

What methods do you plan to use to accomplish the scientific goals?

I plan to execute this analysis in multiple steps/notebooks:

1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:
  
  gene_symbol Kids_First_Biospecimen_ID status detection_peak
  
  TBXAS1 BS_xxxxxx gain Amplification Peak 3
  
  where gene_symbol values are retrieved from the
  amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
  status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
  values are matched between the amp/del files and the all_lesions file.
This notebook will also produce a separate table with GISTIC's cytoband data for comparison
to our cytoband status calls once they are generated (Updated analysis: generate cytoband copy number status file for consumption #497). The table will have the following
columns:

cytoband Kids_First_Biospecimen_ID status
1. The second notebook will take the tidy GISTIC data and count the number of samples for an individual histology (the focus will be LGAT for now) that are found in a particular amplification/deletion peak, for the corresponding genes (found in the same amplification/deletion peak according to the data from the amp_genes.conf_90.txt anddel_genes.conf_90.txt files).
  The output of the notebook will look similar to this sketch outlined by @jaclyn-taroni:

These steps were broadly attempted in open PR #559.
This PR will now be adapted to implement step 1 of the plan above and a separate PR will be filed to address step 2.

What input data are required for this analysis?

The input data required for this includes:

analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip
analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Step 1 ~1 day
Step 2 ~2 days
(rough estimates)

Who will complete the analysis (please add a GitHub handle here if relevant)?

@cbethell

What relevant scientific literature relates to this analysis?

GISTIC's docs

The text was updated successfully, but these errors were encountered:

jaclyn-taroni · 2020-02-26T18:07:01Z

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

cbethell · 2020-02-26T18:33:57Z

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

gene_symbol Kids_First_Biospecimen_ID status detection_peak

TBXAS1 BS_xxxxxx gain Amplification Peak 3

where gene_symbol values are retrieved from the
amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
values are matched between the amp/del files and the all_lesions file.

jaclyn-taroni · 2020-02-26T18:49:38Z

Yep, looks good to me. Thanks for updating!

cbethell added the proposed analysis label Feb 26, 2020

This was referenced Feb 26, 2020

Comparison of GISTIC gene level calls - Data Prep #559

Merged

Comparison of GISTIC gene level calls - Data Wrangling/Tables #600

Merged

cbethell mentioned this issue Mar 6, 2020

WIP: Investigate/Fix formatting of GISTIC's amp and del genes files #614

Closed

5 tasks

jaclyn-taroni mentioned this issue Mar 15, 2020

Proposed Analysis: comparison of GISTIC results (specific histology versus entire cohort) #547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

cbethell commented Feb 26, 2020 •

edited

Loading

jaclyn-taroni commented Feb 26, 2020

cbethell commented Feb 26, 2020

jaclyn-taroni commented Feb 26, 2020

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Comments

cbethell commented Feb 26, 2020 • edited Loading

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

What relevant scientific literature relates to this analysis?

jaclyn-taroni commented Feb 26, 2020

cbethell commented Feb 26, 2020

jaclyn-taroni commented Feb 26, 2020

cbethell commented Feb 26, 2020 •

edited

Loading