Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Open
cbethell opened this issue Feb 26, 2020 · 3 comments
Open

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

cbethell opened this issue Feb 26, 2020 · 3 comments

Comments

@cbethell
Copy link
Contributor

cbethell commented Feb 26, 2020

What are the scientific goals of the analysis?

The scientific goal of the analysis is to gather evidence needed to validate whether or not the way we handle the copy number data in this project is reasonable and the best possible way in which we could handle it.

To do this, we compare our calls to GISTIC's calls. More specifically, we want to compare the GISTIC gene level status calls (and cytoband status calls later) with the focal CN calls we prepare in the focal-cn-file-preparation module. We know that GISTIC is a tool that is widely used, and although it relies on recurrence and may not be appropriate for every histology we have in our cohort, we want to see if our focal-cn-file-preparation analysis gives us the same answer that GISTIC would. Once, we've collected this evidence, we will get an expert's opinion on the interpretation of the evidence.

What methods do you plan to use to accomplish the scientific goals?

I plan to execute this analysis in multiple steps/notebooks:

    1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

      gene_symbol Kids_First_Biospecimen_ID status detection_peak
      TBXAS1 BS_xxxxxx gain Amplification Peak 3

      where gene_symbol values are retrieved from the
      amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
      status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
      values are matched between the amp/del files and the all_lesions file.

    This notebook will also produce a separate table with GISTIC's cytoband data for comparison
    to our cytoband status calls once they are generated (Updated analysis: generate cytoband copy number status file for consumption #497). The table will have the following
    columns:

    cytoband Kids_First_Biospecimen_ID status
    1. The second notebook will take the tidy GISTIC data and count the number of samples for an individual histology (the focus will be LGAT for now) that are found in a particular amplification/deletion peak, for the corresponding genes (found in the same amplification/deletion peak according to the data from the amp_genes.conf_90.txt anddel_genes.conf_90.txt files).
      The output of the notebook will look similar to this sketch outlined by @jaclyn-taroni:
      gistic-comparison-sketch

These steps were broadly attempted in open PR #559.
This PR will now be adapted to implement step 1 of the plan above and a separate PR will be filed to address step 2.

What input data are required for this analysis?

The input data required for this includes:

  • analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz
  • analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip
  • analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Step 1 ~1 day
Step 2 ~2 days
(rough estimates)

Who will complete the analysis (please add a GitHub handle here if relevant)?

@cbethell

What relevant scientific literature relates to this analysis?

GISTIC's docs

@jaclyn-taroni
Copy link
Member

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@cbethell
Copy link
Contributor Author

@cbethell can you say a little bit more about what the output you mention in the quote below will look like:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

    1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

      gene_symbol Kids_First_Biospecimen_ID status detection_peak
      TBXAS1 BS_xxxxxx gain Amplification Peak 3

      where gene_symbol values are retrieved from the
      amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
      status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
      values are matched between the amp/del files and the all_lesions file.

@jaclyn-taroni
Copy link
Member

Yep, looks good to me. Thanks for updating!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants