Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Molecular Subtyping - ATRT Compare GISTIC results #344

Closed

Conversation

cbethell
Copy link
Contributor

Purpose/implementation Section

To compare the GISTIC calls for SMARCB1 deletions in ATRT samples with the current calls.

What was your approach?

I merged the relevant metadata to categorize the GISTIC data and plotted the GISTIC calls with the current calls in ATRT samples.

What GitHub issue does your pull request address?

This PR addresses issue #244.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

  • Does this analysis appear to be correct?
  • Should there be additional plots/results?

Which areas should receive a particularly close look?

  • Is there anything that the analysis may be missing?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this analysis is ready for review.

Results

What types of results are included (e.g., table, figure)?

This analysis produces a plot within the html output of the R notebook in this PR.

What is your summary of the results?

The GISTIC calls seem to consist of many gain values and do not significantly agree with the current calls, thus making me suspicious of the methods used in this analysis.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

- add nb to compare GISTIC results
- add nb to shell script
- add GISTIC input file to `data` directory in this module
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start! I have a few suggestions and requests for some more comments. This may be in part because I haven't seen your first two notebooks and I'm jumping in right now and I'm not sure what these different files are. Secondly, would you be able to provide a link to the rendered html? I'm curious to what your plot looks like and may have more comments based on seeing the rendered version.

## Directories and Files

```{r}
# Detect the ".git" folder -- this will in the project root directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary since you are using a notebook right? I would just hard code the file path using a ..

Kids_First_Biospecimen_ID,
tumor_ploidy)

# Read in gistic broad value data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will have to change whenever the GISTIC files are added to the official data? Maybe add a TODO so we don't forget.

dplyr::filter(sample_id %in% final_df$sample_id)

# Make the GISTIC data numeric
transposed_gistic$V1 <- as.numeric(transposed_gistic$V1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something you could do using mutate and add it to the steps above?


# Read in gistic broad value data
gistic_focal_data <-
data.table::fread(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files really large? Is that why you are using fread for these CSV files? IDK if this will help, but if you want them to be a data.frame off the bat you can use data.table = FALSE not yet sure if this will influence anything downstream though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the reason why I am using fread for these files. I added the data.table = FALSE argument in the most recent commit but I still needed the as.data.frame function in one instance downstream.

date: 2019
---

This notebook addresses the issue of molecular subtyping ATRT samples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more specific to the purpose of this exact notebook? It looks like its purpose is to wrangle the data for this overall purpose?

)
```

# Filter GISTIC data for ATRT samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot is happening in this section, can you add some more comments as to the requirements of what you are trying to end up with? I think we might be able to make this a tad easier to follow or more streamlined but its hard to say at first glance without my having looked at the previous notebooks.

- Add more descriptive comments
- Change file paths 
- Add a TODO for reading in GISTIC files 
- Make purpose of nb more specific
- Add `data.table = FALSE` argument
@cbethell
Copy link
Contributor Author

This looks like a great start! I have a few suggestions and requests for some more comments. This may be in part because I haven't seen your first two notebooks and I'm jumping in right now and I'm not sure what these different files are. Secondly, would you be able to provide a link to the rendered html? I'm curious to what your plot looks like and may have more comments based on seeing the rendered version.

Thank you for the review @cansavvy! I tried to make the comments more descriptive but I am not sure how successful I was, so let me know if there is anything I can be more clear on.

Here is the rendered html output.

@cansavvy
Copy link
Collaborator

cansavvy commented Dec 17, 2019

@cbethell ! Fantastic job at incorporating my comments! Now that I have a better idea of what's going on and have seen the html, I have more questions.

  1. I want to firstly confirm the scientific question of this notebook (since I haven't been following this analysis or its accompanying issues). You are trying to compare the Focal CN analysis conclusion with the GISTIC CN conclusion? Do you know what is expected here and how these methods might be different or the same?

  2. Assuming that this comparison is the meat of your scientific question I think the preferred presentation of this data would contingency table instead of your stacked barplot.

  3. I'm not very familiar with the output of GISTIC or the focal CN analysis but would it make sense to also compare how many copy numbers that each method calls and do this with some kind of scatterplot? Or more generally, is there some other ways we can be looking at this data beyond just a categorical "gain", "loss", or "neutral"? May be a good idea to take multiple approaches to compare the output if this is possible, that way we can be a tad more informed about how these methods agree or disagree.

@cbethell
Copy link
Contributor Author

@cansavvy to answer your above questions,

  1. I want to first confirm the scientific question of this notebook (since I haven't been following this analysis or its accompanying issues). You are trying to compare the Focal CN analysis conclusion with the GISTIC CN conclusion? Do you know what is expected here and how these methods might be different or the same?

I am not 100% sure what is expected here as there is not much documentation on the methods used by GISTIC but I would have expected them to agree more than they do in the stacked barplot.

  1. Assuming that this comparison is the meat of your scientific question I think the preferred presentation of this data would contigency table instead of your stacked barplot.

That assumption would be correct for this PR. I can implement this change in the upcoming commit.

  1. I'm not very familiar with the output of GISTIC or the focal CN analysis but would it make sense to also compare how many copy numbers that each method calls and do this with some kind of scatterplot? Or more generally, is there some other ways we can be looking at this data beyond just a categorical "gain", "loss", or "neutral"? May be a good idea to take multiple approaches to compare the output if this is possible, that way we can be a tad more informed about how these methods agree or disagree.

Good idea. I can dive deeper into this comment and implement the scatterplot comparing the copy numbers and other approaches to compare the output of the two methods.

@cbethell
Copy link
Contributor Author

Here is the updated rendered output.

@cansavvy
Copy link
Collaborator

cansavvy commented Dec 19, 2019

Per our in person discussion, @cbethell.

  • I think because of the small number of samples here, a contingency table is the best way to represent this even though it still is hard to interpret.
  • Additionally, if you can make sure to put the categories in the order of Loss, Neutral, Gain, that would help interpretability.
  • I think the CN scatterplot is good to see so you should keep it, but I think the "status "scatterplot is not super helpful so I would drop that one.
  • One last super minor style comment, can you apply a theme so the ugly gray background isn't there?

@jaclyn-taroni
Copy link
Member

My interpretation of #244 (comment)

@jaclyn-taroni do you and @cbethell want to see if the results from gistic for CNVkit: s3://kf-openaccess-us-east-1-prd-pbta/data/2019-12-10-gistic-results-cnvkit.zip broad_values_by_arm.txt results make sense with the current SMARCB1 deletions found/be good enough for this analysis? If so, we can release these results in the next data release.

Is that we should be using the broad_values_by_arm.txt file to look at (from the ATRT TYR section)

broad SMARCB1 deletions (most have chr22q loss/monosomy 22)

Here you are using ControlFreeC files here for the comparison and GISTIC used CNVkit as input, so I'm not necessarily surprised if you don't see a lot of agreement. That is part of the rationale for #128. There is also the question of how the GISTIC gene symbol mapping step happens, which may contribute to any discrepancies.

I think we want to subset broad_values_by_arm.txt to ATRT samples and relevant chromosome arms only to 00-subset-files-for-ATRT.R. You may still need to use the ploidy information in the histologies file. (It also may make sense to then use the CNVkit file to generate atrt_subset/atrt_focal_cn.tsv.gz instead of the ControlFreeC file until we have consensus calls.) We want to include the chr 22 information in the final table generated in 01-ATRT-molecular-subtyping-data-prep.Rmd.

I am going to close this pull request in the interest of focusing on getting the chr 22 information into the final ATRT subtyping table.

@cbethell cbethell deleted the atrt-subtyping-gistic-comparison branch February 6, 2020 20:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants