GISTIC comparison of cohort results vs individual histology results #549

cbethell · 2020-02-20T17:18:36Z

Purpose/implementation Section

The purpose of this PR is to identify, if any, disagreement between GISTIC results for the entire cohort versus the three individual histologies.

What scientific question is your analysis addressing?

This PR addresses the question of whether or not we should be handling analyses that incorporate the GISTIC results separately based on individual histology.

What was your approach?

My approach is to visualize the disagreements/agreements between the GISTIC result files for the entire cohort and compare this to each of the individual histologies' result files.
A slightly more detailed idea of my approach is found in the original comment on reference issue #547.

What GitHub issue does your pull request address?

This PR addresses issue #547.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The plots should receive a particularly close look.

Are these plots adequately representing the data in an informative way?
Are there suggestions for additional plot types that may be useful if implemented?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

The plots can be seen in the rendered notebook of the R markdown file as it is currently formatted in this PR.

Results

What types of results are included (e.g., table, figure)?

Plots showing the agreements/disagreements between the scores.gistic files are included in the plots directory of this module.

What is your summary of the results?

This PR suggests that the results in the scores.gistic files and the amp/del_genes.conf_90.txt files differ between the entire cohort and the three individual histologies used for comparison.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile. (No new dependencies required).
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

- rename name of notebook to be consistent with module name

- propagate changes to file path definitions

- remove section comparing sample seg count files (not beneficial to this analysis)

jaclyn-taroni

The score plotting looks good! I had some comments about improving the plot_amp_del_genes portion of the notebook.

jaclyn-taroni · 2020-02-21T13:05:35Z

analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd

+
+### Custom function
+```{r}
+plot_amp_del_genes <-


I think what is being compared in the Venn diagram is entire strings? Imagine a situation where the wide peaks between the histology-specific run and the entire cohort GISTIC run have a lot of overlap but are not exactly the same. It could be the case that 70% of the genes from the wide peaks from either run would overlap but show up as a zero in the the Venn diagram. We would want the 70% of genes in both to be reflected in the Venn diagram. So what we're after here is looking at the overlap between vectors of gene symbols, where the vector will contain the unique set of gene symbols that are in any of the significant wide peaks. Does that make sense?

Yes, that does make sense.

jaclyn-taroni · 2020-02-21T13:12:11Z

analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd

+    #  This function displays a Venn Diagram that represents the data that
+    #  overlaps/does not overlap between the two given files
+
+    # Read in files


You are treating the two files in the exact same way, which is a good indication to me that this can be wrapped up in a function. I can imagine a situation where you might want to examine what genes are in the vectors that underlie the Venn diagram, so you can separate things out into two function: 1) read in the file and process the vector and 2) create the Venn diagram.

With the first function, you'll only need to process the entire cohort results once.

jaclyn-taroni · 2020-02-21T13:15:55Z

analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd

+    # Wrnagle data in cohort file for plotting
+    cohort_genes <- t(cohort_genes)
+
+    cohort_genes <- cohort_genes[-1,] %>%
+      as.data.frame() %>%
+      dplyr::mutate(summary_info = paste0(V1, ", ", V2, ", ", V3, ", ", V4)) %>%
+      dplyr::select(-c("V1", "V2", "V3", "V4")) %>%
+      dplyr::select(summary_info, dplyr::everything()) %>%
+      tidyr::gather(V5, V6,-summary_info) %>%
+      dplyr::group_by(summary_info) %>%
+      dplyr::mutate(genes = paste0(sort(unique(V6)), collapse = ", ")) %>%
+      dplyr::select(-V5,-V6) %>%
+      dplyr::distinct() %>%
+      dplyr::mutate(summary = paste0(summary_info, genes)) %>%
+      dplyr::filter(!(summary == "NA, NA, NA, NA")) %>%
+      dplyr::arrange()


For the amplification and deletion files, GISTIC produces "ragged columns" per their documentation. It is often helpful to use lists any time you find yourself in a situation where columns or rows are of different sizes and you have a whole bunch of trailing blanks.

Here's what I came up with for the first function mentioned above, getting the vector of genes of interest for the Venn diagram:

get_genes_vector <- function(genes_file) { genes_ragged_df <- data.table::fread(genes_file, data.table = FALSE) genes_list <- as.list(genes_ragged_df) genes_list <- genes_list %>% # This removes any element from the list that is all NA -- most likely a # result of reading in a ragged array purrr::discard(~ all(is.na(.))) %>% # This removes the element of the list that is essentially the "header" for # the file purrr::discard(~ any(str_detect(., "cytoband|q value"))) %>% # Remove any broad peaks with q-value > 0.05 purrr::discard(~ .[3] > 0.05) %>% # Remove everything before and including the wide peak boundaries for each # remaining element of the list purrr::modify(~ .[-c(1:(str_which(., "chr")))]) %>% # Remove blanks -- result of ragged data.frame purrr::modify(~ .[. != ""]) # This will give us the vector of all the genes that were included genes_vector <- unique(unname(unlist(genes_list))) }

Noted, I'll give this function a try and commit the changes.

- remove GISTIC qplots (does not seem beneficial to this notebook)

jaclyn-taroni · 2020-02-21T20:20:55Z

This looks ready to be documented to me @cbethell! We might want to consider how we could restructure all of the copy number modules such they are all under the same directory. For example I'm going to create a new one called expression-cna to try out some ideas about #387 rather than add to focal-cn-file-preparation. The restructuring is way beyond this scope of this particular pull request, but is something that's on my mind.

cbethell · 2020-02-21T20:27:55Z

This looks ready to be documented to me @cbethell! We might want to consider how we could restructure all of the copy number modules such they are all under the same directory. For example I'm going to create a new one called expression-cna to try out some ideas about #387 rather than add to focal-cn-file-preparation. The restructuring is way beyond this scope of this particular pull request, but is something that's on my mind.

@jaclyn-taroni sounds good to me! I started a README for this module but will likely update that and the analyses/README.md file in a separate PR and mark this one ready for review. I will also give the restructuring of the copy number modules some thought.

Would you also give the files in run-gistic/results another look (if you haven't already) to ensure that we covered all the possible file comparisons in this PR? I attempted to compare the all_lesions.conf_90.txt files as well but it appears that this would be redundant so I believe we covered all but a second opinion won't hurt.

jaclyn-taroni · 2020-02-21T20:55:33Z

@cbethell I would say the results here are sufficient to give me pause about using the GISTIC results from the entire cohort without including all_lesions.conf_90.txt. The data in all_lesions.conf_90.txt also seems like it would be more difficult to represent.

jaclyn-taroni

Looks good! One change to the notebook structure I'd like to see before merging.

jaclyn-taroni · 2020-02-21T21:12:35Z

analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd

+                 "util",
+                 "generate-multipanel-plot-functions.R"))
+```
+


Can you stick all your functions under this section?

cbethell and others added 7 commits February 20, 2020 11:44

Add GISTIC comparison of cohort results vs individual histology results

3e9f6da

Add analysis to .circleci

402444a

- rename name of notebook to be consistent with module name

Load in tidyverse library

ba04105

Place this analysis after run-gistic analysis in .circleci

13a3aad

Merge branch 'master' into gistic-histology-cohort-comparison

7ee3d15

Add step to unzip GISTIC results folders

2c960fa

- propagate changes to file path definitions

Add plots comparing amp and del files

4f3d7b7

- remove section comparing sample seg count files (not beneficial to this analysis)

jaclyn-taroni reviewed Feb 21, 2020

View reviewed changes

refactor functions comparing amp and del gene files

87220e6

- remove GISTIC qplots (does not seem beneficial to this notebook)

Merge branch 'master' into gistic-histology-cohort-comparison

12d20d1

cbethell changed the title ~~WIP: GISTIC comparison of cohort results vs individual histology results~~ GISTIC comparison of cohort results vs individual histology results Feb 21, 2020

cbethell marked this pull request as ready for review February 21, 2020 20:29

jaclyn-taroni approved these changes Feb 21, 2020

View reviewed changes

Place all functions under the Functions heading

9db0edf

jaclyn-taroni merged commit e6b0438 into AlexsLemonade:master Feb 22, 2020

cbethell mentioned this pull request Feb 23, 2020

Add gistic-cohort-vs-histology-comparison README #554

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GISTIC comparison of cohort results vs individual histology results #549

GISTIC comparison of cohort results vs individual histology results #549

cbethell commented Feb 20, 2020 •

edited

Loading

jaclyn-taroni left a comment

jaclyn-taroni Feb 21, 2020

cbethell Feb 21, 2020

jaclyn-taroni Feb 21, 2020

jaclyn-taroni Feb 21, 2020

cbethell Feb 21, 2020

jaclyn-taroni commented Feb 21, 2020

cbethell commented Feb 21, 2020

jaclyn-taroni commented Feb 21, 2020

jaclyn-taroni left a comment

jaclyn-taroni Feb 21, 2020

GISTIC comparison of cohort results vs individual histology results #549

GISTIC comparison of cohort results vs individual histology results #549

Conversation

cbethell commented Feb 20, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

jaclyn-taroni Feb 21, 2020

Choose a reason for hiding this comment

cbethell Feb 21, 2020

Choose a reason for hiding this comment

jaclyn-taroni Feb 21, 2020

Choose a reason for hiding this comment

jaclyn-taroni Feb 21, 2020

Choose a reason for hiding this comment

cbethell Feb 21, 2020

Choose a reason for hiding this comment

jaclyn-taroni commented Feb 21, 2020

cbethell commented Feb 21, 2020

jaclyn-taroni commented Feb 21, 2020

jaclyn-taroni left a comment

Choose a reason for hiding this comment

jaclyn-taroni Feb 21, 2020

Choose a reason for hiding this comment

cbethell commented Feb 20, 2020 •

edited

Loading