Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

GISTIC comparison of cohort results vs individual histology results #549

Merged

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Feb 20, 2020

Purpose/implementation Section

The purpose of this PR is to identify, if any, disagreement between GISTIC results for the entire cohort versus the three individual histologies.

What scientific question is your analysis addressing?

This PR addresses the question of whether or not we should be handling analyses that incorporate the GISTIC results separately based on individual histology.

What was your approach?

My approach is to visualize the disagreements/agreements between the GISTIC result files for the entire cohort and compare this to each of the individual histologies' result files.
A slightly more detailed idea of my approach is found in the original comment on reference issue #547.

What GitHub issue does your pull request address?

This PR addresses issue #547.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The plots should receive a particularly close look.

  • Are these plots adequately representing the data in an informative way?
  • Are there suggestions for additional plot types that may be useful if implemented?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

The plots can be seen in the rendered notebook of the R markdown file as it is currently formatted in this PR.

Results

What types of results are included (e.g., table, figure)?

Plots showing the agreements/disagreements between the scores.gistic files are included in the plots directory of this module.

What is your summary of the results?

This PR suggests that the results in the scores.gistic files and the amp/del_genes.conf_90.txt files differ between the entire cohort and the three individual histologies used for comparison.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile. (No new dependencies required).
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The score plotting looks good! I had some comments about improving the plot_amp_del_genes portion of the notebook.


### Custom function
```{r}
plot_amp_del_genes <-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what is being compared in the Venn diagram is entire strings? Imagine a situation where the wide peaks between the histology-specific run and the entire cohort GISTIC run have a lot of overlap but are not exactly the same. It could be the case that 70% of the genes from the wide peaks from either run would overlap but show up as a zero in the the Venn diagram. We would want the 70% of genes in both to be reflected in the Venn diagram. So what we're after here is looking at the overlap between vectors of gene symbols, where the vector will contain the unique set of gene symbols that are in any of the significant wide peaks. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that does make sense.

# This function displays a Venn Diagram that represents the data that
# overlaps/does not overlap between the two given files

# Read in files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are treating the two files in the exact same way, which is a good indication to me that this can be wrapped up in a function. I can imagine a situation where you might want to examine what genes are in the vectors that underlie the Venn diagram, so you can separate things out into two function: 1) read in the file and process the vector and 2) create the Venn diagram.

With the first function, you'll only need to process the entire cohort results once.

Comment on lines 228 to 243
# Wrnagle data in cohort file for plotting
cohort_genes <- t(cohort_genes)

cohort_genes <- cohort_genes[-1,] %>%
as.data.frame() %>%
dplyr::mutate(summary_info = paste0(V1, ", ", V2, ", ", V3, ", ", V4)) %>%
dplyr::select(-c("V1", "V2", "V3", "V4")) %>%
dplyr::select(summary_info, dplyr::everything()) %>%
tidyr::gather(V5, V6,-summary_info) %>%
dplyr::group_by(summary_info) %>%
dplyr::mutate(genes = paste0(sort(unique(V6)), collapse = ", ")) %>%
dplyr::select(-V5,-V6) %>%
dplyr::distinct() %>%
dplyr::mutate(summary = paste0(summary_info, genes)) %>%
dplyr::filter(!(summary == "NA, NA, NA, NA")) %>%
dplyr::arrange()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the amplification and deletion files, GISTIC produces "ragged columns" per their documentation. It is often helpful to use lists any time you find yourself in a situation where columns or rows are of different sizes and you have a whole bunch of trailing blanks.

Here's what I came up with for the first function mentioned above, getting the vector of genes of interest for the Venn diagram:

get_genes_vector <- function(genes_file) {
  genes_ragged_df <- data.table::fread(genes_file,
                                       data.table = FALSE)
  genes_list <- as.list(genes_ragged_df)
  genes_list <- genes_list %>%
    # This removes any element from the list that is all NA -- most likely a
    # result of reading in a ragged array
    purrr::discard(~ all(is.na(.))) %>%
    # This removes the element of the list that is essentially the "header" for
    # the file
    purrr::discard(~ any(str_detect(., "cytoband|q value"))) %>% 
    # Remove any broad peaks with q-value > 0.05
    purrr::discard(~ .[3] > 0.05) %>%
    # Remove everything before and including the wide peak boundaries for each
    # remaining element of the list
    purrr::modify(~ .[-c(1:(str_which(., "chr")))]) %>%
    # Remove blanks -- result of ragged data.frame
    purrr::modify(~ .[. != ""])
  # This will give us the vector of all the genes that were included
  genes_vector <- unique(unname(unlist(genes_list)))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll give this function a try and commit the changes.

- remove GISTIC qplots (does not seem beneficial to this notebook)
@jaclyn-taroni
Copy link
Member

This looks ready to be documented to me @cbethell! We might want to consider how we could restructure all of the copy number modules such they are all under the same directory. For example I'm going to create a new one called expression-cna to try out some ideas about #387 rather than add to focal-cn-file-preparation. The restructuring is way beyond this scope of this particular pull request, but is something that's on my mind.

@cbethell
Copy link
Contributor Author

This looks ready to be documented to me @cbethell! We might want to consider how we could restructure all of the copy number modules such they are all under the same directory. For example I'm going to create a new one called expression-cna to try out some ideas about #387 rather than add to focal-cn-file-preparation. The restructuring is way beyond this scope of this particular pull request, but is something that's on my mind.

@jaclyn-taroni sounds good to me! I started a README for this module but will likely update that and the analyses/README.md file in a separate PR and mark this one ready for review. I will also give the restructuring of the copy number modules some thought.

Would you also give the files in run-gistic/results another look (if you haven't already) to ensure that we covered all the possible file comparisons in this PR? I attempted to compare the all_lesions.conf_90.txt files as well but it appears that this would be redundant so I believe we covered all but a second opinion won't hurt.

@cbethell cbethell changed the title WIP: GISTIC comparison of cohort results vs individual histology results GISTIC comparison of cohort results vs individual histology results Feb 21, 2020
@cbethell cbethell marked this pull request as ready for review February 21, 2020 20:29
@jaclyn-taroni
Copy link
Member

@cbethell I would say the results here are sufficient to give me pause about using the GISTIC results from the entire cohort without including all_lesions.conf_90.txt. The data in all_lesions.conf_90.txt also seems like it would be more difficult to represent.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One change to the notebook structure I'd like to see before merging.

"util",
"generate-multipanel-plot-functions.R"))
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you stick all your functions under this section?

@jaclyn-taroni jaclyn-taroni merged commit e6b0438 into AlexsLemonade:master Feb 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants