-
Notifications
You must be signed in to change notification settings - Fork 83
GISTIC comparison of cohort results vs individual histology results #549
GISTIC comparison of cohort results vs individual histology results #549
Conversation
- rename name of notebook to be consistent with module name
- propagate changes to file path definitions
- remove section comparing sample seg count files (not beneficial to this analysis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The score plotting looks good! I had some comments about improving the plot_amp_del_genes
portion of the notebook.
|
||
### Custom function | ||
```{r} | ||
plot_amp_del_genes <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is being compared in the Venn diagram is entire strings? Imagine a situation where the wide peaks between the histology-specific run and the entire cohort GISTIC run have a lot of overlap but are not exactly the same. It could be the case that 70% of the genes from the wide peaks from either run would overlap but show up as a zero in the the Venn diagram. We would want the 70% of genes in both to be reflected in the Venn diagram. So what we're after here is looking at the overlap between vectors of gene symbols, where the vector will contain the unique set of gene symbols that are in any of the significant wide peaks. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that does make sense.
# This function displays a Venn Diagram that represents the data that | ||
# overlaps/does not overlap between the two given files | ||
|
||
# Read in files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are treating the two files in the exact same way, which is a good indication to me that this can be wrapped up in a function. I can imagine a situation where you might want to examine what genes are in the vectors that underlie the Venn diagram, so you can separate things out into two function: 1) read in the file and process the vector and 2) create the Venn diagram.
With the first function, you'll only need to process the entire cohort results once.
# Wrnagle data in cohort file for plotting | ||
cohort_genes <- t(cohort_genes) | ||
|
||
cohort_genes <- cohort_genes[-1,] %>% | ||
as.data.frame() %>% | ||
dplyr::mutate(summary_info = paste0(V1, ", ", V2, ", ", V3, ", ", V4)) %>% | ||
dplyr::select(-c("V1", "V2", "V3", "V4")) %>% | ||
dplyr::select(summary_info, dplyr::everything()) %>% | ||
tidyr::gather(V5, V6,-summary_info) %>% | ||
dplyr::group_by(summary_info) %>% | ||
dplyr::mutate(genes = paste0(sort(unique(V6)), collapse = ", ")) %>% | ||
dplyr::select(-V5,-V6) %>% | ||
dplyr::distinct() %>% | ||
dplyr::mutate(summary = paste0(summary_info, genes)) %>% | ||
dplyr::filter(!(summary == "NA, NA, NA, NA")) %>% | ||
dplyr::arrange() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the amplification and deletion files, GISTIC produces "ragged columns" per their documentation. It is often helpful to use lists any time you find yourself in a situation where columns or rows are of different sizes and you have a whole bunch of trailing blanks.
Here's what I came up with for the first function mentioned above, getting the vector of genes of interest for the Venn diagram:
get_genes_vector <- function(genes_file) {
genes_ragged_df <- data.table::fread(genes_file,
data.table = FALSE)
genes_list <- as.list(genes_ragged_df)
genes_list <- genes_list %>%
# This removes any element from the list that is all NA -- most likely a
# result of reading in a ragged array
purrr::discard(~ all(is.na(.))) %>%
# This removes the element of the list that is essentially the "header" for
# the file
purrr::discard(~ any(str_detect(., "cytoband|q value"))) %>%
# Remove any broad peaks with q-value > 0.05
purrr::discard(~ .[3] > 0.05) %>%
# Remove everything before and including the wide peak boundaries for each
# remaining element of the list
purrr::modify(~ .[-c(1:(str_which(., "chr")))]) %>%
# Remove blanks -- result of ragged data.frame
purrr::modify(~ .[. != ""])
# This will give us the vector of all the genes that were included
genes_vector <- unique(unname(unlist(genes_list)))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll give this function a try and commit the changes.
- remove GISTIC qplots (does not seem beneficial to this notebook)
This looks ready to be documented to me @cbethell! We might want to consider how we could restructure all of the copy number modules such they are all under the same directory. For example I'm going to create a new one called |
@jaclyn-taroni sounds good to me! I started a README for this module but will likely update that and the Would you also give the files in |
@cbethell I would say the results here are sufficient to give me pause about using the GISTIC results from the entire cohort without including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One change to the notebook structure I'd like to see before merging.
"util", | ||
"generate-multipanel-plot-functions.R")) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you stick all your functions under this section?
Purpose/implementation Section
The purpose of this PR is to identify, if any, disagreement between GISTIC results for the entire cohort versus the three individual histologies.
What scientific question is your analysis addressing?
This PR addresses the question of whether or not we should be handling analyses that incorporate the GISTIC results separately based on individual histology.
What was your approach?
My approach is to visualize the disagreements/agreements between the GISTIC result files for the entire cohort and compare this to each of the individual histologies' result files.
A slightly more detailed idea of my approach is found in the original comment on reference issue #547.
What GitHub issue does your pull request address?
This PR addresses issue #547.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The plots should receive a particularly close look.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
The plots can be seen in the rendered notebook of the R markdown file as it is currently formatted in this PR.
Results
What types of results are included (e.g., table, figure)?
Plots showing the agreements/disagreements between the
scores.gistic
files are included in theplots
directory of this module.What is your summary of the results?
This PR suggests that the results in the
scores.gistic
files and theamp/del_genes.conf_90.txt
files differ between the entire cohort and the three individual histologies used for comparison.Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.