-
Notifications
You must be signed in to change notification settings - Fork 83
Comparison of GISTIC gene level calls - Data Prep #559
Comparison of GISTIC gene level calls - Data Prep #559
Conversation
- add shell script to run this module's scripts - make appropriate change to `.circleci`
- fix file path in `.circleci` - add variables to final tables - save final tables
- move notebook to its own module and rename to explicitly indicate its purpose, which is to tidy the GISTIC data first - add this analysis to `.circleci` - revert changes to `gistic-cohort-vs-histology-comparison` module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table you are producing looks close to what we want! I have not got through this pull request in detail/line by line @cbethell, but I wanted to pass on some design comments that jumped out at me so you can continue to refine.
|
||
```{r} | ||
# Code adapted from `analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd` | ||
get_genes_df <- function(genes_df) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good indication that it is time for this to get broken out into an R file in a util
... for the parts that you're adding/changing from analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd
can you write a second function that uses the output from get_genes_vector
to create a data.frame or is there some other way to make this more generalizable? I think this is also a good indication that this notebook should be in the same module as what is in gistic-cohort-vs-histology-comparison
but perhaps with a more general name like compare-gistic
.
status %in% c("-1", "-1.2929") ~ "loss", | ||
status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain", | ||
TRUE ~ "neutral" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this says to me is that something that should be a double
is a character
. For this to be robust to changes in the underlying values (e.g., what if some value becomes -1.4597
another time we run it because the histology file changes), the logic should be
status %in% c("-1", "-1.2929") ~ "loss", | |
status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain", | |
TRUE ~ "neutral" | |
status < 0 ~ "loss", | |
status > 0 ~ "gain", | |
status == 0 ~ "neutral" |
Whatever is causing the status values to be a character (and the inconsistencies in spacing) should be fixed before you get to this step.
- move notebook back to same module of gistic cohort vs histology notebook - refactor logic in tidy function - rename notebooks - add shell script - add tidy data files for future cytoband level comparison - make appropriate changes to .circleci
- add functions script to `util` directory of this module - source functions script in `01-GISTIC-cohort-vs-histology-comparison.Rmd` and in `02-GISTIC-tidy-data-prep.Rmd` - move functions from both notebooks to function script
- make function name more broad as output can be a vector or a data.frame
I made the design changes @jaclyn-taroni suggested above and updated the rendered notebook linked in the original comment of this PR. Note: I put all the functions relevant to the notebooks in this module in a custom functions script. I am not sure whether or not the function code would be better shown on the rendered notebook (particularly for the tidying of the GISTIC data steps), if so I can revert the move of these functions. The updated notebook can be viewed here. |
- rerun nb - propogate changes to the `README.md` file
…into gistic-gene-level-comparison
) | ||
} | ||
|
||
lgat_gistic_dir <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this notebook is the second in the series is there a scenario you are expecting that this file will not have been unzipped already? Related question, is the 02 notebook dependent on 01 notebook being run already besides this? If yes, and 02 doesn't work without 01 having been run, you can get rid of the unzipping part of this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this step in there because the 02 notebook is not dependent on the 01 notebook, so if someone wanted to run the 02 notebook independently they can. However, I am not sure how important having this option is here (especially considering that there is an upcoming notebook that will be dependent on the 02 notebook).
### Unzip GISTIC results folders | ||
|
||
```{r} | ||
# Unzip and set up GISTIC results folders and files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can shorten up this section without losing clarity. Can you list all your files first and then try this too unzip all your files at once:
# unzip all your files
ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)
I haven't tried this so let me know if it works. Don't spend too much time on this idea I'm saying because what you have is also fine and clear.
# Save data.frame to file (for later cytoband level comparison) | ||
readr::write_tsv(gistic_all_lesions_df, file.path(results_dir, output_filename)) | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need an extra line
|
||
} | ||
|
||
#### Implemented in `01-GISTIC-cohort-vs-histology-comparison.Rmd` ------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the way you have these functions sectioned off is how I'm thinking it is based on your comment here, you may want to split these functions into two different function files. That way 1) You don't load functions you don't need and 2) It's easier to know where to find functions used by each notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I see why you did this. Both your sets of functions use get_genes_object
. Forget what I said. I think how you have it is fine.
|
||
|
||
#### Implemented in both `01-GISTIC-cohort-vs-histology.Rmd` and `02-GISTIC-tidy-data-prep.Rmd` | ||
get_genes_object <- function(genes_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this function but the name doesn't tell me much about what it does. Maybe format_gistic_genes
? Or something more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that name much better than get_genes_object
!
The output of this notebook includes three multipanel plots, found in this `plots` directory, depicting the distribution of amplifications/deletions across chromosomes (using G-scores) for the specified individual histology versus the entire PBTA cohort. | ||
|
||
[`02-GISTIC-tidy-data-prep.Rmd`](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/01-GISTIC-cohort-vs-histology-comparison/02-GISTIC-tidy-data-prep.nb.html) is a notebook written to tidy and format the GISTIC files needed to compare GISTIC's copy number calls to the copy number calls we prepared in the [focal-cn-file-preparation](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation/results) module of this repository. | ||
The output of this notebook includes tables to be consumed for both a gene-level comparison and cytoband-level comparison of copy number calls analysis notebooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the comparison is an upcoming notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
} | ||
|
||
lgat_gistic_dir <- | ||
file.path(gistic_data_dir, "pbta-cnv-consensus-lgat-gistic") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is LGAT the group we are moving forward with in this next step of the analysis? Can you add the rationale of why LGAT at the top of the notebook and maybe in the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, LGAT was chosen for development purposes (it's more manageable to work with one group as opposed to three groups). That being said, I can add a note to the README explaining this, but will also refactor the code in a way that can be run for all three histologies once we get this code in order (and the code in the upcoming notebook).
|
||
```{r} | ||
# Source this module's custom functions R script | ||
source(file.path("util", "GISTIC-comparison-functions.R")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably belongs in the Set Up
section rather than its own separate section.
|
||
if (!dir.exists(cohort_gistic_dir)) { | ||
unzip(cohort_gistic_zip, | ||
exdir = data_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think data_dir
has been defined in this notebook. Is it supposed to be the same as the previous notebook had it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I like the great documentation on the functions. I don't have huge comments. It looks like the new tables correspond with what @jaclyn-taroni requested.
Other comments not attached to a particular line:
- I notice GISTIC puts brackets around some of the gene names (does this mean something or is it random???) If the brackets don't mean anything to us, we should probably take them off before getting to the final table.
- A very nit picky comment and probably may just be a "me" thing is to ask if you can explicitly name your
output_filename
arguments when you use your functions, that way I can clearly tell what are input and what are output files. (Although I notice you end up having all output files as quotes instead of object names).
cd "$script_directory" || exit | ||
|
||
Rscript -e "rmarkdown::render('01-GISTIC-cohort-vs-histology-comparison.Rmd', clean = TRUE)" | ||
Rscript -e "rmarkdown::render('02-GISTIC-tidy-data-prep.Rmd', clean = TRUE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need extra line here.
|
||
# This script should always run as if it were being called from | ||
# the directory it lives in. | ||
script_directory="$(perl -e 'use File::Basename; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jashapiro , what was the alternative you mentioned you had to this the other day?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd "$(dirname "${BASH_SOURCE[0]}")" |
gistic_all_lesions_df$`Wide Peak Limits` <- | ||
gsub("[(].*", "", gistic_all_lesions_df$`Wide Peak Limits`) | ||
|
||
# Keep just the detection peak name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Keep just the detection peak name | |
# Keep just the detection peak name, some peaks have `- CN values` at the end of the string. |
Do we know why those are there and if they mean anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can link GISTIC documentation here to further explain this, but the reason is because the values in those rows contain actual CN values, while the rows without distinction contain a value that falls between an amplitude threshold (this is noted in the all_lesions.conf_90.txt
file, I will make note of it here as well).
dplyr::mutate(status = dplyr::case_when(status < 0 ~ "loss", | ||
status > 0 ~ "gain", | ||
status == 0 ~ "neutral")) %>% | ||
dplyr::distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about why this distinct is necessary and what it is removing? A distinct()
without a reasoning associated makes me concerned.
dplyr::select( | ||
-c( | ||
"Descriptor", | ||
"Peak Limits", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GISTIC's spaces in all their column names kill me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗️
final_df <- final_df %>% | ||
dplyr::left_join(gistic_all_lesions_df, by = c("peak_region" = "Wide Peak Limits")) %>% | ||
dplyr::select(gene_symbol, Kids_First_Biospecimen_ID, status, detection_peak = `Unique Name`) %>% | ||
dplyr::distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also comment on this distinct()
without looking at the data I don't know what's happening.
|
||
|
||
# Wrangle the GISTIC all lesions data to be in a comparable format with our CN calls | ||
gistic_all_lesions_df <- gistic_all_lesions_df %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section looks familiar! :) We could combine prepare_gene_level_gistic
and prepare_cytoband_level_gistic
into one function and just add an additional option of either gene
or cytoband
but honestly, I think the way you have it is fine too. It's clear this way.
Co-Authored-By: Candace Savonen <cansav09@gmail.com>
- add note on why LGAT's files were chosen to further this analysis - change function name to be more explicit of what it does - add new lines to end of files - add comment explaining purpose of the `distinct` function when implemented - add link to GISTIC documentation in README and in script where relevant - fix `data_dir` file path in 02 nb - move source function script step to `Set Up` section
@cansavvy, re:
From GISTIC's docs: "For peaks that contain no genes, the nearest gene is listed in brackets." As for your other comments, I believe I addressed all with the exception of:
This line did not run as is so I am looking into tweaking it so that it will. However, the rest of this PR is ready for another look (hopefully I didn't miss anything but you would probably be able to tell quicker than I can if I did). |
Can we add this to your 02-notebook and README some place?
Don't worry about getting this to work then. |
} | ||
|
||
#### Implemented in `02-GISTIC-tidy-data-prep.Rmd` ---------------------------- | ||
|
||
prepare_gene_level_gistic <- function(all_lesions_file, | ||
amp_genes_file, | ||
del_genes_file, | ||
output_filename) { | ||
|
||
gene_level_output_tsv_filename) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH. This isn't what I meant. Sorry. Your argument name itself is fine. I was talking about when you call the function can you put output_filename =
in front of it so I know that one is an output file and the others are input files. Can you change it back to output_filename
? Sorry if I didn't communicate my idea clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, okay gotcha! I should have double checked my understanding before making that change. I'll change the argument back and make that initial change that you suggested.
} | ||
|
||
prepare_cytoband_level_gistic <- function(all_lesions_file, | ||
output_filename) { | ||
|
||
cytoband_level_output_tsv_filename) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I have a couple minor comments outstanding, but if you want to take a look at those and make changes if you want but no need for re-review after this. 👍
- put `output_filename =` before each appropriate argument in the 02 nb (to differentiate the output file from the input files) - rerun 02 nb
Purpose/implementation Section
The purpose of this PR is to prepare GISTIC's results file for comparison of GISTIC's gene level calls with the calls we have prepared in
analyses/focal-cn-file-preparation/results
.What scientific question is your analysis addressing?
This is addressing the question of whether or not our calls agree with GISTIC's calls, to further determine if our handling of the copy number data is reasonable in this project.
What was your approach?
My approach was to take the
all_lesions.conf_90.txt
,amp_genes.conf_90.txt
, anddel_genes.conf_90.txt
GISTIC result files for an individual histology (LGAT) and prepare them for consumption in the upcoming second notebook of this analysis and described in the original comment of the reference issue #560:@jaclyn-taroni does the following update to the original comment now seem suffice?
What GitHub issue does your pull request address?
This PR addresses issue #560.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes, the code and tables are ready for review.
Results
What types of results are included (e.g., table, figure)?
The results of this PR includes two tables in this module's
results
directory:cohort_gistic_calls_table.tsv.gz
lgat_gistic_calls_table.tsv.gz
The rendered notebook at the progress of this draft PR thus far can be viewed here.
What is your summary of the results?
The results are not yet in a format that can be summarized and interpreted for the purpose of this analysis, but the final table provides gene, detection peak, biospecimen ID, and the GISTIC call status for the samples.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.