Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Comparison of GISTIC gene level calls - Data Prep #559

Merged

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Feb 25, 2020

Purpose/implementation Section

The purpose of this PR is to prepare GISTIC's results file for comparison of GISTIC's gene level calls with the calls we have prepared in analyses/focal-cn-file-preparation/results.

What scientific question is your analysis addressing?

This is addressing the question of whether or not our calls agree with GISTIC's calls, to further determine if our handling of the copy number data is reasonable in this project.

What was your approach?

My approach was to take the all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt GISTIC result files for an individual histology (LGAT) and prepare them for consumption in the upcoming second notebook of this analysis and described in the original comment of the reference issue #560:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

    1. The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

      gene_symbol Kids_First_Biospecimen_ID status detection_peak
      TBXAS1 BS_xxxxxx gain Amplification Peak 3

      where gene_symbol values are retrieved from the
      amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
      status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
      values are matched between the amp/del files and the all_lesions file.

What GitHub issue does your pull request address?

This PR addresses issue #560.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • Do the final tables represent the data in the best/most informative way?
  • Does the code generating the tables appear to do so accurately?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, the code and tables are ready for review.

Results

What types of results are included (e.g., table, figure)?

The results of this PR includes two tables in this module's results directory:

  • cohort_gistic_calls_table.tsv.gz
  • lgat_gistic_calls_table.tsv.gz

The rendered notebook at the progress of this draft PR thus far can be viewed here.

What is your summary of the results?

The results are not yet in a format that can be summarized and interpreted for the purpose of this analysis, but the final table provides gene, detection peak, biospecimen ID, and the GISTIC call status for the samples.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

- add shell script to run this module's scripts
- make appropriate change to `.circleci`
- fix file path in `.circleci`
- add variables to final tables
- save final tables
@cbethell cbethell marked this pull request as ready for review February 25, 2020 22:35
@cbethell cbethell changed the title WIP: Comparison of GISTIC gene level calls Comparison of GISTIC gene level calls Feb 25, 2020
- move notebook to its own module and rename to explicitly indicate its purpose, which is to tidy the GISTIC data first
- add this analysis to `.circleci`
- revert changes to `gistic-cohort-vs-histology-comparison` module
@cbethell cbethell changed the title Comparison of GISTIC gene level calls Comparison of GISTIC gene level calls - Data Prep Feb 26, 2020
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table you are producing looks close to what we want! I have not got through this pull request in detail/line by line @cbethell, but I wanted to pass on some design comments that jumped out at me so you can continue to refine.


```{r}
# Code adapted from `analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd`
get_genes_df <- function(genes_df) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good indication that it is time for this to get broken out into an R file in a util... for the parts that you're adding/changing from analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd can you write a second function that uses the output from get_genes_vector to create a data.frame or is there some other way to make this more generalizable? I think this is also a good indication that this notebook should be in the same module as what is in gistic-cohort-vs-histology-comparison but perhaps with a more general name like compare-gistic.

Comment on lines 197 to 199
status %in% c("-1", "-1.2929") ~ "loss",
status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain",
TRUE ~ "neutral"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this says to me is that something that should be a double is a character. For this to be robust to changes in the underlying values (e.g., what if some value becomes -1.4597 another time we run it because the histology file changes), the logic should be

Suggested change
status %in% c("-1", "-1.2929") ~ "loss",
status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain",
TRUE ~ "neutral"
status < 0 ~ "loss",
status > 0 ~ "gain",
status == 0 ~ "neutral"

Whatever is causing the status values to be a character (and the inconsistencies in spacing) should be fixed before you get to this step.

- move notebook back to same module of gistic cohort vs histology notebook
- refactor logic in tidy function
- rename notebooks
- add shell script
- add tidy data files for future cytoband level comparison
- make appropriate changes to .circleci
- add functions script to `util` directory of this module
- source functions script in `01-GISTIC-cohort-vs-histology-comparison.Rmd` and in `02-GISTIC-tidy-data-prep.Rmd`
- move functions from both notebooks to function script
- make function name more broad as output can be a vector or a data.frame
@cbethell
Copy link
Contributor Author

I made the design changes @jaclyn-taroni suggested above and updated the rendered notebook linked in the original comment of this PR.

Note: I put all the functions relevant to the notebooks in this module in a custom functions script. I am not sure whether or not the function code would be better shown on the rendered notebook (particularly for the tidying of the GISTIC data steps), if so I can revert the move of these functions.

The updated notebook can be viewed here.

@cansavvy cansavvy self-requested a review March 3, 2020 16:32
@jaclyn-taroni jaclyn-taroni requested review from cansavvy and removed request for cansavvy March 3, 2020 16:32
)
}

lgat_gistic_dir <-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this notebook is the second in the series is there a scenario you are expecting that this file will not have been unzipped already? Related question, is the 02 notebook dependent on 01 notebook being run already besides this? If yes, and 02 doesn't work without 01 having been run, you can get rid of the unzipping part of this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this step in there because the 02 notebook is not dependent on the 01 notebook, so if someone wanted to run the 02 notebook independently they can. However, I am not sure how important having this option is here (especially considering that there is an upcoming notebook that will be dependent on the 02 notebook).

### Unzip GISTIC results folders

```{r}
# Unzip and set up GISTIC results folders and files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can shorten up this section without losing clarity. Can you list all your files first and then try this too unzip all your files at once:

# unzip all your files
ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

I haven't tried this so let me know if it works. Don't spend too much time on this idea I'm saying because what you have is also fine and clear.

# Save data.frame to file (for later cytoband level comparison)
readr::write_tsv(gistic_all_lesions_df, file.path(results_dir, output_filename))

}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an extra line


}

#### Implemented in `01-GISTIC-cohort-vs-histology-comparison.Rmd` ------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the way you have these functions sectioned off is how I'm thinking it is based on your comment here, you may want to split these functions into two different function files. That way 1) You don't load functions you don't need and 2) It's easier to know where to find functions used by each notebook.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I see why you did this. Both your sets of functions use get_genes_object. Forget what I said. I think how you have it is fine.



#### Implemented in both `01-GISTIC-cohort-vs-histology.Rmd` and `02-GISTIC-tidy-data-prep.Rmd`
get_genes_object <- function(genes_file,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this function but the name doesn't tell me much about what it does. Maybe format_gistic_genes? Or something more informative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that name much better than get_genes_object!

The output of this notebook includes three multipanel plots, found in this `plots` directory, depicting the distribution of amplifications/deletions across chromosomes (using G-scores) for the specified individual histology versus the entire PBTA cohort.

[`02-GISTIC-tidy-data-prep.Rmd`](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/01-GISTIC-cohort-vs-histology-comparison/02-GISTIC-tidy-data-prep.nb.html) is a notebook written to tidy and format the GISTIC files needed to compare GISTIC's copy number calls to the copy number calls we prepared in the [focal-cn-file-preparation](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation/results) module of this repository.
The output of this notebook includes tables to be consumed for both a gene-level comparison and cytoband-level comparison of copy number calls analysis notebooks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the comparison is an upcoming notebook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

}

lgat_gistic_dir <-
file.path(gistic_data_dir, "pbta-cnv-consensus-lgat-gistic")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is LGAT the group we are moving forward with in this next step of the analysis? Can you add the rationale of why LGAT at the top of the notebook and maybe in the README?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, LGAT was chosen for development purposes (it's more manageable to work with one group as opposed to three groups). That being said, I can add a note to the README explaining this, but will also refactor the code in a way that can be run for all three histologies once we get this code in order (and the code in the upcoming notebook).


```{r}
# Source this module's custom functions R script
source(file.path("util", "GISTIC-comparison-functions.R"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably belongs in the Set Up section rather than its own separate section.


if (!dir.exists(cohort_gistic_dir)) {
unzip(cohort_gistic_zip,
exdir = data_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think data_dir has been defined in this notebook. Is it supposed to be the same as the previous notebook had it?

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I like the great documentation on the functions. I don't have huge comments. It looks like the new tables correspond with what @jaclyn-taroni requested.

Other comments not attached to a particular line:

  • I notice GISTIC puts brackets around some of the gene names (does this mean something or is it random???) If the brackets don't mean anything to us, we should probably take them off before getting to the final table.
  • A very nit picky comment and probably may just be a "me" thing is to ask if you can explicitly name your output_filename arguments when you use your functions, that way I can clearly tell what are input and what are output files. (Although I notice you end up having all output files as quotes instead of object names).

cd "$script_directory" || exit

Rscript -e "rmarkdown::render('01-GISTIC-cohort-vs-histology-comparison.Rmd', clean = TRUE)"
Rscript -e "rmarkdown::render('02-GISTIC-tidy-data-prep.Rmd', clean = TRUE)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need extra line here.


# This script should always run as if it were being called from
# the directory it lives in.
script_directory="$(perl -e 'use File::Basename;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jashapiro , what was the alternative you mentioned you had to this the other day?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd "$(dirname "${BASH_SOURCE[0]}")"

gistic_all_lesions_df$`Wide Peak Limits` <-
gsub("[(].*", "", gistic_all_lesions_df$`Wide Peak Limits`)

# Keep just the detection peak name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Keep just the detection peak name
# Keep just the detection peak name, some peaks have `- CN values` at the end of the string.

Do we know why those are there and if they mean anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can link GISTIC documentation here to further explain this, but the reason is because the values in those rows contain actual CN values, while the rows without distinction contain a value that falls between an amplitude threshold (this is noted in the all_lesions.conf_90.txt file, I will make note of it here as well).

dplyr::mutate(status = dplyr::case_when(status < 0 ~ "loss",
status > 0 ~ "gain",
status == 0 ~ "neutral")) %>%
dplyr::distinct()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about why this distinct is necessary and what it is removing? A distinct() without a reasoning associated makes me concerned.

dplyr::select(
-c(
"Descriptor",
"Peak Limits",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GISTIC's spaces in all their column names kill me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗️

final_df <- final_df %>%
dplyr::left_join(gistic_all_lesions_df, by = c("peak_region" = "Wide Peak Limits")) %>%
dplyr::select(gene_symbol, Kids_First_Biospecimen_ID, status, detection_peak = `Unique Name`) %>%
dplyr::distinct()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also comment on this distinct() without looking at the data I don't know what's happening.



# Wrangle the GISTIC all lesions data to be in a comparable format with our CN calls
gistic_all_lesions_df <- gistic_all_lesions_df %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks familiar! :) We could combine prepare_gene_level_gistic and prepare_cytoband_level_gistic into one function and just add an additional option of either gene or cytoband but honestly, I think the way you have it is fine too. It's clear this way.

cbethell and others added 6 commits March 3, 2020 13:55
Co-Authored-By: Candace Savonen <cansav09@gmail.com>
- add note on why LGAT's files were chosen to further this analysis
- change function name to be more explicit of what it does
- add new lines to end of files
- add comment explaining purpose of the `distinct` function when implemented
- add link to GISTIC documentation in README and in script where relevant
- fix `data_dir` file path in 02 nb
- move source function script step to `Set Up` section
@cbethell
Copy link
Contributor Author

cbethell commented Mar 3, 2020

@cansavvy, re:

  • I notice GISTIC puts brackets around some of the gene names (does this mean something or is it random???) If the brackets don't mean anything to us, we should probably take them off before getting to the final table.

From GISTIC's docs: "For peaks that contain no genes, the nearest gene is listed in brackets."

As for your other comments, I believe I addressed all with the exception of:

ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

This line did not run as is so I am looking into tweaking it so that it will.

However, the rest of this PR is ready for another look (hopefully I didn't miss anything but you would probably be able to tell quicker than I can if I did).

@cansavvy
Copy link
Collaborator

cansavvy commented Mar 4, 2020

From GISTIC's docs: "For peaks that contain no genes, the nearest gene is listed in brackets."

Can we add this to your 02-notebook and README some place?

ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

This line did not run as is so I am looking into tweaking it so that it will.

Don't worry about getting this to work then.

}

#### Implemented in `02-GISTIC-tidy-data-prep.Rmd` ----------------------------

prepare_gene_level_gistic <- function(all_lesions_file,
amp_genes_file,
del_genes_file,
output_filename) {

gene_level_output_tsv_filename) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH. This isn't what I meant. Sorry. Your argument name itself is fine. I was talking about when you call the function can you put output_filename = in front of it so I know that one is an output file and the others are input files. Can you change it back to output_filename? Sorry if I didn't communicate my idea clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, okay gotcha! I should have double checked my understanding before making that change. I'll change the argument back and make that initial change that you suggested.

}

prepare_cytoband_level_gistic <- function(all_lesions_file,
output_filename) {

cytoband_level_output_tsv_filename) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Sorry.

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I have a couple minor comments outstanding, but if you want to take a look at those and make changes if you want but no need for re-review after this. 👍

cbethell and others added 3 commits March 4, 2020 08:59
- put `output_filename =` before each appropriate argument in the 02 nb (to differentiate the output file  from the input files)
- rerun 02 nb
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants