Comparison of GISTIC gene level calls - Data Prep #559

cbethell · 2020-02-25T18:46:03Z

Purpose/implementation Section

The purpose of this PR is to prepare GISTIC's results file for comparison of GISTIC's gene level calls with the calls we have prepared in analyses/focal-cn-file-preparation/results.

What scientific question is your analysis addressing?

This is addressing the question of whether or not our calls agree with GISTIC's calls, to further determine if our handling of the copy number data is reasonable in this project.

What was your approach?

My approach was to take the all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt GISTIC result files for an individual histology (LGAT) and prepare them for consumption in the upcoming second notebook of this analysis and described in the original comment of the reference issue #560:

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files.

Either a sketch or example markdown table work

@jaclyn-taroni does the following update to the original comment now seem suffice?

The first notebook will tidy the GISTIC data files (all_lesions.conf_90.txt, amp_genes.conf_90.txt, and del_genes.conf_90.txt). This is done to format the GISTIC data in a way that is comparable with the gene level calls in the focal-cn-file-preparation/results files. The output will therefore be a table that can be consumed in the second notebook for comparison to our focal CN files. The table will tentatively have the following columns:

gene_symbol Kids_First_Biospecimen_ID status detection_peak

TBXAS1 BS_xxxxxx gain Amplification Peak 3

where gene_symbol values are retrieved from the
amp_genes.conf_90.txt/del_genes.conf_90.txt files, Kids_First_Biospecimen_ID and
status values are retrieved from the all_lesions.conf_90.txt file, and detection_peak
values are matched between the amp/del files and the all_lesions file.

What GitHub issue does your pull request address?

This PR addresses issue #560.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Do the final tables represent the data in the best/most informative way?
Does the code generating the tables appear to do so accurately?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, the code and tables are ready for review.

Results

What types of results are included (e.g., table, figure)?

The results of this PR includes two tables in this module's results directory:

cohort_gistic_calls_table.tsv.gz
lgat_gistic_calls_table.tsv.gz

The rendered notebook at the progress of this draft PR thus far can be viewed here.

What is your summary of the results?

The results are not yet in a format that can be summarized and interpreted for the purpose of this analysis, but the final table provides gene, detection peak, biospecimen ID, and the GISTIC call status for the samples.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

- add shell script to run this module's scripts - make appropriate change to `.circleci`

- fix file path in `.circleci` - add variables to final tables - save final tables

- move notebook to its own module and rename to explicitly indicate its purpose, which is to tidy the GISTIC data first - add this analysis to `.circleci` - revert changes to `gistic-cohort-vs-histology-comparison` module

jaclyn-taroni

The table you are producing looks close to what we want! I have not got through this pull request in detail/line by line @cbethell, but I wanted to pass on some design comments that jumped out at me so you can continue to refine.

jaclyn-taroni · 2020-02-27T13:20:24Z

analyses/gistic-vs-focal-cn-file-preparation/01-GISTIC-tidy-data-prep.Rmd

+
+```{r}
+# Code adapted from `analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd`
+get_genes_df <- function(genes_df) {


Probably a good indication that it is time for this to get broken out into an R file in a util... for the parts that you're adding/changing from analyses/gistic-cohort-vs-histology-comparison/gistic-cohort-vs-histology-comparison.Rmd can you write a second function that uses the output from get_genes_vector to create a data.frame or is there some other way to make this more generalizable? I think this is also a good indication that this notebook should be in the same module as what is in gistic-cohort-vs-histology-comparison but perhaps with a more general name like compare-gistic.

jaclyn-taroni · 2020-02-27T13:27:41Z

analyses/gistic-vs-focal-cn-file-preparation/01-GISTIC-tidy-data-prep.Rmd

+      status %in% c("-1", "-1.2929") ~ "loss",
+      status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain",
+      TRUE ~ "neutral"


What this says to me is that something that should be a double is a character. For this to be robust to changes in the underlying values (e.g., what if some value becomes -1.4597 another time we run it because the histology file changes), the logic should be

Suggested change

status %in% c("-1", "-1.2929") ~ "loss",

status %in% c("1", " 1", " 2", "2", " 1.0000") ~ "gain",

TRUE ~ "neutral"

status < 0 ~ "loss",

status > 0 ~ "gain",

status == 0 ~ "neutral"

Whatever is causing the status values to be a character (and the inconsistencies in spacing) should be fixed before you get to this step.

- move notebook back to same module of gistic cohort vs histology notebook - refactor logic in tidy function - rename notebooks - add shell script - add tidy data files for future cytoband level comparison - make appropriate changes to .circleci

- add functions script to `util` directory of this module - source functions script in `01-GISTIC-cohort-vs-histology-comparison.Rmd` and in `02-GISTIC-tidy-data-prep.Rmd` - move functions from both notebooks to function script

- make function name more broad as output can be a vector or a data.frame

cbethell · 2020-02-27T23:55:29Z

I made the design changes @jaclyn-taroni suggested above and updated the rendered notebook linked in the original comment of this PR.

Note: I put all the functions relevant to the notebooks in this module in a custom functions script. I am not sure whether or not the function code would be better shown on the rendered notebook (particularly for the tidying of the GISTIC data steps), if so I can revert the move of these functions.

The updated notebook can be viewed here.

- rerun nb - propogate changes to the `README.md` file

…into gistic-gene-level-comparison

cansavvy · 2020-03-03T16:45:44Z

analyses/compare-gistic/02-GISTIC-tidy-data-prep.Rmd

+  )
+}
+
+lgat_gistic_dir <-


Since this notebook is the second in the series is there a scenario you are expecting that this file will not have been unzipped already? Related question, is the 02 notebook dependent on 01 notebook being run already besides this? If yes, and 02 doesn't work without 01 having been run, you can get rid of the unzipping part of this section.

I put this step in there because the 02 notebook is not dependent on the 01 notebook, so if someone wanted to run the 02 notebook independently they can. However, I am not sure how important having this option is here (especially considering that there is an upcoming notebook that will be dependent on the 02 notebook).

cansavvy · 2020-03-03T16:54:16Z

analyses/compare-gistic/01-GISTIC-cohort-vs-histology-comparison.Rmd

+### Unzip GISTIC results folders
+
+```{r}
+# Unzip and set up GISTIC results folders and files.


I think we can shorten up this section without losing clarity. Can you list all your files first and then try this too unzip all your files at once:

# unzip all your files ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

I haven't tried this so let me know if it works. Don't spend too much time on this idea I'm saying because what you have is also fine and clear.

cansavvy · 2020-03-03T16:56:14Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+  # Save data.frame to file (for later cytoband level comparison)
+  readr::write_tsv(gistic_all_lesions_df, file.path(results_dir, output_filename))
+
+}


Need an extra line

cansavvy · 2020-03-03T17:05:34Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+
+}
+
+#### Implemented in `01-GISTIC-cohort-vs-histology-comparison.Rmd` ------------


If the way you have these functions sectioned off is how I'm thinking it is based on your comment here, you may want to split these functions into two different function files. That way 1) You don't load functions you don't need and 2) It's easier to know where to find functions used by each notebook.

Ah. I see why you did this. Both your sets of functions use get_genes_object. Forget what I said. I think how you have it is fine.

cansavvy · 2020-03-03T17:13:19Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+
+
+#### Implemented in both `01-GISTIC-cohort-vs-histology.Rmd` and `02-GISTIC-tidy-data-prep.Rmd`
+get_genes_object <- function(genes_file,


I love this function but the name doesn't tell me much about what it does. Maybe format_gistic_genes? Or something more informative.

I'm a fan of this function.

I like that name much better than get_genes_object!

analyses/compare-gistic/README.md

cansavvy · 2020-03-03T17:48:57Z

analyses/compare-gistic/README.md

+The output of this notebook includes three multipanel plots, found in this `plots` directory, depicting the distribution of amplifications/deletions across chromosomes (using G-scores) for the specified individual histology versus the entire PBTA cohort.
+
+[`02-GISTIC-tidy-data-prep.Rmd`](https://alexslemonade.github.io/OpenPBTA-analysis/analyses/01-GISTIC-cohort-vs-histology-comparison/02-GISTIC-tidy-data-prep.nb.html) is a notebook written to tidy and format the GISTIC files needed to compare GISTIC's copy number calls to the copy number calls we prepared in the [focal-cn-file-preparation](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation/results) module of this repository.
+The output of this notebook includes tables to be consumed for both a gene-level comparison and cytoband-level comparison of copy number calls analysis notebooks.


So the comparison is an upcoming notebook?

cansavvy · 2020-03-03T17:50:44Z

analyses/compare-gistic/02-GISTIC-tidy-data-prep.Rmd

+}
+
+lgat_gistic_dir <-
+  file.path(gistic_data_dir, "pbta-cnv-consensus-lgat-gistic")


Why is LGAT the group we are moving forward with in this next step of the analysis? Can you add the rationale of why LGAT at the top of the notebook and maybe in the README?

Good question, LGAT was chosen for development purposes (it's more manageable to work with one group as opposed to three groups). That being said, I can add a note to the README explaining this, but will also refactor the code in a way that can be run for all three histologies once we get this code in order (and the code in the upcoming notebook).

cansavvy · 2020-03-03T17:51:54Z

analyses/compare-gistic/02-GISTIC-tidy-data-prep.Rmd

+
+```{r}
+# Source this module's custom functions R script
+source(file.path("util", "GISTIC-comparison-functions.R"))


This probably belongs in the Set Up section rather than its own separate section.

cansavvy · 2020-03-03T18:03:44Z

analyses/compare-gistic/02-GISTIC-tidy-data-prep.Rmd

+
+if (!dir.exists(cohort_gistic_dir)) {
+  unzip(cohort_gistic_zip,
+    exdir = data_dir


I don't think data_dir has been defined in this notebook. Is it supposed to be the same as the previous notebook had it?

cansavvy

This looks great! I like the great documentation on the functions. I don't have huge comments. It looks like the new tables correspond with what @jaclyn-taroni requested.

Other comments not attached to a particular line:

I notice GISTIC puts brackets around some of the gene names (does this mean something or is it random???) If the brackets don't mean anything to us, we should probably take them off before getting to the final table.
A very nit picky comment and probably may just be a "me" thing is to ask if you can explicitly name your output_filename arguments when you use your functions, that way I can clearly tell what are input and what are output files. (Although I notice you end up having all output files as quotes instead of object names).

cansavvy · 2020-03-03T18:05:47Z

analyses/compare-gistic/run-compare-gistic.sh

+cd "$script_directory" || exit
+
+Rscript -e "rmarkdown::render('01-GISTIC-cohort-vs-histology-comparison.Rmd', clean = TRUE)"
+Rscript -e "rmarkdown::render('02-GISTIC-tidy-data-prep.Rmd', clean = TRUE)"


Need extra line here.

cansavvy · 2020-03-03T18:06:16Z

analyses/compare-gistic/run-compare-gistic.sh

+
+# This script should always run as if it were being called from
+# the directory it lives in.
+script_directory="$(perl -e 'use File::Basename;


@jashapiro , what was the alternative you mentioned you had to this the other day?

OpenPBTA-analysis/analyses/run-gistic/run-gistic-module.sh

Line 7 in 7442e28

cd "$(dirname "${BASH_SOURCE[0]}")"

cansavvy · 2020-03-03T18:12:18Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+  gistic_all_lesions_df$`Wide Peak Limits` <-
+    gsub("[(].*", "", gistic_all_lesions_df$`Wide Peak Limits`)
+
+  # Keep just the detection peak name


Suggested change

# Keep just the detection peak name

# Keep just the detection peak name, some peaks have `- CN values` at the end of the string.

Do we know why those are there and if they mean anything?

Yes, I can link GISTIC documentation here to further explain this, but the reason is because the values in those rows contain actual CN values, while the rows without distinction contain a value that falls between an amplitude threshold (this is noted in the all_lesions.conf_90.txt file, I will make note of it here as well).

analyses/compare-gistic/util/GISTIC-comparison-functions.R

cansavvy · 2020-03-03T18:14:37Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+    dplyr::mutate(status = dplyr::case_when(status < 0 ~ "loss",
+                                            status > 0 ~ "gain",
+                                            status == 0 ~ "neutral")) %>%
+    dplyr::distinct()


Can you add a comment about why this distinct is necessary and what it is removing? A distinct() without a reasoning associated makes me concerned.

cansavvy · 2020-03-03T18:15:19Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+    dplyr::select(
+      -c(
+        "Descriptor",
+        "Peak Limits",


GISTIC's spaces in all their column names kill me.

cansavvy · 2020-03-03T18:15:54Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+  final_df <- final_df %>%
+    dplyr::left_join(gistic_all_lesions_df, by = c("peak_region" = "Wide Peak Limits")) %>%
+    dplyr::select(gene_symbol, Kids_First_Biospecimen_ID, status, detection_peak = `Unique Name`) %>%
+    dplyr::distinct()


Can you also comment on this distinct() without looking at the data I don't know what's happening.

cansavvy · 2020-03-03T18:20:46Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

+
+
+  # Wrangle the GISTIC all lesions data to be in a comparable format with our CN calls
+  gistic_all_lesions_df <- gistic_all_lesions_df %>%


This section looks familiar! :) We could combine prepare_gene_level_gistic and prepare_cytoband_level_gistic into one function and just add an additional option of either gene or cytoband but honestly, I think the way you have it is fine too. It's clear this way.

analyses/compare-gistic/README.md

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

- add note on why LGAT's files were chosen to further this analysis - change function name to be more explicit of what it does - add new lines to end of files - add comment explaining purpose of the `distinct` function when implemented - add link to GISTIC documentation in README and in script where relevant - fix `data_dir` file path in 02 nb - move source function script step to `Set Up` section

cbethell · 2020-03-03T20:47:57Z

@cansavvy, re:

I notice GISTIC puts brackets around some of the gene names (does this mean something or is it random???) If the brackets don't mean anything to us, we should probably take them off before getting to the final table.

From GISTIC's docs: "For peaks that contain no genes, the nearest gene is listed in brackets."

As for your other comments, I believe I addressed all with the exception of:

ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

This line did not run as is so I am looking into tweaking it so that it will.

However, the rest of this PR is ready for another look (hopefully I didn't miss anything but you would probably be able to tell quicker than I can if I did).

cansavvy · 2020-03-04T12:58:48Z

From GISTIC's docs: "For peaks that contain no genes, the nearest gene is listed in brackets."

Can we add this to your 02-notebook and README some place?

ldply(.data = gistic_files, .fun = unzip, exdir = data_dir)

This line did not run as is so I am looking into tweaking it so that it will.

Don't worry about getting this to work then.

cansavvy · 2020-03-04T13:05:03Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

 }

 #### Implemented in `02-GISTIC-tidy-data-prep.Rmd` ----------------------------

 prepare_gene_level_gistic <- function(all_lesions_file,
                                      amp_genes_file,
                                      del_genes_file,
-                                      output_filename) {
-
+                                      gene_level_output_tsv_filename) {


OH. This isn't what I meant. Sorry. Your argument name itself is fine. I was talking about when you call the function can you put output_filename = in front of it so I know that one is an output file and the others are input files. Can you change it back to output_filename? Sorry if I didn't communicate my idea clearly.

Ohh, okay gotcha! I should have double checked my understanding before making that change. I'll change the argument back and make that initial change that you suggested.

cansavvy · 2020-03-04T13:05:42Z

analyses/compare-gistic/util/GISTIC-comparison-functions.R

 }

 prepare_cytoband_level_gistic <- function(all_lesions_file,
-                                          output_filename) {
-
+                                          cytoband_level_output_tsv_filename) {


Same thing here. Sorry.

cansavvy

This looks good to me! I have a couple minor comments outstanding, but if you want to take a look at those and make changes if you want but no need for re-review after this. 👍

- put `output_filename =` before each appropriate argument in the 02 nb (to differentiate the output file from the input files) - rerun 02 nb

cbethell added 5 commits February 25, 2020 13:30

Add notebook to compare GISTIC gene level calls

fc688bf

- add shell script to run this module's scripts - make appropriate change to `.circleci`

leave neutral calls in for now

94b7b9b

use consensus annotated cn file instead of CNVkit

1f67f16

- fix file path in `.circleci` - add variables to final tables - save final tables

fix column renaming and rerun

ad28371

fix .circleci

b2a608c

cbethell marked this pull request as ready for review February 25, 2020 22:35

cbethell changed the title ~~WIP: Comparison of GISTIC gene level calls~~ Comparison of GISTIC gene level calls Feb 25, 2020

Merge branch 'master' into gistic-gene-level-comparison

b472166

jaclyn-taroni self-requested a review February 26, 2020 15:37

cbethell mentioned this pull request Feb 26, 2020

Proposed Analysis: GISTIC vs. focal-cn-file-preparation comparison #560

Open

cbethell changed the title ~~Comparison of GISTIC gene level calls~~ Comparison of GISTIC gene level calls - Data Prep Feb 26, 2020

cbethell added 2 commits February 26, 2020 18:40

update .circleci

d145c22

fix file path in .circleci

f548df7

jaclyn-taroni reviewed Feb 27, 2020

View reviewed changes

cbethell added 3 commits February 27, 2020 11:19

Generalize get_genes_vector function

3897417

- add functions script to `util` directory of this module - source functions script in `01-GISTIC-cohort-vs-histology-comparison.Rmd` and in `02-GISTIC-tidy-data-prep.Rmd` - move functions from both notebooks to function script

rename get_genes_vector function get_genes_object

9d01f27

- make function name more broad as output can be a vector or a data.frame

cbethell and others added 4 commits February 27, 2020 23:06

Merge branch 'master' into gistic-gene-level-comparison

8531fd8

Merge branch 'master' into gistic-gene-level-comparison

369192d

update module readme

52afaa6

rename output files to be more explicit

443a779

- rerun nb - propogate changes to the `README.md` file

jaclyn-taroni mentioned this pull request Feb 29, 2020

Updated analysis: modules that use output of focal-cn-file-preparation #581

Closed

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

97d89ee

…into gistic-gene-level-comparison

cansavvy self-requested a review March 3, 2020 16:32

jaclyn-taroni requested review from cansavvy and removed request for cansavvy March 3, 2020 16:32

cansavvy reviewed Mar 3, 2020

View reviewed changes

cbethell and others added 6 commits March 3, 2020 13:55

Apply sentence re-stucturing suggestions from @cansavvy review

ac1d4c1

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

make output_filename function arguments more explicit

60277c2

run function script through styler

e846865

remove line left from development in 02 nb

3ffe241

Note that we are looking at the three largest individual histology group

7b0b67d

cansavvy reviewed Mar 4, 2020

View reviewed changes

cansavvy approved these changes Mar 4, 2020

View reviewed changes

cbethell and others added 3 commits March 4, 2020 08:59

Merge branch 'master' into gistic-gene-level-comparison

ef8b5d5

revert changes to functions script

af4118a

- put `output_filename =` before each appropriate argument in the 02 nb (to differentiate the output file from the input files) - rerun 02 nb

Move source script step under "Set Up" header

ee910e9

jaclyn-taroni merged commit b5a3383 into AlexsLemonade:master Mar 4, 2020

cbethell mentioned this pull request Mar 4, 2020

Comparison of GISTIC gene level calls - Data Wrangling/Tables #600

Merged

5 tasks

jaclyn-taroni mentioned this pull request Jun 10, 2020

gistic-cohort-vs-histology-comparison -> compare-gistic in analyses/README #718

Merged


		}

		#### Implemented in `01-GISTIC-cohort-vs-histology-comparison.Rmd` ------------



		#### Implemented in both `01-GISTIC-cohort-vs-histology.Rmd` and `02-GISTIC-tidy-data-prep.Rmd`
		get_genes_object <- function(genes_file,

	# Keep just the detection peak name
	# Keep just the detection peak name, some peaks have `- CN values` at the end of the string.



		# Wrangle the GISTIC all lesions data to be in a comparable format with our CN calls
		gistic_all_lesions_df <- gistic_all_lesions_df %>%

Comparison of GISTIC gene level calls - Data Prep #559

Comparison of GISTIC gene level calls - Data Prep #559

Conversation

cbethell commented Feb 25, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Feb 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cansavvy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Mar 3, 2020

cansavvy commented Mar 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cansavvy left a comment

Choose a reason for hiding this comment

cbethell commented Feb 25, 2020 •

edited

Loading