Chromosomal Instability (PR 3 of 3): The notebook with plots #419

cansavvy · 2020-01-09T19:21:56Z

Purpose/implementation Section

There is really only one notebook being added here. Before diving too far into reviewing this PR, let me know if you see a logical way to split it up. Do note that most of the added files are plots, there's only one actual file of code being added but it's a big file.

This third PR has the plots and notebook that has the results for this analysis so far.

What scientific question is your analysis addressing?

This was the task:

Generating a measure of chromosomal instability (CIN) burden that can be plotted by tumor type

I borrowed concepts from https://github.com/gonzolgarcia/svcnvplus for generating breakpoint density.

What was your approach?

Using functions from #413 to calculate and plot breakpoint density for CNV, SV and combined CNV and SV.

What GitHub issue does your pull request address?

The first part of #394 (the recurrently altered genes part will be a different PR series that will undoubtedly use the functions this PR series uses as well as maybe additional ones).

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Questions I have:

As of now I am only doing things at the "Biospecimen_ID" level. Which means there may be redundancy. I am unsure what way to split this up?
I am unsure my method of using combining the SV and CNV break data is what we want. Can someone comment on the biological validity or lack thereof of this?
I've found that median number of breakpoints, while useful for the CNV data is not at all useful for reporting the SV data due to how few SV breaks points there are, almost all come up as 0. Do you have suggestions for where to go with this? I have averages and totals I could use instead. Or something completely different.

Results

All the plots are the best way to see the results:
https://github.com/cansavvy/OpenPBTA-analysis/tree/cin_setup/analyses/chromosomal-instability/plots

Reproducibility Checklist

There are no new packages that need to be added to the Dockerfile so that is set.

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.
This analysis is recorded in the table in analyses/README.md.

… more

jashapiro

This looks good, though I of course have comments!

I think the priority comments are these:

Fix zipping to only give one copy of plots, and to make sure they are placed in a logical way on unzipping (I am not sure how the R zip command handles nested folders).
Check what is NA vs a zero count in the plots (if a sample has not CNVs in a region, it should be zero, unless the region was unassayed for some reason, which I am not sure we can tell)
Question about what counts as a CNV: are we capturing both dups and dels? Do we need to have a cutoff for ch.pct if we are going to be using a consensus file?
Can we refactor a bit (possibly taking advantage of purrr) to simplify some of the repeated code. This may involve (simplifying) adjustments to break_density() so it takes a single data argument. (And a bit of code to make a combined data set to put in.)

analyses/README.md

analyses/chromosomal-instability/chromosomal-instability.Rmd

cansavvy · 2020-01-13T19:08:32Z

Per our in person discussion, I will turn the individual chromosomal break plots into one big heatmap instead.

cansavvy · 2020-01-14T19:53:00Z

Alrighty, I incorporated some of your suggestions, @jashapiro. I have not applied a GISTIC filter to this data, though there was some mention of that. Is this a thing we would like to do in this PR, or should that be a subsequent PR?

jashapiro

I like the new plot!

I think the code can still be DRYed up quite a bit, mostly by not worrying about where break lists come from in subfunctions like break_density() which frees them up to be used by lapply() rather than as repeated calls.

There is also some filtering code that I think we may want to just remove, pending consensus cnv calls. I can see why it is there, but given that we put some effort into filtering at that stage, it seems like we ought to be consistent in the filters that are applied. If we don't think those filters are appropriate for this use case, we could reconsider, but I feel like we will need strong justification to use different standards in different parts of the analysis.

analyses/chromosomal-instability/chromosomal-instability.Rmd

jashapiro · 2020-01-16T00:55:53Z

analyses/chromosomal-instability/chromosomal-instability.Rmd

+The SV and CNV data comes to us in the form of ranges, but for getting a look 
+at chromosomal instability, we will want to convert this data into single break
+points of the genome. 
+We'll also remove the sex chromosomes. 


I meant this "Why?" as an instruction to add detail to the notebook. 😉

analyses/chromosomal-instability/chromosomal-instability.Rmd

analyses/chromosomal-instability/util/chr-break-calculate.R

jashapiro · 2020-01-18T00:03:21Z

analyses/README.md

@@ -11,7 +11,9 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist

 | Module | Input Files | Brief Description | Output Files Consumed by Other Analyses |
 |--------|-------|-------------------|--------------|
-| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` <br> `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group  | N/A
+| [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities | N/A
+| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group  | N/A


Accidentally duplicated this line.

Suggested change

| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A

jashapiro · 2020-01-18T00:07:23Z

analyses/chromosomal-instability/chromosomal-instability.Rmd

+# Set seed so heatmaps turn out the same
+set.seed(2020)
+
+# This threshold will be used to determine percent of genome changed


Suggested change

# This threshold will be used to determine percent of genome changed

# This threshold will be used to determine the threshold for copy number changes identified.

jashapiro

This looks good to go to me. I have a comment below of one thing to keep in mind (looking for adjacent breaks that could result in double counting). Other than that, I think the next step is to convert to using the cnv consensus files, while keeping in mind the "blacklist" uncalled regions should be represented as missing data, not necessarily as 0 breaks.

jashapiro · 2020-01-18T00:16:58Z

analyses/chromosomal-instability/chromosomal-instability.Rmd

+
+```{r}
+common_breaks <- dplyr::bind_rows(sv_breaks, cnv_breaks) %>%
+  dplyr::distinct(samples, chrom, coord, .keep_all = TRUE)


There was some discussion as to whether this should be an intersection, rather than union (which this is), but this seems correct to me. The only question I had, which we did discuss verbally, was whether we should also filter out adjacent breakpoints, where one segment starts 1 bp after the previous segment. I don't anticipate this will happen much in the consensus data, but it is a thing that could occur, and would overcount high instability regions in particular.

Candace Savonen and others added 30 commits January 3, 2020 12:02

It's a start of a notebook

a43dfd1

Got some of the code converted and working

af29c7c

Percent genome change plot added

89cfdee

Have overlap of CNV and SV

226d906

Breakpoint summary df

4b8871f

Got things mostly organized and the wrangling functions are formed up…

547117d

… more

Load my functions as they are so far. TileGenome setting up

34fe36e

Made a plotting function that probably won't work yet

fce6e40

Things are here, just need to test things, reorg and add more doc

7fd1d28

Plotting function works

e4b14f8

Merge remote-tracking branch 'upstream/master' into cin_setup

35c908c

Merge remote-tracking branch 'upstream/master' into cin_setup

9f800d8

It's working!

d3ee47d

Plots are working just need more doc

cc4e07a

Add a README and change the function file name

19c611e

Some reorganization and adding the plots

f4c2c6b

lintered and neatened up some things

0d9bf6c

Few more touches to the README

4ed2a25

Add the functions

2d59523

Edited a messed up comment

6951c98

Merge branch 'master' into cin_functions

554c113

Fixed some issues with the functions

9e0d321

Add tumor type plots

42d1f68

Rearrange and fix some things with the group calculations

454c70b

Smoothing out some metadata handling of GenomicRanges

ea3221b

Functions handle multiple samples

8c90043

The functions handle grouped data

5a34518

Got rid of a development remnant I found

0e7cbbb

minor typo fixes

94560da

Apply percentage filter to CNV data

ce94432

Candace Savonen and others added 7 commits January 10, 2020 09:45

Fix logic statement for sample check.

9a03085

Remove histology groups if they don't have at least 2 samples

7cb1f8b

Merge remote-tracking branch 'upstream/master' into cin_setup

06dfed6

Merge branch 'master' into cin_setup

3625d55

update notebook

7f622b7

Merge branch 'master' into cin_setup

fe78ff8

Fix plot zipping to overwrite current zip files

8531d20

jashapiro reviewed Jan 13, 2020

View reviewed changes

Candace Savonen and others added 6 commits January 13, 2020 15:27

The easy changes suggested by @jashapiro have been implemented

9a746bb

Heatmap is mostly there.

e2cf01d

Merge branch 'master' into cin_setup

267b2f2

Upload the heatmaps and fixed stuff

3aa31de

Merge remote-tracking branch 'origin/cin_setup' into cin_setup

3f0b142

Merge branch 'master' into cin_setup

e7c70bd

jashapiro reviewed Jan 16, 2020

View reviewed changes

cansavvy added 4 commits January 16, 2020 10:38

Incorporating @jashapiro 's suggestions

9a61060

relinter, refresh and re-run

6a505bd

Merge branch 'master' into cin_setup

8620516

Update README

4c2f7b5

cansavvy mentioned this pull request Jan 16, 2020

Proposed Analysis: chromosomal instability burden, recurrently altered genes #394

Closed

Merge branch 'master' into cin_setup

c608d27

jashapiro reviewed Jan 18, 2020

View reviewed changes

Remove accidentally duplicated line.

aec0cb9

jashapiro approved these changes Jan 18, 2020

View reviewed changes

Merge branch 'master' into cin_setup

4e991ac

jaclyn-taroni merged commit c96bac0 into AlexsLemonade:master Jan 18, 2020

cansavvy deleted the cin_setup branch February 6, 2020 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chromosomal Instability (PR 3 of 3): The notebook with plots #419

Chromosomal Instability (PR 3 of 3): The notebook with plots #419

cansavvy commented Jan 9, 2020 •

edited

Loading

jashapiro left a comment

cansavvy commented Jan 13, 2020

cansavvy commented Jan 14, 2020

jashapiro left a comment

jashapiro Jan 16, 2020

jashapiro Jan 18, 2020 •

edited

Loading

jashapiro Jan 18, 2020 •

edited

Loading

jashapiro left a comment

jashapiro Jan 18, 2020

	# This threshold will be used to determine percent of genome changed
	# This threshold will be used to determine the threshold for copy number changes identified.

Chromosomal Instability (PR 3 of 3): The notebook with plots #419

Chromosomal Instability (PR 3 of 3): The notebook with plots #419

Conversation

cansavvy commented Jan 9, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Results

Reproducibility Checklist

jashapiro left a comment

Choose a reason for hiding this comment

cansavvy commented Jan 13, 2020

cansavvy commented Jan 14, 2020

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 16, 2020

Choose a reason for hiding this comment

jashapiro Jan 18, 2020 • edited Loading

Choose a reason for hiding this comment

jashapiro Jan 18, 2020 • edited Loading

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 18, 2020

Choose a reason for hiding this comment

cansavvy commented Jan 9, 2020 •

edited

Loading

jashapiro Jan 18, 2020 •

edited

Loading

jashapiro Jan 18, 2020 •

edited

Loading