Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Chromosomal Instability (PR 3 of 3): The notebook with plots #419

Merged
merged 71 commits into from
Jan 18, 2020

Conversation

cansavvy
Copy link
Collaborator

@cansavvy cansavvy commented Jan 9, 2020

Purpose/implementation Section

There is really only one notebook being added here. Before diving too far into reviewing this PR, let me know if you see a logical way to split it up. Do note that most of the added files are plots, there's only one actual file of code being added but it's a big file.

This third PR has the plots and notebook that has the results for this analysis so far.

What scientific question is your analysis addressing?

This was the task:

Generating a measure of chromosomal instability (CIN) burden that can be plotted by tumor type

I borrowed concepts from https://github.com/gonzolgarcia/svcnvplus for generating breakpoint density.

What was your approach?

Using functions from #413 to calculate and plot breakpoint density for CNV, SV and combined CNV and SV.

What GitHub issue does your pull request address?

The first part of #394 (the recurrently altered genes part will be a different PR series that will undoubtedly use the functions this PR series uses as well as maybe additional ones).

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Questions I have:

  • As of now I am only doing things at the "Biospecimen_ID" level. Which means there may be redundancy. I am unsure what way to split this up?

  • I am unsure my method of using combining the SV and CNV break data is what we want. Can someone comment on the biological validity or lack thereof of this?

  • I've found that median number of breakpoints, while useful for the CNV data is not at all useful for reporting the SV data due to how few SV breaks points there are, almost all come up as 0. Do you have suggestions for where to go with this? I have averages and totals I could use instead. Or something completely different.

Results

All the plots are the best way to see the results:
https://github.com/cansavvy/OpenPBTA-analysis/tree/cin_setup/analyses/chromosomal-instability/plots

Reproducibility Checklist

There are no new packages that need to be added to the Dockerfile so that is set.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.
  • This analysis is recorded in the table in analyses/README.md.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, though I of course have comments!

I think the priority comments are these:

  1. Fix zipping to only give one copy of plots, and to make sure they are placed in a logical way on unzipping (I am not sure how the R zip command handles nested folders).
  2. Check what is NA vs a zero count in the plots (if a sample has not CNVs in a region, it should be zero, unless the region was unassayed for some reason, which I am not sure we can tell)
  3. Question about what counts as a CNV: are we capturing both dups and dels? Do we need to have a cutoff for ch.pct if we are going to be using a consensus file?
  4. Can we refactor a bit (possibly taking advantage of purrr) to simplify some of the repeated code. This may involve (simplifying) adjustments to break_density() so it takes a single data argument. (And a bit of code to make a combined data set to put in.)

@cansavvy
Copy link
Collaborator Author

Per our in person discussion, I will turn the individual chromosomal break plots into one big heatmap instead.

@cansavvy
Copy link
Collaborator Author

Alrighty, I incorporated some of your suggestions, @jashapiro. I have not applied a GISTIC filter to this data, though there was some mention of that. Is this a thing we would like to do in this PR, or should that be a subsequent PR?

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new plot!

I think the code can still be DRYed up quite a bit, mostly by not worrying about where break lists come from in subfunctions like break_density() which frees them up to be used by lapply() rather than as repeated calls.

There is also some filtering code that I think we may want to just remove, pending consensus cnv calls. I can see why it is there, but given that we put some effort into filtering at that stage, it seems like we ought to be consistent in the filters that are applied. If we don't think those filters are appropriate for this use case, we could reconsider, but I feel like we will need strong justification to use different standards in different parts of the analysis.

The SV and CNV data comes to us in the form of ranges, but for getting a look
at chromosomal instability, we will want to convert this data into single break
points of the genome.
We'll also remove the sex chromosomes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this "Why?" as an instruction to add detail to the notebook. 😉

@@ -11,7 +11,9 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist

| Module | Input Files | Brief Description | Output Files Consumed by Other Analyses |
|--------|-------|-------------------|--------------|
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` <br> `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A
| [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities | N/A
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A
Copy link
Member

@jashapiro jashapiro Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally duplicated this line.

Suggested change
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A

# Set seed so heatmaps turn out the same
set.seed(2020)

# This threshold will be used to determine percent of genome changed
Copy link
Member

@jashapiro jashapiro Jan 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This threshold will be used to determine percent of genome changed
# This threshold will be used to determine the threshold for copy number changes identified.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go to me. I have a comment below of one thing to keep in mind (looking for adjacent breaks that could result in double counting). Other than that, I think the next step is to convert to using the cnv consensus files, while keeping in mind the "blacklist" uncalled regions should be represented as missing data, not necessarily as 0 breaks.


```{r}
common_breaks <- dplyr::bind_rows(sv_breaks, cnv_breaks) %>%
dplyr::distinct(samples, chrom, coord, .keep_all = TRUE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion as to whether this should be an intersection, rather than union (which this is), but this seems correct to me. The only question I had, which we did discuss verbally, was whether we should also filter out adjacent breakpoints, where one segment starts 1 bp after the previous segment. I don't anticipate this will happen much in the consensus data, but it is a thing that could occur, and would overcount high instability regions in particular.

@jaclyn-taroni jaclyn-taroni merged commit c96bac0 into AlexsLemonade:master Jan 18, 2020
@cansavvy cansavvy deleted the cin_setup branch February 6, 2020 20:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants