-
Notifications
You must be signed in to change notification settings - Fork 83
Chromosomal Instability (PR 3 of 3): The notebook with plots #419
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, though I of course have comments!
I think the priority comments are these:
- Fix zipping to only give one copy of plots, and to make sure they are placed in a logical way on unzipping (I am not sure how the R zip command handles nested folders).
- Check what is NA vs a zero count in the plots (if a sample has not CNVs in a region, it should be zero, unless the region was unassayed for some reason, which I am not sure we can tell)
- Question about what counts as a CNV: are we capturing both dups and dels? Do we need to have a cutoff for ch.pct if we are going to be using a consensus file?
- Can we refactor a bit (possibly taking advantage of
purrr
) to simplify some of the repeated code. This may involve (simplifying) adjustments tobreak_density()
so it takes a single data argument. (And a bit of code to make a combined data set to put in.)
Per our in person discussion, I will turn the individual chromosomal break plots into one big heatmap instead. |
Alrighty, I incorporated some of your suggestions, @jashapiro. I have not applied a GISTIC filter to this data, though there was some mention of that. Is this a thing we would like to do in this PR, or should that be a subsequent PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new plot!
I think the code can still be DRYed up quite a bit, mostly by not worrying about where break lists come from in subfunctions like break_density()
which frees them up to be used by lapply() rather than as repeated calls.
There is also some filtering code that I think we may want to just remove, pending consensus cnv calls. I can see why it is there, but given that we put some effort into filtering at that stage, it seems like we ought to be consistent in the filters that are applied. If we don't think those filters are appropriate for this use case, we could reconsider, but I feel like we will need strong justification to use different standards in different parts of the analysis.
The SV and CNV data comes to us in the form of ranges, but for getting a look | ||
at chromosomal instability, we will want to convert this data into single break | ||
points of the genome. | ||
We'll also remove the sex chromosomes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this "Why?" as an instruction to add detail to the notebook. 😉
analyses/README.md
Outdated
@@ -11,7 +11,9 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist | |||
|
|||
| Module | Input Files | Brief Description | Output Files Consumed by Other Analyses | | |||
|--------|-------|-------------------|--------------| | |||
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` <br> `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A | |||
| [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities | N/A | |||
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally duplicated this line.
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip` and `pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A |
# Set seed so heatmaps turn out the same | ||
set.seed(2020) | ||
|
||
# This threshold will be used to determine percent of genome changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This threshold will be used to determine percent of genome changed | |
# This threshold will be used to determine the threshold for copy number changes identified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to go to me. I have a comment below of one thing to keep in mind (looking for adjacent breaks that could result in double counting). Other than that, I think the next step is to convert to using the cnv consensus files, while keeping in mind the "blacklist" uncalled regions should be represented as missing data, not necessarily as 0 breaks.
|
||
```{r} | ||
common_breaks <- dplyr::bind_rows(sv_breaks, cnv_breaks) %>% | ||
dplyr::distinct(samples, chrom, coord, .keep_all = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion as to whether this should be an intersection, rather than union (which this is), but this seems correct to me. The only question I had, which we did discuss verbally, was whether we should also filter out adjacent breakpoints, where one segment starts 1 bp after the previous segment. I don't anticipate this will happen much in the consensus data, but it is a thing that could occur, and would overcount high instability regions in particular.
Purpose/implementation Section
There is really only one notebook being added here. Before diving too far into reviewing this PR, let me know if you see a logical way to split it up. Do note that most of the added files are plots, there's only one actual file of code being added but it's a big file.
This third PR has the plots and notebook that has the results for this analysis so far.
What scientific question is your analysis addressing?
This was the task:
I borrowed concepts from https://github.com/gonzolgarcia/svcnvplus for generating breakpoint density.
What was your approach?
Using functions from #413 to calculate and plot breakpoint density for CNV, SV and combined CNV and SV.
What GitHub issue does your pull request address?
The first part of #394 (the recurrently altered genes part will be a different PR series that will undoubtedly use the functions this PR series uses as well as maybe additional ones).
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Questions I have:
As of now I am only doing things at the "Biospecimen_ID" level. Which means there may be redundancy. I am unsure what way to split this up?
I am unsure my method of using combining the SV and CNV break data is what we want. Can someone comment on the biological validity or lack thereof of this?
I've found that median number of breakpoints, while useful for the CNV data is not at all useful for reporting the SV data due to how few SV breaks points there are, almost all come up as 0. Do you have suggestions for where to go with this? I have averages and totals I could use instead. Or something completely different.
Results
All the plots are the best way to see the results:
https://github.com/cansavvy/OpenPBTA-analysis/tree/cin_setup/analyses/chromosomal-instability/plots
Reproducibility Checklist
There are no new packages that need to be added to the Dockerfile so that is set.
analyses/README.md
.