Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Chr instability: PR 3 of 3: Histology plots #532

Merged
merged 67 commits into from
Feb 18, 2020

Conversation

cansavvy
Copy link
Collaborator

Purpose/implementation Section

This includes redone versions of the material that was included in #492.

This last PR has the last of three notebooks which contain the material that was originally all in the previous 01 notebook.

What scientific question is your analysis addressing?

How does chromosomal instability relate to histology group?

What was your approach?

This takes the binned counts from the 01-localization notebook and makes histology plots for them.

What GitHub issue does your pull request address?

#487

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Most of this material has been reviewed previously, but it now is in its own notebook.
How do you feel about the readability of it? Are there other aesthetic changes that need to be made to the plots?

What is your summary of the results?

Here's the rendered html: https://cansavvy.github.io/openpbta-notebook-concept/chromosomal-instability/02b-plot-chr-instability-by-histology.nb.html

Reproducibility Checklist

These items were done previously.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

These items were done previously.

  • This analysis module has a README and it is up to date (Note it includes documentation for the upcoming notebooks as well.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

…dea why cnv heatmap wasn't being updated????
@cansavvy
Copy link
Collaborator Author

Alrighty, @jashapiro , let me know if we think everything is addressed now.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my last little comments, this looks good!

@jashapiro jashapiro mentioned this pull request Feb 13, 2020
@@ -356,7 +358,7 @@ breaks_density_list <- lapply(breaks_list, function(breaks_df) {
samples, experimental_strategy, genome_size
) %>%
# Count number of mutations for that sample
dplyr::summarize(breaks_count = dplyr::n()) %>%
dplyr::summarize(breaks_count = sum(!is.na(chrom))) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized, while thinking about #490 (comment), that this may make some samples that should be NA into zeros. If the sample was not in the consensus seg file, it should be NA for CNV breaks and SV breaks I think such a sample here would end up with a zero here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those do end up with zeroes. If NA is preferred I can make those changes. I’ll just have to convert them back to zeroes for the CDF plots, unless we think those samples should be dropped from the CDF plots?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, drop them if they are missing. If you don't drop them, it says to your audience that there are n samples that have 0 breakpoints when we don't have evidence either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it correct to say that all WGS samples have been ran through both CNV detection pipelines, so if they don't show up in consensus CNV file or the SV file then they should be NAs for break density?

Copy link
Collaborator Author

@cansavvy cansavvy Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, then there will be no such thing as a 0, only NAs and the minimum break_counts would 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jashapiro jashapiro Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs linked to above by @jaclyn-taroni should explain, but yes, there are both zeros and NAs. Some samples were deemed “uncallable” and should be NA. Others were called but had no CNVs and should be 0.

Comment on lines 360 to 367
dplyr::summarize(is_na = any(is.na(chrom)),
breaks_count = dplyr::n()) %>%
# Calculate breaks density, but put NA for breaks_count if the sample was not
# in the SV or CNV data originally
dplyr::mutate(breaks_count = dplyr::case_when(
!is_na ~ as.numeric(breaks_count),
is_na ~ as.numeric(NA)
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this isn't actually doing what we really want it to do.
The trouble is that anything with a true zero count will get an NA in chrom column, just like something that is actually missing.

I think the thing to do is to add a column to metadata called surveyed which would indicate if the sample is in unique(c(cnv_samples, sv_samples)). Then if surveyed is TRUE, set the is_na break_counts to 0, and to NA otherwise.

something like:

dplyr::mutate(
    breaks_count = dplyr::case_when(
      !is_na ~ as.numeric(breaks_count), 
      is_na & surveyed ~ as.numeric(0),
      TRUE ~ as.numeric(NA)
      ), 

@jashapiro
Copy link
Member

Okay, I think this is good to go. :shipit:

@jaclyn-taroni jaclyn-taroni merged commit dcc9bf0 into AlexsLemonade:master Feb 18, 2020
jashapiro added a commit to tkoganti/OpenPBTA-analysis that referenced this pull request Feb 18, 2020
jaclyn-taroni pushed a commit that referenced this pull request Feb 18, 2020
* Initial files added to ependymoma subtyping folder

* Added bash script and changed the paths for all files to run from OpenPBTA directory

* Added bash script and changed the paths for all files to run from OpenPBTA directory

* Add Ependymoma subtyping to CI

* Add to analyses/README.md

* Test removing rpy2 and using pyreadr exclusively

* Change to use pyreadr properly

* Revert pyreadr changes

* Add subset flag to CI

* Use R to generate subset file & shell to specify filenames

* Add results file

* Move Ependymoma subtyping up in CI

* Responding to pull request reviews

* Adding jupyter notebook

* Typo fixes

* Update gistic filename

* Changed implemented as suggested on feb 7 2020

* Small review changes 

remove unused imports
simplify use of `stats`

* Update 00-subset-for-EPN.R with changes from @cansavvy code review

* Zscore column names changed

* Changed how merge is done between RNA and DNA tables

* Removed  comment lines

* remove duplicate commented code.

* Added some columns as per comments from 02-12-2020

* update invocation of 02_ependymoma_generate_all_data.py 

Add line continuation characters to bash script
Remove no longer used --breakpoints option

* Handle missing data, and some refactoring

I made some substantial changes here in structure, but the results should be largely unchanged.

I did some transposing when constructing data frames so we can use the same function (fill_df)  to extract data more often, and moved the ID column specification out of the function so that RNA and DNA-derived data are handled the same way.

The function then allows a set of samples to be specified, and if the request is for a a sample that does not fall in there, it is set as NA for that column in the output data, otherwise it is filled in with a default value.

* Delete unused full table zscore

* Rerun with updated data 

#532 changed results.

Co-authored-by: jashapiro <jashapiro@gmail.com>
@cansavvy cansavvy deleted the reorg-chr-inst-3 branch February 28, 2020 14:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants