diff --git a/.circleci/config.yml b/.circleci/config.yml
index a2d4a5ada7..af1b86e5f6 100644
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -128,6 +128,10 @@ jobs:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"
+ - run:
+ name: Chromosomal instability breakpoints
+ command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/chromosomal-instability/chromosomal-instability.Rmd', clean = TRUE)"
+
- run:
name: Fusion Summary
command: ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"
diff --git a/analyses/README.md b/analyses/README.md
index 2233e28f4f..d22f0ea1c8 100644
--- a/analyses/README.md
+++ b/analyses/README.md
@@ -11,7 +11,8 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| Module | Input Files | Brief Description | Output Files Consumed by Other Analyses |
|--------|-------|-------------------|--------------|
-| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip`
`pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A
+| [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv`
`pbta-sv-manta.tsv.gz`
`pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities | N/A
+| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-cnvkit-gistic.zip`
`pbta-cnv-cnvkit.seg.gz` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A\
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds`
`pbta-gene-expression-rsem-fpkm.stranded.rds`
`gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`
`results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds`
`pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
diff --git a/analyses/chromosomal-instability/README.md b/analyses/chromosomal-instability/README.md
index c9ca64b55c..df69d6dfa8 100644
--- a/analyses/chromosomal-instability/README.md
+++ b/analyses/chromosomal-instability/README.md
@@ -41,5 +41,5 @@ The individual sample plots and grouped by `short_histology` plots are in the `p
- `make_granges` : Given a data.frame with chr break coordinates, make a `GenomicRanges` object.
- `break_density`: Given data.frame(s) with chr break coordinates, calculate the density of the breaks.
-- `map_density_plot`: Given a `GenomicRanges` object, use map the chromosomal coordinates to a `ggplot2`
-- `chr_break_plot`: Given a list of `GenomicRanges` objects, plot them in a combined `cowplot`.
+- `map_breaks_plot`: Given a `GenomicRanges` object, use map the chromosomal coordinates to a `ggplot2`
+- `multipanel_break_plot`: Given a list of `GenomicRanges` objects, plot them in a combined `cowplot`.
diff --git a/analyses/chromosomal-instability/chromosomal-instability.Rmd b/analyses/chromosomal-instability/chromosomal-instability.Rmd
new file mode 100644
index 0000000000..e73caf2e13
--- /dev/null
+++ b/analyses/chromosomal-instability/chromosomal-instability.Rmd
@@ -0,0 +1,471 @@
+---
+title: "Chromosomal Instability: Breakpoints"
+output:
+ html_notebook:
+ toc: true
+ toc_float: true
+author: Candace Savonen for ALSF - CCDL
+date: 2020
+---
+
+This analysis evaluates chromosomal instability by formatting SV and CNV data
+into chromosomal breaks and measuring chromosomal break density.
+This notebook returns chromosomal break heatmaps and plots for each tumor type
+group in the `plots` directory.
+
+Code adapted from [svcnvplus](https://github.com/gonzolgarcia/svcnvplus).
+
+### Usage
+
+This notebook can be run via the command line from the top directory of the
+repository as follows:
+
+```
+Rscript -e "rmarkdown::render('analyses/chromosomal-instability/chromosomal-instability.Rmd',
+ clean = TRUE)"
+```
+
+### Set Up
+
+```{r}
+# Set seed so heatmaps turn out the same
+set.seed(2020)
+
+# This threshold will be used to determine percent of genome changed
+# TODO: after updating to consensus CNV data, evaluate this threshold setting.
+ch.pct <- 0.2
+
+# Magrittr pipe
+`%>%` <- dplyr::`%>%`
+```
+
+Read in the custom functions needed for this analysis.
+
+```{r}
+source(file.path("util", "chr-break-calculate.R"))
+source(file.path("util", "chr-break-plot.R"))
+```
+
+### Directories and Files
+
+```{r}
+# Path to data directory
+data_dir <- file.path("..", "..", "data")
+
+# Path to output directory
+plots_dir <- "plots"
+
+# Path to tumor type plots output directory
+hist_plots_dir <- file.path(plots_dir, "tumor-type")
+
+# Create the hist_plots_dir if it does not exist
+if (!dir.exists(hist_plots_dir)) {
+ dir.create(hist_plots_dir, recursive = TRUE)
+}
+```
+
+### Read in data and format it
+
+#### Set up metadata
+
+```{r}
+# Read in the metadata
+metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv"))
+```
+
+#### Set up CNV data:
+
+Read in the CNV data and format it to make the chromosome and coordinates easier for
+handling later.
+TODO: Update to consensus CNV data.
+
+```{r}
+# Read in the segment copy number data
+cnv_df <- data.table::fread(file.path(data_dir, "pbta-cnv-cnvkit.seg.gz"),
+ data.table = FALSE
+) %>%
+ dplyr::mutate(
+ samples = as.factor(ID),
+ # Calculate width
+ width = loc.end - loc.start,
+ # Make a binary variable that labels whether or not a sequence
+ # is considered changed by the threshold set
+ # TODO: after updating to Consensus CNV data, see if sex chromosomes still
+ # need to be removed and whether these thresholds should be changed.
+ changed = as.factor(dplyr::case_when(
+ abs(seg.mean) > log2(1 + ch.pct) ~ "change",
+ TRUE ~ "no_change"
+ ))
+ ) %>%
+ # Reformat the chromosome variable to drop the "chr"
+ dplyr::mutate(chrom = factor(gsub("chr", "", chrom),
+ levels = c(1:22, "X", "Y")
+ )) %>%
+ # Changing these so they end up matching the SV data
+ dplyr::rename(start = loc.start, end = loc.end)
+```
+
+Filter CNV data to only the changes that are larger than our cutoff `ch.pct`.
+
+```{r}
+cnv_filtered_df <- cnv_df %>%
+ dplyr::filter(changed == "change")
+```
+
+#### Set up SV data
+
+Read in the SV data.
+
+```{r}
+sv_df <- data.table::fread(file.path(data_dir, "pbta-sv-manta.tsv.gz"),
+ data.table = FALSE
+) %>%
+ # Reformat the 23 and 24 chromosomes so they are X and Y and also factors
+ dplyr::mutate(
+ chrom = dplyr::recode(SV.chrom,
+ `23` = "X", `24` = "Y"
+ )
+ )
+```
+
+#### Set up chromosome reference data
+
+Set up chromosome size data.
+It just so happens that this BED file: `WGS.hg38.strelka2.unpadded.bed` is actually
+just a list of the chromosome sizes so we are using that for now.
+
+```{r}
+# Set up Chr sizes
+chr_sizes <- readr::read_tsv(file.path(data_dir, "WGS.hg38.strelka2.unpadded.bed"),
+ col_names = FALSE
+) %>%
+ # Reformat the chromosome variable to drop the "chr"
+ dplyr::mutate(X1 = factor(gsub("chr", "", X1),
+ levels = c(1:22, "X", "Y", "M")
+ )) %>%
+ # Remove sex chromosomes
+ dplyr::filter(!(X1 %in% c("X", "Y", "M")))
+
+# Make chromosome size named vector
+chr_sizes_vector <- chr_sizes$X3
+names(chr_sizes_vector) <- chr_sizes$X1
+```
+
+## Format the data as chromosomes breaks
+
+The SV and CNV data comes to us in the form of ranges, but for getting a look
+at chromosomal instability, we will want to convert this data into single break
+points of the genome.
+We'll also remove the sex chromosomes.
+TODO: after updating to consensus CNV data, evaluate whether sex chromosomes should still be removed.
+
+Make an CNV breaks data.frame.
+
+```{r}
+# Make a CNV data.frame that has the breaks
+cnv_breaks <- data.frame(
+ samples = rep(cnv_filtered_df$ID, 2),
+ chrom = rep(cnv_filtered_df$chrom, 2),
+ coord = c(cnv_filtered_df$start, cnv_filtered_df$end),
+ seg.mean = rep(cnv_filtered_df$seg.mean, 2),
+ copy.num = rep(cnv_filtered_df$copy.num, 2),
+ stringsAsFactors = FALSE
+) %>%
+ # Remove sex chromosomes
+ dplyr::filter(!(chrom %in% c("X", "Y")))
+```
+
+Make an SV breaks data.frame.
+
+```{r}
+# Make a data.frame that has the breaks
+sv_breaks <- data.frame(
+ samples = rep(sv_df$Kids.First.Biospecimen.ID.Tumor, 2),
+ chrom = rep(sv_df$chrom, 2),
+ coord = c(sv_df$SV.start, sv_df$SV.end),
+ svclass = rep(sv_df$SV.type, 2),
+ stringsAsFactors = FALSE
+) %>%
+ # Remove sex chromosomes and NAs
+ dplyr::filter(!(chrom %in% c("X", "Y", "M", NA)))
+```
+
+Make an common breaks data.frame.
+
+```{r}
+common_breaks <- dplyr::bind_rows(sv_breaks, cnv_breaks) %>%
+ dplyr::distinct(samples, chrom, coord, .keep_all = TRUE)
+```
+
+Put all the breaks into a list.
+
+```{r}
+breaks_list <- list(
+ common_breaks = common_breaks,
+ cnv_breaks = cnv_breaks,
+ sv_breaks = sv_breaks
+)
+```
+
+### Co-localization of breakpoints for samples
+
+For each sample, we will bin the genome and count how many chromosome breaks
+occur for each bin, given the given `bin_size`.
+
+We will run this for each sample and return a list of three `GenomicRanges` objects:
+1) `common_breaks` contains the combined break counts for both SV and CNV break data.
+2) `cnv_breaks` contains the number of break counts for CNV.
+3) `sv_breaks` contains the number of break counts for SV.
+
+```{r, warning = FALSE, results='hide'}
+# Change here and it will change the rest
+bin_size <- 1e6
+
+# Obtain a list of samples that are common to both datasets.
+common_samples <- intersect(sv_breaks$samples, cnv_breaks$samples)
+
+# Get a big list of break densities for each sample.
+sample_densities <- lapply(common_samples, function(sample_id) {
+ lapply(breaks_list, function(breaks_df) {
+ break_density(breaks_df,
+ sample_id = sample_id,
+ start_col = "coord",
+ end_col = "coord",
+ window_size = bin_size,
+ chr_sizes_vector = chr_sizes_vector
+ )
+ })
+})
+
+# Bring along the sample IDs
+names(sample_densities) <- common_samples
+```
+
+### Set up for making heatmaps of the breakpoints
+
+Given the `GenomicRanges` objects for each sample, create a combined plot for
+each.
+
+Make chromosome labeling `HeatmapAnnotation` object.
+
+```{r}
+# Extract chromosome names from a GenomicRanges object
+chr_labels <- sample_densities[[1]]$common_breaks@seqnames
+chr_labels <- paste0("chr", as.factor(chr_labels))
+
+# Make a key for assigning alternating colors to the chromosomes
+chr_colors <- rep(c("grey", "lightblue"), length.out = length(unique(chr_labels)))
+names(chr_colors) <- unique(chr_labels)
+
+# Create the Heatmap annotation object
+chr_annot <- ComplexHeatmap::HeatmapAnnotation(
+ df = data.frame(chr_labels), col = list(chr_labels = c(chr_colors)),
+ name = "",
+ show_legend = FALSE,
+ show_annotation_name = FALSE
+)
+```
+
+Make histology labeling `HeatmapAnnotation` object.
+
+```{r}
+# Get the histologies for the samples in this set and order them by histology
+histologies <-
+ data.frame(Kids_First_Biospecimen_ID = common_samples) %>%
+ dplyr::inner_join(dplyr::select(metadata, Kids_First_Biospecimen_ID, short_histology)) %>%
+ dplyr::mutate(short_histology = as.factor(short_histology)) %>%
+ dplyr::arrange(short_histology) %>%
+ tibble::column_to_rownames("Kids_First_Biospecimen_ID")
+
+# Create the Heatmap annotation object
+hist_annot <- ComplexHeatmap::rowAnnotation(
+ df = histologies,
+ show_annotation_name = FALSE
+)
+```
+
+Make a color function.
+
+```{r}
+col_fun <- circlize::colorRamp2(
+ c(0, 2, 5, 10, 20),
+ c("#edf8fb", "#b2e2e2", "#66c2a4", "#2ca25f", "#006d2c")
+)
+```
+
+Make a function for making the heatmaps.
+
+```{r}
+breaks_heatmap <- function(data_name) {
+ # A wrapper function for making a heatmap from the samples GenomicRanges list.
+ #
+ # Args:
+ # data_name: a character string that matches a name in the list.
+
+ # Returns:
+ # A heatmap of the chromosomal breaks
+
+ # Pull out the total_counts
+ total_counts <- lapply(sample_densities, function(granges_list) {
+ granges_list[[data_name]]@elementMetadata@listData$total_counts
+ })
+ # Make into a data.frame
+ total_counts <- dplyr::bind_rows(total_counts) %>%
+ dplyr::select(rownames(histologies)) %>%
+ t()
+ # Add chromosome labels
+ colnames(total_counts) <- chr_labels
+ # Plot on a heatmap
+ heatmap <- ComplexHeatmap::Heatmap(total_counts,
+ col = col_fun,
+ heatmap_legend_param = list(title = "Count of chr breaks"),
+ cluster_columns = FALSE,
+ cluster_rows = FALSE,
+ show_column_names = FALSE,
+ show_row_names = FALSE,
+ bottom_annotation = chr_annot,
+ left_annotation = hist_annot
+ )
+ # Return plot
+ return(heatmap)
+}
+```
+
+## Common breaks heatmap
+
+```{r}
+common_heatmap <- breaks_heatmap(data_name = "common_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("common_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+common_heatmap
+dev.off()
+
+# Print out here
+common_heatmap
+```
+
+## CNV breaks heatmap
+
+```{r}
+cnv_heatmap <- breaks_heatmap(data_name = "cnv_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("cnv_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+cnv_heatmap
+dev.off()
+
+# Print out here
+cnv_heatmap
+```
+
+## SV breaks heatmap
+
+```{r}
+sv_heatmap <- breaks_heatmap(data_name = "sv_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("sv_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+sv_heatmap
+dev.off()
+
+# Print out here
+sv_heatmap
+```
+
+### Co-localization of breakpoints for tumor-type groups
+
+Same as was done for each sample, now we will calculate densities for
+each tumor-type group.
+
+```{r, results='hide'}
+# Change bin_size here and it will change the rest
+bin_size <- 1e6
+
+# Get a list of the tumor_types for which we have DNA-seq data
+tumor_types <- metadata %>%
+ dplyr::filter(!is.na(short_histology), experimental_strategy != "RNA-Seq") %>%
+ dplyr::distinct(short_histology) %>%
+ dplyr::pull(short_histology)
+```
+
+Run the density calculations for the groups.
+
+```{r}
+# Get a big list of break densities for each sample.
+group_densities <- lapply(tumor_types, function(tumor_type) {
+
+ # Obtain a list of sample_ids to subset by
+ sample_ids <- metadata %>%
+ dplyr::filter(metadata$short_histology == tumor_type) %>%
+ dplyr::pull(Kids_First_Biospecimen_ID)
+
+ # Double check these samples are in the list
+ check_samples <- (sample_ids %in% common_samples)
+
+ # If no samples, go to next
+ if (sum(check_samples) > 1) {
+ message(paste("Calculating breakpoint density for", tumor_type, "samples"))
+ lapply(breaks_list, function(breaks_df) {
+ break_density(breaks_df,
+ sample_id = sample_ids,
+ start_col = "coord",
+ end_col = "coord",
+ window_size = bin_size,
+ chr_sizes_vector = chr_sizes_vector
+ )
+ })
+ }
+})
+
+# Bring along the tumor-type labels
+names(group_densities) <- tumor_types
+
+# Remove tumor_types data that didn't have at least two samples
+group_densities <- group_densities[!sapply(group_densities, is.null)]
+```
+
+### Plot the breakpoints for each tumor-type
+
+Here we will plot median number of break points for the tumor-type group per
+each bin.
+
+```{r}
+purrr::imap(group_densities, function(.x, name = .y) {
+ # Make the combo plot
+ multipanel_break_plot(
+ granges_list = .x,
+ plot_name = name,
+ y_val = "median_counts",
+ y_lab = "Median Breaks per Mb",
+ plot_dir = hist_plots_dir
+ )
+})
+```
+
+Zip up the PNG files into one file.
+
+```{r}
+# Declare name of zip file
+zip_file <- paste0(hist_plots_dir, ".zip")
+
+# Remove any current zip_file of this name so we can overwrite it
+if (file.exists(zip_file)) {
+ file.remove(zip_file)
+}
+# Zip up the plots
+zip(zip_file, hist_plots_dir)
+```
+
+### Session Info
+
+```{r}
+sessionInfo()
+```
+
diff --git a/analyses/chromosomal-instability/chromosomal-instability.nb.html b/analyses/chromosomal-instability/chromosomal-instability.nb.html
new file mode 100644
index 0000000000..686d1bcfcf
--- /dev/null
+++ b/analyses/chromosomal-instability/chromosomal-instability.nb.html
@@ -0,0 +1,3884 @@
+
+
+
+
+
This analysis evaluates chromosomal instability by formatting SV and CNV data into chromosomal breaks and measuring chromosomal break density. This notebook returns chromosomal break heatmaps and plots for each tumor type group in the plots
directory.
Code adapted from svcnvplus.
+This notebook can be run via the command line from the top directory of the repository as follows:
+Rscript -e "rmarkdown::render('analyses/chromosomal-instability/chromosomal-instability.Rmd',
+ clean = TRUE)"
+# Set seed so heatmaps turn out the same
+set.seed(2020)
+
+# This threshold will be used to determine percent of genome changed
+# TODO: after updating to consensus CNV data, evaluate this threshold setting.
+ch.pct <- 0.2
+
+# Magrittr pipe
+`%>%` <- dplyr::`%>%`
+
+
+
+Read in the custom functions needed for this analysis.
+ + + +source(file.path("util", "chr-break-calculate.R"))
+source(file.path("util", "chr-break-plot.R"))
+
+
+
+# Path to data directory
+data_dir <- file.path("..", "..", "data")
+
+# Path to output directory
+plots_dir <- "plots"
+
+# Path to tumor type plots output directory
+hist_plots_dir <- file.path(plots_dir, "tumor-type")
+
+# Create the hist_plots_dir if it does not exist
+if (!dir.exists(hist_plots_dir)) {
+ dir.create(hist_plots_dir, recursive = TRUE)
+}
+
+
+
+# Read in the metadata
+metadata <- readr::read_tsv(file.path(data_dir, "pbta-histologies.tsv"))
+
+
+Parsed with column specification:
+cols(
+ .default = col_character(),
+ OS_days = [32mcol_double()[39m,
+ age_last_update_days = [32mcol_double()[39m,
+ normal_fraction = [32mcol_double()[39m,
+ tumor_fraction = [32mcol_double()[39m,
+ tumor_ploidy = [32mcol_double()[39m
+)
+See spec(...) for full column specifications.
+
+
+
+Read in the CNV data and format it to make the chromosome and coordinates easier for handling later. TODO: Update to consensus CNV data.
+ + + +# Read in the segment copy number data
+cnv_df <- data.table::fread(file.path(data_dir, "pbta-cnv-cnvkit.seg.gz"),
+ data.table = FALSE
+) %>%
+ dplyr::mutate(
+ samples = as.factor(ID),
+ # Calculate width
+ width = loc.end - loc.start,
+ # Make a binary variable that labels whether or not a sequence
+ # is considered changed by the threshold set
+ # TODO: after updating to Consensus CNV data, see if sex chromosomes still
+ # need to be removed and whether these thresholds should be changed.
+ changed = as.factor(dplyr::case_when(
+ abs(seg.mean) > log2(1 + ch.pct) ~ "change",
+ TRUE ~ "no_change"
+ ))
+ ) %>%
+ # Reformat the chromosome variable to drop the "chr"
+ dplyr::mutate(chrom = factor(gsub("chr", "", chrom),
+ levels = c(1:22, "X", "Y")
+ )) %>%
+ # Changing these so they end up matching the SV data
+ dplyr::rename(start = loc.start, end = loc.end)
+
+
+
+Filter CNV data to only the changes that are larger than our cutoff ch.pct
.
cnv_filtered_df <- cnv_df %>%
+ dplyr::filter(changed == "change")
+
+
+
+Read in the SV data.
+ + + +sv_df <- data.table::fread(file.path(data_dir, "pbta-sv-manta.tsv.gz"),
+ data.table = FALSE
+) %>%
+ # Reformat the 23 and 24 chromosomes so they are X and Y and also factors
+ dplyr::mutate(
+ chrom = dplyr::recode(SV.chrom,
+ `23` = "X", `24` = "Y"
+ )
+ )
+
+
+|--------------------------------------------------|
+|==================================================|
+
+
+
+Set up chromosome size data. It just so happens that this BED file: WGS.hg38.strelka2.unpadded.bed
is actually just a list of the chromosome sizes so we are using that for now.
# Set up Chr sizes
+chr_sizes <- readr::read_tsv(file.path(data_dir, "WGS.hg38.strelka2.unpadded.bed"),
+ col_names = FALSE
+) %>%
+ # Reformat the chromosome variable to drop the "chr"
+ dplyr::mutate(X1 = factor(gsub("chr", "", X1),
+ levels = c(1:22, "X", "Y", "M")
+ )) %>%
+ # Remove sex chromosomes
+ dplyr::filter(!(X1 %in% c("X", "Y", "M")))
+
+
+Parsed with column specification:
+cols(
+ X1 = [31mcol_character()[39m,
+ X2 = [32mcol_double()[39m,
+ X3 = [32mcol_double()[39m
+)
+
+
+# Make chromosome size named vector
+chr_sizes_vector <- chr_sizes$X3
+names(chr_sizes_vector) <- chr_sizes$X1
+
+
+
+The SV and CNV data comes to us in the form of ranges, but for getting a look at chromosomal instability, we will want to convert this data into single break points of the genome. We’ll also remove the sex chromosomes. TODO: after updating to consensus CNV data, evaluate whether sex chromosomes should still be removed.
+Make an CNV breaks data.frame.
+ + + +# Make a CNV data.frame that has the breaks
+cnv_breaks <- data.frame(
+ samples = rep(cnv_filtered_df$ID, 2),
+ chrom = rep(cnv_filtered_df$chrom, 2),
+ coord = c(cnv_filtered_df$start, cnv_filtered_df$end),
+ seg.mean = rep(cnv_filtered_df$seg.mean, 2),
+ copy.num = rep(cnv_filtered_df$copy.num, 2),
+ stringsAsFactors = FALSE
+) %>%
+ # Remove sex chromosomes
+ dplyr::filter(!(chrom %in% c("X", "Y")))
+
+
+
+Make an SV breaks data.frame.
+ + + +# Make a data.frame that has the breaks
+sv_breaks <- data.frame(
+ samples = rep(sv_df$Kids.First.Biospecimen.ID.Tumor, 2),
+ chrom = rep(sv_df$chrom, 2),
+ coord = c(sv_df$SV.start, sv_df$SV.end),
+ svclass = rep(sv_df$SV.type, 2),
+ stringsAsFactors = FALSE
+) %>%
+ # Remove sex chromosomes and NAs
+ dplyr::filter(!(chrom %in% c("X", "Y", "M", NA)))
+
+
+
+Make an common breaks data.frame.
+ + + +common_breaks <- dplyr::bind_rows(sv_breaks, cnv_breaks) %>%
+ dplyr::distinct(samples, chrom, coord, .keep_all = TRUE)
+
+
+binding character and factor vector, coercing into character vector
+
+
+
+Put all the breaks into a list.
+ + + +breaks_list <- list(
+ common_breaks = common_breaks,
+ cnv_breaks = cnv_breaks,
+ sv_breaks = sv_breaks
+)
+
+
+
+For each sample, we will bin the genome and count how many chromosome breaks occur for each bin, given the given bin_size
.
We will run this for each sample and return a list of three GenomicRanges
objects: 1) common_breaks
contains the combined break counts for both SV and CNV break data.
+2) cnv_breaks
contains the number of break counts for CNV.
+3) sv_breaks
contains the number of break counts for SV.
# Change here and it will change the rest
+bin_size <- 1e6
+
+# Obtain a list of samples that are common to both datasets.
+common_samples <- intersect(sv_breaks$samples, cnv_breaks$samples)
+
+# Get a big list of break densities for each sample.
+sample_densities <- lapply(common_samples, function(sample_id) {
+ lapply(breaks_list, function(breaks_df) {
+ break_density(breaks_df,
+ sample_id = sample_id,
+ start_col = "coord",
+ end_col = "coord",
+ window_size = bin_size,
+ chr_sizes_vector = chr_sizes_vector
+ )
+ })
+})
+
+# Bring along the sample IDs
+names(sample_densities) <- common_samples
+
+
+
+Given the GenomicRanges
objects for each sample, create a combined plot for each.
Make chromosome labeling HeatmapAnnotation
object.
# Extract chromosome names from a GenomicRanges object
+chr_labels <- sample_densities[[1]]$common_breaks@seqnames
+chr_labels <- paste0("chr", as.factor(chr_labels))
+
+# Make a key for assigning alternating colors to the chromosomes
+chr_colors <- rep(c("grey", "lightblue"), length.out = length(unique(chr_labels)))
+names(chr_colors) <- unique(chr_labels)
+
+# Create the Heatmap annotation object
+chr_annot <- ComplexHeatmap::HeatmapAnnotation(
+ df = data.frame(chr_labels), col = list(chr_labels = c(chr_colors)),
+ name = "",
+ show_legend = FALSE,
+ show_annotation_name = FALSE
+)
+
+
+
+Make histology labeling HeatmapAnnotation
object.
# Get the histologies for the samples in this set and order them by histology
+histologies <-
+ data.frame(Kids_First_Biospecimen_ID = common_samples) %>%
+ dplyr::inner_join(dplyr::select(metadata, Kids_First_Biospecimen_ID, short_histology)) %>%
+ dplyr::mutate(short_histology = as.factor(short_histology)) %>%
+ dplyr::arrange(short_histology) %>%
+ tibble::column_to_rownames("Kids_First_Biospecimen_ID")
+
+
+Joining, by = "Kids_First_Biospecimen_ID"
+Column `Kids_First_Biospecimen_ID` joining factor and character vector, coercing into character vector
+
+
+# Create the Heatmap annotation object
+hist_annot <- ComplexHeatmap::rowAnnotation(
+ df = histologies,
+ show_annotation_name = FALSE
+)
+
+
+
+Make a color function.
+ + + +col_fun <- circlize::colorRamp2(
+ c(0, 2, 5, 10, 20),
+ c("#edf8fb", "#b2e2e2", "#66c2a4", "#2ca25f", "#006d2c")
+)
+
+
+
+Make a function for making the heatmaps.
+ + + +breaks_heatmap <- function(data_name) {
+ # A wrapper function for making a heatmap from the samples GenomicRanges list.
+ #
+ # Args:
+ # data_name: a character string that matches a name in the list.
+
+ # Returns:
+ # A heatmap of the chromosomal breaks
+
+ # Pull out the total_counts
+ total_counts <- lapply(sample_densities, function(granges_list) {
+ granges_list[[data_name]]@elementMetadata@listData$total_counts
+ })
+ # Make into a data.frame
+ total_counts <- dplyr::bind_rows(total_counts) %>%
+ dplyr::select(rownames(histologies)) %>%
+ t()
+ # Add chromosome labels
+ colnames(total_counts) <- chr_labels
+ # Plot on a heatmap
+ heatmap <- ComplexHeatmap::Heatmap(total_counts,
+ col = col_fun,
+ heatmap_legend_param = list(title = "Count of chr breaks"),
+ cluster_columns = FALSE,
+ cluster_rows = FALSE,
+ show_column_names = FALSE,
+ show_row_names = FALSE,
+ bottom_annotation = chr_annot,
+ left_annotation = hist_annot
+ )
+ # Return plot
+ return(heatmap)
+}
+
+
+
+common_heatmap <- breaks_heatmap(data_name = "common_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("common_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+common_heatmap
+dev.off()
+
+
+null device
+ 1
+
+
+# Print out here
+common_heatmap
+
+
+cnv_heatmap <- breaks_heatmap(data_name = "cnv_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("cnv_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+cnv_heatmap
+dev.off()
+
+
+null device
+ 1
+
+
+# Print out here
+cnv_heatmap
+
+
+sv_heatmap <- breaks_heatmap(data_name = "sv_breaks")
+
+# Save plot as PNG
+png(file.path(plots_dir, paste0("sv_breaks_heatmap.png")),
+ width = 1200, height = 900, units = "px"
+)
+sv_heatmap
+dev.off()
+
+
+null device
+ 1
+
+
+# Print out here
+sv_heatmap
+
+
+Same as was done for each sample, now we will calculate densities for each tumor-type group.
+ + + +# Change bin_size here and it will change the rest
+bin_size <- 1e6
+
+# Get a list of the tumor_types for which we have DNA-seq data
+tumor_types <- metadata %>%
+ dplyr::filter(!is.na(short_histology), experimental_strategy != "RNA-Seq") %>%
+ dplyr::distinct(short_histology) %>%
+ dplyr::pull(short_histology)
+
+
+
+Run the density calculations for the groups.
+ + + +# Get a big list of break densities for each sample.
+group_densities <- lapply(tumor_types, function(tumor_type) {
+
+ # Obtain a list of sample_ids to subset by
+ sample_ids <- metadata %>%
+ dplyr::filter(metadata$short_histology == tumor_type) %>%
+ dplyr::pull(Kids_First_Biospecimen_ID)
+
+ # Double check these samples are in the list
+ check_samples <- (sample_ids %in% common_samples)
+
+ # If no samples, go to next
+ if (sum(check_samples) > 1) {
+ message(paste("Calculating breakpoint density for", tumor_type, "samples"))
+ lapply(breaks_list, function(breaks_df) {
+ break_density(breaks_df,
+ sample_id = sample_ids,
+ start_col = "coord",
+ end_col = "coord",
+ window_size = bin_size,
+ chr_sizes_vector = chr_sizes_vector
+ )
+ })
+ }
+})
+
+
+Calculating breakpoint density for Ependymoma samples
+Calculating breakpoint density for HGAT samples
+Calculating breakpoint density for LGAT samples
+Calculating breakpoint density for Other samples
+Calculating breakpoint density for CNS sarcoma samples
+Calculating breakpoint density for Medulloblastoma samples
+Calculating breakpoint density for Schwannoma samples
+Calculating breakpoint density for ATRT samples
+Calculating breakpoint density for Choroid plexus tumor samples
+Calculating breakpoint density for Craniopharyngioma samples
+Calculating breakpoint density for DNET samples
+Calculating breakpoint density for Teratoma samples
+Calculating breakpoint density for Hemangioblastoma samples
+Calculating breakpoint density for Meningioma samples
+Calculating breakpoint density for Adenoma samples
+Calculating breakpoint density for Neurofibroma samples
+Calculating breakpoint density for Ganglioglioma samples
+Calculating breakpoint density for Langerhans Cell histiocytosis samples
+Calculating breakpoint density for CNS Embryonal tumor samples
+Calculating breakpoint density for CNS neuroblastoma samples
+Calculating breakpoint density for Chordoma samples
+Calculating breakpoint density for MPNST samples
+Calculating breakpoint density for Glial-neuronal tumor NOS samples
+Calculating breakpoint density for Pineoblastoma samples
+Calculating breakpoint density for CNS EFT-CIC samples
+Calculating breakpoint density for Central neurocytoma samples
+Calculating breakpoint density for Germinoma samples
+Calculating breakpoint density for CNS Rhabdomyosarcoma samples
+Calculating breakpoint density for Dysplasia samples
+Calculating breakpoint density for Oligodendroglioma samples
+Calculating breakpoint density for Hemangioma samples
+Calculating breakpoint density for LGMT samples
+Calculating breakpoint density for CNS ganglioneuroblastoma samples
+
+
+# Bring along the tumor-type labels
+names(group_densities) <- tumor_types
+
+# Remove tumor_types data that didn't have at least two samples
+group_densities <- group_densities[!sapply(group_densities, is.null)]
+
+
+
+Here we will plot median number of break points for the tumor-type group per each bin.
+ + + +purrr::imap(group_densities, function(.x, name = .y) {
+ # Make the combo plot
+ multipanel_break_plot(
+ granges_list = .x,
+ plot_name = name,
+ y_val = "median_counts",
+ y_lab = "Median Breaks per Mb",
+ plot_dir = hist_plots_dir
+ )
+})
+
+
+$Ependymoma
+
+$HGAT
+
+$LGAT
+
+$Other
+
+$`CNS sarcoma`
+
+$Medulloblastoma
+
+$Schwannoma
+
+$ATRT
+
+$`Choroid plexus tumor`
+
+$Craniopharyngioma
+
+$DNET
+
+$Teratoma
+
+$Hemangioblastoma
+
+$Meningioma
+
+$Adenoma
+
+$Neurofibroma
+
+$Ganglioglioma
+
+$`Langerhans Cell histiocytosis`
+
+$`CNS Embryonal tumor`
+
+$`CNS neuroblastoma`
+
+$Chordoma
+
+$MPNST
+
+$`Glial-neuronal tumor NOS`
+
+$Pineoblastoma
+
+$`CNS EFT-CIC`
+
+$`Central neurocytoma`
+
+$Germinoma
+
+$`CNS Rhabdomyosarcoma`
+
+$Dysplasia
+
+$Oligodendroglioma
+
+$Hemangioma
+
+$LGMT
+
+$`CNS ganglioneuroblastoma`
+
+
+Zip up the PNG files into one file.
+ + + +# Declare name of zip file
+zip_file <- paste0(hist_plots_dir, ".zip")
+
+# Remove any current zip_file of this name so we can overwrite it
+if (file.exists(zip_file)) {
+ file.remove(zip_file)
+}
+
+
+[1] TRUE
+
+
+# Zip up the plots
+zip(zip_file, hist_plots_dir)
+
+
+ adding: plots/tumor-type/ (stored 0%)
+ adding: plots/tumor-type/Ependymoma_breaks.png (deflated 16%)
+ adding: plots/tumor-type/CNS Embryonal tumor_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Glial-neuronal tumor NOS_breaks.png (deflated 14%)
+ adding: plots/tumor-type/ATRT_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Medulloblastoma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Oligodendroglioma_breaks.png (deflated 12%)
+ adding: plots/tumor-type/LGMT_breaks.png (deflated 11%)
+ adding: plots/tumor-type/.DS_Store (deflated 97%)
+ adding: plots/tumor-type/HGAT_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Hemangioma_breaks.png (deflated 12%)
+ adding: plots/tumor-type/LGAT_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Pineoblastoma_breaks.png (deflated 13%)
+ adding: plots/tumor-type/Dysplasia_breaks.png (deflated 16%)
+ adding: plots/tumor-type/MPNST_breaks.png (deflated 13%)
+ adding: plots/tumor-type/Central neurocytoma_breaks.png (deflated 10%)
+ adding: plots/tumor-type/CNS neuroblastoma_breaks.png (deflated 13%)
+ adding: plots/tumor-type/Neurofibroma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Chordoma_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Schwannoma_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Hemangioblastoma_breaks.png (deflated 14%)
+ adding: plots/tumor-type/DNET_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Langerhans Cell histiocytosis_breaks.png (deflated 13%)
+ adding: plots/tumor-type/CNS Rhabdomyosarcoma_breaks.png (deflated 11%)
+ adding: plots/tumor-type/Meningioma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Ganglioglioma_breaks.png (deflated 16%)
+ adding: plots/tumor-type/CNS sarcoma_breaks.png (deflated 11%)
+ adding: plots/tumor-type/Germinoma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Choroid plexus tumor_breaks.png (deflated 14%)
+ adding: plots/tumor-type/Craniopharyngioma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/Teratoma_breaks.png (deflated 15%)
+ adding: plots/tumor-type/CNS EFT-CIC_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Other_breaks.png (deflated 16%)
+ adding: plots/tumor-type/Adenoma_breaks.png (deflated 12%)
+ adding: plots/tumor-type/CNS ganglioneuroblastoma_breaks.png (deflated 14%)
+
+
+
+sessionInfo()
+
+
+R version 3.6.0 (2019-04-26)
+Platform: x86_64-pc-linux-gnu (64-bit)
+Running under: Debian GNU/Linux 9 (stretch)
+
+Matrix products: default
+BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+attached base packages:
+[1] parallel stats4 stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.1 S4Vectors_0.24.1 BiocGenerics_0.32.0
+
+loaded via a namespace (and not attached):
+ [1] colorspace_1.4-1 rjson_0.2.20
+ [3] rprojroot_1.3-2 biovizBase_1.34.1
+ [5] htmlTable_1.13.1 circlize_0.4.8
+ [7] XVector_0.26.0 GlobalOptions_0.1.1
+ [9] base64enc_0.1-3 dichromat_2.0-0
+ [11] clue_0.3-57 rstudioapi_0.10
+ [13] getopt_1.20.3 bit64_0.9-7
+ [15] AnnotationDbi_1.48.0 fansi_0.4.0
+ [17] splines_3.6.0 R.methodsS3_1.7.1
+ [19] ggbio_1.34.0 knitr_1.23
+ [21] zeallot_0.1.0 Formula_1.2-3
+ [23] jsonlite_1.6 Rsamtools_2.2.1
+ [25] cluster_2.1.0 dbplyr_1.4.2
+ [27] png_0.1-7 R.oo_1.22.0
+ [29] graph_1.62.0 BiocManager_1.30.4
+ [31] readr_1.3.1 compiler_3.6.0
+ [33] httr_1.4.0 backports_1.1.4
+ [35] assertthat_0.2.1 Matrix_1.2-17
+ [37] lazyeval_0.2.2 cli_1.1.0
+ [39] acepack_1.4.1 htmltools_0.3.6
+ [41] prettyunits_1.0.2 tools_3.6.0
+ [43] gtable_0.3.0 glue_1.3.1
+ [45] GenomeInfoDbData_1.2.2 reshape2_1.4.3
+ [47] dplyr_0.8.3 rappdirs_0.3.1
+ [49] Rcpp_1.0.1 TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0
+ [51] Biobase_2.46.0 styler_1.1.1
+ [53] vctrs_0.1.0 Biostrings_2.54.0
+ [55] rtracklayer_1.46.0 optparse_1.6.2
+ [57] xfun_0.8 stringr_1.4.0
+ [59] ensembldb_2.10.2 XML_3.98-1.20
+ [61] org.Hs.eg.db_3.10.0 zlibbioc_1.32.0
+ [63] scales_1.0.0 BSgenome_1.54.0
+ [65] VariantAnnotation_1.32.0 ProtGenerics_1.18.0
+ [67] hms_0.4.2 RBGL_1.62.1
+ [69] SummarizedExperiment_1.16.0 rematch2_2.0.1
+ [71] AnnotationFilter_1.10.0 RColorBrewer_1.1-2
+ [73] ComplexHeatmap_2.2.0 yaml_2.2.0
+ [75] curl_3.3 memoise_1.1.0
+ [77] gridExtra_2.3 ggplot2_3.2.0
+ [79] rpart_4.1-15 biomaRt_2.42.0
+ [81] reshape_0.8.8 latticeExtra_0.6-28
+ [83] stringi_1.4.3 RSQLite_2.1.1
+ [85] checkmate_1.9.4 GenomicFeatures_1.38.0
+ [87] BiocParallel_1.20.0 shape_1.4.4
+ [89] rlang_0.4.0 pkgconfig_2.0.2
+ [91] matrixStats_0.55.0 bitops_1.0-6
+ [93] evaluate_0.14 lattice_0.20-38
+ [95] purrr_0.3.2 labeling_0.3
+ [97] GenomicAlignments_1.22.1 htmlwidgets_1.3
+ [99] cowplot_0.9.4 bit_1.1-14
+[101] tidyselect_0.2.5 GGally_1.4.0
+[103] plyr_1.8.4 magrittr_1.5
+[105] R6_2.4.0 Hmisc_4.2-0
+[107] DelayedArray_0.12.0 DBI_1.0.0
+[109] withr_2.1.2 foreign_0.8-71
+[111] pillar_1.4.2 nnet_7.3-12
+[113] survival_2.44-1.1 RCurl_1.95-4.12
+[115] tibble_2.1.3 crayon_1.3.4
+[117] utf8_1.1.4 OrganismDbi_1.28.0
+[119] BiocFileCache_1.10.2 rmarkdown_1.13
+[121] GetoptLong_0.1.7 progress_1.2.2
+[123] grid_3.6.0 data.table_1.12.2
+[125] blob_1.1.1 digest_0.6.20
+[127] tidyr_0.8.3 R.utils_2.9.0
+[129] openssl_1.4 munsell_0.5.0
+[131] askpass_1.1
+
+
+
+
+