Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Chr instability: PR 3 of 3: Histology plots #532

Merged
merged 67 commits into from
Feb 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
942c927
Reorganize to localization first
cansavvy Feb 5, 2020
adef2d9
Missed some README edits
cansavvy Feb 5, 2020
294567f
Forgot I can push the results files too
cansavvy Feb 5, 2020
d1264b7
Merge branch 'master' into reorg-chr-insta-1
cansavvy Feb 5, 2020
31c81f8
Make intersect_cnv_sv its own function with more specific ally declar…
cansavvy Feb 5, 2020
970dba7
make it snazzy with purrr::transpose
cansavvy Feb 5, 2020
599913f
change results file names
cansavvy Feb 5, 2020
d41e7e4
Update words
cansavvy Feb 5, 2020
beba0bc
Fix bin_indices thing
cansavvy Feb 5, 2020
7e90984
Setting up the next notebook
cansavvy Feb 6, 2020
6a7291d
Move file names to their own section for @jashapiro relinter and rerun
cansavvy Feb 6, 2020
d433b94
Remove remnant knit file
cansavvy Feb 6, 2020
df7e363
Merge branch 'reorg-chr-insta-1' into reog-chr-insta-2
cansavvy Feb 6, 2020
68f2a71
Update heatmaps and run it
cansavvy Feb 6, 2020
64b98f1
Add to bash script
cansavvy Feb 6, 2020
d48b4f6
Relinter and refresh notebooks
cansavvy Feb 6, 2020
15e54ed
Add in histology nb
cansavvy Feb 6, 2020
d3f1b03
Set up and run histology notebook
cansavvy Feb 6, 2020
f91b5a5
Delete union file and take out some unnecssary pastes still have no i…
cansavvy Feb 6, 2020
846716f
Refresh notebook
cansavvy Feb 6, 2020
f307785
Implement @jashapiro 's fixes and finds!
cansavvy Feb 10, 2020
f937439
Merge remote-tracking branch 'upstream/master' into reog-chr-insta-2
cansavvy Feb 10, 2020
162f15b
toTitleCase !
cansavvy Feb 10, 2020
575a69d
Update file name in bash script
cansavvy Feb 10, 2020
34412b2
Merge branch 'reog-chr-insta-2' into reorg-chr-inst-3
cansavvy Feb 10, 2020
234dd89
update plots and refresh notebook
cansavvy Feb 10, 2020
36066e7
Add the two doc suggestions from @jashapiro
cansavvy Feb 10, 2020
815c061
Merge branch 'reog-chr-insta-2' into reorg-chr-inst-3
cansavvy Feb 10, 2020
936958b
refresh notebooks and update bash script
cansavvy Feb 10, 2020
dd00106
Refresh notebooks and rearrange circleCI
cansavvy Feb 10, 2020
77f6c4c
circleCI rearrange to be above copy num
cansavvy Feb 10, 2020
4471d00
Turn off sample filter when in CI
cansavvy Feb 10, 2020
0ad96c7
Refresh everything
cansavvy Feb 10, 2020
56bfecd
Try to fix logic
cansavvy Feb 10, 2020
8891d26
Add an as.numeric to try to fix CircleCI option
cansavvy Feb 11, 2020
df3e455
Fix bash if
cansavvy Feb 11, 2020
3f25a80
Merge remote-tracking branch 'upstream/master' into reorg-chr-inst-3
cansavvy Feb 11, 2020
e8eb2f5
Merge branch 'master' into reorg-chr-inst-3
cansavvy Feb 11, 2020
a1b4006
Add @jashapiro 's suggested changes and refresh nb
cansavvy Feb 11, 2020
9e85712
Merge remote-tracking branch 'origin/reorg-chr-inst-3' into reorg-chr…
cansavvy Feb 11, 2020
b0976f5
Add set -e to bash script
cansavvy Feb 11, 2020
9787ea8
Switch to PDFs to bypass PNG issue
cansavvy Feb 11, 2020
89f6988
Upate y-axis CDF label
cansavvy Feb 11, 2020
0e548f4
Fix log transformation
cansavvy Feb 11, 2020
9097bed
Change CDF plot to use stat_ecdf
cansavvy Feb 12, 2020
9dbf26c
Bring back zeroes for intersection data
cansavvy Feb 12, 2020
2096740
Refreshed notebook
cansavvy Feb 12, 2020
42f8020
Push most recent notebook and plot
cansavvy Feb 12, 2020
a51db40
Functionalize and re-run fixed CDF plot
cansavvy Feb 12, 2020
b28056f
Update some docs
cansavvy Feb 12, 2020
3aa00cd
Merge branch 'master' into reorg-chr-inst-3
cansavvy Feb 12, 2020
cdea6d6
Fix column name thing that I changed elsewhere
cansavvy Feb 12, 2020
6af1808
Merge remote-tracking branch 'origin/reorg-chr-inst-3' into reorg-chr…
cansavvy Feb 12, 2020
4a46e3c
Address @jashapiro comments
cansavvy Feb 12, 2020
a7ad558
Fix outdated "median" comment.
cansavvy Feb 12, 2020
c4c938e
Push NA vs 0 fix
cansavvy Feb 13, 2020
f35c8a8
Re-run everything after making the fixes
cansavvy Feb 13, 2020
2057202
Suppress some NA messages
cansavvy Feb 13, 2020
39d0320
Add uncalled file to options
jashapiro Feb 14, 2020
1268fba
Make adjustment to unsurveyed samples per @jashapiro 's suggestion
cansavvy Feb 14, 2020
cdd42af
Merge branch 'master' into reorg-chr-inst-3
cansavvy Feb 14, 2020
751d8ba
Update to handle surveyed samples
jashapiro Feb 14, 2020
e992d8b
Rerun with latest changes
jashapiro Feb 14, 2020
a8d9c50
Merge remote and rerun
jashapiro Feb 14, 2020
795c612
Merge branch 'master' into reorg-chr-inst-3
jashapiro Feb 14, 2020
9d2e90d
Merge branch 'master' into reorg-chr-inst-3
cansavvy Feb 17, 2020
27c6440
Merge branch 'master' into reorg-chr-inst-3
cansavvy Feb 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ jobs:
name: Tumor mutation burden with TCGA
command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/tmb-compare-tcga/compare-tmb.Rmd', clean = TRUE)"

- run:
name: Chromosomal instability breakpoints
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash analyses/chromosomal-instability/run_breakpoint_analysis.sh

- run:
name: Copy number consensus
command: ./scripts/run_in_ci.sh bash "analyses/copy_number_consensus_call/run_consensus_call.sh"
Expand Down Expand Up @@ -147,10 +151,6 @@ jobs:
name: Gene set enrichment analysis to generate GSVA scores
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/gene-set-enrichment-analysis/run-gsea.sh"

- run:
name: Chromosomal instability breakpoints
command: ./scripts/run_in_ci.sh bash analyses/chromosomal-instability/run_breakpoint_analysis.sh

- run:
name: Fusion Summary
command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"
Expand Down
90 changes: 53 additions & 37 deletions analyses/chromosomal-instability/00-setup-breakpoint-data.R
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,13 @@ option_list <- list(
effectively surveyed regions of the genome for the WXS samples.",
metavar = "character"
),
make_option(
opt_str = "--uncalled_samples", type = "character", default = "none",
help = "Relative file path (assuming from top directory of
'OpenPBTA-analysis') that specifies the regions that were uncalled in CNV
analysis.",
metavar = "character"
),
make_option(
opt_str = "--gap", default = 0,
help = "An integer that indicates how many base pairs away a CNV and SV
Expand All @@ -172,7 +179,7 @@ opt$surveyed_wxs <- file.path(root_dir, opt$surveyed_wxs)

########### Check that the files we need are in the paths specified ############
needed_files <- c(
opt$cnv_seg, opt$sv, opt$metadata, opt$surveyed_wgs, opt$surveyed_wxs
opt$cnv_seg, opt$sv, opt$metadata, opt$surveyed_wgs, opt$surveyed_wxs, opt$uncalled
)

# Get list of which files were found
Expand Down Expand Up @@ -222,6 +229,8 @@ cnv_df <- data.table::fread(opt$cnv_seg, data.table = FALSE) %>%
# Changing these so they end up matching the SV data
dplyr::rename(start = loc.start, end = loc.end)

# Obtain full list of samples
cnv_samples <- unique(as.character(cnv_df$samples))

# Filter CNV data to only the changes that are larger than our cutoff `ch.pct`.
cnv_filtered_df <- cnv_df %>%
Expand All @@ -238,6 +247,10 @@ sv_df <- data.table::fread(opt$sv, data.table = FALSE) %>%
`23` = "X", `24` = "Y"
)
)

# Obtain full list of samples
sv_samples <- unique(sv_df$Kids.First.Biospecimen.ID.Tumor)

####################### Drop Sex Chr if option is on ###########################
if (opt$drop_sex){
sv_df <- sv_df %>%
Expand All @@ -247,15 +260,6 @@ if (opt$drop_sex){
dplyr::filter(!(chrom %in% c("X", "Y", "M")))
}
#################### Format the data as chromosomes breaks #####################
# Only keep samples for which there are both SV and CNV data
common_samples <- dplyr::intersect(
unique(cnv_df$ID),
unique(sv_df$Kids.First.Biospecimen.ID.Tumor)
)

# TODO: after updating to consensus CNV data, evaluate whether sex chromosomes should still be removed.
# Make an CNV breaks data.frame.

# Make a CNV data.frame that has the breaks
cnv_breaks <- data.frame(
samples = rep(cnv_filtered_df$ID, 2),
Expand All @@ -265,12 +269,8 @@ cnv_breaks <- data.frame(
copy.num = rep(cnv_filtered_df$copy.num, 2),
stringsAsFactors = FALSE
) %>%
# Remove sex chromosomes
dplyr::filter(
!(chrom %in% c("X", "Y")),
# Only keep samples that have both CNV and SV data
samples %in% common_samples
)
# Remove NAs
dplyr::filter(!is.na(chrom))

# Make an SV breaks data.frame.
sv_breaks <- data.frame(
Expand All @@ -280,20 +280,16 @@ sv_breaks <- data.frame(
svclass = rep(sv_df$SV.type, 2),
stringsAsFactors = FALSE
) %>%
# Remove sex chromosomes and NAs
dplyr::filter(
!(chrom %in% c("X", "Y", "M", NA)),
# Only keep samples that have both CNV and SV data
samples %in% common_samples
)
# Remove NAs
dplyr::filter(!is.na(chrom))

# Remake this in case some samples' data got filtered out in the process
######################### Create intersection of breaks ########################
# Get list of samples for which there are both SV and CNV data
common_samples <- dplyr::intersect(
unique(cnv_breaks$samples),
unique(sv_breaks$samples)
)

######################### Create intersection of breaks ########################
# Make an intersection of breaks data.frame.
intersection_of_breaks <- lapply(common_samples,
intersect_cnv_sv, # Special intersect function
Expand All @@ -304,9 +300,11 @@ intersection_of_breaks <- lapply(common_samples,
# Bring along sample names
names(intersection_of_breaks) <- common_samples

# Collaps into a data.frame
# Collapse into a data.frame
intersection_of_breaks <- dplyr::bind_rows(intersection_of_breaks,
.id = "samples")
.id = "samples") %>%
# Only need one chromosome
dplyr::select(chrom = sv_ranges.seqnames, dplyr::everything(), -cnv_ranges.seqnames)

# Put all the breaks into a list.
breaks_list <- list(
Expand All @@ -332,33 +330,52 @@ wxs_size <- sum(bed_wxs[, 3] - bed_wxs[, 2])
wgs_size <- as.numeric(wgs_size)
wxs_size <- as.numeric(wxs_size)

# Read in uncalled samples from consensus module
uncalled_samples <- readr::read_tsv(opt$uncalled)$sample

# Get vector of all samples
all_samples <- unique(c(cnv_samples, sv_samples, uncalled_samples))


# Set up the metadata
metadata <- readr::read_tsv(opt$metadata) %>%
metadata <- readr::read_tsv(opt$metadata, guess_max = 10000) %>%
# Isolate metadata to only the samples that are in the datasets.
dplyr::filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
dplyr::filter(Kids_First_Biospecimen_ID %in% all_samples) %>%
# Keep the columns to only the experimental strategy and the biospecimen ID
dplyr::select(Kids_First_Biospecimen_ID, experimental_strategy) %>%
# For an easier time matching to our breaks data.frames, lets just rename this.
dplyr::rename(samples = Kids_First_Biospecimen_ID)
dplyr::rename(samples = Kids_First_Biospecimen_ID) %>%
# samples not in cnv_samples were were noisy or otherwise bad data
dplyr::mutate(surveyed = samples %in% cnv_samples)


# Calculate the breaks density for each data.frame
breaks_density_list <- lapply(breaks_list, function(breaks_df) {
# Calculate the breaks density
breaks_df %>%
# Tack on the experimental strategy
dplyr::inner_join(metadata) %>%
# Tack on the experimental strategy and samples with no counts
dplyr::full_join(metadata) %>%
# Recode using the BED range sizes
dplyr::mutate(genome_size = dplyr::recode(experimental_strategy,
"WGS" = wgs_size,
"WXS" = wxs_size
)) %>%
dplyr::group_by(
samples, experimental_strategy, genome_size
samples, experimental_strategy, genome_size, surveyed
) %>%
# Count number of mutations for that sample
dplyr::summarize(breaks_count = dplyr::n()) %>%
# Calculate breaks density
dplyr::mutate(breaks_density = breaks_count / (genome_size / 1000000))
# Count number of mutations for that sample but find out if it is NA
dplyr::summarize(is_na = any(is.na(chrom)),
breaks_count = dplyr::n()) %>%
# Calculate breaks density, but put NA for breaks_count if the sample was
# dropped from CNV consensus analysis
dplyr::mutate(breaks_count = dplyr::case_when(
!is_na ~ as.numeric(breaks_count),
is_na & surveyed ~ as.numeric(0),
TRUE ~ as.numeric(NA)
),
breaks_density = breaks_count / (genome_size / 1000000)) %>%
# Drop the is_na column, we only needed if for recoding
dplyr::select(-is_na, -surveyed)
})

# Write the break densities each as their own files
Expand All @@ -369,4 +386,3 @@ purrr::imap(breaks_density_list, function(.x, name = .y) {
file.path(opt$output, paste0(name, "_densities.tsv"))
)
})

Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,6 @@ We'll go by SV data here.
# For simplicity's sake, change the name of the intersection column we want to use
breaks_list$intersection_of_breaks <- breaks_list$intersection_of_breaks %>%
dplyr::rename(
chrom = sv_ranges.seqnames,
coord = sv_ranges.start
)
```
Expand Down
Loading