This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch ‘Alex/master’ into collapse-rnaseq
- Loading branch information
Showing
31 changed files
with
2,865,821 additions
and
214 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
## Compare CNV callers | ||
|
||
This analysis is **DEPRECATED**. | ||
It was designed to compare results from CNVkit and ControlFreeC when both methods produced SEG files and when the following [was noted in the README](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/0c2d0d25c01dcbbbd63f94b064a69afc9dc44ea8#data-caveats): | ||
|
||
> We noticed ControlFreeC does not properly handle aneuploidy well for a subset of samples in that it calls the entire genome gained. | ||
As of `release-v7-20191031`, the CNV files are in two different formats (see: [CNVkit format](https://cnvkit.readthedocs.io/en/stable/fileformats.html) and [ControlFreeC format](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/format/controlfreec-tsv.md)). | ||
|
||
Consensus copy number calls are tracked here: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Author: Komal S. Rathi | ||
# Date: 11/09/2019 | ||
# Function: Merge RSEM files and split into polya and stranded | ||
|
||
# load libraries | ||
suppressPackageStartupMessages(library(optparse)) | ||
suppressPackageStartupMessages(library(dplyr)) | ||
suppressPackageStartupMessages(library(reshape2)) | ||
|
||
# read params | ||
option_list <- list( | ||
make_option(c("-i", "--inputdir"), type = "character", | ||
help = "Input directory for RSEM files"), | ||
make_option(c("-c", "--clinical"), type = "character", | ||
help = "Histology file (.TSV)"), | ||
make_option(c("-m", "--manifest"), type = "character", | ||
help = "Manifest file (.csv)"), | ||
make_option(c("-o", "--outdir"), type = "character", | ||
help = "Path to output directory") | ||
) | ||
|
||
# parse the parameters | ||
opt <- parse_args(OptionParser(option_list = option_list)) | ||
topDir <- opt$inputdir | ||
clin <- opt$clinical | ||
manifest <- opt$manifest | ||
outdir <- opt$outdir | ||
|
||
# read manifest file | ||
manifest <- read.csv(manifest, stringsAsFactors = F) | ||
manifest <- manifest[,c("Kids.First.Biospecimen.ID","name")] | ||
manifest$name <- gsub('[.].*', '', manifest$name) | ||
|
||
# read histology file and split into polyA and stranded | ||
clin <- read.delim(clin, stringsAsFactors = F) | ||
polya <- clin %>% | ||
filter(experimental_strategy == "RNA-Seq" & RNA_library == "poly-A") %>% | ||
as.data.frame() | ||
|
||
stranded <- clin %>% | ||
filter(experimental_strategy == "RNA-Seq" & RNA_library == "stranded") %>% | ||
as.data.frame() | ||
|
||
# read and merge RSEM genes files | ||
lfiles <- list.files(path = topDir, pattern = "*.rsem.genes.results.gz", recursive = TRUE, full.names = T) | ||
read.rsem <- function(x){ | ||
print(x) | ||
dat <- data.table::fread(x) | ||
filename <- gsub('.*[/]|.rsem.genes.results.gz','',x) | ||
sample.id <- manifest[which(manifest$name %in% filename),'Kids.First.Biospecimen.ID'] | ||
dat$Sample <- sample.id | ||
return(dat) | ||
} | ||
expr <- lapply(lfiles, read.rsem) | ||
expr <- data.table::rbindlist(expr) | ||
expr.fpkm <- dcast(expr, gene_id~Sample, value.var = 'FPKM') # FPKM | ||
expr.counts <- dcast(expr, gene_id~Sample, value.var = 'expected_count') # counts | ||
|
||
# split into polya and stranded matrices | ||
polya.fpkm <- expr.fpkm[,polya$Kids_First_Biospecimen_ID] | ||
stranded.fpkm <- expr.fpkm[,stranded$Kids_First_Biospecimen_ID] | ||
polya.counts <- expr.counts[,polya$Kids_First_Biospecimen_ID] | ||
stranded.counts <- expr.counts[,stranded$Kids_First_Biospecimen_ID] | ||
|
||
# save output | ||
saveRDS(polya.fpkm, file = paste0(outdir,'/pbta-gene-expression-rsem-fpkm.polya.rds')) | ||
saveRDS(polya.counts, file = paste0(outdir, '/pbta-gene-counts-rsem-expected_count.polya.rds')) | ||
saveRDS(stranded.fpkm, file = paste0(outdir, '/pbta-gene-expression-rsem-fpkm.stranded.rds')) | ||
saveRDS(stranded.counts, file = paste0(outdir, '/pbta-gene-counts-rsem-expected_count.stranded.rds')) | ||
print("Done!") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
--- | ||
title: "Summary of collapsed symbols" | ||
output: html_notebook | ||
--- | ||
|
||
```{r} | ||
# load libraries | ||
library(tidyverse) | ||
library(reshape2) | ||
library(refGenome) | ||
# actual code | ||
print("Generating input matrix...!") | ||
expr <- readRDS(input.dat) | ||
# reduce dataframe | ||
expr <- expr[which(rowSums(expr[,2:ncol(expr)]) > 0),] # remove all genes with no expression | ||
expr <- expr[grep('_PAR_', expr$gene_id, invert = T),] # discard PAR_* chromosomes from analysis | ||
# collapse to matrix of HUGO symbols x Sample identifiers | ||
# take mean per row and use the max value for duplicated gene symbols | ||
expr.collapsed <- expr %>% | ||
separate(gene_id, c("gene_id", "gene_symbol"), sep = "\\_", extra = "merge") %>% | ||
pivot_longer(-c(gene_id, gene_symbol), | ||
names_to = "sample_name", values_to = "fpkm") %>% | ||
group_by(gene_id) %>% | ||
mutate(means = mean(fpkm)) %>% | ||
group_by(gene_symbol) %>% | ||
filter(means == max(means)) %>% | ||
select(-gene_id) %>% unique() | ||
# matrix of HUGO symbols x Sample identifiers | ||
expr.input <- acast(expr.collapsed, gene_symbol~sample_name, value.var = 'fpkm') | ||
print(dim(expr.input)) | ||
# save matrix | ||
saveRDS(object = expr.input, file = output.mat) | ||
print("Matrix generated. Done!!") | ||
``` | ||
|
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.