Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
(Comparative RNA-Seq Analysis) Apply sample filters (QC + tumor statu…
Browse files Browse the repository at this point in the history
…s); convert scratch files to .feather (#544)

* update README (TODO: proofread markdown)

* implement pull 3: sample filters and general cleanup

- add tsv and feather IO functions to utils

- add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples
- For dropped genes, note if dropped by Expression filter or Variance filter
- all internal .rds files converted to .feather
- all path handling in step 01 moved to main()

* [#323] update CI file, dockerfile, results file

for comparative-rnaseq-analysis:
- update command lines in CI file; add dependency to dockerfile;
add new version of outlier results tsv.gz file with sample filters applied.

* [#229] tidy up README markdown.

* [#229] analyses/README update comparative-rna-seq

* [#229] check MEND QC files to confirm parsed ok

Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and
crash if they didn't.
Also call out any samples that didn't have MEND QC files (these will be treated as
fails and not included in the dataset).

* [#229] maintain consistent column order in outlier results tsv

Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix.

Includes new results file which has data identical to previous, but columns in correct order.
  • Loading branch information
e-t-k authored Feb 20, 2020
1 parent 8a49217 commit 87386d2
Show file tree
Hide file tree
Showing 8 changed files with 219 additions and 81 deletions.
6 changes: 3 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,15 +129,15 @@ jobs:

- run:
name: Comparative RNASeq - generate correlation matrix - rsem-tpm.polya
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.polya.rds --output-prefix rsem-tpm-polya- --verbose
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.polya.rds --clinical-path ../../data/pbta-histologies.tsv --qc-manifest-path ../../data/pbta-mend-qc-manifest.tsv --qc-results-path ../../data/pbta-mend-qc-results.tar.gz --prefix rsem-tpm-polya- --verbose

- run:
name: Comparative RNASeq - generate correlation matrix - rsem-tpm.stranded
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.stranded.rds --output-prefix rsem-tpm-stranded- --verbose
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.stranded.rds --clinical-path ../../data/pbta-histologies.tsv --qc-manifest-path ../../data/pbta-mend-qc-manifest.tsv --qc-results-path ../../data/pbta-mend-qc-results.tar.gz --prefix rsem-tpm-stranded- --verbose

- run:
name: Comparative RNASeq - generate thresholds and outliers - rsem-tpm.stranded
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py --prefix rsem-tpm-stranded- --verbose
command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py --prefix rsem-tpm-stranded- --results results --verbose

- run:
name: Process SV file
Expand Down
9 changes: 6 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -257,9 +257,6 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
# package required for shatterseek
RUN R -e "withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS='true'), remotes::install_github('parklab/ShatterSeek', ref = '83ab3effaf9589cc391ecc2ac45a6eaf578b5046', dependencies = TRUE))"

#### Please install your dependencies here
#### Add a comment to indicate what analysis it is required for

# MATLAB Compiler Runtime is required for GISTIC, MutSigCV
# Install steps are adapted from usuresearch/matlab-runtime
# https://hub.docker.com/r/usuresearch/matlab-runtime/dockerfile
Expand Down Expand Up @@ -293,4 +290,10 @@ RUN mkdir -p gistic_install && \
RUN chown -R rstudio:rstudio /home/rstudio/gistic_install
RUN chmod 755 /home/rstudio/gistic_install

# pyarrow for comparative-RNASeq-analysis, to read/write .feather files
RUN pip3 install "pyarrow==0.16.0"

#### Please install your dependencies here
#### Add a comment to indicate what analysis it is required for

WORKDIR /rocker-build/
2 changes: 1 addition & 1 deletion analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` <br> `pbta-histologies.tsv` <br> `pbta-mend-qc-manifest.tsv` <br> `pbta-mend-qc-results.tar.gz` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg.gz` <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
| [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
| [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz` | Maps from copy number variant caller segments to gene identifiers; will be updated to take into account changes that affect entire cytobands, chromosome arms ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `results/controlfreec_annotated_cn_autosomes.tsv.gz` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `results/consensus_seg_annotated_cn_x_and_y.tsv.gz`
Expand Down
174 changes: 135 additions & 39 deletions analyses/comparative-RNASeq-analysis/01-correlation-matrix.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@

"""
Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
Filter polyA and ribodeplete (stranded) samples to only tumor samples that are MEND QC pass.
Then, create correlation matrices for these samples using gene expression data.
The correlations will be calculated using pairwise Spearman correlation of the RNA-Seq gene expression profiles.
The gene expression profile data will be filtered using a developed method described in Vaske et al.
Expand All @@ -11,34 +12,112 @@
import sys
import argparse
import math
import tarfile
import numpy as np
import pandas as pd
import scipy.stats
import sklearn.metrics.pairwise as sklp
import utils

def prepare_expression(input_matrix,
scratch_dir,
def extract_sample_qc_status(qc_file_handle, file_label):
"""Parses a MEND QC results file handle to "PASS" or "FAIL".
QC results parsing to something else (or failing to parse) indicate a major
failure in the QC script and should interrupt analysis.
The qc_file_handle is as provided by TarFile.extractfile, ie, in raw bytes"""
# Expected File format (example. Has PASS\n or FAIL\n at end of second line
# input uniqMappedNonDupeReadCount estExonicUniqMappedNonDupeReadCount qc\n
# sample1.readDist.txt 22333444.5 10999888.765 PASS\n
status = qc_file_handle.readlines()[1][-5:-1].decode("utf-8")
if not status in ["PASS", "FAIL"]:
print("ERROR: MEND QC file {} parsed status as '{}' - expected PASS or FAIL!".format(file_label, status))
raise ValueError
return status

def get_qc_status(mend_results_path, mend_manifest_path):
"""Takes: mend results tgz containing bam_umend_qc.tsv files, and
manifest containing mapping from filename to sample id.
returns dictionary from sample id to mend status: PASS or FAIL"""
mend_tgz = tarfile.open(mend_results_path)
qc_files = (i for i in mend_tgz.getmembers() if i.name.endswith("bam_umend_qc.tsv"))
# Dictionary of filename (UUID.bam_umend_qc.tsv) to PASS or FAIL string
filename_map = { i.name : extract_sample_qc_status(mend_tgz.extractfile(i), i.name) for i in qc_files }
# Find sample ID entries in manifest and map them via filename.
manifest = utils.read_tsv(mend_manifest_path)

# If a QC result file is not listed in the manifest, this throws KeyError
# This would indicate major failure in the QC script and should interrupt analysis.
return { manifest.loc[k]["Kids.First.Biospecimen.ID"] : v for k, v in filename_map.items() }

def filter_samples(input_matrix,
mend_results_path,
mend_manifest_path,
clinical_path,
nofilter=False,
verbose=False):
"""Filters samples by QC and tumor status.
Takes path to samples expression file; tgz of MEND QC results;
manifest of MEND QC files to sample ids; and clinical data file.
Returns data frame of sample expression filtered to contain only
MEND QC pass tumor samples. Filters are skippable via nofilter if,
for example, clinical and qc files aren't available."""
print_v = print if verbose else lambda *a, **k: None

print_v("Loading sample matrix {}".format(input_matrix))
raw_samples = utils.read_rds(input_matrix)

if nofilter:
print_v("Skipped the QC and tumor status filters; using all samples.")
return raw_samples

# Filter to only sample ids with a MEND QC status of PASS
# Samples not mentioned in the qc_status (ie, no qc result) will be dropped by the filter
qc_status = get_qc_status(mend_results_path, mend_manifest_path)
qc_pass_ids= [k for k, v in qc_status.items() if v == "PASS"]
qc_pass_ids_in_dataset = set(raw_samples.columns).intersection(qc_pass_ids)
print_v("QC filter: dropped {} samples that were not MEND QC PASS.".format(
len(raw_samples.columns) - len(qc_pass_ids_in_dataset)))

# Call out any sample ids that had no corresponding QC file, but ok to continue even if some are present
samples_without_qc_results = set(raw_samples.columns) - set(qc_status.keys())
if len(samples_without_qc_results):
print("WARNING: {} samples were dropped due to missing MEND QC results. Sample IDs: {}".format(
len(samples_without_qc_results),
", ".join(samples_without_qc_results)))

# Further filter to only tumor RNA-Seq sample ids
clinical = utils.read_tsv(clinical_path)
tumor_ids = clinical[(clinical["sample_type"] == "Tumor")
& (clinical["composition"] == "Solid Tissue")
& (clinical["experimental_strategy"] == "RNA-Seq")
].index
qc_pass_tumor_ids_in_dataset = qc_pass_ids_in_dataset.intersection(tumor_ids)
print_v("Tumor filter: dropped {} samples that were not solid tumor RNA-Seq.".format(
len(qc_pass_ids_in_dataset) - len(qc_pass_tumor_ids_in_dataset)))

# And apply filtered IDs to the original dataset, retaining the original column order
qc_pass_tumor_samples = raw_samples[[sid for sid in raw_samples.columns if sid in qc_pass_tumor_ids_in_dataset]]
return qc_pass_tumor_samples

def prepare_expression(raw_samples,
normalized_samples_path,
gene_filters_path,
proportion_unexpressed=0.8,
variance_filter_level=0.2,
verbose=False,
prefix=""):
"""Loads samples, converts to log2(TPM+1), and applies expression and variance filters.
returns dataframe of converted and filtered expression.
input_matrix = path to .rds file
writes file: scratch/{prefix}filtered_genes_to_keep.rds"""
verbose=False):
"""Converts sample matrix to log2(TPM+1) and applies expression and variance filters.
returns dataframe of converted and filtered expression; generates output feather files
stored at normalized_samples_path, gene_filters_path.
raw_samples = TPM expression dataframe.
Normalized samples file is a log2(TPM+1) expression matrix retaining all genes.
Gene filters file lists all genes and their filter status - either R = Retained,
E = dropped by the Expression filter, or V = dropped by the variance filter."""

print_v = print if verbose else lambda *a, **k: None
filtered_genelist_filepath = os.path.join(scratch_dir, "{}filtered_genes_to_keep.rds".format(prefix))

print_v("Loading samples {}".format(input_matrix))
raw_samples = utils.read_rds(input_matrix)

# Convert to log2(tpm+1)
samples = np.log2(raw_samples+1)
print_v("Writing normalized samples to {}".format(normalized_samples_path))
utils.write_rds(samples, normalized_samples_path)
utils.write_feather(samples, normalized_samples_path)

# run expression filter
# Remove any genes that have 0 expression in more samples than proportion_unexpressed
Expand All @@ -60,20 +139,24 @@ def prepare_expression(input_matrix,
expr_var_filtered_genelist = pd.DataFrame(variance.nlargest(keep_proportion).index)
expr_var_filtered_df = expression_filtered_df.filter(expr_var_filtered_genelist["gene_id"], axis=0)

utils.write_rds(expr_var_filtered_genelist, filtered_genelist_filepath)
# Save filter status by first by marking every gene as failed Expression filter;
# marking genes that passed as failed Variance filter, and again marking genes that passed
# that filter as Retained.
gene_filter_status = pd.DataFrame("E", index=samples.index, columns=["status"])
gene_filter_status["status"][expression_filtered_df.index] = "V"
gene_filter_status["status"][expr_var_filtered_df.index] = "R"

return expr_var_filtered_df
print_v("Writing gene filter status to {}".format(gene_filters_path))
utils.write_feather(gene_filter_status, gene_filters_path)

return expr_var_filtered_df

def calculate_correlation(expr_var_filtered_df, scratch_dir, verbose=False, prefix=""):
def calculate_correlation(expr_var_filtered_df, all_by_all_path, verbose=False):
"""Takes an expression matrix dataframe and calculates spearman correlations.
(spearman is a rank transformed pearson ('correlation') metric)
output file: scratch/{prefix}all_by_all_correlations.tsv"""

output file: scratch/{prefix}all_by_all_correlations.feather"""
print_v = print if verbose else lambda *a, **k: None

all_by_all_filepath = os.path.join(scratch_dir, "{}all_by_all_correlations.rds".format(prefix))

# Rank transform - values are now that rank of that gene within each sample
print_v("Rank transforming")
filtered_rank_transformed_df = np.apply_along_axis(scipy.stats.rankdata, 0, expr_var_filtered_df)
Expand All @@ -86,8 +169,8 @@ def calculate_correlation(expr_var_filtered_df, scratch_dir, verbose=False, pref
labels=expr_var_filtered_df.columns
all_by_all_df = pd.DataFrame(x_corr, index=labels, columns=labels)

print_v("Writing to file {}".format(all_by_all_filepath))
utils.write_rds(all_by_all_df, all_by_all_filepath)
print_v("Writing to file {}".format(all_by_all_path))
utils.write_feather(all_by_all_df, all_by_all_path)
return all_by_all_df

def main():
Expand All @@ -98,11 +181,15 @@ def main():

p = argparse.ArgumentParser()
p.add_argument("input_path", metavar="input-path", help="Path to input Rds file.")
p.add_argument("--clinical-path", help="Path to clinical data matrix tsv file.")
p.add_argument("--qc-manifest-path", help="Path to MEND QC manifest tsv file.")
p.add_argument("--qc-results-path", help="Path to tgz file of MEND QC results for each sample.")
p.add_argument("--verbose", action="store_true")
p.add_argument("--nofilter", action="store_true", help="Skip the sample QC and tumor status filters")
p.add_argument("--scratch",
default=os.path.join("..", "..", "scratch"),
help="Path to scratch dir.")
p.add_argument("--output-prefix", help="Prefix for output files.")
p.add_argument("--prefix", help="Prefix for output files.")
p.add_argument("--proportion-unexpressed",
default=0.8,
type=float,
Expand All @@ -114,21 +201,30 @@ def main():
args = p.parse_args()

# Use input basename as prefix if none was supplied
prefix = args.output_prefix or os.path.splitext(os.path.basename(args.input_path))[0]

# Output file path
normalized_samples_path = os.path.join(args.scratch, "{}log2-normalized.rds".format(prefix))

expression = prepare_expression(args.input_path,
args.scratch,
normalized_samples_path,
prefix=prefix,
verbose=args.verbose,
proportion_unexpressed=args.proportion_unexpressed,
variance_filter_level=args.variance_filter_level)
calculate_correlation(expression,
args.scratch,
prefix=prefix,
prefix = args.prefix or os.path.splitext(os.path.basename(args.input_path))[0]

# Output file paths
normalized_samples_path = os.path.join(args.scratch, "{}log2-normalized.feather".format(prefix))
gene_filters_path = os.path.join(args.scratch, "{}gene-filters.feather".format(prefix))
all_by_all_path = os.path.join(args.scratch, "{}all_by_all_correlations.feather".format(prefix))


# Load the expression file and filter samples to QC-pass tumors only
filtered_expression = filter_samples(args.input_path,
args.qc_results_path,
args.qc_manifest_path,
args.clinical_path,
nofilter=args.nofilter,
verbose=args.verbose)

normalized_expression = prepare_expression(filtered_expression,
normalized_samples_path,
gene_filters_path,
proportion_unexpressed=args.proportion_unexpressed,
variance_filter_level=args.variance_filter_level,
verbose=args.verbose)
calculate_correlation(normalized_expression,
all_by_all_path,
verbose=args.verbose)

if __name__ == "__main__":
Expand Down
Loading

0 comments on commit 87386d2

Please sign in to comment.