(Comparative RNA-Seq Analysis) Apply sample filters (QC + tumor statu…

…s); convert scratch files to .feather (#544) * update README (TODO: proofread markdown) * implement pull 3: sample filters and general cleanup - add tsv and feather IO functions to utils - add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples - For dropped genes, note if dropped by Expression filter or Variance filter - all internal .rds files converted to .feather - all path handling in step 01 moved to main() * [#323] update CI file, dockerfile, results file for comparative-rnaseq-analysis: - update command lines in CI file; add dependency to dockerfile; add new version of outlier results tsv.gz file with sample filters applied. * [#229] tidy up README markdown. * [#229] analyses/README update comparative-rna-seq * [#229] check MEND QC files to confirm parsed ok Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and crash if they didn't. Also call out any samples that didn't have MEND QC files (these will be treated as fails and not included in the dataset). * [#229] maintain consistent column order in outlier results tsv Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix. Includes new results file which has data identical to previous, but columns in correct order.
AlexsLemonade · Feb 20, 2020 · 87386d2 · 87386d2
1 parent 8a49217
commit 87386d2
Show file tree

Hide file tree

Showing 8 changed files with 219 additions and 81 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -129,15 +129,15 @@ jobs:
 
       - run:
           name: Comparative RNASeq - generate correlation matrix - rsem-tpm.polya
-          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.polya.rds --output-prefix rsem-tpm-polya- --verbose
+          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.polya.rds --clinical-path ../../data/pbta-histologies.tsv --qc-manifest-path ../../data/pbta-mend-qc-manifest.tsv --qc-results-path ../../data/pbta-mend-qc-results.tar.gz --prefix rsem-tpm-polya- --verbose
 
       - run:
           name: Comparative RNASeq - generate correlation matrix - rsem-tpm.stranded
-          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.stranded.rds --output-prefix rsem-tpm-stranded- --verbose
+          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/01-correlation-matrix.py ../../data/pbta-gene-expression-rsem-tpm.stranded.rds --clinical-path ../../data/pbta-histologies.tsv --qc-manifest-path ../../data/pbta-mend-qc-manifest.tsv --qc-results-path ../../data/pbta-mend-qc-results.tar.gz --prefix rsem-tpm-stranded- --verbose
 
       - run:
           name: Comparative RNASeq - generate thresholds and outliers - rsem-tpm.stranded
-          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py --prefix rsem-tpm-stranded- --verbose
+          command: ./scripts/run_in_ci.sh python3 analyses/comparative-RNASeq-analysis/02-thresholds-and-outliers.py --prefix rsem-tpm-stranded- --results results --verbose
 
       - run:
           name: Process SV file

diff --git a/Dockerfile b/Dockerfile
@@ -257,9 +257,6 @@ RUN apt-get update -qq && apt-get -y --no-install-recommends install \
 # package required for shatterseek
 RUN R -e "withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS='true'), remotes::install_github('parklab/ShatterSeek', ref = '83ab3effaf9589cc391ecc2ac45a6eaf578b5046', dependencies = TRUE))"
 
-#### Please install your dependencies here
-#### Add a comment to indicate what analysis it is required for
-
 # MATLAB Compiler Runtime is required for GISTIC, MutSigCV
 # Install steps are adapted from usuresearch/matlab-runtime
 # https://hub.docker.com/r/usuresearch/matlab-runtime/dockerfile
@@ -293,4 +290,10 @@ RUN mkdir -p gistic_install && \
 RUN chown -R rstudio:rstudio /home/rstudio/gistic_install
 RUN chmod 755 /home/rstudio/gistic_install
 
+# pyarrow for comparative-RNASeq-analysis, to read/write .feather files
+RUN pip3 install "pyarrow==0.16.0"
+
+#### Please install your dependencies here
+#### Add a comment to indicate what analysis it is required for
+
 WORKDIR /rocker-build/
diff --git a/analyses/README.md b/analyses/README.md
@@ -15,7 +15,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group  | N/A
 | [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
 | [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
-| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
+| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` <br> `pbta-histologies.tsv` <br> `pbta-mend-qc-manifest.tsv` <br> `pbta-mend-qc-results.tar.gz` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
 | [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg.gz` <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
 | [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
 | [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz` | Maps from copy number variant caller segments to gene identifiers; will be updated to take into account changes that affect entire cytobands, chromosome arms ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `results/controlfreec_annotated_cn_autosomes.tsv.gz` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `results/consensus_seg_annotated_cn_x_and_y.tsv.gz`

diff --git a/analyses/comparative-RNASeq-analysis/01-correlation-matrix.py b/analyses/comparative-RNASeq-analysis/01-correlation-matrix.py
@@ -1,6 +1,7 @@
 
 """
-Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
+Filter polyA and ribodeplete (stranded) samples to only tumor samples that are MEND QC pass.
+Then, create correlation matrices for these samples using gene expression data.
 The correlations will be calculated using pairwise Spearman correlation of the RNA-Seq gene expression profiles.
 The gene expression profile data will be filtered using a developed method described in Vaske et al.
 
@@ -11,34 +12,112 @@
 import sys
 import argparse
 import math
+import tarfile
 import numpy as np
 import pandas as pd
 import scipy.stats
 import sklearn.metrics.pairwise as sklp
 import utils
 
-def prepare_expression(input_matrix,
-                       scratch_dir,
+def extract_sample_qc_status(qc_file_handle, file_label):
+    """Parses a MEND QC results file handle to "PASS" or "FAIL".
+    QC results parsing to something else (or failing to parse) indicate a major
+    failure in the QC script and should interrupt analysis.
+    The qc_file_handle is as provided by TarFile.extractfile, ie, in raw bytes"""
+    # Expected File format (example. Has PASS\n or FAIL\n at end of second line
+    # input	uniqMappedNonDupeReadCount	estExonicUniqMappedNonDupeReadCount	qc\n
+    # sample1.readDist.txt	22333444.5	10999888.765	PASS\n
+    status = qc_file_handle.readlines()[1][-5:-1].decode("utf-8")
+    if not status in ["PASS", "FAIL"]:
+        print("ERROR: MEND QC file {} parsed status as '{}' - expected PASS or FAIL!".format(file_label, status))
+        raise ValueError
+    return status
+
+def get_qc_status(mend_results_path, mend_manifest_path):
+    """Takes: mend results tgz containing bam_umend_qc.tsv files, and
+    manifest containing mapping from filename to sample id.
+    returns dictionary from sample id to mend status: PASS or FAIL"""
+    mend_tgz = tarfile.open(mend_results_path)
+    qc_files = (i for i in mend_tgz.getmembers() if i.name.endswith("bam_umend_qc.tsv"))
+    # Dictionary of filename (UUID.bam_umend_qc.tsv) to PASS or FAIL string
+    filename_map = { i.name : extract_sample_qc_status(mend_tgz.extractfile(i), i.name) for i in qc_files }
+    # Find sample ID entries in manifest and map them via filename.
+    manifest = utils.read_tsv(mend_manifest_path)
+
+    # If a QC result file is not listed in the manifest, this throws KeyError
+    # This would indicate major failure in the QC script and should interrupt analysis.
+    return { manifest.loc[k]["Kids.First.Biospecimen.ID"] : v for k, v in filename_map.items() }
+
+def filter_samples(input_matrix,
+                   mend_results_path,
+                   mend_manifest_path,
+                   clinical_path,
+                   nofilter=False,
+                   verbose=False):
+    """Filters samples by QC and tumor status.
+    Takes path to samples expression file; tgz of MEND QC results;
+    manifest of MEND QC files to sample ids; and clinical data file.
+    Returns data frame of sample expression filtered to contain only
+    MEND QC pass tumor samples. Filters are skippable via nofilter if,
+    for example, clinical and qc files aren't available."""
+    print_v = print if verbose else lambda *a, **k: None
+
+    print_v("Loading sample matrix {}".format(input_matrix))
+    raw_samples = utils.read_rds(input_matrix)
+
+    if nofilter:
+        print_v("Skipped the QC and tumor status filters; using all samples.")
+        return raw_samples
+
+    # Filter to only sample ids with a MEND QC status of PASS
+    # Samples not mentioned in the qc_status (ie, no qc result) will be dropped by the filter
+    qc_status = get_qc_status(mend_results_path, mend_manifest_path)
+    qc_pass_ids= [k for k, v in qc_status.items() if v == "PASS"]
+    qc_pass_ids_in_dataset = set(raw_samples.columns).intersection(qc_pass_ids)
+    print_v("QC filter: dropped {} samples that were not MEND QC PASS.".format(
+            len(raw_samples.columns) - len(qc_pass_ids_in_dataset)))
+
+    # Call out any sample ids that had no corresponding QC file, but ok to continue even if some are present
+    samples_without_qc_results = set(raw_samples.columns) - set(qc_status.keys())
+    if len(samples_without_qc_results):
+        print("WARNING: {} samples were dropped due to missing MEND QC results. Sample IDs: {}".format(
+              len(samples_without_qc_results),
+              ", ".join(samples_without_qc_results)))
+
+    # Further filter to only tumor RNA-Seq sample ids
+    clinical = utils.read_tsv(clinical_path)
+    tumor_ids = clinical[(clinical["sample_type"] == "Tumor")
+                         & (clinical["composition"] == "Solid Tissue")
+                         & (clinical["experimental_strategy"] == "RNA-Seq")
+                        ].index
+    qc_pass_tumor_ids_in_dataset = qc_pass_ids_in_dataset.intersection(tumor_ids)
+    print_v("Tumor filter: dropped {} samples that were not solid tumor RNA-Seq.".format(
+            len(qc_pass_ids_in_dataset) - len(qc_pass_tumor_ids_in_dataset)))
+
+    # And apply filtered IDs to the original dataset, retaining the original column order
+    qc_pass_tumor_samples = raw_samples[[sid for sid in raw_samples.columns if sid in qc_pass_tumor_ids_in_dataset]]
+    return qc_pass_tumor_samples
+
+def prepare_expression(raw_samples,
                        normalized_samples_path,
+                       gene_filters_path,
                        proportion_unexpressed=0.8,
                        variance_filter_level=0.2,
-                       verbose=False,
-                       prefix=""):
-    """Loads samples, converts to log2(TPM+1), and applies expression and variance filters.
-    returns dataframe of converted and filtered expression.
-    input_matrix = path to .rds file
-    writes file: scratch/{prefix}filtered_genes_to_keep.rds"""
+                       verbose=False):
+    """Converts sample matrix to log2(TPM+1) and applies expression and variance filters.
+    returns dataframe of converted and filtered expression; generates output feather files
+    stored at normalized_samples_path, gene_filters_path.
+    raw_samples = TPM expression dataframe.
+    Normalized samples file is a log2(TPM+1) expression matrix retaining all genes.
+    Gene filters file lists all genes and their filter status - either R = Retained,
+    E = dropped by the Expression filter, or V = dropped by the variance filter."""
 
     print_v = print if verbose else lambda *a, **k: None
-    filtered_genelist_filepath = os.path.join(scratch_dir, "{}filtered_genes_to_keep.rds".format(prefix))
-
-    print_v("Loading samples {}".format(input_matrix))
-    raw_samples = utils.read_rds(input_matrix)
 
     # Convert to log2(tpm+1)
     samples = np.log2(raw_samples+1)
     print_v("Writing normalized samples to {}".format(normalized_samples_path))
-    utils.write_rds(samples, normalized_samples_path)
+    utils.write_feather(samples, normalized_samples_path)
 
     # run expression filter
     # Remove any genes that have 0 expression in more samples than proportion_unexpressed
@@ -60,20 +139,24 @@ def prepare_expression(input_matrix,
     expr_var_filtered_genelist = pd.DataFrame(variance.nlargest(keep_proportion).index)
     expr_var_filtered_df = expression_filtered_df.filter(expr_var_filtered_genelist["gene_id"], axis=0)
 
-    utils.write_rds(expr_var_filtered_genelist, filtered_genelist_filepath)
+    # Save filter status by first by marking every gene as failed Expression filter;
+    # marking genes that passed as failed Variance filter, and again marking genes that passed
+    # that filter as Retained.
+    gene_filter_status = pd.DataFrame("E", index=samples.index, columns=["status"])
+    gene_filter_status["status"][expression_filtered_df.index] = "V"
+    gene_filter_status["status"][expr_var_filtered_df.index] = "R"
 
-    return expr_var_filtered_df
+    print_v("Writing gene filter status to {}".format(gene_filters_path))
+    utils.write_feather(gene_filter_status, gene_filters_path)
 
+    return expr_var_filtered_df
 
-def calculate_correlation(expr_var_filtered_df, scratch_dir, verbose=False, prefix=""):
+def calculate_correlation(expr_var_filtered_df, all_by_all_path, verbose=False):
     """Takes an expression matrix dataframe and calculates spearman correlations.
     (spearman is a rank transformed pearson ('correlation') metric)
-    output file: scratch/{prefix}all_by_all_correlations.tsv"""
-
+    output file: scratch/{prefix}all_by_all_correlations.feather"""
     print_v = print if verbose else lambda *a, **k: None
 
-    all_by_all_filepath = os.path.join(scratch_dir, "{}all_by_all_correlations.rds".format(prefix))
-
     # Rank transform - values are now that rank of that gene within each sample
     print_v("Rank transforming")
     filtered_rank_transformed_df = np.apply_along_axis(scipy.stats.rankdata, 0, expr_var_filtered_df)
@@ -86,8 +169,8 @@ def calculate_correlation(expr_var_filtered_df, scratch_dir, verbose=False, pref
     labels=expr_var_filtered_df.columns
     all_by_all_df = pd.DataFrame(x_corr, index=labels, columns=labels)
 
-    print_v("Writing to file {}".format(all_by_all_filepath))
-    utils.write_rds(all_by_all_df, all_by_all_filepath)
+    print_v("Writing to file {}".format(all_by_all_path))
+    utils.write_feather(all_by_all_df, all_by_all_path)
     return all_by_all_df
 
 def main():
@@ -98,11 +181,15 @@ def main():
 
     p = argparse.ArgumentParser()
     p.add_argument("input_path", metavar="input-path", help="Path to input Rds file.")
+    p.add_argument("--clinical-path", help="Path to clinical data matrix tsv file.")
+    p.add_argument("--qc-manifest-path", help="Path to MEND QC manifest tsv file.")
+    p.add_argument("--qc-results-path", help="Path to tgz file of MEND QC results for each sample.")
     p.add_argument("--verbose", action="store_true")
+    p.add_argument("--nofilter", action="store_true", help="Skip the sample QC and tumor status filters")
     p.add_argument("--scratch", 
                    default=os.path.join("..", "..", "scratch"),
                    help="Path to scratch dir.")
-    p.add_argument("--output-prefix", help="Prefix for output files.")
+    p.add_argument("--prefix", help="Prefix for output files.")
     p.add_argument("--proportion-unexpressed", 
                    default=0.8, 
                    type=float, 
@@ -114,21 +201,30 @@ def main():
     args = p.parse_args()
 
     # Use input basename as prefix if none was supplied
-    prefix = args.output_prefix or os.path.splitext(os.path.basename(args.input_path))[0]
-
-    # Output file path
-    normalized_samples_path = os.path.join(args.scratch, "{}log2-normalized.rds".format(prefix))
-
-    expression = prepare_expression(args.input_path, 
-                                    args.scratch, 
-                                    normalized_samples_path,
-                                    prefix=prefix, 
-                                    verbose=args.verbose,
-                                    proportion_unexpressed=args.proportion_unexpressed, 
-                                    variance_filter_level=args.variance_filter_level)
-    calculate_correlation(expression, 
-                          args.scratch, 
-                          prefix=prefix, 
+    prefix = args.prefix or os.path.splitext(os.path.basename(args.input_path))[0]
+
+    # Output file paths
+    normalized_samples_path = os.path.join(args.scratch, "{}log2-normalized.feather".format(prefix))
+    gene_filters_path = os.path.join(args.scratch, "{}gene-filters.feather".format(prefix))
+    all_by_all_path = os.path.join(args.scratch, "{}all_by_all_correlations.feather".format(prefix))
+
+
+    # Load the expression file and filter samples to QC-pass tumors only
+    filtered_expression = filter_samples(args.input_path,
+                                         args.qc_results_path,
+                                         args.qc_manifest_path,
+                                         args.clinical_path,
+                                         nofilter=args.nofilter,
+                                         verbose=args.verbose)
+
+    normalized_expression = prepare_expression(filtered_expression,
+                                               normalized_samples_path,
+                                               gene_filters_path,
+                                               proportion_unexpressed=args.proportion_unexpressed,
+                                               variance_filter_level=args.variance_filter_level,
+                                               verbose=args.verbose)
+    calculate_correlation(normalized_expression,
+                          all_by_all_path,
                           verbose=args.verbose)
 
 if __name__ == "__main__":