Skip to content

Commit

Permalink
[AlexsLemonade#229] comparative-rnaseq-analysis: README cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
e-t-k authored Feb 5, 2020
1 parent ec77eb1 commit 00337b0
Showing 1 changed file with 13 additions and 12 deletions.
25 changes: 13 additions & 12 deletions analyses/comparative-RNASeq-analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,23 @@
- [Limitations and Requirements](#limitations-and-requirements)
- [Future Updates](#future-updates)

# Purpose
## Purpose
The comparative-RNAseq-analysis module implements the outlier analysis workflow published in [Vaske et al. Jama Open Network. 2019](https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2753519), which highlights genes within each sample whose expression is an outlier compared to the expression distribution of the dataset as a whole. This workflow:
- Creates correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
- Generates gene outlier threshold values for ribodeplete samples.
- Lists outlier genes for each ribodeplete sample.

# Usage
## Usage
The input file must be an RNA-Seq TPM gene expression matrix in the [RDS](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html) file format. Currently available matrices that meet this format are `pbta-gene-expression-rsem-tpm.polya.rds` and `pbta-gene-expression-rsem-tpm.stranded.rds`.

Example command lines follow, using the stranded (ribodeplete) dataset.
Command lines follow, using as example the stranded (ribodeplete) dataset.
Available flags:
- `--verbose` toggles verbose output
- `--output-prefix MyDataset` prepends MyDataset to output filenames to identify runs. Subsequent steps in the same run must use the same prefix.
- `--scratch ../../scratch` provides path to scratch dir where intermediate files shared between steps can be read and written.
- `--results ./results` provides path to final results dir.

## 01 - Correlation Matrix
### 01 - Correlation Matrix
Generates the correlation matrix and filtered gene list.

```
Expand All @@ -35,15 +35,17 @@ Generates the correlation matrix and filtered gene list.
```

Input file:
`data/pbta-gene-expression-rsem-tpm.stranded.rds`
```
data/pbta-gene-expression-rsem-tpm.stranded.rds
```

Output files:
```
scratch/rsem-tpm-stranded-all_by_all_correlations.rds
scratch/rsem-tpm-stranded-filtered_genes_to_keep.rds
```

## 02 - Thresholds and Outliers
### 02 - Thresholds and Outliers
Generates outlier thresholds and matrix of outlier genes.

```
Expand All @@ -70,25 +72,24 @@ scratch/rsem-tpm-stranded-log2-normalized.rds
scratch/rsem-tpm-stranded-threshold-expression-values.rds
```

# Limitations and requirements
## Limitations and requirements
Because the per-sample results of this analysis are dependent on the entire dataset, all samples in the dataset must meet certain standards for the outliers to be meaningful. *(Currently, these standards are not being enforced.)*
- All samples must pass a quality control check.
- Dataset must contain only tumor samples; no normal, cell line, etc data.
- All samples in the dataset must have the same library preparation for their gene expression to be comparable. (Eg, polyA selection, ribodepletion, or hybrid capture).

## Software dependencies
### Software dependencies
The analysis uses python 3 and requires the following libraries. Version numbers
represent version in use and earlier or later versions may also be acceptable.
are those currently in use and earlier or later versions may also be acceptable but have not been tested.
```
numpy (1.17.3)
pandas (0.25.3)
scipy (1.3.2)
scikit-learn (0.19.1)
pyreadr (0.2.1)
```
In addition, a `utils` module is included for certain shared functions.

# Future updates
- The next iteration of this of this module will use the `pbta-histologies.tsv` file to filter the input datasets to only tumor samples. It will also filter samples to those which meet a to-be-described quality control standard.
## Future updates
- The next iteration of this module will use the `pbta-histologies.tsv` file to filter the input datasets to only tumor samples. It will also filter samples to those which meet a to-be-described quality control standard.

0 comments on commit 00337b0

Please sign in to comment.