Perform differential expression tests for RNASeq at scale with a configurable interface and visualize results with tunable control over p-value thresholds.
Start with raw count data from RNASeq and format a data file with factors and features for all samples.
Data files not included in the distribution will be automatically downloaded when running GO and KEGG annotation. If the internet is unavailable, download ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz and extract to the data/ directory.
pip install -I .
jupyter nbextension enable --py --sys-prefix widgetsnbextension
jupyter nbextension enable --py --sys-prefix clustergrammer_widget
To run omics tools at scale, run the following command:
python omics_tools/run_omics.py --input_counts_file <PATH_TO_COUNTS_FILE> --config_file <PATH_TO_CONFIG_FILE> --output_dir <PATH_TO_OUTPUT_DIR>
Description of command line arguments:
input_counts_file
: this is a pointer to the raw counts file on your systemconfig_file
: this is a configuration file that describes all the comparisons that need to be made. Keys and values in config MUST line up with columns in yourinput_counts_file
output_dir
: Path to your output directory
You can find example config files in config
.
NOTE: the omics tools expects all factors in the following sections, when present, of the config file to be in lower case:
int_factors
float_factors
bool_factor
Below is an example of executing the tool for the Bacillus Inducer 1.0 ER from DARPA's SD2 program.
python omics_tools/run_omics.py --input_counts_file omics_tools/examples/scaled_example/experiment.ginkgo.29422_ReadCountMatrix_preCAD_transposed.csv --config_file omics_tools/config/Bacillus_Inducer_1_0.json --output_dir omics_tools/examples/scaled_example
All output files will be generated in the "results" folder. Outputs will include:
- One txt file per comparison, named with the metadata.
- A file called
massive_df.csv
which is a wide dataframe of all comparisons and FDRs. Each column is named with the comparison. - A file called
additive_design_df.csv
where it is a long representation of (2)
The visualization aspect can be run on jupyter or the dataframe exported to the clustergrammer2 web application.
If you want to use this tool to perform differential expression testing then there are some options:
-
Run the tests within python using the rpy2 R interface. For 200+ tests, use on a workstation (24+ cpu).
-
If you have access to a HPC cluster an option to produce script files is available. This is probably the fastest compute option, but requires knowledge of a job scheduler and a few extra steps.
Modify
differential_expression.make_hpc_de_files
with appropriate system paths.
Git
- Alexander Cristofaro
- Mohammed Eslami
- George Zheng
This project is licensed under the MIT License - see the LICENSE.md file for details
- Clustergrammer2 for the heatmap visualization.
- GOATOOLS for gene ontology annotation.
- ClusterProfiler for KEGG annotation.
- edgeR for differential expression.