Skip to content

Latest commit

 

History

History
56 lines (38 loc) · 2.8 KB

README.md

File metadata and controls

56 lines (38 loc) · 2.8 KB

scripts

This directory contains various scripts used by the pipeline. However, you can use most of these scripts on their own, too. Some may even be helpful in day-to-day use.

All python scripts implement the --help argument. For bash, R, and awk scripts, you can run head <script> to read about their usage.

A python script that uses files from the prepare and classify pipelines to create a VCF with the final, predicted variants. This script also has a special internal mode, which can be used for recalibrating the QUAL scores output in the VCF.

A bash script for identifying sites at which the variant callers in our ensemble outputted conflicting alleles.

A bash script for extracting columns from TSVs via grep. Every argument besides the first is passed directly to grep.

A fast awk script for classifying each site in a VCF as DEL, INS, SNP, etc. It accepts a two column table (REF and ALT) from the VCF.

A bash script for converting all REF/ALT columns in a TSV to binary positive/negative labels using classify.awk.

A bash script for replacing NA values in a large TSV.

A bash script for filtering rows from a large TSV by specific columns.

A python script for creating plots of the importance of each variable (ie feature) outputted by each variant caller.

A python script for calculating evaluation metrics on a two column TSV of binary labels: truth and predictions.

A python script for summarizing multiple metrics files output by metrics.py in a nicely formatted table.

A fast awk script for ensuring that unusual numerical values in a large TSV can be read by R.

A python script for creating precision-recall plots. It takes as input the output of metrics.py and/or statistics.py.

An R script for predicting variants using a trained classifier. It takes as input a model generated by train_RF.R.

A python script for creating ROC plots. It takes as input the output of statistics.py.

A python script for generating points to use in a precision-recall or ROC curve. It takes as input a two column TSV: true labels and prediction p-values.

An R script for creating a trained classifier. We recommend using the Snakefile-classify pipeline to run this script.

An R script for visualizing the results of hyperparameter tuning from the train_RF.R script.