Skip to content

Pipeline scripts

zeyang-shen edited this page Jun 26, 2020 · 6 revisions

FASTA input: maggie_fasta_input.py

This script takes positive and negative sequences from FASTA files and conduct MAGGIE analysis.

python ./bin/maggie_fasta_input.py [posFile] [negFile] -o [directory]
argument description default
posFile fasta file(s) that contain positive sequences; multiple files should be separated by comma without space (e.g., file1,file2,file3,...) required user input
negFile fasta file(s) that contain negative sequences that should have the same sequence identifiers as positive sequences to form pairs required user input
-o directory to store output files; by default, a new folder will be created under the current path ./maggie_output/
--motifPath directory that stores motif files ./data/JASPAR2020_CORE_vertebrates_motifs/
-m/--motifs specify motifs to compute; multiple motifs should be separated by comma without space (e.g., SPI1,CEBPB) all motifs from --motifPath
-p number of processors to run 1
-R Flag to overwrite the output folder specified by -o if it already exists False
-mCut cutoff for merging similar motifs; should be a float value ranging from 0 (merge everything) to 1 (no merging at all) 0.6
-sCut cutoff for calling significance based on FDR values 0.05
-T number of top motif scores to be used to compute for representative motif score 1
--saveDiff Flag for saving motif score differences. This file can be large. False
--linear Change to linear model for analysis False

VCF input: maggie_vcf_input.py

This script takes a VCF file with allele and effect size information and conduct MAGGIE analysis.

python ./bin/maggie_vcf_input.py [vcfFile] [genome] -e [effect size column number] -o [directory]
argument description default
vcfFile VCF file that contains testing variants required user input
genome reference genome; or specify a path to genome FASTA file required user input (currently support hg19, hg38, mm10, hg18)
-o directory to store output files ./maggie_output/
-e/--effect the column index in the input file for effect size that compares alternative vs. reference alleles. If not specified, assume alternative alleles are always associated with a higher signal required user input
-a1 the column index for the reference allele. If not specified, use the 4th column 4
-a2 the column index for the alternative allele. If not specified, use the 5th column 5
--saveSeq Flag for saving intermediate sequences. Will generate two files that correspond to alleles associated higher and lower signals False
-S/--size size of sequences to test around variants 100
--motifPath path to the motif files ./data/JASPAR2020_CORE_vertebrates_motifs/
-m/--motifs specify motifs to compute; multiple motifs should be separated by comma without space (e.g., SPI1,CEBPB) all motifs from --motifPath
-p number of processors to run 1
-mCut cutoff for merging similar motifs; should be a float value ranging from 0 (merge everything) to 1 (no merging at all) 0.6
-sCut cutoff for calling significance based on FDR values 0.05
-T number of top motif scores to be used to compute for representative motif score 1
--saveDiff Flag for saving motif score differences. This file can be large. False

Split variants: splitVariants.py

This script splits variants in a VCF file based on genomic annotations into different categories (near TSS, intergenic, intronic).

python ./bin/splitVariants.py [vcfFile] [genome] -o [directory]
argument description default
vcfFile VCF file that contains testing variants required user input
genome reference genome; or specify a path to genome FASTA file required user input (currently support hg19, hg38, mm10, hg18)
-o directory to store output files ./maggie_output/
-L/--overlap overlap size to count for annotation 100

Download data: download_genomic_data.py

This script can be used to download the genomic data, including genome and annotations. Currently available genomes include hg19, hg38, mm10, hg18.

python ./bin/download_genomic_data.py [genome] -o [directory]
argument description default
genome genome to download: hg19, hg38, mm10, hg18 required user input
-o directory to store downloaded files ./data/genomes/
--annot Flag for downloading annotation files at the same time False