Pipeline scripts

FASTA input: maggie_fasta_input.py

This script takes positive and negative sequences from FASTA files and conduct MAGGIE analysis.

python ./bin/maggie_fasta_input.py [posFile] [negFile] -o [directory]

argument	description	default
posFile	fasta file(s) that contain positive sequences; multiple files should be separated by comma without space (e.g., file1,file2,file3,...)	required user input
negFile	fasta file(s) that contain negative sequences that should have the same sequence identifiers as positive sequences to form pairs	required user input
-o	directory to store output files; by default, a new folder will be created under the current path	./maggie_output/
--motifPath	directory that stores motif files	./data/JASPAR2020_CORE_vertebrates_motifs/
-m/--motifs	specify motifs to compute; multiple motifs should be separated by comma without space (e.g., SPI1,CEBPB)	all motifs from `--motifPath`
-p	number of processors to run	1
-R	Flag to overwrite the output folder specified by `-o` if it already exists	False
-mCut	cutoff for merging similar motifs; should be a float value ranging from 0 (merge everything) to 1 (no merging at all)	0.6
-sCut	cutoff for calling significance based on FDR values	0.05
-T	number of top motif scores to be used to compute for representative motif score	1
--saveDiff	Flag for saving motif score differences. This file can be large.	False
--linear	Change to linear model for analysis	False

This script takes a VCF file with allele and effect size information and conduct MAGGIE analysis.

python ./bin/maggie_vcf_input.py [vcfFile] [genome] -e [effect size column number] -o [directory]

argument	description	default
vcfFile	VCF file that contains testing variants	required user input
genome	reference genome; or specify a path to genome FASTA file	required user input (currently support hg19, hg38, mm10, hg18)
-o	directory to store output files	./maggie_output/
-e/--effect	the column index in the input file for effect size that compares alternative vs. reference alleles. If not specified, assume alternative alleles are always associated with a higher signal	required user input
-a1	the column index for the reference allele. If not specified, use the 4th column	4
-a2	the column index for the alternative allele. If not specified, use the 5th column	5
--saveSeq	Flag for saving intermediate sequences. Will generate two files that correspond to alleles associated higher and lower signals	False
-S/--size	size of sequences to test around variants	100
--motifPath	path to the motif files	./data/JASPAR2020_CORE_vertebrates_motifs/
-m/--motifs	specify motifs to compute; multiple motifs should be separated by comma without space (e.g., SPI1,CEBPB)	all motifs from `--motifPath`
-p	number of processors to run	1
-mCut	cutoff for merging similar motifs; should be a float value ranging from 0 (merge everything) to 1 (no merging at all)	0.6
-sCut	cutoff for calling significance based on FDR values	0.05
-T	number of top motif scores to be used to compute for representative motif score	1
--saveDiff	Flag for saving motif score differences. This file can be large.	False

This script splits variants in a VCF file based on genomic annotations into different categories (near TSS, intergenic, intronic).

python ./bin/splitVariants.py [vcfFile] [genome] -o [directory]

argument	description	default
vcfFile	VCF file that contains testing variants	required user input
genome	reference genome; or specify a path to genome FASTA file	required user input (currently support hg19, hg38, mm10, hg18)
-o	directory to store output files	./maggie_output/
-L/--overlap	overlap size to count for annotation	100

This script can be used to download the genomic data, including genome and annotations. Currently available genomes include hg19, hg38, mm10, hg18.

python ./bin/download_genomic_data.py [genome] -o [directory]

argument	description	default
genome	genome to download: hg19, hg38, mm10, hg18	required user input
-o	directory to store downloaded files	./data/genomes/
--annot	Flag for downloading annotation files at the same time	False