diff --git a/README.md b/README.md index 3b271fb..ae495f7 100644 --- a/README.md +++ b/README.md @@ -21,3 +21,121 @@ Run: For detailed help/support, email Adam: adamscott@wustl.edu + +## Usage Details +### Input data + -m Standard .maf + -f Standard .vcf + -T Custom .tsv +Variant data may be input via at least one variant file. +This means that if variants are spread across several files, then you can input one of each type. +For the .maf and .tsv, use the custom columns to determine which columns to use. +Note that a standard .maf does not include protein annotations. +Use the custom column for the peptide change column. +If your .vcf has VEP annotations, then CharGer should be able to parse the information. +This information will be added to your variants when available. + +### Output + -o output file + -w output as HTML (flag) + -k annotate input (flag) + --run-url-test test url when creating links +Name your output file; otherwise it will be called charger_summary.tsv. +You can opt to make the output into an HTML page, instead of a readable .tsv. +If you need to be assured of properly linked URL's, use the url test flag. + +### Access data + -l ClinVar (flag) + -x ExAC (flag) + -E VEP (flag) + -t TCGA cancer types (flag) +Using these flags turns on accession features built in. +For the ClinVar, ExAC, and VEP flags, if no local VEP or databse is provided, then BioMine will be used to access the ReST interface. +The TCGA flag allows disease determination from sample barcodes in a .maf when using a diseases file (see below). + +### Suppress data or overrides + -O override with ClinVar description (flag) + -D suppress needing disease specific (flag) +You can have CharGer override its pathogenic characterization with whatever ClinVar has. +Suppressing disease specific variants takes any variants in the diseases file (see below) and treats them as equally pathogenic without disease consideration. + +### Cross-reference data + -z pathogenic variants, .vcf + -e expression matrix file, .tsv + -g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt + -d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv + -n de novo file, standard .maf + -a assumed de novo file, standard .maf + -c co-segregation file, standard .maf + -H HotSpot3D clusters file, .clusters + -r recurrence threshold (default = 2) +Variants or genes from each of these files can be used as additional known information. +An expression matrix file has columns for each sample, and its rows are genes. +The genes should be approved HUGO symbols. +HotSpot3D clusters can be used for versions v1.x.x. +The recurrence threshold will be pulled from the recurrence/weight column of the .clusters file when provided. + +### Local VEP + --vep-script Path to VEP + --vep-dir Path to VEP directory + --vep-cache Path to VEP cache directory + --vep-version VEP version (default = 87) + --vep-output VEP output file (default = charger.vep.vcf) + --grch assembly GRCh verion (default = 37) + --ensembl-release Ensembl release version (default = 75) + --reference-fasta VEP reference fasta + --fork Number of forked processes used in VEP (default = 0) +This currently only works with .vcf input only. +Annotations are run with the VEP everything flag, so any local plugins will be used. +The BioMine accession is also suppressed when using a local VEP installaltion. +The VEP directory is not the same as would be given to VEP's --dir option. +Instead it is the path to the directory with the VEP .pl file. +The VEP script is the .pl file only. +If not given, it will be /vep-dir/variant\_effect\_predictor.pl. +The VEP cache directory is the same as would be given to VEP's --dir-cache option. +If you have multiple VEP versions, then specify the version you want to use. +This can be different from the Ensembl release option. +VEP output is the same os would be given to VEP's -o option and should end with .vcf. +The default output file will be called charger.vep.vcf. +The GRCh reference genome can be set to either 37 or 38. +The reference Fasta file will be deteremined automatically if not specified. +If the reference Fasta file is constructed automatically, then if, for example, the VEP chache is ~/.vep/, the Ensembl release is 74, and the reference assembly is 37, then the reference Fasta file will be ~/.vep/homo\_sapiens/74\_GRCH37/Homo\_sapiens.GRCh37.74.dna.primary\_assembly.fa.gz. + +### Local databases + --exac-vcf ExAC vcf.gz + --mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz) +Using local databases suppresses the BioMine accession too. +These files can be downloaded from their respective sites. + +### Filters + --rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency): + --vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag) + --mutation-types Comma delimited list of types to allow +Using filters will limit the variants processed. +The rare option takes variants with allele frequency less than the given value. +The .vcf any filter accepts only variants that have passed all filters. +If no .vcf pass filter status given, the .vcf null value will be taken as having passed. +Mutation types filtering requires a comma delimitted list (no spaces) using terms from Ensembl's consequence terms. + +### ReST batch sizes + -v VEP (#variants, default/max allowed = 150) + -b ClinVar summary (#variants, default/max allowed = 500) + -B ClinVar searchsize (#variants, default/max allowed = 50) +ReST API's usually have limits on the amount of data sent or received. +Exceeding these batch sizes would normally lead to warnings and/or IP blockage, but CharGer and BioMine try to keep batches at safe sizes. Last updated limits February 2017. + +### Custom columns (0-based) + -G HUGO gene symbol + -X chromosome + -S start position + -P stop position + -R reference allele + -A alternate allele + -s strand + -M sample name + -C codon + -p peptide change + -L variant classification + -F allele frequency +Use these for .tsv and/or .maf input variant files to specify columns of relevant data. +CharGer makes use of genomic and protein variant annotations, so the more data made available the better your results. diff --git a/bin/charger b/bin/charger index 8e1ea8c..62df617 100755 --- a/bin/charger +++ b/bin/charger @@ -1,7 +1,7 @@ #!/bin/python # CharGer - Characterization of Germline variants # author: Adam D Scott (adamscott@wustl.edu) & Kuan-lin Huang (khuang@genome.wustl.edu) -# version: v0.1.0 - 2015*12 +# version: v0.1.0 - 2016*04 import sys import getopt @@ -9,7 +9,8 @@ from charger import charger import time def parseArgs( argv ): - helpText = "Usage: " + helpText = "\nCharGer - v0.1.0\n\n" + helpText += "Usage: " helpText += "charger [options]\n\n" helpText += "Accepted input data files:\n" helpText += " -m Standard .maf\n" @@ -29,15 +30,15 @@ def parseArgs( argv ): helpText += " -O override with ClinVar description (flag)\n" helpText += " -D suppress needing disease specific (flag)\n" helpText += "Cross-reference data files:\n" - helpText += " -z pathogenic variants .vcf\n" - helpText += " -e expression matrix file .tsv\n" - helpText += " -g gene list file .txt\n" - helpText += " -d diseases file (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n" - helpText += " -n de novo file .?\n" - helpText += " -a assumed de novo file .?\n" - helpText += " -c co-segregation file .?\n" - helpText += " -H HotSpot3D clusters file .clusters\n" - helpText += " -r recurrence threshold (default = )\n" + helpText += " -z pathogenic variants, .vcf\n" + helpText += " -e expression matrix file, .tsv\n" + helpText += " -g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt\n" + helpText += " -d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n" + helpText += " -n de novo file, standard .maf\n" + helpText += " -a assumed de novo file, standard .maf\n" + helpText += " -c co-segregation file, standard .maf\n" + helpText += " -H HotSpot3D clusters file, .clusters\n" + helpText += " -r recurrence threshold (default = 2)\n" helpText += "Local VEP (works with .vcf input only; suppresses ReST too):\n" helpText += " --vep-script Path to VEP\n" helpText += " --vep-dir Path to VEP directory\n" @@ -45,7 +46,7 @@ def parseArgs( argv ): helpText += " --vep-version VEP version (default = 87)\n" helpText += " --vep-output VEP output file (default = charger.vep.vcf)\n" helpText += " --grch assembly GRCh verion (default = 37)\n" - helpText += " --ensembl-release Ensembl release version (default = 74)\n" + helpText += " --ensembl-release Ensembl release version (default = 75)\n" helpText += " --reference-fasta VEP reference fasta\n" helpText += " --fork Number of forked processes used in VEP (default = 0) \n" helpText += "Local databases (suppresses ReST too):\n" @@ -53,13 +54,13 @@ def parseArgs( argv ): #helpText += " --clinvar-tsv ClinVar (.tsv.gz download)\n" #helpText += " --clinvar-vcf ClinVar (.vcf.gz download)\n" helpText += " --mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz)\n" - helpText += " --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n" + #helpText += " --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n" helpText += "Filters:\n" helpText += " --rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency):\n" helpText += " --vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag)\n" - helpText += " --mutation-types Comma delimited list of types to allow\n" + helpText += " --mutation-types Comma delimited list (no spaces) of types to allow\n" helpText += "ReST batch sizes:\n" - helpText += " -v VEP (#variants, default/max allowed = )\n" + helpText += " -v VEP (#variants, default/max allowed = 150)\n" helpText += " -b ClinVar summary (#variants, default/max allowed = 500)\n" helpText += " -B ClinVar searchsize (#variants, default/max allowed = 50)\n" helpText += "Custom columns (0-based)\n" diff --git a/charger/charger.py b/charger/charger.py index aa1e40d..0d8b824 100644 --- a/charger/charger.py +++ b/charger/charger.py @@ -109,7 +109,6 @@ def getInputData( self , **kwargs ): deNovoFile = kwargs.get( 'deNovo' , "" ) assumedDeNovoFile = kwargs.get( 'assumedDeNovo' , "" ) coSegregateFile = kwargs.get( 'coSegregate' , "" ) - tcga = kwargs.get( 'tcga' , True ) diseaseFile = kwargs.get( 'diseases' , "" ) specific = kwargs.get( 'specific' , False ) self.getDiseases( diseaseFile , **kwargs ) @@ -131,6 +130,12 @@ def getInputData( self , **kwargs ): exacDone = self.readTSV( tsvFile , **kwargs ) if pathogenicVariantsFile: self.readVCF( pathogenicVariantsFile , appendTo="pathogenic" , **kwargs ) + if deNovoFile: + self.readDeNovo( deNovoFile ) + if assumedDeNovoFile: + self.readAssumedDeNovo( assumedDeNovoFile ) + if coSegregateFile: + self.readCoSegregate( coSegregateFile ) if expressionFile: self.readExpression( expressionFile ) else: @@ -733,10 +738,10 @@ def readGeneList( self , inputFile , **kwargs ): # gene list formatted "gene", " print "CharGer::readGeneList Error: bad gene list file" def readDeNovo( self , inputFile ): self.readOtherMAF( inputFile , varDict = self.deNovoVariants ) - def readCoSegregate( self , inputFile ): - self.readOtherMAF( inputFile , varDict = self.coSegregateVariants ) def readAssumedDeNovo( self , inputFile ): self.readOtherMAF( inputFile , varDict = self.assumedDeNovoVariants ) + def readCoSegregate( self , inputFile ): + self.readOtherMAF( inputFile , varDict = self.coSegregateVariants ) def readOtherMAF( self , inputFile, varDict ): try: inFile = self.safeOpen( inputFile , 'r' , warning=True ) @@ -772,7 +777,15 @@ def getClinVar( self , **kwargs ): macClinVarVCF = kwargs.get( 'macClinVarVCF' , None ) doClinVar = kwargs.get( 'clinvar' , False ) summaryBatchSize = kwargs.get( 'summaryBatchSize' , 500 ) - searchBatchSize = kwargs.get( 'searchBatchSize' , 50 ) + maxSearchBatchSize = 50 + searchBatchSize = kwargs.get( 'searchBatchSize' , maxSearchBatchSize ) + if searchBatchSize > maxSearchBatchSize: + message = "warning: ClinVar ReST search batch size given is " + message += "greater than max allowed (" + message += str( maxSearchBatchSize ) + ")" + message += ". Overriding to max search batch size." + print( message ) + searchBatchSize = maxSearchBatchSize #print( ' '.join( [ str( doClinVar ) , str( macClinVarTSV ) , str( macClinVarVCF ) ] ) ) if doClinVar: if macClinVarTSV: @@ -1421,7 +1434,7 @@ def checkClinVarPC( self , var , mod , **kwargs ): #if genomic change is the same, then PS1 if clin["description"] == clinvarvariant.pathogenic: if mod == "PS1": - print( "also PS1 via samGenomicVariant" ) + #print( "also PS1 via samGenomicVariant" ) var.PS1 = True # already pathogenic still suffices to be PS1 called = 1 elif consequence.sameGenomicReference( clinvarVar ): @@ -1429,7 +1442,7 @@ def checkClinVarPC( self , var , mod , **kwargs ): if clinvarVar.alternatePeptide == consequence.alternatePeptide: #same amino acid change if clin["description"] == clinvarvariant.pathogenic: if mod == "PS1": - print( "also PS1 via sameGenomicReference" ) + #print( "also PS1 via sameGenomicReference" ) var.PS1 = True called = 1 if consequence.samePeptideReference( clinvarVar ): @@ -1438,7 +1451,7 @@ def checkClinVarPC( self , var , mod , **kwargs ): if consequence.plausibleCodonFrame( clinvarVar ): if clin["description"] == clinvarvariant.pathogenic: if mod == "PM5": - print( "also PS5 via plausibleCodonFrame" ) + #print( "also PS5 via plausibleCodonFrame" ) var.PM5 = True # already pathogenic still suffices to be PS1 called = 1 return called diff --git a/setup.py b/setup.py index db47007..8c61d8d 100644 --- a/setup.py +++ b/setup.py @@ -15,11 +15,13 @@ classification, CharGer cross-checks user variants against \ several public databases for information. CharGer then \ provides a pathogenicity score according to ACMG or \ - CharGers custom scale.' , + CharGer\'s custom scale.' , download_url = 'https://github.com/ding-lab/CharGer/archive/v' + \ version + '.tar.gz' , scripts = ['bin/charger'] , - keywords = ["bioinformatics" , "genomics"] , + keywords = [ "bioinformatics" , "genomics" , "pathogenic" , "benign" , \ + "ClinVar" , "ExAC" , "VEP" , "BioMine" , "germline" , \ + "variant" , "classifier" ] , classifiers = [ \ "License :: OSI Approved :: MIT License " , "Programming Language :: Python" ,