Skip to content

Commit

Permalink
added usage description to README. added more keywords to setup. adde…
Browse files Browse the repository at this point in the history
…d back the de novo files and co-segregate variant file inputs.
  • Loading branch information
AdamDS committed Apr 9, 2017
1 parent 79228a5 commit c472f97
Show file tree
Hide file tree
Showing 4 changed files with 158 additions and 24 deletions.
118 changes: 118 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,121 @@ Run:
For detailed help/support, email Adam:

adamscott@wustl.edu

## Usage Details
### Input data
-m Standard .maf
-f Standard .vcf
-T Custom .tsv
Variant data may be input via at least one variant file.
This means that if variants are spread across several files, then you can input one of each type.
For the .maf and .tsv, use the custom columns to determine which columns to use.
Note that a standard .maf does not include protein annotations.
Use the custom column for the peptide change column.
If your .vcf has VEP annotations, then CharGer should be able to parse the information.
This information will be added to your variants when available.

### Output
-o output file
-w output as HTML (flag)
-k annotate input (flag)
--run-url-test test url when creating links
Name your output file; otherwise it will be called charger_summary.tsv.
You can opt to make the output into an HTML page, instead of a readable .tsv.
If you need to be assured of properly linked URL's, use the url test flag.

### Access data
-l ClinVar (flag)
-x ExAC (flag)
-E VEP (flag)
-t TCGA cancer types (flag)
Using these flags turns on accession features built in.
For the ClinVar, ExAC, and VEP flags, if no local VEP or databse is provided, then BioMine will be used to access the ReST interface.
The TCGA flag allows disease determination from sample barcodes in a .maf when using a diseases file (see below).

### Suppress data or overrides
-O override with ClinVar description (flag)
-D suppress needing disease specific (flag)
You can have CharGer override its pathogenic characterization with whatever ClinVar has.
Suppressing disease specific variants takes any variants in the diseases file (see below) and treats them as equally pathogenic without disease consideration.

### Cross-reference data
-z pathogenic variants, .vcf
-e expression matrix file, .tsv
-g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt
-d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv
-n de novo file, standard .maf
-a assumed de novo file, standard .maf
-c co-segregation file, standard .maf
-H HotSpot3D clusters file, .clusters
-r recurrence threshold (default = 2)
Variants or genes from each of these files can be used as additional known information.
An expression matrix file has columns for each sample, and its rows are genes.
The genes should be approved HUGO symbols.
HotSpot3D clusters can be used for versions v1.x.x.
The recurrence threshold will be pulled from the recurrence/weight column of the .clusters file when provided.

### Local VEP
--vep-script Path to VEP
--vep-dir Path to VEP directory
--vep-cache Path to VEP cache directory
--vep-version VEP version (default = 87)
--vep-output VEP output file (default = charger.vep.vcf)
--grch assembly GRCh verion (default = 37)
--ensembl-release Ensembl release version (default = 75)
--reference-fasta VEP reference fasta
--fork Number of forked processes used in VEP (default = 0)
This currently only works with .vcf input only.
Annotations are run with the VEP everything flag, so any local plugins will be used.
The BioMine accession is also suppressed when using a local VEP installaltion.
The VEP directory is not the same as would be given to VEP's --dir option.
Instead it is the path to the directory with the VEP .pl file.
The VEP script is the .pl file only.
If not given, it will be /vep-dir/variant\_effect\_predictor.pl.
The VEP cache directory is the same as would be given to VEP's --dir-cache option.
If you have multiple VEP versions, then specify the version you want to use.
This can be different from the Ensembl release option.
VEP output is the same os would be given to VEP's -o option and should end with .vcf.
The default output file will be called charger.vep.vcf.
The GRCh reference genome can be set to either 37 or 38.
The reference Fasta file will be deteremined automatically if not specified.
If the reference Fasta file is constructed automatically, then if, for example, the VEP chache is ~/.vep/, the Ensembl release is 74, and the reference assembly is 37, then the reference Fasta file will be ~/.vep/homo\_sapiens/74\_GRCH37/Homo\_sapiens.GRCh37.74.dna.primary\_assembly.fa.gz.

### Local databases
--exac-vcf ExAC vcf.gz
--mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz)
Using local databases suppresses the BioMine accession too.
These files can be downloaded from their respective sites.

### Filters
--rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency):
--vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag)
--mutation-types Comma delimited list of types to allow
Using filters will limit the variants processed.
The rare option takes variants with allele frequency less than the given value.
The .vcf any filter accepts only variants that have passed all filters.
If no .vcf pass filter status given, the .vcf null value will be taken as having passed.
Mutation types filtering requires a comma delimitted list (no spaces) using terms from Ensembl's consequence terms.

### ReST batch sizes
-v VEP (#variants, default/max allowed = 150)
-b ClinVar summary (#variants, default/max allowed = 500)
-B ClinVar searchsize (#variants, default/max allowed = 50)
ReST API's usually have limits on the amount of data sent or received.
Exceeding these batch sizes would normally lead to warnings and/or IP blockage, but CharGer and BioMine try to keep batches at safe sizes. Last updated limits February 2017.

### Custom columns (0-based)
-G HUGO gene symbol
-X chromosome
-S start position
-P stop position
-R reference allele
-A alternate allele
-s strand
-M sample name
-C codon
-p peptide change
-L variant classification
-F allele frequency
Use these for .tsv and/or .maf input variant files to specify columns of relevant data.
CharGer makes use of genomic and protein variant annotations, so the more data made available the better your results.
31 changes: 16 additions & 15 deletions bin/charger
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
#!/bin/python
# CharGer - Characterization of Germline variants
# author: Adam D Scott (adamscott@wustl.edu) & Kuan-lin Huang (khuang@genome.wustl.edu)
# version: v0.1.0 - 2015*12
# version: v0.1.0 - 2016*04

import sys
import getopt
from charger import charger
import time

def parseArgs( argv ):
helpText = "Usage: "
helpText = "\nCharGer - v0.1.0\n\n"
helpText += "Usage: "
helpText += "charger <input file> [options]\n\n"
helpText += "Accepted input data files:\n"
helpText += " -m Standard .maf\n"
Expand All @@ -29,37 +30,37 @@ def parseArgs( argv ):
helpText += " -O override with ClinVar description (flag)\n"
helpText += " -D suppress needing disease specific (flag)\n"
helpText += "Cross-reference data files:\n"
helpText += " -z pathogenic variants .vcf\n"
helpText += " -e expression matrix file .tsv\n"
helpText += " -g gene list file .txt\n"
helpText += " -d diseases file (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n"
helpText += " -n de novo file .?\n"
helpText += " -a assumed de novo file .?\n"
helpText += " -c co-segregation file .?\n"
helpText += " -H HotSpot3D clusters file .clusters\n"
helpText += " -r recurrence threshold (default = )\n"
helpText += " -z pathogenic variants, .vcf\n"
helpText += " -e expression matrix file, .tsv\n"
helpText += " -g gene list file, (format: gene\\tdisease\\tmode_of_inheritance) .txt\n"
helpText += " -d diseases file, (format: gene\\tdisease\\tmode_of_inheritance) .tsv\n"
helpText += " -n de novo file, standard .maf\n"
helpText += " -a assumed de novo file, standard .maf\n"
helpText += " -c co-segregation file, standard .maf\n"
helpText += " -H HotSpot3D clusters file, .clusters\n"
helpText += " -r recurrence threshold (default = 2)\n"
helpText += "Local VEP (works with .vcf input only; suppresses ReST too):\n"
helpText += " --vep-script Path to VEP\n"
helpText += " --vep-dir Path to VEP directory\n"
helpText += " --vep-cache Path to VEP cache directory\n"
helpText += " --vep-version VEP version (default = 87)\n"
helpText += " --vep-output VEP output file (default = charger.vep.vcf)\n"
helpText += " --grch assembly GRCh verion (default = 37)\n"
helpText += " --ensembl-release Ensembl release version (default = 74)\n"
helpText += " --ensembl-release Ensembl release version (default = 75)\n"
helpText += " --reference-fasta VEP reference fasta\n"
helpText += " --fork Number of forked processes used in VEP (default = 0) \n"
helpText += "Local databases (suppresses ReST too):\n"
helpText += " --exac-vcf ExAC vcf.gz\n"
#helpText += " --clinvar-tsv ClinVar (.tsv.gz download)\n"
#helpText += " --clinvar-vcf ClinVar (.vcf.gz download)\n"
helpText += " --mac-clinvar-tsv ClinVar from MacArthur lab (clinvar_alleles.tsv.gz)\n"
helpText += " --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n"
#helpText += " --mac-clinvar-vcf ClinVar from MacArthur lab (clinvar_alleles.vcf.gz)\n"
helpText += "Filters:\n"
helpText += " --rare Allele frequency threshold for rare/common (default = 1, process variant with any frequency):\n"
helpText += " --vcf-any-filter Allow variants that do not pass all filters in .vcf input (flag)\n"
helpText += " --mutation-types Comma delimited list of types to allow\n"
helpText += " --mutation-types Comma delimited list (no spaces) of types to allow\n"
helpText += "ReST batch sizes:\n"
helpText += " -v VEP (#variants, default/max allowed = )\n"
helpText += " -v VEP (#variants, default/max allowed = 150)\n"
helpText += " -b ClinVar summary (#variants, default/max allowed = 500)\n"
helpText += " -B ClinVar searchsize (#variants, default/max allowed = 50)\n"
helpText += "Custom columns (0-based)\n"
Expand Down
27 changes: 20 additions & 7 deletions charger/charger.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,6 @@ def getInputData( self , **kwargs ):
deNovoFile = kwargs.get( 'deNovo' , "" )
assumedDeNovoFile = kwargs.get( 'assumedDeNovo' , "" )
coSegregateFile = kwargs.get( 'coSegregate' , "" )
tcga = kwargs.get( 'tcga' , True )
diseaseFile = kwargs.get( 'diseases' , "" )
specific = kwargs.get( 'specific' , False )
self.getDiseases( diseaseFile , **kwargs )
Expand All @@ -131,6 +130,12 @@ def getInputData( self , **kwargs ):
exacDone = self.readTSV( tsvFile , **kwargs )
if pathogenicVariantsFile:
self.readVCF( pathogenicVariantsFile , appendTo="pathogenic" , **kwargs )
if deNovoFile:
self.readDeNovo( deNovoFile )
if assumedDeNovoFile:
self.readAssumedDeNovo( assumedDeNovoFile )
if coSegregateFile:
self.readCoSegregate( coSegregateFile )
if expressionFile:
self.readExpression( expressionFile )
else:
Expand Down Expand Up @@ -733,10 +738,10 @@ def readGeneList( self , inputFile , **kwargs ): # gene list formatted "gene", "
print "CharGer::readGeneList Error: bad gene list file"
def readDeNovo( self , inputFile ):
self.readOtherMAF( inputFile , varDict = self.deNovoVariants )
def readCoSegregate( self , inputFile ):
self.readOtherMAF( inputFile , varDict = self.coSegregateVariants )
def readAssumedDeNovo( self , inputFile ):
self.readOtherMAF( inputFile , varDict = self.assumedDeNovoVariants )
def readCoSegregate( self , inputFile ):
self.readOtherMAF( inputFile , varDict = self.coSegregateVariants )
def readOtherMAF( self , inputFile, varDict ):
try:
inFile = self.safeOpen( inputFile , 'r' , warning=True )
Expand Down Expand Up @@ -772,7 +777,15 @@ def getClinVar( self , **kwargs ):
macClinVarVCF = kwargs.get( 'macClinVarVCF' , None )
doClinVar = kwargs.get( 'clinvar' , False )
summaryBatchSize = kwargs.get( 'summaryBatchSize' , 500 )
searchBatchSize = kwargs.get( 'searchBatchSize' , 50 )
maxSearchBatchSize = 50
searchBatchSize = kwargs.get( 'searchBatchSize' , maxSearchBatchSize )
if searchBatchSize > maxSearchBatchSize:
message = "warning: ClinVar ReST search batch size given is "
message += "greater than max allowed ("
message += str( maxSearchBatchSize ) + ")"
message += ". Overriding to max search batch size."
print( message )
searchBatchSize = maxSearchBatchSize
#print( ' '.join( [ str( doClinVar ) , str( macClinVarTSV ) , str( macClinVarVCF ) ] ) )
if doClinVar:
if macClinVarTSV:
Expand Down Expand Up @@ -1421,15 +1434,15 @@ def checkClinVarPC( self , var , mod , **kwargs ):
#if genomic change is the same, then PS1
if clin["description"] == clinvarvariant.pathogenic:
if mod == "PS1":
print( "also PS1 via samGenomicVariant" )
#print( "also PS1 via samGenomicVariant" )
var.PS1 = True # already pathogenic still suffices to be PS1
called = 1
elif consequence.sameGenomicReference( clinvarVar ):
#if genomic change is different, but the peptide change is the same, then PS1
if clinvarVar.alternatePeptide == consequence.alternatePeptide: #same amino acid change
if clin["description"] == clinvarvariant.pathogenic:
if mod == "PS1":
print( "also PS1 via sameGenomicReference" )
#print( "also PS1 via sameGenomicReference" )
var.PS1 = True
called = 1
if consequence.samePeptideReference( clinvarVar ):
Expand All @@ -1438,7 +1451,7 @@ def checkClinVarPC( self , var , mod , **kwargs ):
if consequence.plausibleCodonFrame( clinvarVar ):
if clin["description"] == clinvarvariant.pathogenic:
if mod == "PM5":
print( "also PS5 via plausibleCodonFrame" )
#print( "also PS5 via plausibleCodonFrame" )
var.PM5 = True # already pathogenic still suffices to be PS1
called = 1
return called
Expand Down
6 changes: 4 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@
classification, CharGer cross-checks user variants against \
several public databases for information. CharGer then \
provides a pathogenicity score according to ACMG or \
CharGers custom scale.' ,
CharGer\'s custom scale.' ,
download_url = 'https://github.com/ding-lab/CharGer/archive/v' + \
version + '.tar.gz' ,
scripts = ['bin/charger'] ,
keywords = ["bioinformatics" , "genomics"] ,
keywords = [ "bioinformatics" , "genomics" , "pathogenic" , "benign" , \
"ClinVar" , "ExAC" , "VEP" , "BioMine" , "germline" , \
"variant" , "classifier" ] ,
classifiers = [ \
"License :: OSI Approved :: MIT License " ,
"Programming Language :: Python" ,
Expand Down

0 comments on commit c472f97

Please sign in to comment.