-
Notifications
You must be signed in to change notification settings - Fork 11
Introduction
SV2 (support-vector structural-variant genotyper) is a machine learning algorithm for genotyping deletions and duplications from paired-end whole genome sequencing data. SV2 can rapidly integrate variant calls from multiple SV discovery algorithms into a unified callset with high genotyping accuracy and detection of de novo mutations.
SV2 is an open source software written in Python/Cython that exploits four features of SV: read depth, discordant paired-ends, split-reads, and heterozygous alelle ratio in a supervised support vector machine classifier.
Required inputs include a BAM/CRAM file with supplementary alignment tags (SA), a SNV VCF file with allele depth, and either a BED or VCF of SV to genotype. SV2 operates in three stages: preprocessing, feature extraction , and genotyping. SV2 outputs a BED and VCF file of genotypes along with annotations for genes, repeats, and other befitting statistics for SV analysis.
For more information or citing SV2 please refer to the publication.
SV2 preprocessing records the median coverage, insert size, read length for each chromosome for downstream normalization of features. Preprocessing statistics are obtained for each chromosome in 100 random nonoverlapping regions 100kb in length. The default random seed is 42, but can be altered with the [-s|-seed] INT
option.
Preprocessing output can be found in sv2_preprocessing/
in the current working directory. If preprocessing has been completed, supplying -pre sv2_preprocessing/
to the SV2 command will bypass this stage.
Before feature extraction, a mask is applied to SV regions. The mask includes segmental duplications, short tandem-repeats, centromeres, telomeres, and unmappable regions and is included here $SV2_INSTALL_DIR/sv2/sv2/resources/annotation_files/*excluded.bed.gz
. The genome mask with merged positions is located here $SV2_INSTALL_DIR/sv2/sv2/resources/*_excluded.bed.gz
. SV calls that completely overlap masked elements cannot be genotyped and are represented as ./.
in the output VCF.
SV2 genotypes using four features of SV: depth of coverage, discordant paired-end, split-reads, and heterozygous allele ratio. Feature extraction output is located in sv2_features/
in the current working directory. If feature extraction has completed and a VCF of multiple samples is desired, supplying the option -feats sv2_features/
to the SV2 command with skip feature extraction.
Depth of coverage is estimated via the number of reads spanning a locus for SV greater than 1000bp. For smaller SVs, depth of coverage was recorded as the median per-base-pair read depth.
Coverage features are first normalized according to the median chromosome coverage. For SVs overlapping pseudoautosomal regions on male sex chromosomes, normalization implements the median genome coverage. Normalized coverage is then corrected for GC content, adapted from CNVator, for either PCR or PCR-free libraries.
For PCR-free libraries supply the -pcrfree
flag.
SV2 cannot genotype SVs when normalized coverage exceeds 5.0 (10 copies for diploid samples).
Discordant paired-ends contain insert sizes greater than the chromosome median plus five times the median absolute deviation. SV2 only considers discordant paired-ends if both mates bridge the putative breakpoint by +/- 500bp. Likewise, SV2 requires that both mates rest on opposite sides of the breakpoint. The resulting number of discordant paired-ends that meet these criteria are then normalized by the number of concordant paired-ends that span 500bp windows of the start and end positions of the SV.
Split-reads are those with supplementary alignments. To reduce noise, SV2 only considers split-reads if the primary and supplementary alignments bridge the breakpoint by +/- 500bp. Likewise, both alignments must map to opposite sides of the breakpoint. The resulting number of split-reads are normalized to the number of concordant reads that span 500bp windows of the start and end positions of the SV.
Akin to B-allele frequency in microarrays, heterozygous alelle ratio is defined as the median ratio of the allele less coverage to the allele with more coverage for every heterozygous SNV within the SV. This feature is parsed from a SNV VCF that contains either AD
or DPR
in the format column.
SV2 genotypes SV with six support vector machine classifiers that are trained with respect to SV type and length.
Each classifier, with the exception of Duplication SNV implements depth of coverage, discordant paired-ends, and split-reads as features. The Duplication SNV classifier employs depth of coverage and heterozygous alelle ratio as features.
Genotyping output in BED and VCF format are located in sv2_genotypes/
in the current working directory.