Introduction

SV² (support-vector structural-variant genotyper) is a machine learning algorithm for genotyping deletions and duplications from paired-end whole genome sequencing data. SV² can rapidly integrate variant calls from multiple SV discovery algorithms into a unified callset with high genotyping accuracy and detection of de novo mutations.

SV² is an open source software written in Python/Cython that exploits four features of SV: read depth, discordant paired-ends, split-reads, and heterozygous alelle ratio in a supervised support vector machine classifier.

Required inputs include a BAM/CRAM file with supplementary alignment tags (SA), a SNV VCF file with allele depth, and either a BED or VCF of SV to genotype. SV² operates in three stages: preprocessing, feature extraction , and genotyping. SV² outputs a BED and VCF file of genotypes along with annotations for genes, repeats, and other befitting statistics for SV analysis.

For more information or citing SV² please refer to the publication.

Preprocessing

SV² preprocessing records the median coverage, insert size, read length for each chromosome for downstream normalization of features. Preprocessing statistics are obtained for each chromosome in 100 random nonoverlapping regions 100kb in length. The default random seed is 42, but can be altered with the [-s|-seed] INT option.

Preprocessing output can be found in sv2_preprocessing/ in the current working directory. If preprocessing has been completed, supplying -pre sv2_preprocessing/ to the SV² command will bypass this stage.

Feature Extraction

Before feature extraction, a mask is applied to SV regions. The mask includes segmental duplications, short tandem-repeats, centromeres, telomeres, and unmappable regions and is included here $SV2_INSTALL_DIR/sv2/sv2/resources/annotation_files/*excluded.bed.gz. The genome mask with merged positions is located here $SV2_INSTALL_DIR/sv2/sv2/resources/*_excluded.bed.gz. SV calls that completely overlap masked elements cannot be genotyped and are represented as ./. in the output VCF.

SV² genotypes using four features of SV: depth of coverage, discordant paired-end, split-reads, and heterozygous allele ratio. Feature extraction output is located in sv2_features/ in the current working directory. If feature extraction has completed and a VCF of multiple samples is desired, supplying the option -feats sv2_features/ to the SV² command with skip feature extraction.

Depth of Coverage

alt text

Depth of coverage is estimated via the number of reads spanning a locus for SV greater than 1000bp. For smaller SVs, depth of coverage was recorded as the median per-base-pair read depth.

Coverage features are first normalized according to the median chromosome coverage. For SVs overlapping pseudoautosomal regions on male sex chromosomes, normalization implements the median genome coverage. Normalized coverage is then corrected for GC content, adapted from CNVator, for either PCR or PCR-free libraries.

For PCR-free libraries supply the -pcrfree flag.

SV² cannot genotype SVs when normalized coverage exceeds 5.0 (10 copies for diploid samples).

Discordant Paired-Ends

alt text

Discordant paired-ends contain insert sizes greater than the chromosome median plus five times the median absolute deviation. SV² only considers discordant paired-ends if both mates bridge the putative breakpoint by +/- 500bp. Likewise, SV² requires that both mates rest on opposite sides of the breakpoint. The resulting number of discordant paired-ends that meet these criteria are then normalized by the number of concordant paired-ends that span 500bp windows of the start and end positions of the SV.

Split-Reads

alt text

Split-reads are those with supplementary alignments. To reduce noise, SV² only considers split-reads if the primary and supplementary alignments bridge the breakpoint by +/- 500bp. Likewise, both alignments must map to opposite sides of the breakpoint. The resulting number of split-reads are normalized to the number of concordant reads that span 500bp windows of the start and end positions of the SV.

Heterozygous Allele Ratio

alt text

Akin to B-allele frequency in microarrays, heterozygous alelle ratio is defined as the median ratio of the allele less coverage to the allele with more coverage for every heterozygous SNV within the SV. This feature is parsed from a SNV VCF that contains either AD or DPR in the format column.

Genotyping

SV² genotypes SV with six support vector machine classifiers that are trained with respect to SV type and length.

alt text

Each classifier, with the exception of Duplication SNV implements depth of coverage, discordant paired-ends, and split-reads as features. The Duplication SNV classifier employs depth of coverage and heterozygous alelle ratio as features.

Genotyping output in BED and VCF format are located in sv2_genotypes/ in the current working directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction

Preprocessing

Feature Extraction

Depth of Coverage

Discordant Paired-Ends

Split-Reads

Heterozygous Allele Ratio

Genotyping

Clone this wiki locally