Skip to content

Commit

Permalink
Merge pull request #3 from TRON-Bioinformatics/update-vafator
Browse files Browse the repository at this point in the history
Update vafator to v1.2.0 + add SnpEff step
  • Loading branch information
priesgo authored Nov 22, 2021
2 parents d423297 + a2bd6a6 commit e6c107d
Show file tree
Hide file tree
Showing 20 changed files with 165 additions and 117 deletions.
99 changes: 76 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,37 @@
# TronFlow variant normalization pipeline
# TronFlow VCF postprocessing

![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-variant-normalization?sort=semver)
[![Run tests](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml)
[![DOI](https://zenodo.org/badge/372133189.svg)](https://zenodo.org/badge/latestdoi/372133189)
[![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
[![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/)

The TronFlow variant normalization pipeline is part of a collection of computational workflows for tumor-normal pair
The TronFlow VCF postprocessing pipeline is part of a collection of computational workflows for tumor-normal pair
somatic variant calling.
These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework.

Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest)


This pipeline aims at normalizing variants represented in a VCF into the convened normal form as described in Tan 2015.
The variant normalization is based on the implementation in vt (Tan 2015) and bcftools (Danecek 2021).
The pipeline is implemented on the Nextflow (Di Tommaso 2017) framework.
This pipeline has several objectives:
* Variant filtering
* Variant normalization
* Technical annotations from different BAM files
* Functional annotations

## Variant filtering

Optionally, only variants with the value in the column `FILTER` matching the value of parameter `--filter` are kept.
If this parameter is not used not variants are filtered out. Multiple values can be passed separated by commas without spaces.

For instance, `--filter PASS,.` will keep variant having `FILTER=PASS` or `FILTER=.`, but remove all others.


## Variant normalization

The normalization step aims to represent variants into the convened normal form as described in Tan 2015.
The variant normalization is based on the implementation in vt (Tan, 2015) and bcftools (Danecek, 2021).

The pipeline consists of the following steps:
* Variant filtering (optional)
The normalization pipeline consists of the following steps:
* Decomposition of MNPs into atomic variants (ie: AC > TG is decomposed into two variants A>T and C>G) (optional).
* Decomposition of multiallelic variants into biallelic variants (ie: A > C,G is decomposed into two variants A > C and A > G)
* Trim redundant sequence and left align indels, indels in repetitive sequences can have multiple representations
Expand All @@ -34,9 +48,9 @@ The output consists of:
![Pipeline](images/variant_normalization_pipeline.png)


## What is variant normalization?
### What is variant normalization?

### Variants are trimmed removing redundant bases
#### Variants are trimmed removing redundant bases

Before:
```
Expand All @@ -49,7 +63,7 @@ chr1 13085 . A C . UNTRIMMED OLD_CLUMPED=chr1|13083|AGA|AGC GT:AD 0:276,276 0/1:

**NOTE**: when a variant is changed during normalization the original variant is kept in the `INFO` field `OLD_CLUMPED`.

### Indels are left aligned
#### Indels are left aligned

If an indel lies within a repetitive sequence it can be reported on different positions, the convention is to report
the left most indel. This is what is called left alignment.
Expand All @@ -69,7 +83,7 @@ chr1 13140 . CCTGAG C . UNALIGNED OLD_CLUMPED=chr1|13141|CTGAGG|G GT:AD 0:80,1 0
**NOTE**: the rule of thumb to confirm if any given indel is left aligned, assuming that they are trimmed, is that the last base in both reference and
alternate must differ.

### Multi-allelic variants are split.
#### Multi-allelic variants are split.

Before:
```
Expand All @@ -84,7 +98,7 @@ chr1 13204 . C T . MULTIALLELIC OLD_VARIANT=chr1|13204|C|G,|2 GT:AD 0:229,229 0/
Note that the AD values are incorrectly set after the split, this seems to be a regression issue in bcftools v1.12
and has been reported here https://github.com/samtools/bcftools/issues/1499.

### Multi Nucleotide Variants (MNVs) can be decomposed
#### Multi Nucleotide Variants (MNVs) can be decomposed

Before:
```
Expand All @@ -101,7 +115,7 @@ chr1 13266 . T C . MNV OLD_CLUMPED=chr1:13261:GCTCCT/CCCCCC GT:AD:PS 0:41,0:1326
This behaviour is optional and can be disabled with `--skip_decompose_complex`. Beware, that the phase is maintained in
the fields `FORMAT/GT` and `FORMAT/PS` as described in the VCF specification section 1.4.2.

### Complex variants combining SNV and indels can be decomposed
#### Complex variants combining SNV and indels can be decomposed

Before:
```
Expand All @@ -118,6 +132,36 @@ chr1 13325 . CT C . MNV-INDEL OLD_CLUMPED=chr1:13321:AGCCCT/CGCC GT:AD:PS 0:229,
Same as MNVs this behaviour can de disabled with `--skip_decompose_complex`.


## Technical annotations

The technical annotations provide an insight on the variant calling process by looking into the context of each variant
within the pileup of a BAM file. When doing somatic variant calling it may be relevant to have technical annotations
for the same variant in a patient from multiple BAM files.
These annotations are provided by VAFator (https://github.com/TRON-Bioinformatics/vafator).

## Functional annotations

The functional annotations provide a biological context for every variant. Such as the overlapping genes or the effect
of the variant in a protein. These annotations are provided by SnpEff (Cingolani, 2012).

The SnpEff available human annotations are:
* GRCh37.75
* GRCh38.99
* hg19
* hg19kg
* hg38
* hg38kg

Before running the functional annotations you will need to download the reference genome you need to use.
This can be done as follows: `snpEff download -dataDir /your/snpeff/folder -v hg19`

When running indicate the right reference genome like `--snpeff_organism hg19`.
If none is provided no SnpEff annotations will be provided.
Provide the snpEff folder with `--snpeff_datadir`
To provide any additional SnpEff arguments use `--snpeff_args` such as
`--snpeff_args "-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs"`,
otherwise defaults will be used.


## How to run it

Expand Down Expand Up @@ -167,22 +211,31 @@ Output:

The table with VCF files expects two tab-separated columns without a header

| Sample name | VCF |
| Patient name | VCF |
|----------------------|------------------------------------------------------------------------|
| sample_1 | /path/to/sample_1.vcf |
| sample_2 | /path/to/sample_2.vcf |
| patient_1 | /path/to/patient_1.vcf |
| patient_2 | /path/to/patient_2.vcf |

The optional table with BAM files expects three tab-separated columns without a header. Multiple comma-separated BAMs can be provided.
The optional table with BAM files expects two tab-separated columns without a header.

| Sample name | Tumor BAMs | Normal BAMs |
|----------------------|---------------------------------|------------------------------|
| sample_1 | /path/to/sample_1.tumor_1.bam,/path/to/sample_1.tumor_2.bam | /path/to/sample_1.normal_1.bam,/path/to/sample_1.normal_2.bam |
| sample_2 | /path/to/sample_2.tumor.bam | /path/to/sample_2.normal.bam |
| Patient name | Sample name:BAM |
|----------------------|---------------------------------|
| patient_1 | primary_tumor:/path/to/sample_1.primary.bam |
| patient_1 | metastasis_tumor:/path/to/sample_1.metastasis.bam |
| patient_1 | normal:/path/to/sample_1.normal.bam |
| patient_2 | primary_tumor:/path/to/sample_1.primary_1.bam |
| patient_2 | primary_tumor:/path/to/sample_1.primary_2.bam |
| patient_2 | metastasis_tumor:/path/to/sample_1.metastasis.bam |
| patient_2 | normal:/path/to/sample_1.normal.bam |

Each patient can have any number of samples. Any sample can have any number of BAM files, annotations from the
different BAM files of the same sample will be provided with suffixes _1, _2, etc.
The aggregated vafator annotations on each sample will also be provided without a suffix.



## References

* Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204](http://bioinformatics.oxfordjournals.org/content/31/13/2202) and uses bcftools [Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (Oxford, England), 27(21), 2987–2993. 10.1093/bioinformatics/btr509
* Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
* Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. 10.1038/nbt.3820
* Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.". Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
5 changes: 0 additions & 5 deletions bin/filter_passed.sh

This file was deleted.

47 changes: 0 additions & 47 deletions bin/normalization.sh

This file was deleted.

7 changes: 0 additions & 7 deletions bin/summary.sh

This file was deleted.

30 changes: 20 additions & 10 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

nextflow.enable.dsl = 2

include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/normalization'
include { FILTER_VCF } from './modules/filter'
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/summary'
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/vafator'
include { FILTER_VCF } from './modules/01_filter'
include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/02_normalization'
include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/03_summary'
include { VAFATOR; MULTIALLELIC_FILTER } from './modules/04_vafator'
include { VARIANT_ANNOTATION } from './modules/05_variant_annotation'

params.help= false
params.input_vcfs = false
Expand All @@ -21,14 +22,20 @@ params.vcf_without_ad = false
params.mapping_quality = false
params.base_call_quality = false
params.skip_multiallelic_filter = false
params.prefix = false
params.snpeff_organism = false
params.snpeff_args = ""
params.snpeff_datadir = false


if (params.help) {
log.info params.help_message
exit 0
}

if ( params.snpeff_organism && ! params.snpeff_datadir) {
exit 1, "To run snpEff, please, provide your snpEff data folder with --snpeff_datadir"
}

if (! params.input_vcfs && ! params.input_vcf) {
exit 1, "Neither --input_vcfs or --input_vcf are provided!"
}
Expand All @@ -50,8 +57,8 @@ else if (params.input_vcf) {
if (params.input_bams) {
Channel
.fromPath(params.input_bams)
.splitCsv(header: ['name', 'tumor_bams', 'normal_bams'], sep: "\t")
.map{ row-> tuple(row.name, row.tumor_bams, row.normal_bams) }
.splitCsv(header: ['name', 'sample_name', 'bam'], sep: "\t")
.map{ row-> tuple(row.name, row.sample_name, row.bam) }
.set { input_bams }
}

Expand All @@ -74,15 +81,18 @@ workflow {

SUMMARY_VCF_2(final_vcfs)

if ( params.input_bams) {
VAFATOR(final_vcfs.join(input_bams))
if ( params.input_bams ) {
VAFATOR(final_vcfs.join(input_bams.groupTuple()))
final_vcfs = VAFATOR.out.annotated_vcf
if ( ! params.skip_multiallelic_filter ) {
final_vcfs = MULTIALLELIC_FILTER(final_vcfs)
final_vcfs = MULTIALLELIC_FILTER.out.filtered_vcf
}
}

final_vcfs.map {it.join("\t")}.collectFile(name: "${params.output}/normalized_vcfs.txt", newLine: true)
if (params.snpeff_organism) {
VARIANT_ANNOTATION(final_vcfs)
final_vcfs = VARIANT_ANNOTATION.out.annotated_vcf
}
}

File renamed without changes.
File renamed without changes.
File renamed without changes.
22 changes: 10 additions & 12 deletions modules/vafator.nf → modules/04_vafator.nf
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,31 @@ params.output = ""
params.mapping_quality = false
params.base_call_quality = false
params.skip_multiallelic_filter = false
params.prefix = false
params.enable_conda = false


process VAFATOR {
cpus params.cpus
memory params.memory
tag "${name}"
publishDir "${params.output}/${name}", mode: "copy"
tag "${patient_name}"
publishDir "${params.output}/${patient_name}", mode: "copy"

conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)

input:
tuple val(name), file(vcf), val(normal_bams), val(tumor_bams)
tuple val(patient_name), file(vcf), val(bams)

output:
tuple val(name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf
tuple val(patient_name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf

script:
normal_bams_param = normal_bams?.trim() ? "--normal-bams " + normal_bams.split(",").join(" ") : ""
tumor_bams_param = tumor_bams?.trim() ? "--tumor-bams " + tumor_bams.split(",").join(" ") : ""
bams_param = bams.collect { b -> "--bam " + b.split(":").join(" ") }.join(" ")
mq_param = params.mapping_quality ? "--mapping-quality " + params.mapping_quality : ""
bq_param = params.base_call_quality ? "--base-call-quality " + params.base_call_quality : ""
prefix_param = params.prefix ? "--prefix " + params.prefix : ""
"""
vafator \
--input-vcf ${vcf} \
--output-vcf ${vcf.baseName}.vaf.vcf ${normal_bams_param} ${tumor_bams_param} ${mq_param} ${bq_param} ${prefix_param}
--output-vcf ${vcf.baseName}.vaf.vcf ${bams_param} ${mq_param} ${bq_param}
"""
}

Expand All @@ -41,13 +39,13 @@ process MULTIALLELIC_FILTER {
tag "${name}"
publishDir "${params.output}/${name}", mode: "copy"

conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)

input:
tuple val(name), file(vcf)

output:
tuple val(name), file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf
tuple file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf

script:
"""
Expand Down
27 changes: 27 additions & 0 deletions modules/05_variant_annotation.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
params.memory = "3g"
params.cpus = 1
params.output = "."
params.snpeff_datadir = false
params.snpeff_organism = false
params.snpeff_args = ""


process VARIANT_ANNOTATION {
cpus params.cpus
memory params.memory
publishDir "${params.output}/${name}", mode: "copy"

conda (params.enable_conda ? "bioconda::snpeff=5.0" : null)

input:
tuple val(name), file(vcf)

output:
tuple val(name), file("${name}.annotated.vcf") , emit: annotated_vcf

script:
datadir_arg = params.snpeff_datadir ? "-dataDir ${params.snpeff_datadir}" : ""
"""
snpEff eff ${datadir_arg} ${params.snpeff_args} -nodownload ${params.snpeff_organism} ${vcf} > ${name}.annotated.vcf
"""
}
2 changes: 1 addition & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ env {
// Capture exit codes from upstream processes when piping
process.shell = ['/bin/bash', '-euo', 'pipefail']

VERSION = '2.0.0'
VERSION = '2.1.0'

cleanup=true

Expand Down
File renamed without changes.
Loading

0 comments on commit e6c107d

Please sign in to comment.