Merge pull request #3 from TRON-Bioinformatics/update-vafator

Update vafator to v1.2.0 + add SnpEff step
TRON-Bioinformatics · Nov 22, 2021 · e6c107d · e6c107d
2 parents d423297 + a2bd6a6
commit e6c107d
Show file tree

Hide file tree

Showing 20 changed files with 165 additions and 117 deletions.
diff --git a/README.md b/README.md
@@ -1,23 +1,37 @@
-# TronFlow variant normalization pipeline
+# TronFlow VCF postprocessing
 
 ![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-variant-normalization?sort=semver)
 [![Run tests](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml)
 [![DOI](https://zenodo.org/badge/372133189.svg)](https://zenodo.org/badge/latestdoi/372133189)
 [![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT)
 [![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/)
 
-The TronFlow variant normalization pipeline is part of a collection of computational workflows for tumor-normal pair 
+The TronFlow VCF postprocessing pipeline is part of a collection of computational workflows for tumor-normal pair 
 somatic variant calling. 
+These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework.
 
 Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest)
-
 
-This pipeline aims at normalizing variants represented in a VCF into the convened normal form as described in Tan 2015. 
-The variant normalization is based on the implementation in vt (Tan 2015) and bcftools (Danecek 2021). 
-The pipeline is implemented on the Nextflow (Di Tommaso 2017) framework.
+This pipeline has several objectives:
+* Variant filtering
+* Variant normalization
+* Technical annotations from different BAM files
+* Functional annotations
+
+## Variant filtering
+
+Optionally, only variants with the value in the column `FILTER` matching the value of parameter `--filter` are kept.
+If this parameter is not used not variants are filtered out. Multiple values can be passed separated by commas without spaces.
+
+For instance, `--filter PASS,.` will keep variant having `FILTER=PASS` or `FILTER=.`, but remove all others.
+
+
+## Variant normalization
+
+The normalization step aims to represent variants into the convened normal form as described in Tan 2015. 
+The variant normalization is based on the implementation in vt (Tan, 2015) and bcftools (Danecek, 2021).
 
-The pipeline consists of the following steps:
- * Variant filtering (optional)
+The normalization pipeline consists of the following steps:
  * Decomposition of MNPs into atomic variants (ie: AC > TG is decomposed into two variants A>T and C>G) (optional).
  * Decomposition of multiallelic variants into biallelic variants (ie: A > C,G is decomposed into two variants A > C and A > G)
  * Trim redundant sequence and left align indels, indels in repetitive sequences can have multiple representations
@@ -34,9 +48,9 @@ The output consists of:
 ![Pipeline](images/variant_normalization_pipeline.png)
 
 
-## What is variant normalization?
+### What is variant normalization?
 
-### Variants are trimmed removing redundant bases
+#### Variants are trimmed removing redundant bases
 
 Before:
 ```
@@ -49,7 +63,7 @@ chr1	13085	.	A	C	.	UNTRIMMED	OLD_CLUMPED=chr1|13083|AGA|AGC	GT:AD	0:276,276	0/1:
 
 **NOTE**: when a variant is changed during normalization the original variant is kept in the `INFO` field `OLD_CLUMPED`.
 
-### Indels are left aligned
+#### Indels are left aligned
 
 If an indel lies within a repetitive sequence it can be reported on different positions, the convention is to report 
 the left most indel. This is what is called left alignment.
@@ -69,7 +83,7 @@ chr1	13140	.	CCTGAG	C	.	UNALIGNED	OLD_CLUMPED=chr1|13141|CTGAGG|G	GT:AD	0:80,1	0
 **NOTE**: the rule of thumb to confirm if any given indel is left aligned, assuming that they are trimmed, is that the last base in both reference and 
 alternate must differ.
 
-### Multi-allelic variants are split.
+#### Multi-allelic variants are split.
 
 Before:
 ```
@@ -84,7 +98,7 @@ chr1	13204	.	C	T	.	MULTIALLELIC	OLD_VARIANT=chr1|13204|C|G,|2	GT:AD	0:229,229	0/
 Note that the AD values are incorrectly set after the split, this seems to be a regression issue in bcftools v1.12 
 and has been reported here https://github.com/samtools/bcftools/issues/1499.
 
-### Multi Nucleotide Variants (MNVs) can be decomposed
+#### Multi Nucleotide Variants (MNVs) can be decomposed
 
 Before:
 ```
@@ -101,7 +115,7 @@ chr1	13266	.	T	C	.	MNV	OLD_CLUMPED=chr1:13261:GCTCCT/CCCCCC	GT:AD:PS	0:41,0:1326
 This behaviour is optional and can be disabled with `--skip_decompose_complex`. Beware, that the phase is maintained in
 the fields `FORMAT/GT` and `FORMAT/PS` as described in the VCF specification section 1.4.2.
 
-### Complex variants combining SNV and indels can be decomposed
+#### Complex variants combining SNV and indels can be decomposed
 
 Before:
 ```
@@ -118,6 +132,36 @@ chr1	13325	.	CT	C	.	MNV-INDEL	OLD_CLUMPED=chr1:13321:AGCCCT/CGCC	GT:AD:PS	0:229,
 Same as MNVs this behaviour can de disabled with `--skip_decompose_complex`.
 
 
+## Technical annotations
+
+The technical annotations provide an insight on the variant calling process by looking into the context of each variant
+within the pileup of a BAM file. When doing somatic variant calling it may be relevant to have technical annotations
+for the same variant in a patient from multiple BAM files.
+These annotations are provided by VAFator (https://github.com/TRON-Bioinformatics/vafator).
+
+## Functional annotations
+
+The functional annotations provide a biological context for every variant. Such as the overlapping genes or the effect 
+of the variant in a protein. These annotations are provided by SnpEff (Cingolani, 2012).
+
+The SnpEff available human annotations are:
+* GRCh37.75 
+* GRCh38.99 
+* hg19 
+* hg19kg 
+* hg38 
+* hg38kg
+
+Before running the functional annotations you will need to download the reference genome you need to use. 
+This can be done as follows: `snpEff download -dataDir /your/snpeff/folder -v hg19`
+
+When running indicate the right reference genome like `--snpeff_organism hg19`. 
+If none is provided no SnpEff annotations will be provided.
+Provide the snpEff folder with `--snpeff_datadir`
+To provide any additional SnpEff arguments use `--snpeff_args` such as 
+`--snpeff_args "-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs"`, 
+otherwise defaults will be used.
+
 
 ## How to run it
 
@@ -167,22 +211,31 @@ Output:
 
 The table with VCF files expects two tab-separated columns without a header
 
-| Sample name          | VCF                                                             |
+| Patient name          | VCF                                                             |
 |----------------------|------------------------------------------------------------------------|
-| sample_1             | /path/to/sample_1.vcf               |
-| sample_2             | /path/to/sample_2.vcf                    |
+| patient_1             | /path/to/patient_1.vcf               |
+| patient_2             | /path/to/patient_2.vcf                    |
 
-The optional table with BAM files expects three tab-separated columns without a header. Multiple comma-separated BAMs can be provided.
+The optional table with BAM files expects two tab-separated columns without a header.
 
-| Sample name          | Tumor BAMs                      | Normal BAMs                  |
-|----------------------|---------------------------------|------------------------------|
-| sample_1             | /path/to/sample_1.tumor_1.bam,/path/to/sample_1.tumor_2.bam      |    /path/to/sample_1.normal_1.bam,/path/to/sample_1.normal_2.bam   |
-| sample_2             | /path/to/sample_2.tumor.bam           |  /path/to/sample_2.normal.bam  |
+| Patient name          | Sample name:BAM                      |
+|----------------------|---------------------------------|
+| patient_1             | primary_tumor:/path/to/sample_1.primary.bam   |
+| patient_1             | metastasis_tumor:/path/to/sample_1.metastasis.bam   |
+| patient_1             | normal:/path/to/sample_1.normal.bam   |
+| patient_2             | primary_tumor:/path/to/sample_1.primary_1.bam   |
+| patient_2             | primary_tumor:/path/to/sample_1.primary_2.bam   |
+| patient_2             | metastasis_tumor:/path/to/sample_1.metastasis.bam   |
+| patient_2             | normal:/path/to/sample_1.normal.bam   |
+
+Each patient can have any number of samples. Any sample can have any number of BAM files, annotations from the 
+different BAM files of the same sample will be provided with suffixes _1, _2, etc.
+The aggregated vafator annotations on each sample will also be provided without a suffix.
 
-
 
 ## References
 
 * Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204](http://bioinformatics.oxfordjournals.org/content/31/13/2202) and uses bcftools [Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (Oxford, England), 27(21), 2987–2993. 10.1093/bioinformatics/btr509
 * Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819.
 * Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. 10.1038/nbt.3820
+* Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.". Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
diff --git a/bin/filter_passed.sh b/bin/filter_passed.sh
diff --git a/bin/normalization.sh b/bin/normalization.sh
diff --git a/bin/summary.sh b/bin/summary.sh
diff --git a/main.nf b/main.nf
@@ -2,10 +2,11 @@
 
 nextflow.enable.dsl = 2
 
-include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/normalization'
-include { FILTER_VCF } from './modules/filter'
-include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/summary'
-include { VAFATOR; MULTIALLELIC_FILTER } from './modules/vafator'
+include { FILTER_VCF } from './modules/01_filter'
+include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/02_normalization'
+include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/03_summary'
+include { VAFATOR; MULTIALLELIC_FILTER } from './modules/04_vafator'
+include { VARIANT_ANNOTATION } from './modules/05_variant_annotation'
 
 params.help= false
 params.input_vcfs = false
@@ -21,14 +22,20 @@ params.vcf_without_ad = false
 params.mapping_quality = false
 params.base_call_quality = false
 params.skip_multiallelic_filter = false
-params.prefix = false
+params.snpeff_organism = false
+params.snpeff_args = ""
+params.snpeff_datadir = false
 
 
 if (params.help) {
     log.info params.help_message
     exit 0
 }
 
+if ( params.snpeff_organism && ! params.snpeff_datadir) {
+  exit 1, "To run snpEff, please, provide your snpEff data folder with --snpeff_datadir"
+}
+
 if (! params.input_vcfs && ! params.input_vcf) {
   exit 1, "Neither --input_vcfs or --input_vcf are provided!"
 }
@@ -50,8 +57,8 @@ else if (params.input_vcf) {
 if (params.input_bams) {
     Channel
     .fromPath(params.input_bams)
-    .splitCsv(header: ['name', 'tumor_bams', 'normal_bams'], sep: "\t")
-    .map{ row-> tuple(row.name, row.tumor_bams, row.normal_bams) }
+    .splitCsv(header: ['name', 'sample_name', 'bam'], sep: "\t")
+    .map{ row-> tuple(row.name, row.sample_name, row.bam) }
     .set { input_bams }
 }
 
@@ -74,15 +81,18 @@ workflow {
 
     SUMMARY_VCF_2(final_vcfs)
 
-    if ( params.input_bams) {
-        VAFATOR(final_vcfs.join(input_bams))
+    if ( params.input_bams ) {
+        VAFATOR(final_vcfs.join(input_bams.groupTuple()))
         final_vcfs = VAFATOR.out.annotated_vcf
         if ( ! params.skip_multiallelic_filter ) {
             final_vcfs = MULTIALLELIC_FILTER(final_vcfs)
             final_vcfs = MULTIALLELIC_FILTER.out.filtered_vcf
         }
     }
 
-    final_vcfs.map {it.join("\t")}.collectFile(name: "${params.output}/normalized_vcfs.txt", newLine: true)
+    if (params.snpeff_organism) {
+        VARIANT_ANNOTATION(final_vcfs)
+        final_vcfs = VARIANT_ANNOTATION.out.annotated_vcf
+    }
 }
 
diff --git a/modules/filter.nf → modules/01_filter.nf b/modules/filter.nf → modules/01_filter.nf
diff --git a/modules/normalization.nf → modules/02_normalization.nf b/modules/normalization.nf → modules/02_normalization.nf
diff --git a/modules/summary.nf → modules/03_summary.nf b/modules/summary.nf → modules/03_summary.nf
diff --git a/modules/vafator.nf → modules/04_vafator.nf b/modules/vafator.nf → modules/04_vafator.nf
@@ -4,33 +4,31 @@ params.output = ""
 params.mapping_quality = false
 params.base_call_quality = false
 params.skip_multiallelic_filter = false
-params.prefix = false
+params.enable_conda = false
 
 
 process VAFATOR {
     cpus params.cpus
     memory params.memory
-    tag "${name}"
-    publishDir "${params.output}/${name}", mode: "copy"
+    tag "${patient_name}"
+    publishDir "${params.output}/${patient_name}", mode: "copy"
 
-    conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
+    conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)
 
     input:
-    tuple val(name), file(vcf), val(normal_bams), val(tumor_bams)
+    tuple val(patient_name), file(vcf), val(bams)
 
     output:
-    tuple val(name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf
+    tuple val(patient_name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf
 
     script:
-    normal_bams_param = normal_bams?.trim() ? "--normal-bams " + normal_bams.split(",").join(" ") : ""
-    tumor_bams_param = tumor_bams?.trim() ? "--tumor-bams " + tumor_bams.split(",").join(" ") : ""
+    bams_param = bams.collect { b -> "--bam " + b.split(":").join(" ") }.join(" ")
     mq_param = params.mapping_quality ? "--mapping-quality " + params.mapping_quality : ""
     bq_param = params.base_call_quality ? "--base-call-quality " + params.base_call_quality : ""
-    prefix_param = params.prefix ? "--prefix " + params.prefix : ""
     """
     vafator \
     --input-vcf ${vcf} \
-    --output-vcf ${vcf.baseName}.vaf.vcf ${normal_bams_param} ${tumor_bams_param} ${mq_param} ${bq_param} ${prefix_param}
+    --output-vcf ${vcf.baseName}.vaf.vcf ${bams_param} ${mq_param} ${bq_param}
     """
 }
 
@@ -41,13 +39,13 @@ process MULTIALLELIC_FILTER {
     tag "${name}"
     publishDir "${params.output}/${name}", mode: "copy"
 
-    conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null)
+    conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null)
 
     input:
     tuple val(name), file(vcf)
 
     output:
-    tuple val(name), file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf
+    tuple file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf
 
     script:
     """

diff --git a/modules/05_variant_annotation.nf b/modules/05_variant_annotation.nf
@@ -0,0 +1,27 @@
+params.memory = "3g"
+params.cpus = 1
+params.output = "."
+params.snpeff_datadir = false
+params.snpeff_organism = false
+params.snpeff_args = ""
+
+
+process VARIANT_ANNOTATION {
+    cpus params.cpus
+    memory params.memory
+    publishDir "${params.output}/${name}", mode: "copy"
+
+    conda (params.enable_conda ? "bioconda::snpeff=5.0" : null)
+
+    input:
+        tuple val(name), file(vcf)
+
+    output:
+    tuple val(name), file("${name}.annotated.vcf") , emit: annotated_vcf
+
+    script:
+    datadir_arg = params.snpeff_datadir ? "-dataDir ${params.snpeff_datadir}" : ""
+    """
+    snpEff eff ${datadir_arg} ${params.snpeff_args} -nodownload ${params.snpeff_organism} ${vcf} > ${name}.annotated.vcf
+    """
+}
diff --git a/nextflow.config b/nextflow.config
@@ -27,7 +27,7 @@ env {
 // Capture exit codes from upstream processes when piping
 process.shell = ['/bin/bash', '-euo', 'pipefail']
 
-VERSION = '2.0.0'
+VERSION = '2.1.0'
 
 cleanup=true
 

diff --git a/bin/assert.sh → tests/assert.sh b/bin/assert.sh → tests/assert.sh