diff --git a/README.md b/README.md index 7158ac9..46db1d6 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# TronFlow variant normalization pipeline +# TronFlow VCF postprocessing ![GitHub tag (latest SemVer)](https://img.shields.io/github/v/release/tron-bioinformatics/tronflow-variant-normalization?sort=semver) [![Run tests](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml/badge.svg?branch=master)](https://github.com/TRON-Bioinformatics/tronflow-variant-normalization/actions/workflows/automated_tests.yml) @@ -6,18 +6,32 @@ [![License](https://img.shields.io/badge/license-MIT-green)](https://opensource.org/licenses/MIT) [![Powered by Nextflow](https://img.shields.io/badge/powered%20by-Nextflow-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://www.nextflow.io/) -The TronFlow variant normalization pipeline is part of a collection of computational workflows for tumor-normal pair +The TronFlow VCF postprocessing pipeline is part of a collection of computational workflows for tumor-normal pair somatic variant calling. +These workflows are implemented in the Nextflow (Di Tommaso, 2017) framework. Find the documentation here [![Documentation Status](https://readthedocs.org/projects/tronflow-docs/badge/?version=latest)](https://tronflow-docs.readthedocs.io/en/latest/?badge=latest) - -This pipeline aims at normalizing variants represented in a VCF into the convened normal form as described in Tan 2015. -The variant normalization is based on the implementation in vt (Tan 2015) and bcftools (Danecek 2021). -The pipeline is implemented on the Nextflow (Di Tommaso 2017) framework. +This pipeline has several objectives: +* Variant filtering +* Variant normalization +* Technical annotations from different BAM files +* Functional annotations + +## Variant filtering + +Optionally, only variants with the value in the column `FILTER` matching the value of parameter `--filter` are kept. +If this parameter is not used not variants are filtered out. Multiple values can be passed separated by commas without spaces. + +For instance, `--filter PASS,.` will keep variant having `FILTER=PASS` or `FILTER=.`, but remove all others. + + +## Variant normalization + +The normalization step aims to represent variants into the convened normal form as described in Tan 2015. +The variant normalization is based on the implementation in vt (Tan, 2015) and bcftools (Danecek, 2021). -The pipeline consists of the following steps: - * Variant filtering (optional) +The normalization pipeline consists of the following steps: * Decomposition of MNPs into atomic variants (ie: AC > TG is decomposed into two variants A>T and C>G) (optional). * Decomposition of multiallelic variants into biallelic variants (ie: A > C,G is decomposed into two variants A > C and A > G) * Trim redundant sequence and left align indels, indels in repetitive sequences can have multiple representations @@ -34,9 +48,9 @@ The output consists of: ![Pipeline](images/variant_normalization_pipeline.png) -## What is variant normalization? +### What is variant normalization? -### Variants are trimmed removing redundant bases +#### Variants are trimmed removing redundant bases Before: ``` @@ -49,7 +63,7 @@ chr1 13085 . A C . UNTRIMMED OLD_CLUMPED=chr1|13083|AGA|AGC GT:AD 0:276,276 0/1: **NOTE**: when a variant is changed during normalization the original variant is kept in the `INFO` field `OLD_CLUMPED`. -### Indels are left aligned +#### Indels are left aligned If an indel lies within a repetitive sequence it can be reported on different positions, the convention is to report the left most indel. This is what is called left alignment. @@ -69,7 +83,7 @@ chr1 13140 . CCTGAG C . UNALIGNED OLD_CLUMPED=chr1|13141|CTGAGG|G GT:AD 0:80,1 0 **NOTE**: the rule of thumb to confirm if any given indel is left aligned, assuming that they are trimmed, is that the last base in both reference and alternate must differ. -### Multi-allelic variants are split. +#### Multi-allelic variants are split. Before: ``` @@ -84,7 +98,7 @@ chr1 13204 . C T . MULTIALLELIC OLD_VARIANT=chr1|13204|C|G,|2 GT:AD 0:229,229 0/ Note that the AD values are incorrectly set after the split, this seems to be a regression issue in bcftools v1.12 and has been reported here https://github.com/samtools/bcftools/issues/1499. -### Multi Nucleotide Variants (MNVs) can be decomposed +#### Multi Nucleotide Variants (MNVs) can be decomposed Before: ``` @@ -101,7 +115,7 @@ chr1 13266 . T C . MNV OLD_CLUMPED=chr1:13261:GCTCCT/CCCCCC GT:AD:PS 0:41,0:1326 This behaviour is optional and can be disabled with `--skip_decompose_complex`. Beware, that the phase is maintained in the fields `FORMAT/GT` and `FORMAT/PS` as described in the VCF specification section 1.4.2. -### Complex variants combining SNV and indels can be decomposed +#### Complex variants combining SNV and indels can be decomposed Before: ``` @@ -118,6 +132,36 @@ chr1 13325 . CT C . MNV-INDEL OLD_CLUMPED=chr1:13321:AGCCCT/CGCC GT:AD:PS 0:229, Same as MNVs this behaviour can de disabled with `--skip_decompose_complex`. +## Technical annotations + +The technical annotations provide an insight on the variant calling process by looking into the context of each variant +within the pileup of a BAM file. When doing somatic variant calling it may be relevant to have technical annotations +for the same variant in a patient from multiple BAM files. +These annotations are provided by VAFator (https://github.com/TRON-Bioinformatics/vafator). + +## Functional annotations + +The functional annotations provide a biological context for every variant. Such as the overlapping genes or the effect +of the variant in a protein. These annotations are provided by SnpEff (Cingolani, 2012). + +The SnpEff available human annotations are: +* GRCh37.75 +* GRCh38.99 +* hg19 +* hg19kg +* hg38 +* hg38kg + +Before running the functional annotations you will need to download the reference genome you need to use. +This can be done as follows: `snpEff download -dataDir /your/snpeff/folder -v hg19` + +When running indicate the right reference genome like `--snpeff_organism hg19`. +If none is provided no SnpEff annotations will be provided. +Provide the snpEff folder with `--snpeff_datadir` +To provide any additional SnpEff arguments use `--snpeff_args` such as +`--snpeff_args "-noStats -no-downstream -no-upstream -no-intergenic -no-intron -onlyProtein -hgvs1LetterAa -noShiftHgvs"`, +otherwise defaults will be used. + ## How to run it @@ -167,22 +211,31 @@ Output: The table with VCF files expects two tab-separated columns without a header -| Sample name | VCF | +| Patient name | VCF | |----------------------|------------------------------------------------------------------------| -| sample_1 | /path/to/sample_1.vcf | -| sample_2 | /path/to/sample_2.vcf | +| patient_1 | /path/to/patient_1.vcf | +| patient_2 | /path/to/patient_2.vcf | -The optional table with BAM files expects three tab-separated columns without a header. Multiple comma-separated BAMs can be provided. +The optional table with BAM files expects two tab-separated columns without a header. -| Sample name | Tumor BAMs | Normal BAMs | -|----------------------|---------------------------------|------------------------------| -| sample_1 | /path/to/sample_1.tumor_1.bam,/path/to/sample_1.tumor_2.bam | /path/to/sample_1.normal_1.bam,/path/to/sample_1.normal_2.bam | -| sample_2 | /path/to/sample_2.tumor.bam | /path/to/sample_2.normal.bam | +| Patient name | Sample name:BAM | +|----------------------|---------------------------------| +| patient_1 | primary_tumor:/path/to/sample_1.primary.bam | +| patient_1 | metastasis_tumor:/path/to/sample_1.metastasis.bam | +| patient_1 | normal:/path/to/sample_1.normal.bam | +| patient_2 | primary_tumor:/path/to/sample_1.primary_1.bam | +| patient_2 | primary_tumor:/path/to/sample_1.primary_2.bam | +| patient_2 | metastasis_tumor:/path/to/sample_1.metastasis.bam | +| patient_2 | normal:/path/to/sample_1.normal.bam | + +Each patient can have any number of samples. Any sample can have any number of BAM files, annotations from the +different BAM files of the same sample will be provided with suffixes _1, _2, etc. +The aggregated vafator annotations on each sample will also be provided without a suffix. - ## References * Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204](http://bioinformatics.oxfordjournals.org/content/31/13/2202) and uses bcftools [Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (Oxford, England), 27(21), 2987–2993. 10.1093/bioinformatics/btr509 * Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16;10(2):giab008. doi: 10.1093/gigascience/giab008. PMID: 33590861; PMCID: PMC7931819. * Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316–319. 10.1038/nbt.3820 +* Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.". Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672 diff --git a/bin/filter_passed.sh b/bin/filter_passed.sh deleted file mode 100755 index b928146..0000000 --- a/bin/filter_passed.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/usr/bin/env bash -vcf=$1 - -# keeps only the passed variants -bcftools view --apply-filters PASS -o `basename $vcf .vcf`.passed.vcf $vcf diff --git a/bin/normalization.sh b/bin/normalization.sh deleted file mode 100755 index fbc5568..0000000 --- a/bin/normalization.sh +++ /dev/null @@ -1,47 +0,0 @@ -#!/usr/bin/env bash - -vcf=$1 -reference=$2 - - -module load anaconda/3/2019 - -# separate first by variant type -bcftools view --types snps -o `basename $vcf .vcf`.original.snps.vcf $vcf -bcftools view --types indels -o `basename $vcf .vcf`.original.indels.vcf $vcf -bcftools view --types mnps -o `basename $vcf .vcf`.original.mnps.vcf $vcf -bcftools view --types ref -o `basename $vcf .vcf`.original.ref.vcf $vcf -bcftools view --types bnd -o `basename $vcf .vcf`.original.bnd.vcf $vcf -bcftools view --types other -o `basename $vcf .vcf`.original.other.vcf $vcf - -# decompose biallelic block substitutions (AC>TG to A>T and C>G) -# -a: best guess for non blocked substitutions -# -p: output phased genotypes and PS annotation -vt decompose_blocksub $vcf -a -p -o `basename $vcf .vcf`.atomic.vcf 2> `basename $vcf .vcf`.decompose_blocksub_stats.log - -# decompose multiallelic variants into biallelic (C>T,G to C>T and C>G) -vt decompose `basename $vcf .vcf`.atomic.vcf -o `basename $vcf .vcf`.biallelic.vcf 2> `basename $vcf .vcf`.decompose_stats.log - -# sort the input VCF -vt sort `basename $vcf .vcf`.biallelic.vcf -o `basename $vcf .vcf`.sorted.vcf - -# normalize variants (trim and left alignment) -vt normalize `basename $vcf .vcf`.sorted.vcf -r $reference -o `basename $vcf .vcf`.normalized.vcf 2> `basename $vcf .vcf`.normalization_stats.log - -# removes duplicated variants -vt uniq `basename $vcf .vcf`.normalized.vcf -o `basename $vcf .vcf`.uniq.vcf 2> `basename $vcf .vcf`.uniq_stats.log - -# separate by variant type once normalized -bcftools view --types snps -o `basename $vcf .vcf`.normalized.snps.vcf `basename $vcf .vcf`.uniq.vcf -bcftools view --types indels -o `basename $vcf .vcf`.normalized.indels.vcf `basename $vcf .vcf`.uniq.vcf -bcftools view --types mnps -o `basename $vcf .vcf`.normalized.mnps.vcf `basename $vcf .vcf`.uniq.vcf -bcftools view --types ref -o `basename $vcf .vcf`.normalized.ref.vcf `basename $vcf .vcf`.uniq.vcf -bcftools view --types bnd -o `basename $vcf .vcf`.normalized.bnd.vcf `basename $vcf .vcf`.uniq.vcf -bcftools view --types other -o `basename $vcf .vcf`.normalized.other.vcf `basename $vcf .vcf`.uniq.vcf - -# delete intermediate files -rm -f `basename $vcf .vcf`.atomic.vcf -rm -f `basename $vcf .vcf`.biallelic.vcf -rm -f `basename $vcf .vcf`.sorted.vcf -rm -f `basename $vcf .vcf`.normalized.vcf -rm -f `basename $vcf .vcf`.uniq.vcf diff --git a/bin/summary.sh b/bin/summary.sh deleted file mode 100644 index 8b18927..0000000 --- a/bin/summary.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/usr/bin/env bash -vcf=$1 - -module load anaconda/3/2019 - -bcftools summary $vcf > `basename $vcf .vcf`.stats -plot-vcfstats -p `basename $vcf .vcf`.stats_plots `basename $vcf .vcf`.stats diff --git a/main.nf b/main.nf index 089af97..e565445 100755 --- a/main.nf +++ b/main.nf @@ -2,10 +2,11 @@ nextflow.enable.dsl = 2 -include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/normalization' -include { FILTER_VCF } from './modules/filter' -include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/summary' -include { VAFATOR; MULTIALLELIC_FILTER } from './modules/vafator' +include { FILTER_VCF } from './modules/01_filter' +include { BCFTOOLS_NORM; VT_DECOMPOSE_COMPLEX; REMOVE_DUPLICATES } from './modules/02_normalization' +include { SUMMARY_VCF; SUMMARY_VCF as SUMMARY_VCF_2 } from './modules/03_summary' +include { VAFATOR; MULTIALLELIC_FILTER } from './modules/04_vafator' +include { VARIANT_ANNOTATION } from './modules/05_variant_annotation' params.help= false params.input_vcfs = false @@ -21,7 +22,9 @@ params.vcf_without_ad = false params.mapping_quality = false params.base_call_quality = false params.skip_multiallelic_filter = false -params.prefix = false +params.snpeff_organism = false +params.snpeff_args = "" +params.snpeff_datadir = false if (params.help) { @@ -29,6 +32,10 @@ if (params.help) { exit 0 } +if ( params.snpeff_organism && ! params.snpeff_datadir) { + exit 1, "To run snpEff, please, provide your snpEff data folder with --snpeff_datadir" +} + if (! params.input_vcfs && ! params.input_vcf) { exit 1, "Neither --input_vcfs or --input_vcf are provided!" } @@ -50,8 +57,8 @@ else if (params.input_vcf) { if (params.input_bams) { Channel .fromPath(params.input_bams) - .splitCsv(header: ['name', 'tumor_bams', 'normal_bams'], sep: "\t") - .map{ row-> tuple(row.name, row.tumor_bams, row.normal_bams) } + .splitCsv(header: ['name', 'sample_name', 'bam'], sep: "\t") + .map{ row-> tuple(row.name, row.sample_name, row.bam) } .set { input_bams } } @@ -74,8 +81,8 @@ workflow { SUMMARY_VCF_2(final_vcfs) - if ( params.input_bams) { - VAFATOR(final_vcfs.join(input_bams)) + if ( params.input_bams ) { + VAFATOR(final_vcfs.join(input_bams.groupTuple())) final_vcfs = VAFATOR.out.annotated_vcf if ( ! params.skip_multiallelic_filter ) { final_vcfs = MULTIALLELIC_FILTER(final_vcfs) @@ -83,6 +90,9 @@ workflow { } } - final_vcfs.map {it.join("\t")}.collectFile(name: "${params.output}/normalized_vcfs.txt", newLine: true) + if (params.snpeff_organism) { + VARIANT_ANNOTATION(final_vcfs) + final_vcfs = VARIANT_ANNOTATION.out.annotated_vcf + } } diff --git a/modules/filter.nf b/modules/01_filter.nf similarity index 100% rename from modules/filter.nf rename to modules/01_filter.nf diff --git a/modules/normalization.nf b/modules/02_normalization.nf similarity index 100% rename from modules/normalization.nf rename to modules/02_normalization.nf diff --git a/modules/summary.nf b/modules/03_summary.nf similarity index 100% rename from modules/summary.nf rename to modules/03_summary.nf diff --git a/modules/vafator.nf b/modules/04_vafator.nf similarity index 50% rename from modules/vafator.nf rename to modules/04_vafator.nf index 325e19b..573c5b0 100644 --- a/modules/vafator.nf +++ b/modules/04_vafator.nf @@ -4,33 +4,31 @@ params.output = "" params.mapping_quality = false params.base_call_quality = false params.skip_multiallelic_filter = false -params.prefix = false +params.enable_conda = false process VAFATOR { cpus params.cpus memory params.memory - tag "${name}" - publishDir "${params.output}/${name}", mode: "copy" + tag "${patient_name}" + publishDir "${params.output}/${patient_name}", mode: "copy" - conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null) + conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null) input: - tuple val(name), file(vcf), val(normal_bams), val(tumor_bams) + tuple val(patient_name), file(vcf), val(bams) output: - tuple val(name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf + tuple val(patient_name), file("${vcf.baseName}.vaf.vcf"), emit: annotated_vcf script: - normal_bams_param = normal_bams?.trim() ? "--normal-bams " + normal_bams.split(",").join(" ") : "" - tumor_bams_param = tumor_bams?.trim() ? "--tumor-bams " + tumor_bams.split(",").join(" ") : "" + bams_param = bams.collect { b -> "--bam " + b.split(":").join(" ") }.join(" ") mq_param = params.mapping_quality ? "--mapping-quality " + params.mapping_quality : "" bq_param = params.base_call_quality ? "--base-call-quality " + params.base_call_quality : "" - prefix_param = params.prefix ? "--prefix " + params.prefix : "" """ vafator \ --input-vcf ${vcf} \ - --output-vcf ${vcf.baseName}.vaf.vcf ${normal_bams_param} ${tumor_bams_param} ${mq_param} ${bq_param} ${prefix_param} + --output-vcf ${vcf.baseName}.vaf.vcf ${bams_param} ${mq_param} ${bq_param} """ } @@ -41,13 +39,13 @@ process MULTIALLELIC_FILTER { tag "${name}" publishDir "${params.output}/${name}", mode: "copy" - conda (params.enable_conda ? "bioconda::vafator=0.4.0" : null) + conda (params.enable_conda ? "bioconda::vafator=1.1.2" : null) input: tuple val(name), file(vcf) output: - tuple val(name), file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf + tuple file("${vcf.baseName}.filtered_multiallelics.vcf"), emit: filtered_vcf script: """ diff --git a/modules/05_variant_annotation.nf b/modules/05_variant_annotation.nf new file mode 100644 index 0000000..b9c5f2a --- /dev/null +++ b/modules/05_variant_annotation.nf @@ -0,0 +1,27 @@ +params.memory = "3g" +params.cpus = 1 +params.output = "." +params.snpeff_datadir = false +params.snpeff_organism = false +params.snpeff_args = "" + + +process VARIANT_ANNOTATION { + cpus params.cpus + memory params.memory + publishDir "${params.output}/${name}", mode: "copy" + + conda (params.enable_conda ? "bioconda::snpeff=5.0" : null) + + input: + tuple val(name), file(vcf) + + output: + tuple val(name), file("${name}.annotated.vcf") , emit: annotated_vcf + + script: + datadir_arg = params.snpeff_datadir ? "-dataDir ${params.snpeff_datadir}" : "" + """ + snpEff eff ${datadir_arg} ${params.snpeff_args} -nodownload ${params.snpeff_organism} ${vcf} > ${name}.annotated.vcf + """ +} diff --git a/nextflow.config b/nextflow.config index 714dcbe..4c1f15b 100644 --- a/nextflow.config +++ b/nextflow.config @@ -27,7 +27,7 @@ env { // Capture exit codes from upstream processes when piping process.shell = ['/bin/bash', '-euo', 'pipefail'] -VERSION = '2.0.0' +VERSION = '2.1.0' cleanup=true diff --git a/bin/assert.sh b/tests/assert.sh similarity index 100% rename from bin/assert.sh rename to tests/assert.sh diff --git a/tests/run_test_1.sh b/tests/run_test_1.sh index f0953ba..c6c7606 100644 --- a/tests/run_test_1.sh +++ b/tests/run_test_1.sh @@ -1,7 +1,7 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test1 nextflow main.nf -profile test,conda --output $output_folder test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 1 output file!"; exit 1; } diff --git a/tests/run_test_2.sh b/tests/run_test_2.sh index c351206..6edcf9e 100644 --- a/tests/run_test_2.sh +++ b/tests/run_test_2.sh @@ -1,7 +1,7 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test2 nextflow main.nf -profile test,conda --output $output_folder --filter PASS,MNV test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 2 output file!"; exit 1; } diff --git a/tests/run_test_3.sh b/tests/run_test_3.sh index 0c4ad47..8da45d0 100644 --- a/tests/run_test_3.sh +++ b/tests/run_test_3.sh @@ -1,7 +1,7 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test3 nextflow main.nf -profile test,conda --output $output_folder --skip_decompose_complex test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 2 output file!"; exit 1; } diff --git a/tests/run_test_4.sh b/tests/run_test_4.sh index d95097d..e768448 100644 --- a/tests/run_test_4.sh +++ b/tests/run_test_4.sh @@ -1,7 +1,7 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test4 nextflow main.nf -profile test,conda --input_vcfs test_data/test_input_no_ad.txt --output $output_folder test -s $output_folder/sample_no_ad/sample_no_ad.normalized.vcf || { echo "Missing test 4 output file!"; exit 1; } diff --git a/tests/run_test_5.sh b/tests/run_test_5.sh index e956763..552acf5 100644 --- a/tests/run_test_5.sh +++ b/tests/run_test_5.sh @@ -1,7 +1,7 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test5 nextflow main.nf -profile test,conda --output $output_folder --input_vcfs false --input_vcf test_data/test_single_sample.vcf test -s $output_folder/test_single_sample/test_single_sample.normalized.vcf || { echo "Missing test 4 output file!"; exit 1; } diff --git a/tests/run_test_6.sh b/tests/run_test_6.sh index 8b290a5..7181bff 100644 --- a/tests/run_test_6.sh +++ b/tests/run_test_6.sh @@ -1,10 +1,14 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test6 -echo -e "tumor_normal\t"`pwd`"/test_data/TESTX_S1_L001.bam\t"`pwd`"/test_data/TESTX_S1_L002.bam" > test_data/test_bams.txt -echo -e "single_sample\t"`pwd`"/test_data/TESTX_S1_L001.bam,"`pwd`"/test_data/TESTX_S1_L002.bam\t"`pwd`"/test_data/TESTX_S1_L001.bam,"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "tumor_normal\tprimary:"`pwd`"/test_data/TESTX_S1_L001.bam" > test_data/test_bams.txt +echo -e "tumor_normal\tnormal:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "single_sample\ttumor:"`pwd`"/test_data/TESTX_S1_L001.bam" >> test_data/test_bams.txt +echo -e "single_sample\ttumor:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "single_sample\tnormal:"`pwd`"/test_data/TESTX_S1_L001.bam" >> test_data/test_bams.txt +echo -e "single_sample\tnormal:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt nextflow main.nf -profile test,conda --output $output_folder --input_bams test_data/test_bams.txt test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 6 output file!"; exit 1; } test -s $output_folder/tumor_normal/tumor_normal.normalized.vcf || { echo "Missing test 6 output file!"; exit 1; } @@ -19,4 +23,4 @@ assert_eq `grep tumor_af $output_folder/single_sample/single_sample.normalized.v assert_eq `grep normal_af $output_folder/single_sample/single_sample.normalized.vaf.vcf | wc -l | cut -d' ' -f 1` 35 "Wrong number of variants" assert_eq `wc -l $output_folder/tumor_normal/tumor_normal.normalized.vaf.vcf | cut -d' ' -f 1` 60 "Wrong number of variants" assert_eq `wc -l $output_folder/tumor_normal/tumor_normal.normalized.vaf.filtered_multiallelics.vcf | cut -d' ' -f 1` 53 "Wrong number of variants" -assert_eq `grep tumor_af $output_folder/tumor_normal/tumor_normal.normalized.vaf.filtered_multiallelics.vcf | wc -l | cut -d' ' -f 1` 24 "Wrong number of variants" \ No newline at end of file +assert_eq `grep primary_af $output_folder/tumor_normal/tumor_normal.normalized.vaf.filtered_multiallelics.vcf | wc -l | cut -d' ' -f 1` 24 "Wrong number of variants" \ No newline at end of file diff --git a/tests/run_test_7.sh b/tests/run_test_7.sh index b3e270e..046e8cb 100644 --- a/tests/run_test_7.sh +++ b/tests/run_test_7.sh @@ -1,10 +1,14 @@ #!/bin/bash -source bin/assert.sh +source tests/assert.sh output_folder=output/test7 -echo -e "tumor_normal\t"`pwd`"/test_data/TESTX_S1_L001.bam\t"`pwd`"/test_data/TESTX_S1_L002.bam" > test_data/test_bams.txt -echo -e "single_sample\t"`pwd`"/test_data/TESTX_S1_L001.bam,"`pwd`"/test_data/TESTX_S1_L002.bam\t"`pwd`"/test_data/TESTX_S1_L001.bam,"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "tumor_normal\tprimary:"`pwd`"/test_data/TESTX_S1_L001.bam" > test_data/test_bams.txt +echo -e "tumor_normal\tnormal:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "single_sample\ttumor:"`pwd`"/test_data/TESTX_S1_L001.bam" >> test_data/test_bams.txt +echo -e "single_sample\ttumor:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt +echo -e "single_sample\tnormal:"`pwd`"/test_data/TESTX_S1_L001.bam" >> test_data/test_bams.txt +echo -e "single_sample\tnormal:"`pwd`"/test_data/TESTX_S1_L002.bam" >> test_data/test_bams.txt nextflow main.nf -profile test,conda --output $output_folder --input_bams test_data/test_bams.txt --skip_multiallelic_filter test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 7 output file!"; exit 1; } test -s $output_folder/tumor_normal/tumor_normal.normalized.vcf || { echo "Missing test 7 output file!"; exit 1; } diff --git a/tests/run_test_8.sh b/tests/run_test_8.sh new file mode 100644 index 0000000..86bec8d --- /dev/null +++ b/tests/run_test_8.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +### this test cannot be automated as it relies on SnpEff references which need to be downloaded beforehand +source tests/assert.sh +output_folder=output/test8 +snpeff_datadir=/home/you/snpeff +nextflow main.nf -profile test,conda --output $output_folder --snpeff_organism hg19 --snpeff_datadir $snpeff_datadir +test -s $output_folder/single_sample/single_sample.normalized.vcf || { echo "Missing test 1 output file!"; exit 1; } +test -s $output_folder/tumor_normal/tumor_normal.normalized.vcf || { echo "Missing test 1 output file!"; exit 1; } +assert_eq `wc -l $output_folder/single_sample/single_sample.normalized.vcf | cut -d' ' -f 1` 53 "Wrong number of variants" +assert_eq `wc -l $output_folder/tumor_normal/tumor_normal.normalized.vcf | cut -d' ' -f 1` 53 "Wrong number of variants" \ No newline at end of file