DupFinder, a gene duplication detection tool based on a combination of several variant calling tools and efficient filtering methods, the aim of which is to generate the broadest possible duplication catalogue minimising the false positive rate. DupFinder combines both short-read data from Illumina sequencing and long-read data from Nanopore sequencing.
DUPFinder is a tool developed for the detection of gene duplications from next generation sequencing (NGS) data using paired-end Illumina reads. It is specifically designed for plant data but can work well with human data with a reference genome and gene annotation file.
The pipeline is built using nextflow, a workflow tool that makes it very easy to run tasks across multiple computational infrastructures. It uses containers like Docker or Singularity or cross-platform package and environment managers like Conda; these make the workflow more reproducible. The Nextflow implementation on this pipeline uses the Conda package manager which easily manages the maintenance and update of the software used by the pipeline as well as the dependencies.
- Aligning reads to a reference genome using bwa mem for Illumina data (short reads sequencing) and minimap2 for Nanopore data (long reads sequencing)
- Calling CNVs using the structural variant callers on Illumina data Delly, Dysgu, Lumpy-sv and smoove
- Calling CNVs using the structural variant callers on Nanopore data Sniffles, Svim, cuteSV
- Post-processing each set of CNVs to keep the duplications and remove false positives Duphold, Bcftools
- Merging all sets of duplications into one large set SURVIVOR
- Detection of duplication gene using the annotation file Bedtools
DupFinder can only be installed on Linux systems and requires Anaconda/Miniconda (Python 3.9+) to be present on the system.
All steps of DupFinder are run using the Nextflow
) workflow language.
#Step 1. Download the DupFinder :
git clone https://github.com/assane-mbodj/dupfinder
#Step 2. Go to dupfinder folder
cd dupfinder
#Step 3. Find the yaml file in the folder and run :
conda env create -f dupfinder_env.yml
bash install.sh
#Step 4. Activate the environnement dupfinder_env:
conda activate dupfinder_env
You can finally run the test.sh script with the command line below to see if DupFinder has been installed on your machine.
bash test/test.sh
Before starting, create index file for the reference genome to reduce mapping time using the command following.
# build index accordingly
bwa index reference.fa
DupFinder: Tool for detecting duplicate gene using Illumina and Nanopore sequencing data.
DupFinder version: v2.0.0
For Illumina data:
nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "pair_id_{1,2}.fastq" --annot file.bed --out Output_DupFinder
For Nanopores data:
nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fasta --reads_lr "pair_id.fastq" --annot file.bed --out Output_DupFinder
Command arguments DupFinder: The following parameters need to be specified when running DupFinder
--genome_file: Reference genome in FASTA format
--reads_sr: set of paired-end short reads in FASTQ format. Gzipped FASTQ files are allowed
--reads_lr: set of single-end long reads in FASTQ format. Gzipped FASTQ files are allowed
--sr: allow to run the short reads version
--lr: allow to run the long reads version
--annot: the file containing the gene annotation: it can be in gff or bed format and must be tabulated
--out: Output directory to which all results will be written
--c: Config file specifying the number of CPU cores and memory that will be assigned to DupFinder
Optional arguments:
-w: Working directory to which intermediate results will be written. Default: work
-v: version
DupFinder can be used to run multiple samples using a single command. For exemple if there existe several sample paired-end for Illumina or Single-end for Nanopore, they can all be processed using:
For Illumina data:
nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "*_{1,2}.fastq" --annot file.bed --out Output_DupFinder
For Nanopore data:
nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fa --reads_lr "*.fastq" --annot file.bed --out Output_DupFinder
The outputs are specified on the table below variant_calls folder containing the CNV calls of the three callers, on the duplicate_annot_calls folder containing the annotated duplications, merge_vcf folder and on the duplicated_gene folder containing the gene duplications.
Col | Type | Description |
1 | folder | Folder containing alignment files Bam_files |
2 | folder | Folder containing Variants calling files variant_calls |
3 | folder | Folder containing filtered duplicate regions files duplication_annot_calls |
4 | folder | Folder containing gene duplications detected files detected_gene |
5 | folder | Folder containing merging of all duplicate callers files merge_vcf |
Copyright © 2023 Assane Mbodj (assanembodj11@gmail.com)
Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.