The goal of nf-genomeassembly
and nf-annotate
is to make to genome assembly and annotation workflows accessible for a broader community, particularily for plant-sciences. Long-read sequencing technologies are already cheap and will continue to drop in price, genome sequencing will soon be available to many researchers without a strong bioinformatic background.
The assembly is naturally quite organisms agnostic, but the annotation pipeline contains some steps that may not make sense for other eukaryotes, unless there is a particular interest in NB-LRR genes.
I am currently preparing
nf-genomeassembly
to be added into nf-co.re/genomeassembler. Please see here for the latest version:genomeassembler
Assembly pipeline for genomes from long-read sequencing written in nextflow
.
The pipeline supports for assembly Oxford Nanopore, Pacbio HiFi, combinations of ONT and pacbio HiFi, and can take short-reads for quality control and / or polishing.
Preprocessisng:
-
For nanopore:
- Extract all fastq.gz files in the readpath folder into a single fastq file. By default this is skipped, enable with
--collect
. - Barcodes and adaptors will be removed using
porechop
. By default this is skipped, enable with--porechop
.NB: flye claims to work well on raw, un-trimmed reads
- Read QC is done via
nanoq
- Extract all fastq.gz files in the readpath folder into a single fastq file. By default this is skipped, enable with
-
For pacbio:
lima
to remove primers.
Assembly
- k-mer based assessment of ONT reads via
Jellyfish
andgenomescope
- Assemblies are performed with
flye
, - or
hifiasm
Polishing:
Scaffolding:
Annotation:
- Annotations are lifted from reference using
liftoff
.
QC:
- Quality of each stage is assessed using
QUAST
andBUSCO
(standalone), - k-mer spectra can be used for further QC with
yak
, - if short-reads are provided
merqury
is run to compare k-mer spectra between assemblies (or scaffolds) and short-reads.
Clone this repo:
git clone https://github.com/nschan/nf-genomeassembly/
Run via nextflow:
The samplesheet is a .csv
file with a header. It must adhere to this format, including the header row. Please note the absence of spaces after the commas:
sample,ontreads,hifireads,ref_fasta,ref_gff
sampleName,path/to/reads,path/to/hifi.fastq.gz,path/to/reference.fasta,path/to/reference.gff
To run the default pipeline with a samplesheet on biohpc_gen using charliecloud:
nextflow run nf-genomeassembly --samplesheet 'path/to/sample_sheet.csv' \
-profile charliecloud,biohpc_gen
See also schema.md
Parameter | Effect |
---|---|
General parameters | |
--samplesheet |
Path to samplesheet |
--use_ref |
Use a refence genome? (default: true ) |
--lift_annotations |
Lift annotations from reference using liftoff ? Default: true |
--out |
Results directory, default: './results' |
--ont |
ONT reads are available? These should go into the ontreads column of the samplesheet. Default: false |
--hifi |
Pacbio hifi reads are available? These should go into the hifireads column of the samplesheet. default: false |
ONT Preprocessing | |
--collect |
Are the provided reads a folder (true ) or a single fq files (default: false ) |
--porechop |
Run porechop on ONT reads? (default: false ) |
pacbio Preprocessing | |
--lima |
Run lima on pacbio reads? default: false |
--pacbio_primers |
Primers to be used with lima (required if --lima is used)? default: null |
Assembly | |
--assembler |
Assembler to use. Valid choices are: 'hifiasm' , 'flye' , or 'flye_on_hifiasm' . flye_on_hifiasm will scaffold flye assembly (ont) on hifiasm (hifi) assembly using ragtag . Defaul: 'flye' |
Assembly | flye specific arguments |
--flye_args |
The mode to be used by flye ; default: "--nano-hq" , options are: "--pacbio-raw" , "--pacbio-corr" , "--pacbio-hifi" , "--nano-raw" , "--nano-corr" , "--nano-hq" |
--kmer_length |
kmer size for Jellyfish ? (default: 21) |
--read_length |
Read length for genomescope ? If this is null (default), the median read length estimated by nanoq . will be used. If this is not null , the given value will be used for all samples. |
--genome_size |
Expected genome size for flye . If this is null (default), the haploid genome size for each sample will be estimated via genomescope . If this is not null , the given value will be used for all samples. |
--flye_args |
Arguments to be passed to flye , default: none . Example: --flye_args '--genome-size 130g --asm-coverage 50' |
Assembly | hifiasm specific arguments |
--hifi_ont |
Use hifi and ONT reads with hifiasm --ul ? default: false |
--hifiasm_args |
Extra arguments passed to hifiasm . default: '' |
Polishing | |
--polish_medaka |
Polish using medaka , default: false |
--medaka_model |
Model used by medaka , default: 'r1041_e82_400bps_hac@v4.2.0:consesus' |
--polish_pilon |
Polish with short reads (see below) using pilon ? Sefault: false |
Scaffolding | |
--scaffold_ragtag |
Scaffolding with ragtag ? Default: false |
--scaffold_links |
Scaffolding with LINKS ? Default: false |
--scaffold_longstitch |
Scaffolding with longstitch ? Default: false |
QC | |
--short_reads |
Short reads available? These should go into shortread_F and shortread_R columns and the paired column should be true if both are filled. If only single-end reads are available, shortread_R remains empty, and paired is false. If short-reads are supplied, k-mer spectra will be used to assess quality of the assembly(s). Default: false |
--trim_short_reads |
Trim short reads with trimgalore ? Default: true |
--meryl_k |
Value of k for meryl k-mers. Default: 21 |
--qc_reads |
Long reads that should be used for QC when both ONT and HiFi reads are provided. Options are 'ONT' or 'HIFI' . Default: 'ONT' |
--busco |
Run BUSCO ? Default: 'true' |
--busco_db |
Path to local BUSCO db? Default: "" |
--busco_lineage |
BUSCO lineage to use. Default: brassicales_odb10 |
--quast |
Run QUAST ? Default: true |
Skipping steps | |
--skip_assembly |
Skip assembly? Requires different samplesheet (!). Default: false |
--skip_alignments |
Skip alignments with minimap2 ? Requires different samplesheet (!). Default: false |
This pipelines comes with some profiles to modify run behaviour independent of infrastructure configs, which can be used via -profile
.
Name | Contents |
---|---|
ont_flye |
Assemble ONT reads with flye |
hifi_flye |
Assemble pac-bio hifi reads with flye |
hifi_hifiasm |
Assemble pac-bio hifi reads with hifiasm |
hifi_ul |
Assemble ONT and HiFI reads via hifiasm |
ont_on_hifi |
Assemble HiFi (via hifiasm ) and ONT (via flye ) and subsequent scaffolding of the ONT assembly onto HiFi assembly with ragtag |
If short reads are available, yak
can be used to perform additional quality control based on kmer spectra.
This can be enabled using --short_reads
and a samplesheet that looks like this:
sample,ontreads,hifireads,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,ontreads.fa.gz,hifireads.fa.gz,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true
If there are only single-end reads, shortread_R should remain empty, and paired should be false
The assemblies can be polished using available short-reads using pilon
.
--polish_pilon
This requires additional information in the samplesheet: shortread_F
, shortread_R
and paired
:
sample,ontreads,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true
In a case where only single-reads are available, shortread_R
should be empty, and paired
should be false.
LINKS
, longstitch
and / or ragtag
can be used for scaffolding.
If --lift_annotations
is used (default), the annotations from the reference genome will be mapped to assemblies and scaffolds using liftoff.
This will happen at each step of the pipeline where a new genome fasta is created, i.e. after assembly, after polishing and after scaffolding.
If there is no reference genome available use --use_ref false
to disable the reference genome.
Liftoff should not be used without a reference, QUAST will no longer compare to reference.
In case you already have an assembly and would only like to check it with QUAST and polish use
--skip_assembly true
This mode requires a different samplesheet:
sample,readpath,assembly,ref_fasta,ref_gff
sampleName,path/to/reads,assembly.fasta.gz,reference.fasta,reference.gff
When skipping flye the original reads will be mapped to the assembly and the reference genome.
In case you have an assembly and have already mapped your reads to the assembly and the reference genome you can use
--skip_assembly true --skip_alignments true
This mode requires a different samplesheet:
sample,readpath,assembly,ref_fasta,ref_gff,assembly_bam,assembly_bai,ref_bam
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,reads_on_assembly.bam,reads_on_assembly.bai,reads_on_reference.bam
QUAST
will run with the following additional parameters:
--eukaryote \\
--glimmer \\
--conserved-genes-finding \\