GAAP-BI v. 1.0 - Genome Assembly and Annotation Pipeline for Bacteria Illumina
MIT LICENSE - Copyright © 2022 Gian M.N. Benucci, Ph.D.
email: benucci[at]msu[dot]edu
GAAP-BI v.1, October 2022
This pipeline is based upon work supported by the Great Lakes Bioenergy Research Center, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-SC0018409
WARNING
To use GAAP-BI just clone the directory
git clone https://github.com/Gian77/GAAP-BI.git
You will then need to
- Copy your raw reads file in the rawdata directory together with a md5sum file.
- Install all the necessary tools through
conda
(please see the complete list of tools reported below). - Download the databases (please see below and the
config.yaml
file), and include the full paths into the config file. - Select all the User options (please see below).
Then you should be good to go and run GAAP-BI by just sh GAAP-BI-v1.0.sh
The main project's directory is: project_dir="/mnt/home/benucci/GAAP-BI/
. Of course, you will need to adjust the path to your HPCC user name.
NOTE
- This pipeline run using SLURM (please see bove). Resourches of each individual scripts present in the
/mnt/home/benucci/GAAP-BI/code/
directory must be adjusted to the amount of data you want to run for each pipeline run. In particular the parameters below.#SBATCH --time=00:30:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=48 #SBATCH --mem=256G
- The individual scripts in the
code
direcotry include the buy-in node priority#SBATCH -A shade-cole-bonito
. If you do not have access to those priority nodes please remove that line in the individual scripts.- You can change the name of the
project_dir
, but by default is going to beGAAP-BI
. Subdirectories such asoutputs
andslurms
are part of the workflow, and should be left as they are.- Please check the config file for options. A few script are additional and are can be avoided to save time.
When the config is sourced it defaults to the rawdata
project directory
cd $project_dir/rawdata/
If not numeric (i.e. float) then yes
or no
.
INTERLACED=yes
COVERAGE=2.0
LENGTH=500
MASH=yes
CHROMOSOMES=2
EGGNOG=yes
BAKTA=no
The INTERLACED
variable is for when read are interlaced (i.e. R1
and R2
are in the same file), if interlaced then specify yes
.
The COVERAGE
and LENGTH
are selexplanaroty and refer to the assembled contigs.
The STRING
variable differentiate hifi
form raw
pacbio reads. Usually ccs
is present in the raw file name, when found it activates the hifi mode in the Flye assembler. If you want to differentiate using a different string you can change it accordingly.
The MASH
variable is for running plasmid detection using Mash.
The CHROMOSOMES
variable is for circularize the genome (default 2
, one contig is the bacterial genome and he other contig is the plasmid). However, It is known that some bacteria have multiple circular cromosomes (e.g., Rhodobacter sphaeroides), for reference see here. Feel free to increase to how many chromosomes you think the tazon you are working on may have. If the number of contigs is less than the number of specified chromosomes then Circlator is run on the contings to circularize them.
The EGGNOG
variable is to run eggnog-mapper to re-classify the proteins detected by Prokka.
The BAKTA
variable is for running the bakta annotation pipeline.
!For detailes on the tools mentioned above see below.
Find a place in your lab space (or in your home directory) where to put all the needed databases.
Correct the directory hierarchy according to your HPCC account name. For example, in my case all the
databases are in /mnt/research/ShadeLab/Benucci/databases/
.
NCBI_nt="/mnt/research/ShadeLab/Benucci/databases/ncbi_nt1121"
kraken2_db="/mnt/research/ShadeLab/Benucci/databases/kraken2_db/"
minikraken2_db="/mnt/research/ShadeLab/Benucci/databases/minikraken2_db/"
platon_db="/mnt/research/ShadeLab/Benucci/databases/platon_db/db/"
mash_plsdb="/mnt/research/ShadeLab/Benucci/databases/plasmid_db/plsdb.msh"
mash_plsdb_meta="/mnt/research/ShadeLab/Benucci/databases/plasmid_db/plsdb.tsv"
busco_db="/mnt/research/ShadeLab/Benucci/databases/busco_db1121/bacteria_odb10"
phix_db="/mnt/research/ShadeLab/Benucci/databases/phix_index/my_phix"
gunc_db="/mnt/research/ShadeLab/Benucci/databases/gunc_db"
bakta_db="/mnt/research/ShadeLab/Benucci/databases/bakta_db/db"
export GTDBTK_DATA_PATH=/mnt/research/ShadeLab/Benucci/databases/gtdb_tk/release207_v2
export EGGNOG_DATA_DIR=/mnt/research/ShadeLab/Benucci/databases/emapperdb/
export GUNC_DB=/mnt/research/ShadeLab/Benucci/databases/gunc_db/gunc_db_progenomes2.1.dmnd
Please install via conda (or use the binaries in the HPPC) of all these software below.
FastQC, NanoStat, Flye, Minimap2, Racon, Pilon, Circlator, quast, qualimap, BLAST, BlobTools, Platon, Mash, Kraken 2, Bracken, CheckM, Gunc, GTDB-Tk, Barrnap, Metaxa2, eggNOG, Prokka, BUSCO, ABRicate, MultiQC, Bandage, R.
NOTE
Make sure you installtydiverse
andggplot2
along with R.
Please check the config file for options. A few script are additional and can be avoided if needed.
All the useful results, generated reports, and slurms outputs for each script and for each assembled genome, will be available as tar.gz
file at this PATH $project_dir/outputs/25_exportreports-bash
. If you are interested in all the other files generated by the tools run by the peipeline you can explore all the directories in the $project_dir/outputs/
directory before erasing it.
Many thanks to the Institute for Cyber-Enabled Research (ICER) for helping troubshoting SLURM, the DOE Joint Genome Institute for providing the sequence data, methods, and metadata support, the whole crew at the Shade lab for brainstorming over methods and tools, and my good friend and colleague, Dr. Livio Antonielli at the AIT Austrian Institute of Technology GmbH for giving me a baseline and for the insiteful discussions and bioiformatic suggestions.