-
clone the repository from github;
git clone git@github.com:brendanf/optimotu_targets.git
This will by default create a directory calledoptimotu_targets
for your project, but you can put it in a different directory usinggit clone git@github.com:brendanf/optimotu_targets.git {name_of_directory}
-
(recommended) start a new branch; if you will do several projects, it is recommended to put each branch in a separate directory using
git worktree
. This allows them to share one copy of the.git
directory, as well asprotaxFungi
.cd optimotu_targets #or another directory name if you chose one git worktree add ../{name_of_project}
Or, if you only plan on doing one project, you can just create the branch in the main directory:
cd optimotu_targets #or another directory name if you chose one git checkout -b {name_of_project}
-
download protaxFungi and unzip into a sister directory (i.e.,
../protaxFungi
). This currently does not work because the github version of protaxfungi is not the lastest version.cd .. wget https://github.com/psomervuo/protaxfungi/raw/master/protaxfungi.tgz tar -xf protaxfungi.tgz
To get the current version, on puhti (Finnish computing cluster) only:
cd .. cp -r /scratch/project_2003156/protaxFungi .
-
download usearch into
bin/
, and rename or link the executable to be called justbin/usearch
. Make sure it is executable.cd {name_of_project}/bin wget https://www.drive5.com/downloads/usearch11.0.667_i86linux32.gz gunzip usearch11.0.667_i86linux32.gz ln -s usearch11.0.667_i86linux32 usearch chmod +x usearch cd ..
-
download FastqInDex into
bin/
and unzip. Make sure it is executable.cd bin wget https://github.com/DKFZ-ODCF/FastqIndEx/releases/download/0.9.0b/fastqindex_0.9.0b.gz gunzip fastqindex_0.9.0b.gz chmod +x fastqindex_0.9.0b cd ..
-
download reference data for the Unite sh_matching pipeline and unzip in
data/sh_matching_data
. (Currently onlysanger_refs_sh.fasta
andshs_out.txt
are used, so you can delete everything else.)wget https://s3.hpc.ut.ee/plutof-public/original/5dcbe93a-b50c-4ff5-8898-528160c4e593.zip unzip -j 5dcbe93a-b50c-4ff5-8898-528160c4e593.zip data/shs_out.txt data/sanger_refs_sh.fasta -d data/sh_matching_data rm 5dcbe93a-b50c-4ff5-8898-528160c4e593.zip
-
download or link your own demultiplexed, paired-end Illumina reads into
sequences/01_raw
. Each sequencing run should be in its own directory.mkdir -p sequences/01_raw cd sequences/01_raw # do your download here cd ../..
The actual download might be done with
cp
orln -s
if it is already somewhere else on your system;wget
for a general HTTP or FTP address;bs download project --name
from Illumina BaseSpace;rclone copy
from AWS or another object storage platform (e.g. Allas). -
(optional) if there are particular species that you are interested in, write their full species identifiers as used in ProtaxFungi (
Genus_species_mycobank#
, e.g.Rhodofomes_roseus_127496
), one per line, in a filedata/target_taxa.txt
. -
(optional) if you have additional reference sequences for species of interest (which may or may not also be in
target_taxa.txt
), then you can create two files:data/culture_refs.fasta
with the actual sequences, anddata/culture_sequences.xlsx
with medadata. In particular,data/culture_sequences.clsx
must include, in the first worksheet, a column “Culture ID” giving the name of each reference sequence exactly as it appears indata/culture_refs.fasta
, and a column “Protax_synonym” giving the full, comma-separated taxonomy for each reference in Protax format, e.g.Fungi,Basidiomycota,Agaricomycetes,Polyporales,Fomitopsidaceae,Antrodia_17083,Antrodia_piceata_813073
. Adding reference sequences which are not annotated fully to species (e.g.,Fungi,Basidiomycota,Agaricomycetes,Polyporales,Fomitopsidaceae
) likely works but has not been tested. -
Install R dependencies. This can be done in several ways:
(using renv)R
then
#if renv is not installed install.packages("renv") renv::init() # this will have failed to install github and bioconductor packages renv::install("brendanf/optimotu", "bioconductor::dada2") renv::snapshot()
(using conda)
conda env create -f conda/its2_taxonomy_first.yaml conda activate its2_taxonomy_first
(using tykky on puhti)
module load tykky mkdir /projappl/{your_csc_project}/{your_project} conda-containerize new --prefix /projappl/{your_csc_project}/{your_project} conda/OptimOTU_v2.yaml export PATH="/projappl/{your_csc_project}/{your_project}:$PATH"
(using existing tykky container on puhti)
export PATH="/projappl/project_2003156/its2_taxonomy_first/bin:$PATH"
Configuration of the pipeline is done through a (more or less)
human-readable file called pipeline_options.yaml
. Comments in the file
(Beginning with #
) explain the function of most of the options. Listed
here are a few types of options you may want/need to change depending on
different characteristics of your data. Also see the example files
pipeline_options_example_fungi.yaml
and
pipeline_options_example_metazoa.yaml
.
The first option is project_name
. The default option is
“metabarcoding_project”, but the pipeline will issue a warning if you
don’t change it. The only thing this option actually controls is the
name of the zip file containing all outputs for a run. As such it should
be a good file name (ideally underscores instead of spaces, not too
long). It could be (but does not have to be) the same name as your
project directory/git branch.
By default, the OptimOTU pipeline auto-detects sample and sequencing-run names based on file names. It strips some common Illumina elements from the file names and assumes that what is left is the sample name; and it assumes the samples are saved in directories which correspond to sampling runs.
If your situation is more complex than this, you may need to use a
custom sample table. This is a TSV file which includes, at a minimum,
columns named “sample
”, “seqrun
”, “fastq_R1
”, and “fastq_R2
”.
These give, for each sample, the sequencing run it came from, and the
file paths for the R1 and R2 fastq files.
One possible reason to use a custom sample table is to define the
orientation of reads. Some lab workflows result in read pairs which all
have the same orientation, while others result in read pairs which are
in mixed orientation; finally others (in order to increase sequence
diversity) are all in the same orientation within each sample, but this
varies between samples. All of these cases are supported by the orient
option; the case where orientations vary between samples, however,
requires you to designate the orientation of each sample, which can be
done using a custom sample table with an “orient
” column.
The default pipeline_options.yaml
file (and also
pipeline_options_example_fungi.yaml
)is set up for ITS2 amplicons
generated using the primer pair ITS3-ITS4.
pipeline_options_example_metazoa.yaml
is set up for COI amplicons
using the primer pair BF3-BR2. The following options need to be changed
when switching to a different primer pair:
-
forward_primer
andreverse_primer
should have the sequences of the primers. Ambiguous bases (N, C, Y, M, R, …) are OK. -
The settings under
amplicon_model
will need to be changed. Full instructions for generating a model (HMM or CM) for a new amplicon is beyond the scope of this README, although the process will be automated in a future version of OptimOTU. The easiest solution is to disable model alignment entirely, by settingmodel_type
to “none
”.model_type
: “CM
” is specifically for non-coding RNA genes (like ITS, 16S, LSU, …). They can be created using the programcmbuild
from Infernal. “HMM
” is a more general type of model which can be used for non-coding RNA, but also for protein-coding genes or sequences in general. They can be created using the programhmmbuild
from Hmmer.model_file
: location of the HMM or CM file, relative to the root of the projectmodel_filter
: these parameters should be adjusted when using a new CM or HMM. Especially themin_model_end
(i.e., how far in the model the amplicon must go) and themin_model_score
(i.e., how well must the amplicon match the model) are typically higher for longer models.
OptimOTU currently uses two different versions of Protax for taxonomic
identification: ProtaxFungi and ProtaxAnimal. Although these are named
for the taxonomic group they were designed for, they are also completely
different programs which use a different file structure for their
trained models. It is (in theory) possible to train a model for the
ProtaxFungi software for any amplicon and any group of organisms, given
a defined taxonomy and a set of labeled reference sequences.
ProtaxAnimal is more restrictive but also much faster. It requires that
all input sequences, both references and queries, are globally aligned.
This alignment can be performed in the OptimOTU pipeline by setting
model_align
(under amplicon_model
) and aligned
(under protax
)
both to “yes
”. The location of the Protax installation directory
should also be given as location
under protax
.
Run the full pipeline:
R -e 'targets::tar_make(callr_function=NULL)'
Test that samples are correctly detected:
Rscript _targets.R
Check what targets need to be run:
R -e 'targets::tar_outdated(callr_function=NULL)'
Run to generate a specific target:
R -e 'targets::tar_make({name_of_target}, callr_function=NULL)'
Check if a specific target (or its dependencies) need to be run:
R -e 'targets::tar_outdated({name_of_target}, callr_function=NULL)'
Run the full pipeline:
targets::tar_make()
Test that samples are correctly detected:
source("_targets.R")
# you can also interactively investigate the sample_table object
sample_table
sample_table$sample
sample_table$seq_run
# etc.
Check what targets need to be run:
targets::tar_outdated()
Run to generate a specific target:
targets::tar_make({name_of_target})
Check if a specific target (or its dependencies) need to be run:
targets::tar_outdated({name_of_target})
You may need to modify run_node.sh
, for instance to change the project
or tykky container.
Run the full pipeline:
sbatch run_node.sh
Test that samples are correctly detected (on login node):
# first line only needed once per session
export PATH="/projappl/{your_csc_project}/{your_project}/bin:$PATH"
Rscript _targets.R
Check what targets need to be run (on login node):
bash run_node.sh test
Run to generate a specific target:
sbatch run_node.sh {name_of_target}
Check if a specific target (or its dependencies) need to be run:
bash run_node.sh test {name_of_target}
You probably need to modify run_crew.sh
, run_crew.R
and
slurm/puhti_crew.tmpl
, for instance to change the project or tykky
container (.sh
and .tmpl
) or the number of workers (.R
).
Run the full pipeline:
sbatch run_clustermq.sh
Test that samples are correctly detected (on login node):
# first line only needed once per session
export PATH="/projappl/{your_csc_project}/{your_project}:$PATH"
Rscript _targets.R
Check what targets need to be run (on login node):
bash run_node.sh test
Run to generate a specific target:
sbatch run_clustermq.sh {name_of_target}
Check if a specific target (or its dependencies) need to be run (on login node):
bash run_node.sh test {name_of_target}
It should be possible to run on other Slurm-based HPC systems by
additional modification of run_node.sh
, run_clustermq.sh
,
run_clustermq.R
, and slurm/puhti_clustermq.tmpl
. The least portable
element is the tykky containerization. In the future a Singularity
container will be provided to make installation on other systems easier.