This tutorial assumes that fastq/fasta files have been converted to TFRecord.
To get a full list of options for DeepMicrobes.py
:
DeepMicrobes.py --helpfull
The shell scripts used to make predictions with all tested DNNs in the paper can be found in pipelines.
In these scripts we use the --model_name
option to tell DeepMicrobes.py
which DNN architecture we would like to use.
The scripts of models called by DeepMicrobes.py
are indicated in square brackets below.
The final best DNN:
attention
: Embed + LSTM + Attention (DeepMicrobes) [./models/embed_lstm_attention.py]
Other tested DNNs:
deep_cnn
: ResNet-like CNN [./models/resnet_cnn.py]cnn_lstm
: CNN + LSTM [./models/cnn_lstm.py]seq2species
: Seq2species [./models/seq2species.py]embed_pool
: Embed + Pool [./models/embed_pool.py]embed_cnn
: Embed + CNN [./models/embed_cnn.py]embed_lstm
: Embed + LSTM [./models/embed_lstm.py]
To make prediction on a metagenome dataset (referred to as sample.tfrec
) using DeepMicrobes :
predict_DeepMicrobes.sh -i sample.tfrec -b 8192 -l species -p 8 -m model_dir -o prefix
Arguments:
-i
TFRecord input containing interleaved paired-end reads-m
Dictionary containing model weights (should match the taxonomic level)-o
Output prefix-b
(Optional) Batch size (a multiple of 4) (default: 8192)-l
(Optional) Taxonomic level, species/genus (should match the weights) (default: species)-p
(Optional) Number of parallel calls for input preparation (default: 8)
Note: The model classifies sequences faster using a larger batch size. We recommend users to try different values and select the largest batch size that fits into memory.
The script takes as input a TFRecord dataset and generates a tab-delimited output file containing predictions made on each pair of reads.
- 1st column: category labels (integer)
- 2nd column: confidence score (decimal)
The tab-delimited file can then be used to generate a species/genus profile (see next tutorial).