Skip to content

Commit

Permalink
Merge branch 'dev' into 'master'
Browse files Browse the repository at this point in the history
Dev

See merge request research/medaka!52
  • Loading branch information
cjw85 committed Jan 22, 2018
2 parents 011d0ba + 5e8218e commit d9ce1b3
Show file tree
Hide file tree
Showing 20 changed files with 1,942 additions and 643 deletions.
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,10 @@ venv/bin/activate:
test -d venv || virtualenv venv --python=python3
${IN_VENV} && pip install pip --upgrade
${IN_VENV} && pip install numpy # needs to get done before other things
${IN_VENV} && pip install -r requirements.txt

install: venv | $(addprefix $(BINCACHEDIR)/, $(BINARIES))
${IN_VENV} && pip install -r requirements.txt && python setup.py install
${IN_VENV} && python setup.py install

test: install
${IN_VENV} && python -m unittest discover
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,16 @@ being released in an effort to allow researchers to experiment without having
to write much of the tedious data preparation code. Researchers can extend the
tools provided to achieve better results that those obtained currently.

Development of medaka as a "base-space" consensus tool has been paused to focus
on methods which exploit the nanopore signal data. Nevertheless researchers may
find medaka useful as a method to generate quickly sufficiently accurate
consensus sequences in many use cases; see the
[Benchmarks](https://nanoporetech.github.io/medaka/benchmarks.html) page for
further details.

Documentation can be found at https://nanoporetech.github.io/medaka/.


Installation
------------

Expand Down
123 changes: 78 additions & 45 deletions docs/benchmarks.rst
Original file line number Diff line number Diff line change
@@ -1,49 +1,82 @@
Benchmarks
==========

The following demonstrates the utility of a recurrent neural network trained
to make corrections of a draft assembly by considering counts of bases and
indels in pileup columns obtained from reads aligned to the draft, as
implemented with :ref:`SequenceCorrection`.

All models here were trained from reads aligning to a region of E.coli and
tested on a distinct, non-overlapping region. In all cases
`scrappie <https://github.com/nanoporetech/scrappie>`_ was used to perform
the basecalling; the 6mer transducer model was trained specifically for this
experiment. The draft assemblies here were created using the
`mini_assemble <https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_. Statistics
were calculated using `alignqc <https://www.healthcare.uiowa.edu/labs/au/AlignQC/>`_.

+----------------------------+-----------------+
| | Raw transducer |
| +--------+--------+
| | draft | medaka |
+===================+========+========+========+
| **Total Error %** | | 0.269 | 0.081 |
+-------------------+--------+--------+--------+
| **Mismatch** | | 0.007 | 0.011 |
+-------------------+--------+--------+--------+
| **Deletion** | Total | 0.222 | 0.045 |
+ +--------+--------+--------+
| | Non-HP | 0.011 | 0.005 |
+ +--------+--------+--------+
| | HP | 0.211 | 0.040 |
+-------------------+--------+--------+--------+
| **Insertion** | Total | 0.040 | 0.026 |
+ +--------+--------+--------+
| | Non-HP | 0.019 | 0.004 |
+ +--------+--------+--------+
| | HP | 0.021 | 0.022 |
+-------------------+--------+--------+--------+

Medaka reduces the total error by more than a factor of three. This is achieved
mainly through reducing the homopolymer (three or more bases) deletion error.

This rather simple correction model does not reach the level of
nanopolish, but does reduce the runtime of nanopolish considerably.

Future versions of medaka will aim to improve on the above results with the
aim to surpass the nanopolish results whilst also improving runtime. See
:ref:`FutureDirections` for more information.
The following demonstrates the utility of a recurrent neural network trained to
perform a consensus call from a pileup in which basecalls and the draft
assembly have been reduced using run-length encoding (as demonstrated in
:ref:`sequence_correction`). The network receives counts of base
run-lengths within each column of a pileup obtained by aligning the encoded
basecalls to the encoded draft assembly.

Results were obtained using the default model provided with `medaka`. This model
was trained using data obtained from E.coli, S.cerevisaie and H.sapiens samples.
Training data were basecalled using Guppy 0.3.0. Draft assemblies were created
using the `mini_assemble <https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.

Error statistics were calculated using the
`pomoxis <https://github.com/nanoporetech/pomoxis>`_ program `stats_from_bam` after
aligning 10kb chunks of the consensus to the reference.


Comparison of `medaka` and `nanopolish`
---------------------------------------

Evaluation of the model was performed using the `medaka` E.coli
:doc:`walkthrough` dataset. These data we not used to train the model.
Basecalling was performed with
`scrappie <https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94`
model. The pileup had a median depth of ~80-fold.
`nanopolish <https://github.com/jts/nanopolish>`_ was run with homopolymer
correction but without methylation correction. `medaka` and `nanopolish` were
run on the same hardware.

+-----------------+--------+------------+
| | medaka | nanopolish |
+=================+========+============+
| Q(Accuracy) | 30.53 | 30.80 |
+-----------------+--------+------------+
| Q(Identity) | 45.35 | 42.27 |
+-----------------+--------+------------+
| Q(Deletion) | 31.82 | 31.64 |
+-----------------+--------+------------+
| Q(Insertion) | 37.03 | 40.60 |
+-----------------+--------+------------+
| runtime (hours) | 0.17 | 3.0 |
+-----------------+--------+------------+
| CPU cores | 4 | 32 |
+-----------------+--------+------------+
| CPU hours | 0.67 | 96 |
+-----------------+--------+------------+

For this dataset `medaka` delivers similar results to `nanopolish` in a
fraction of the time.


Evaluation across samples and depths
------------------------------------

Evaluation of the model was performed using E.coli, H.sapiens chromosome 21,
and `K.pneumoniae <https://github.com/rrwick/Basecalling-comparison>`_.
The E.coli and human reads were basecalled with `Guppy` version 0.3.0,
while the Klebsiella reads were basecalled with `scrappie
<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
draft assemblies here were created at multiple depths using the `mini_assemble
<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.

+---------------------+-----------------+------------------+
| Data set | Racon Error (%) | Medaka Error (%) |
+=====================+=================+==================+
| E.coli 25X | 0.47 | 0.19 |
+---------------------+-----------------+------------------+
| E.coli 57X | 0.37 | 0.10 |
+---------------------+-----------------+------------------+
| K.pneumoniae 25X | 0.86 | 0.63 |
+---------------------+-----------------+------------------+
| K.pneumoniae 57X | 0.72 | 0.45 |
+---------------------+-----------------+------------------+
| H.sapiens chr21 31X | 1.01 | 0.48 |
+---------------------+-----------------+------------------+

`medaka` reduces the total error in the `racon` consensus by roughly a factor of two.
132 changes: 58 additions & 74 deletions docs/examples.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Medaka Examples
Getting Started
===============

Medaka demonstrates a framework for error correcting sequencing data,
`medaka` demonstrates a framework for error correcting sequencing data,
particularly aimed at nanopore sequencing. Tools are provided for both training
and inference. The code exploits the `keras <https://keras.io>`_ deep learning
library.
Expand All @@ -18,92 +18,76 @@ The currently implemented neural networks are limited in their ability to
correct errors: to predict an output base from a window of oriented base
features. They make no use of primary signal data from nanopore experiments,
as for instance performed by `nanopolish <https://github.com/jts/nanopolish>`_.
Future versions will implement correction schemes working directly from signal
data of multiple reads. For more details see :ref:`FutureDirections`.
Future projects will implement correction schemes working directly from signal
data of multiple reads. For more details see :doc:`future`.

For a complete examples starting from basecalls, through forming a draft
assembly, to training and using a consensus network see the :ref:`Walkthrough`.
For a complete examples starting from signal data, calculating basecalls,
through forming a draft assembly, to training and using a consensus network
see the :doc:`walkthrough`.

.. _SequenceCorrection:

Sequence correction
-------------------
.. _creating_software_env:

Creating a Software Environment
-------------------------------

`medaka consensus` uses a neural network error model to fix errors in input sequences.
While not strictly necessary to use `medaka`, the :doc:`walkthrough` makes use of
several other open-source tools from Oxford Nanopore Technologies. To obtain
these run the following:

.. code-block:: bash
usage: medaka consensus [-h] [--encoding ENCODING]
[--output_fasta OUTPUT_FASTA] [--start START]
[--end END]
(--pileupdata PILEUPDATA | --feature_file FEATURE_FILE | --alignments reads.bam ref.fasta ref_name)
model
positional arguments:
model Model .hdf file from training.
optional arguments:
-h, --help show this help message and exit
--encoding ENCODING Model label encoding .json, used only if encoding not
in .hdf model
--output_fasta OUTPUT_FASTA
Polished consensus output file.
--start START Reference position at which to start, only used with
--alignments.
--end END Reference position at which to end, only used with
--alignments.
--pileupdata PILEUPDATA
Pileup input data.
--feature_file FEATURE_FILE
Pregenerated features as stored during training.
--alignments reads.bam ref.fasta ref_name
Input alignments, reference fasta and reference name
(within fasta).
Neural network training
-----------------------
Training an error model is a two-stage process.
Generate a training data HDF5 file using `medaka prepare`.
MEDAKAHOME=${PWD}
git clone https://github.com/nanoporetech/pomoxis --recursive
git clone https://github.com/nanoporetech/medaka
git clone https://github.com/nanoporetech/scrappie
# While it is possible to install pomoxis and medaka into the same
# virtual environment, we will install each package into its own
# environment for simplicity. For more details see the readme for
# each of the packages.
cd pomoxis && make install && cd ..
cd medaka && make install && cd ..
cd scrappie && mkdir build && cd build && cmake .. && make && cd ../../
POMOXIS=${MEDAKAHOME}/pomoxis/venv/bin/activate
MEDAKA=${MEDAKAHOME}/medaka/venv/bin/activate
If any of the above steps fail consult the install instructions of the
individual packages.

In order to train effectively consensus models it is desirable to use a system
with a GPU. In order to use `medaka` training with a GPU, one should further
install the `tensorflow-gpu` package from pip into the medaka virtual
environment with:

.. code-block:: bash
usage: medaka prepare [-h] [--start START] [--end END] [--chunk_len CHUNK_LEN]
[--truth TRUTH]
ref_fasta ref_name output bam
source ${MEDAKA}
pip install tensorflow-gpu
positional arguments:
ref_fasta .fasta reference corresponding to bams.
ref_name Name of reference within ref_fasta.
output Output .hdf.
bam Input alignments (aligned to ref).
Depending on your environment, specifically the versions of `CUDA` and `cuDNN`
that one has installed, it may be necessary to use versions of this package other
than the latest. `medaka` has been used with all release versions of `tensorflow`
after and including version 1.0.0.

optional arguments:
-h, --help show this help message and exit
--start START Start reference coordinate.
--end END End reference coordinate.
--chunk_len CHUNK_LEN
Width of pileup chunks (in ref coords) to produce.
--truth TRUTH Input bam of truth aligned to ref to label data.

Then supply the training data file to `medaka train`.
.. _sequence_correction:

.. code-block:: bash
Train model using preprocessed training data.
Sequence correction
-------------------

After installing all the necessary software (see :ref:`creating_software_env`),
`medaka` can be run using its default settings through the `medaka_consensus`
program. An assembly in `.fasta` format and basecalls in `.fasta` or `.fastq`
format are required, see :ref:`basecalling_and_draft_assembly`.

usage: medaka train [-h] [--train_name TRAIN_NAME] [--model MODEL]
[--features]
pileupdata
.. code-block:: bash
positional arguments:
pileupdata Path for training data.
source ${MEDAKA} # i.e. medaka/bin/activate as above
NPROC=$(nproc)
medaka_consensus -i basecalls.fa -d draft_assembly.fa -o medaka_consensus -t ${NPROC} -p ${POMOXIS}
optional arguments:
-h, --help show this help message and exit
--train_name TRAIN_NAME
Name for training run.
--model MODEL Model definition and initial weights .hdf.
--features Stop after generating features.
When `medaka_consensus` has finished running, the consensus will be saved to
`medaka_consensus/consensus.fasta`.
8 changes: 4 additions & 4 deletions docs/future.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@

.. _FutureDirections:

History and Future Directions
=============================

Expand Down Expand Up @@ -56,4 +53,7 @@ data; this is the focus of current research. Further it may be possible that
such an approach need not be iterative: a single pass on the inputs could be
performed to achieve results in quicker time.

A future release on medaka will encorporate these higher-order methods.
Development of medaka as a "base-space" consensus tool has been paused to focus
on these higher-level methods. Nevertheless researchers may find medaka useful
as a method to generate quickly sufficiently accurate consensus sequences in many
use cases; see the `Benchmarks` page for more details.
15 changes: 11 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Welcome to Medaka's documentation!
==================================

Medaka demonstrates a framework for error correcting sequencing data,
Medaka demonstrates a framework for rapid error correction of sequencing data,
particularly aimed at nanopore sequencing. Tools are provided for both training
and inference. The code exploits the `keras <https://keras.io>`_ deep learning
library.
Expand All @@ -11,9 +11,16 @@ being released in an effort to allow researchers to experiment without having
to write much of the tedious data preparation code. Researchers can extend the
tools provided to achieve better results that those obtained currently.

See :doc:`examples` for further herlp in getting started and
:ref:`FutureDirections` for inspirational ideas and some background into
nanopore sequence correction.
See :doc:`examples` for standard usage in correcting a draft assembly,
:doc:`walkthrough` for a complete overview of training a model and its use,
and :doc:`future` for inspirational ideas and some background into nanopore
sequence correction.

Development of medaka as a "base-space" consensus tool has been paused to focus
on methods which exploit the nanopore signal data. Nevertheless researchers may
find medaka useful as a method to generate quickly sufficiently accurate
consensus sequences in many use cases; see the :doc:`benchmarks` page for further
details.

Tools to enable the creation of draft assembies can be found in a sister
project `pomoxis <https://github.com/nanoporetech/pomoxis>`_.
Expand Down
Loading

0 comments on commit d9ce1b3

Please sign in to comment.