Merge branch 'dev' into 'master'

Dev See merge request research/medaka!52
nanoporetech · Jan 22, 2018 · d9ce1b3 · d9ce1b3
2 parents 011d0ba + 5e8218e
commit d9ce1b3
Show file tree

Hide file tree

Showing 20 changed files with 1,942 additions and 643 deletions.
diff --git a/Makefile b/Makefile
@@ -37,9 +37,10 @@ venv/bin/activate:
 	test -d venv || virtualenv venv --python=python3
 	${IN_VENV} && pip install pip --upgrade
 	${IN_VENV} && pip install numpy # needs to get done before other things
+	${IN_VENV} && pip install -r requirements.txt
 
 install: venv | $(addprefix $(BINCACHEDIR)/, $(BINARIES))
-	${IN_VENV} && pip install -r requirements.txt && python setup.py install
+	${IN_VENV} && python setup.py install
 
 test: install
 	${IN_VENV} && python -m unittest discover

diff --git a/README.md b/README.md
@@ -13,8 +13,16 @@ being released in an effort to allow researchers to experiment without having
 to write much of the tedious data preparation code. Researchers can extend the
 tools provided to achieve better results that those obtained currently.
 
+Development of medaka as a "base-space" consensus tool has been paused to focus
+on methods which exploit the nanopore signal data. Nevertheless researchers may
+find medaka useful as a method to generate quickly sufficiently accurate
+consensus sequences in many use cases; see the
+[Benchmarks](https://nanoporetech.github.io/medaka/benchmarks.html) page for
+further details.
+
 Documentation can be found at https://nanoporetech.github.io/medaka/.
 
+
 Installation
 ------------
 

diff --git a/docs/benchmarks.rst b/docs/benchmarks.rst
@@ -1,49 +1,82 @@
 Benchmarks
 ==========
 
-The following demonstrates the utility of a recurrent neural network trained
-to make corrections of a draft assembly by considering counts of bases and
-indels in pileup columns obtained from reads aligned to the draft, as
-implemented with :ref:`SequenceCorrection`.
-
-All models here were trained from reads aligning to a region of E.coli and
-tested on a distinct, non-overlapping region. In all cases
-`scrappie <https://github.com/nanoporetech/scrappie>`_ was used to perform
-the basecalling; the 6mer transducer model was trained specifically for this
-experiment. The draft assemblies here were created using the
-`mini_assemble <https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
-pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_. Statistics
-were calculated using `alignqc <https://www.healthcare.uiowa.edu/labs/au/AlignQC/>`_.
-
-+----------------------------+-----------------+
-|                            | Raw transducer  |
-|                            +--------+--------+
-|                            |  draft | medaka |
-+===================+========+========+========+
-| **Total Error %** |        |  0.269 |  0.081 |
-+-------------------+--------+--------+--------+
-| **Mismatch**      |        |  0.007 |  0.011 |
-+-------------------+--------+--------+--------+
-| **Deletion**      | Total  |  0.222 |  0.045 |
-+                   +--------+--------+--------+
-|                   | Non-HP |  0.011 |  0.005 |
-+                   +--------+--------+--------+
-|                   | HP     |  0.211 |  0.040 |
-+-------------------+--------+--------+--------+
-| **Insertion**     | Total  |  0.040 |  0.026 |
-+                   +--------+--------+--------+
-|                   | Non-HP |  0.019 |  0.004 |
-+                   +--------+--------+--------+
-|                   | HP     |  0.021 |  0.022 |
-+-------------------+--------+--------+--------+
-
-Medaka reduces the total error by more than a factor of three. This is achieved
-mainly through reducing the homopolymer (three or more bases) deletion error. 
-
-This rather simple correction model does not reach the level of
-nanopolish, but does reduce the runtime of nanopolish considerably.
-
-Future versions of medaka will aim to improve on the above results with the
-aim to surpass the nanopolish results whilst also improving runtime. See
-:ref:`FutureDirections` for more information.
+The following demonstrates the utility of a recurrent neural network trained to
+perform a consensus call from a pileup in which basecalls and the draft
+assembly have been reduced using run-length encoding (as demonstrated in
+:ref:`sequence_correction`). The network receives counts of base
+run-lengths within each column of a pileup obtained by aligning the encoded
+basecalls to the encoded draft assembly. 
 
+Results were obtained using the default model provided with `medaka`. This model
+was trained using data obtained from E.coli, S.cerevisaie and H.sapiens samples.
+Training data were basecalled using Guppy 0.3.0. Draft assemblies were created
+using the `mini_assemble <https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
+pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_. 
+
+Error statistics were calculated using the 
+`pomoxis <https://github.com/nanoporetech/pomoxis>`_ program `stats_from_bam` after
+aligning 10kb chunks of the consensus to the reference. 
+
+
+Comparison of `medaka` and `nanopolish` 
+---------------------------------------
+
+Evaluation of the model was performed using the `medaka` E.coli
+:doc:`walkthrough` dataset. These data we not used to train the model.
+Basecalling was performed with 
+`scrappie <https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94`
+model. The pileup had a median depth of ~80-fold.
+`nanopolish <https://github.com/jts/nanopolish>`_ was run with homopolymer
+correction but without methylation correction. `medaka` and `nanopolish` were
+run on the same hardware. 
+
++-----------------+--------+------------+
+|                 | medaka | nanopolish |
++=================+========+============+
+| Q(Accuracy)     |  30.53 |  30.80     |
++-----------------+--------+------------+
+| Q(Identity)     |  45.35 |  42.27     |
++-----------------+--------+------------+
+| Q(Deletion)     |  31.82 |  31.64     |
++-----------------+--------+------------+
+| Q(Insertion)    |  37.03 |  40.60     |
++-----------------+--------+------------+
+| runtime (hours) |  0.17  |  3.0       |
++-----------------+--------+------------+
+| CPU cores       |  4     |  32        |
++-----------------+--------+------------+
+| CPU hours       |  0.67  |  96        |
++-----------------+--------+------------+
+
+For this dataset `medaka` delivers similar results to `nanopolish` in a
+fraction of the time. 
+
+
+Evaluation across samples and depths
+------------------------------------
+
+Evaluation of the model was performed using E.coli, H.sapiens chromosome 21,
+and `K.pneumoniae <https://github.com/rrwick/Basecalling-comparison>`_. 
+The E.coli and human reads were basecalled with `Guppy` version 0.3.0,
+while the Klebsiella reads were basecalled with `scrappie
+<https://github.com/nanoporetech/scrappie>`_ using the `rgrgr_r94` model. The
+draft assemblies here were created at multiple depths using the `mini_assemble
+<https://nanoporetech.github.io/pomoxis/examples.html#fast-de-novo-assembly>`_
+pipeline in `pomoxis <https://github.com/nanoporetech/pomoxis>`_.
+
++---------------------+-----------------+------------------+
+| Data set            | Racon Error (%) | Medaka Error (%) |
++=====================+=================+==================+
+| E.coli 25X          |       0.47      |       0.19       |
++---------------------+-----------------+------------------+
+| E.coli 57X          |       0.37      |       0.10       |
++---------------------+-----------------+------------------+
+| K.pneumoniae 25X    |       0.86      |       0.63       |
++---------------------+-----------------+------------------+
+| K.pneumoniae 57X    |       0.72      |       0.45       |
++---------------------+-----------------+------------------+
+| H.sapiens chr21 31X |       1.01      |       0.48       |
++---------------------+-----------------+------------------+
+
+`medaka` reduces the total error in the `racon` consensus by roughly a factor of two. 
diff --git a/docs/examples.rst b/docs/examples.rst
@@ -1,7 +1,7 @@
-Medaka Examples
+Getting Started
 ===============
 
-Medaka demonstrates a framework for error correcting sequencing data,
+`medaka` demonstrates a framework for error correcting sequencing data,
 particularly aimed at nanopore sequencing. Tools are provided for both training
 and inference. The code exploits the `keras <https://keras.io>`_ deep learning
 library.
@@ -18,92 +18,76 @@ The currently implemented neural networks are limited in their ability to
 correct errors: to predict an output base from a window of oriented base
 features. They make no use of primary signal data from nanopore experiments,
 as for instance performed by `nanopolish <https://github.com/jts/nanopolish>`_.
-Future versions will implement correction schemes working directly from signal
-data of multiple reads. For more details see :ref:`FutureDirections`.
+Future projects will implement correction schemes working directly from signal
+data of multiple reads. For more details see :doc:`future`.
 
-For a complete examples starting from basecalls, through forming a draft
-assembly, to training and using a consensus network see the :ref:`Walkthrough`.
+For a complete examples starting from signal data, calculating basecalls,
+through forming a draft assembly, to training and using a consensus network
+see the :doc:`walkthrough`.
 
-.. _SequenceCorrection:
 
-Sequence correction
--------------------
+.. _creating_software_env:
+
+Creating a Software Environment
+-------------------------------
 
-`medaka consensus` uses a neural network error model to fix errors in input sequences.
+While not strictly necessary to use `medaka`, the :doc:`walkthrough` makes use of
+several other open-source tools from Oxford Nanopore Technologies. To obtain
+these run the following:
 
 .. code-block:: bash
 
-	usage: medaka consensus [-h] [--encoding ENCODING]
-							[--output_fasta OUTPUT_FASTA] [--start START]
-							[--end END]
-							(--pileupdata PILEUPDATA | --feature_file FEATURE_FILE | --alignments reads.bam ref.fasta ref_name)
-							model
-
-	positional arguments:
-	  model                 Model .hdf file from training.
-
-	optional arguments:
-	  -h, --help            show this help message and exit
-	  --encoding ENCODING   Model label encoding .json, used only if encoding not
-							in .hdf model
-	  --output_fasta OUTPUT_FASTA
-							Polished consensus output file.
-	  --start START         Reference position at which to start, only used with
-							--alignments.
-	  --end END             Reference position at which to end, only used with
-							--alignments.
-	  --pileupdata PILEUPDATA
-							Pileup input data.
-	  --feature_file FEATURE_FILE
-							Pregenerated features as stored during training.
-	  --alignments reads.bam ref.fasta ref_name
-							Input alignments, reference fasta and reference name
-							(within fasta).
-
-
-Neural network training
------------------------
-
-Training an error model is a two-stage process.
-
-Generate a training data HDF5 file using `medaka prepare`.
+    MEDAKAHOME=${PWD}
+    git clone https://github.com/nanoporetech/pomoxis --recursive
+    git clone https://github.com/nanoporetech/medaka
+    git clone https://github.com/nanoporetech/scrappie
+    
+    # While it is possible to install pomoxis and medaka into the same
+    #   virtual environment, we will install each package into its own
+    #   environment for simplicity. For more details see the readme for
+    #   each of the packages.
+
+    cd pomoxis && make install && cd ..
+    cd medaka && make install && cd ..
+    cd scrappie && mkdir build && cd build && cmake .. && make && cd ../../
+
+    POMOXIS=${MEDAKAHOME}/pomoxis/venv/bin/activate
+    MEDAKA=${MEDAKAHOME}/medaka/venv/bin/activate
+
+If any of the above steps fail consult the install instructions of the
+individual packages. 
+
+In order to train effectively consensus models it is desirable to use a system
+with a GPU. In order to use `medaka` training with a GPU, one should further
+install the `tensorflow-gpu` package from pip into the medaka virtual
+environment with:
 
 .. code-block:: bash
 
-	usage: medaka prepare [-h] [--start START] [--end END] [--chunk_len CHUNK_LEN]
-						  [--truth TRUTH]
-						  ref_fasta ref_name output bam
+    source ${MEDAKA}
+    pip install tensorflow-gpu
 
-	positional arguments:
-	  ref_fasta             .fasta reference corresponding to bams.
-	  ref_name              Name of reference within ref_fasta.
-	  output                Output .hdf.
-	  bam                   Input alignments (aligned to ref).
+Depending on your environment, specifically the versions of `CUDA` and `cuDNN`
+that one has installed, it may be necessary to use versions of this package other
+than the latest. `medaka` has been used with all release versions of `tensorflow`
+after and including version 1.0.0.
 
-	optional arguments:
-	  -h, --help            show this help message and exit
-	  --start START         Start reference coordinate.
-	  --end END             End reference coordinate.
-	  --chunk_len CHUNK_LEN
-							Width of pileup chunks (in ref coords) to produce.
-	  --truth TRUTH         Input bam of truth aligned to ref to label data.
 
-Then supply the training data file to `medaka train`.
+.. _sequence_correction:
 
-.. code-block:: bash
-    
-    Train model using preprocessed training data.
+Sequence correction
+-------------------
+
+After installing all the necessary software (see :ref:`creating_software_env`),
+`medaka` can be run using its default settings through the `medaka_consensus`
+program. An assembly in `.fasta` format and basecalls in `.fasta` or `.fastq`
+format are required, see :ref:`basecalling_and_draft_assembly`.
 
-	usage: medaka train [-h] [--train_name TRAIN_NAME] [--model MODEL]
-						[--features]
-						pileupdata
+.. code-block:: bash
 
-	positional arguments:
-	  pileupdata            Path for training data.
+    source ${MEDAKA}  # i.e. medaka/bin/activate as above
+    NPROC=$(nproc)
+    medaka_consensus -i basecalls.fa -d draft_assembly.fa -o medaka_consensus -t ${NPROC} -p ${POMOXIS}
 
-	optional arguments:
-	  -h, --help            show this help message and exit
-	  --train_name TRAIN_NAME
-							Name for training run.
-	  --model MODEL         Model definition and initial weights .hdf.
-	  --features            Stop after generating features.
+When `medaka_consensus` has finished running, the consensus will be saved to
+`medaka_consensus/consensus.fasta`.
diff --git a/docs/future.rst b/docs/future.rst
@@ -1,6 +1,3 @@
-
-.. _FutureDirections:
-
 History and Future Directions
 =============================
 
@@ -56,4 +53,7 @@ data; this is the focus of current research. Further it may be possible that
 such an approach need not be iterative: a single pass on the inputs could be
 performed to achieve results in quicker time.
 
-A future release on medaka will encorporate these higher-order methods.
+Development of medaka as a "base-space" consensus tool has been paused to focus
+on these higher-level methods. Nevertheless researchers may find medaka useful
+as a method to generate quickly sufficiently accurate consensus sequences in many
+use cases; see the `Benchmarks` page for more details.
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,7 +1,7 @@
 Welcome to Medaka's documentation!
 ==================================
 
-Medaka demonstrates a framework for error correcting sequencing data,
+Medaka demonstrates a framework for rapid error correction of sequencing data,
 particularly aimed at nanopore sequencing. Tools are provided for both training
 and inference. The code exploits the `keras <https://keras.io>`_ deep learning
 library.
@@ -11,9 +11,16 @@ being released in an effort to allow researchers to experiment without having
 to write much of the tedious data preparation code. Researchers can extend the
 tools provided to achieve better results that those obtained currently.
 
-See :doc:`examples` for further herlp in getting started and
-:ref:`FutureDirections` for inspirational ideas and some background into
-nanopore sequence correction.
+See :doc:`examples` for standard usage in correcting a draft assembly,
+:doc:`walkthrough` for a complete overview of training a model and its use,
+and :doc:`future` for inspirational ideas and some background into nanopore
+sequence correction.
+
+Development of medaka as a "base-space" consensus tool has been paused to focus
+on methods which exploit the nanopore signal data. Nevertheless researchers may
+find medaka useful as a method to generate quickly sufficiently accurate
+consensus sequences in many use cases; see the :doc:`benchmarks` page for further
+details.
 
 Tools to enable the creation of draft assembies can be found in a sister
 project `pomoxis <https://github.com/nanoporetech/pomoxis>`_.