Skip to content

Commit

Permalink
Merge pull request #765 from memex-explorer/bhard/services_docs
Browse files Browse the repository at this point in the history
Documentation on Optional Services
  • Loading branch information
brittainhard committed Nov 10, 2015
2 parents 6d7baea + 0a30adc commit 1c10627
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 39 deletions.
27 changes: 15 additions & 12 deletions docs/source/crawler_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,21 +28,21 @@ Creating a Seeds List

Simply put, the seeds list should contain pages that are relevant to the topics you are searching. Both Nutch and Ache provide insight into the relevance of your seeds list, but in different ways.

For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler.
For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler.

Seeds lists are created on the seeds page, and seeds lists can be created from the add crawl page.

Crawler Control Buttons
=======================
Here's an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which crawler you are using.
Here we have an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which one you are using.

These are the buttons available for Ache:
These are the buttons available for Ache:

.. image:: _static/img/ache-buttons.png
.. image:: _static/img/ache-buttons.png

These are the buttons available for Nutch:
These are the buttons available for Nutch:

.. image:: _static/img/nutch-buttons.png
.. image:: _static/img/nutch-buttons.png

Options Button
--------------
Expand All @@ -54,9 +54,9 @@ Start Button

Stop Button
-----------
Symbolized by the "stop" button. Stops the crawl.
Symbolized by the "stop" button. Stops the crawl.

In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current round. This is in order to prevent data corruption that can occur when killing the Nutch process.
In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current process. However, the data on the current round of the crawl will be lost.

Restart Button
--------------
Expand All @@ -70,9 +70,12 @@ Get Crawl Log

CCA Export
----------

This button is Nutch only. It allows you to export your crawl data into the CCA format.

Rounds Input
------------
Nutch only. This allows you to specify how many rounds you want the crawl to run. You can press the stop button at any time and it will stop when it is done with the current round.

Crawl Settings
==============
The crawl settings page allows you to delete the crawl, as well as change the name or description of the crawl. It is accessed by clicking the "pencil" icon next to the name of the crawl.
Expand All @@ -86,11 +89,11 @@ Crawl Settings
*****
Nutch
*****
`Nutch <http://nutch.apache.org/>`_ is developed by Apache, and has interfaces with both Solr and Elasticsearch, and it allows memex-explorer to offer different crawling functionality from Ache.
`Nutch <http://nutch.apache.org/>`_ is developed by Apache, and has an interface with Elasticsearch. All Nutch crawls create Elasticsearch indices by default.

Nutch runs in uninterruptible rounds of crawling. Nutch will run indefinitely until asked to stop. By viewing the crawl log, it is possible to see how many pages are left to crawl in the current round.
With Nutch, you can define how long you want to crawl by setting the number of rounds to crawl. You can keep track of the overall crawl time and the sites currently being crawled by looking at the Nutch crawl visualizations.

The number of pages left to crawl in a Nutch round increases significantly after each round. With Nutch, you can pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running.
The number of pages left to crawl in a Nutch round increases significantly after each round. You might pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running.

Memex Explorer currently uses the Nutch REST API for running all crawls.

Expand Down
71 changes: 53 additions & 18 deletions docs/source/dev_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ Developer's Guide to Memex Explorer
Setting up Memex Explorer
*************************

To setup your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda <http://continuum.io/downloads>`_ or `Miniconda <http://conda.pydata.org/miniconda.html>`_ from their respective sites.
To set up your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda <http://continuum.io/downloads>`_ or `Miniconda <http://conda.pydata.org/miniconda.html>`_ from their respective sites.

Memex Explorer requires conda, either from Miniconda or Anaconda.
Memex Explorer requires conda, either from Miniconda or Anaconda.

Application Setup
=================
Expand All @@ -30,12 +30,37 @@ Application Setup

Memex Explorer will now be running locally at `http://localhost:8000 <http://localhost:8000/>`_.

Enabling Nutch Visualizations
Tests
=====
To run the tests, return to the root directory and run:

.. code-block:: html

$ py.test

The Database Model
==================
The current entity relation diagram:

.. image:: _static/img/DbVisualizer.png

Updating the Database
---------------------
As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues.

If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course of action is to delete the existing file at `source/db.sqlite3` and start over with a fresh database.

Enabling Non-Default Services
=============================

Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page <https://www.rabbitmq.com/download.html>`_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user.
Nutch Visualizations
--------------------

Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. For more information on how to install RabbitMQ, read `this page <https://www.rabbitmq.com/download.html>`_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user.

To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in source/supervisord.conf, and then kill and restart supervisor.
RabbitMQ and Bokeh-Server are necessary for creating the Nutch visualizations. The Nutch streaming visualization works by creating and subscribing to a queue of AMQP messages (hosted by RabbitMQ) being dispatched from Nutch as it runs the crawl. A background task reads the messages and updates the plot (hosted by Bokeh server).

To enable Bokeh visualizations for Nutch, change `autostart=false` to `autostart=true` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor.

.. code-block:: html

Expand All @@ -51,22 +76,32 @@ Enabling Nutch Visualizations
-autostart=false
+autostart=true

Tests
=====
To run the tests, return to the root directory and run:
Domain Discovery Tool (DDT)
---------------------------

.. code-block:: html
Domain Discovery Tool can be installed as a conda package. Simply run `conda install ddt` to download the package for DDT.

$ py.test
Like with Nutch visualizations, to enable DDT, change the directive in `source/supervisord`.

The Database Model
==================
The current entity relation diagram:
.. code-block:: html

.. image:: _static/img/DbVisualizer.png
[program:ddt]
command=ddt
priority=5
-autostart=false
+autostart=false

Updating the Database
---------------------
As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues.
Temporal Anomaly Detection (TAD)
--------------------------------

TAD does not currently have a conda package. Like the Nutch visualizations, it also has a RabbitMQ dependency. For instructions on installing TAD, visit the `github repository <https://github.com/autonlab/tad>`_.

If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database.
Like DDT and Nutch Visualizations, you also have to change the supervisor directive.

.. code-block:: html

[program:tad]
command=tad
priority=5
-autostart=false
+autostart=false
13 changes: 4 additions & 9 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,13 @@
Memex Explorer
==============

Memex Explorer is a web application that provides easy-to-use interfaces for
gathering, analyzing, and graphing web crawl data.
Memex Explorer is a web application that provides easy-to-use interfaces for gathering, analyzing, and graphing web crawl data.

For usage instructions, please refer to the `User's Guide <user_guide.html>`_.
For usage instructions, please refer to the `User's Guide <user_guide.html>`_.

.. For more information about the project architecture, please refer to our `Developer's Guide <dev_guide.html>`_ and `API Guide <api.html>`_.
For more information about the project architecture, please refer to our `Developer's Guide <dev_guide.html>`_ and `API Guide <api.html>`_.

Memex Explorer is built by `Continuum Analytics <http://continuum.io/>`_,
with grants and support from the
`NASA Jet Propulsion Laboratory <http://www.jpl.nasa.gov/>`_,
`Kitware <http://www.kitware.com/>`_,
and the `NYU Polytechnic School of Engineering <http://engineering.nyu.edu/>`_.
Memex Explorer is built by `Continuum Analytics <http://continuum.io/>`_, with grants and support from the `NASA Jet Propulsion Laboratory <http://www.jpl.nasa.gov/>`_, `Kitware <http://www.kitware.com/>`_, and the `NYU Polytechnic School of Engineering <http://engineering.nyu.edu/>`_.

Contents:

Expand Down

0 comments on commit 1c10627

Please sign in to comment.