Skip to content

Commit

Permalink
Wording fixes, updating section about Nutch crawler.
Browse files Browse the repository at this point in the history
  • Loading branch information
brittainhard committed Nov 10, 2015
1 parent d498a55 commit 0a30adc
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 15 deletions.
12 changes: 6 additions & 6 deletions docs/source/crawler_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,13 @@ Creating a Seeds List

Simply put, the seeds list should contain pages that are relevant to the topics you are searching. Both Nutch and Ache provide insight into the relevance of your seeds list, but in different ways.

For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler.
For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler.

Seeds lists are created on the seeds page, and seeds lists can be created from the add crawl page.

Crawler Control Buttons
=======================
Here's an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which crawler you are using.
Here we have an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which one you are using.

These are the buttons available for Ache:

Expand All @@ -56,7 +56,7 @@ Stop Button
-----------
Symbolized by the "stop" button. Stops the crawl.

In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current round. This is in order to prevent data corruption that can occur when killing the Nutch process.
In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current process. However, the data on the current round of the crawl will be lost.

Restart Button
--------------
Expand Down Expand Up @@ -89,11 +89,11 @@ Crawl Settings
*****
Nutch
*****
`Nutch <http://nutch.apache.org/>`_ is developed by Apache, and has interfaces with both Solr and Elasticsearch, and it allows memex-explorer to offer different crawling functionality from Ache.
`Nutch <http://nutch.apache.org/>`_ is developed by Apache, and has an interface with Elasticsearch. All Nutch crawls create Elasticsearch indices by default.

Nutch runs in uninterruptible rounds of crawling. Nutch will run indefinitely until asked to stop. By viewing the crawl log, it is possible to see how many pages are left to crawl in the current round.
With Nutch, you can define how long you want to crawl by setting the number of rounds to crawl. You can keep track of the overall crawl time and the sites currently being crawled by looking at the Nutch crawl visualizations.

The number of pages left to crawl in a Nutch round increases significantly after each round. With Nutch, you can pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running.
The number of pages left to crawl in a Nutch round increases significantly after each round. You might pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running.

Memex Explorer currently uses the Nutch REST API for running all crawls.

Expand Down
20 changes: 11 additions & 9 deletions docs/source/dev_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Developer's Guide to Memex Explorer
Setting up Memex Explorer
*************************

To setup your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda <http://continuum.io/downloads>`_ or `Miniconda <http://conda.pydata.org/miniconda.html>`_ from their respective sites.
To set up your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda <http://continuum.io/downloads>`_ or `Miniconda <http://conda.pydata.org/miniconda.html>`_ from their respective sites.

Memex Explorer requires conda, either from Miniconda or Anaconda.

Expand Down Expand Up @@ -48,17 +48,19 @@ Updating the Database
---------------------
As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues.

If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database.
If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course of action is to delete the existing file at `source/db.sqlite3` and start over with a fresh database.

Enabling Non-Default Services
=============================

Nutch Visualizations
--------------------

Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page <https://www.rabbitmq.com/download.html>`_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user.
Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. For more information on how to install RabbitMQ, read `this page <https://www.rabbitmq.com/download.html>`_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user.

To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor.
RabbitMQ and Bokeh-Server are necessary for creating the Nutch visualizations. The Nutch streaming visualization works by creating and subscribing to a queue of AMQP messages (hosted by RabbitMQ) being dispatched from Nutch as it runs the crawl. A background task reads the messages and updates the plot (hosted by Bokeh server).

To enable Bokeh visualizations for Nutch, change `autostart=false` to `autostart=true` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor.

.. code-block:: html

Expand Down Expand Up @@ -98,8 +100,8 @@ Temporal Anomaly Detection (TAD)

.. code-block:: html

[program:tad]
command=tad
priority=5
-autostart=false
+autostart=false
[program:tad]
command=tad
priority=5
-autostart=false
+autostart=false

0 comments on commit 0a30adc

Please sign in to comment.