From 1a3de5f5f16ac5919166e9db33817804f4c590f6 Mon Sep 17 00:00:00 2001 From: Brittain Hard Date: Mon, 9 Nov 2015 12:07:34 -0600 Subject: [PATCH 1/5] Moved enabling nutch vis to optional services. --- docs/source/dev_guide.rst | 45 +++++++++++++++++++++------------------ 1 file changed, 24 insertions(+), 21 deletions(-) diff --git a/docs/source/dev_guide.rst b/docs/source/dev_guide.rst index 71b661b7..31ffcd14 100644 --- a/docs/source/dev_guide.rst +++ b/docs/source/dev_guide.rst @@ -30,27 +30,6 @@ Application Setup Memex Explorer will now be running locally at `http://localhost:8000 `_. -Enabling Nutch Visualizations -============================= - - Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page `_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user. - - To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in source/supervisord.conf, and then kill and restart supervisor. - - .. code-block:: html - - [program:rabbitmq] - command=rabbitmq-server - priority=1 - -autostart=false - +autostart=true - - [program:bokeh-server] - command=bokeh-server --backend memory --port 5006 - priority=1 - -autostart=false - +autostart=true - Tests ===== To run the tests, return to the root directory and run: @@ -70,3 +49,27 @@ Updating the Database As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues. If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database. + +Enabling Non-Default Services +========================== + +Nutch Visualizations +-------------------- + + Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page `_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user. + + To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in source/supervisord.conf, and then kill and restart supervisor. + + .. code-block:: html + + [program:rabbitmq] + command=rabbitmq-server + priority=1 + -autostart=false + +autostart=true + + [program:bokeh-server] + command=bokeh-server --backend memory --port 5006 + priority=1 + -autostart=false + +autostart=true From 9a3dd80ebc70d6ce77c71da0792b0d142740832e Mon Sep 17 00:00:00 2001 From: Brittain Hard Date: Mon, 9 Nov 2015 12:16:39 -0600 Subject: [PATCH 2/5] Fixed problems with indentation. --- docs/source/crawler_guide.rst | 6 +++--- docs/source/dev_guide.rst | 17 +++++++++++------ docs/source/index.rst | 13 ++++--------- 3 files changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/source/crawler_guide.rst b/docs/source/crawler_guide.rst index 17daad0c..da2b6cae 100644 --- a/docs/source/crawler_guide.rst +++ b/docs/source/crawler_guide.rst @@ -34,13 +34,13 @@ Creating a Seeds List Crawler Control Buttons ======================= -Here's an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which crawler you are using. + Here's an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which crawler you are using. -These are the buttons available for Ache: + These are the buttons available for Ache: .. image:: _static/img/ache-buttons.png -These are the buttons available for Nutch: + These are the buttons available for Nutch: .. image:: _static/img/nutch-buttons.png diff --git a/docs/source/dev_guide.rst b/docs/source/dev_guide.rst index 31ffcd14..501b0e5b 100644 --- a/docs/source/dev_guide.rst +++ b/docs/source/dev_guide.rst @@ -6,9 +6,9 @@ Developer's Guide to Memex Explorer Setting up Memex Explorer ************************* -To setup your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda `_ or `Miniconda `_ from their respective sites. + To setup your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda `_ or `Miniconda `_ from their respective sites. -Memex Explorer requires conda, either from Miniconda or Anaconda. + Memex Explorer requires conda, either from Miniconda or Anaconda. Application Setup ================= @@ -40,18 +40,18 @@ Tests The Database Model ================== -The current entity relation diagram: + The current entity relation diagram: .. image:: _static/img/DbVisualizer.png Updating the Database --------------------- -As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues. + As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues. -If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database. + If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database. Enabling Non-Default Services -========================== +============================= Nutch Visualizations -------------------- @@ -73,3 +73,8 @@ Nutch Visualizations priority=1 -autostart=false +autostart=true + +Domain Discovery Tool +--------------------- + + Domain diff --git a/docs/source/index.rst b/docs/source/index.rst index 00ff1fe3..0655e169 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,18 +1,13 @@ Memex Explorer ============== -Memex Explorer is a web application that provides easy-to-use interfaces for -gathering, analyzing, and graphing web crawl data. + Memex Explorer is a web application that provides easy-to-use interfaces for gathering, analyzing, and graphing web crawl data. -For usage instructions, please refer to the `User's Guide `_. + For usage instructions, please refer to the `User's Guide `_. -.. For more information about the project architecture, please refer to our `Developer's Guide `_ and `API Guide `_. + For more information about the project architecture, please refer to our `Developer's Guide `_ and `API Guide `_. -Memex Explorer is built by `Continuum Analytics `_, -with grants and support from the -`NASA Jet Propulsion Laboratory `_, -`Kitware `_, -and the `NYU Polytechnic School of Engineering `_. + Memex Explorer is built by `Continuum Analytics `_, with grants and support from the `NASA Jet Propulsion Laboratory `_, `Kitware `_, and the `NYU Polytechnic School of Engineering `_. Contents: From 890ecbd1049bed9b7722e4fe71dba41ae4823685 Mon Sep 17 00:00:00 2001 From: Brittain Hard Date: Mon, 9 Nov 2015 13:54:45 -0600 Subject: [PATCH 3/5] Info on DDT and TAD. --- docs/source/crawler_guide.rst | 13 ++++++++----- docs/source/dev_guide.rst | 33 +++++++++++++++++++++++++++++---- 2 files changed, 37 insertions(+), 9 deletions(-) diff --git a/docs/source/crawler_guide.rst b/docs/source/crawler_guide.rst index da2b6cae..80fc11e6 100644 --- a/docs/source/crawler_guide.rst +++ b/docs/source/crawler_guide.rst @@ -38,11 +38,11 @@ Crawler Control Buttons These are the buttons available for Ache: -.. image:: _static/img/ache-buttons.png + .. image:: _static/img/ache-buttons.png These are the buttons available for Nutch: -.. image:: _static/img/nutch-buttons.png + .. image:: _static/img/nutch-buttons.png Options Button -------------- @@ -54,9 +54,9 @@ Start Button Stop Button ----------- - Symbolized by the "stop" button. Stops the crawl. + Symbolized by the "stop" button. Stops the crawl. - In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current round. This is in order to prevent data corruption that can occur when killing the Nutch process. + In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current round. This is in order to prevent data corruption that can occur when killing the Nutch process. Restart Button -------------- @@ -70,9 +70,12 @@ Get Crawl Log CCA Export ---------- - This button is Nutch only. It allows you to export your crawl data into the CCA format. +Rounds Input +------------ + Nutch only. This allows you to specify how many rounds you want the crawl to run. You can press the stop button at any time and it will stop when it is done with the current round. + Crawl Settings ============== The crawl settings page allows you to delete the crawl, as well as change the name or description of the crawl. It is accessed by clicking the "pencil" icon next to the name of the crawl. diff --git a/docs/source/dev_guide.rst b/docs/source/dev_guide.rst index 501b0e5b..45d8bdce 100644 --- a/docs/source/dev_guide.rst +++ b/docs/source/dev_guide.rst @@ -58,7 +58,7 @@ Nutch Visualizations Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page `_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user. - To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in source/supervisord.conf, and then kill and restart supervisor. + To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor. .. code-block:: html @@ -74,7 +74,32 @@ Nutch Visualizations -autostart=false +autostart=true -Domain Discovery Tool ---------------------- +Domain Discovery Tool (DDT) +--------------------------- + + Domain Discovery Tool can be installed as a conda package. Simply run `conda install ddt` to download the package for DDT. + + Like with Nutch visualizations, to enable DDT, change the directive in `source/supervisord`. + + .. code-block:: html + + [program:ddt] + command=ddt + priority=5 + -autostart=false + +autostart=false + +Temporal Anomaly Detection (TAD) +-------------------------------- + + TAD does not currently have a conda package. Like the Nutch visualizations, it also has a RabbitMQ dependency. For instructions on installing TAD, visit the `github repository `_. + + Like DDT and Nutch Visualizations, you also have to change the supervisord directive. + + .. code-block:: html - Domain + [program:tad] + command=tad + priority=5 + -autostart=false + +autostart=false From d498a55098b581880936461a98d9bfdaf779d9aa Mon Sep 17 00:00:00 2001 From: Brittain Hard Date: Mon, 9 Nov 2015 13:58:18 -0600 Subject: [PATCH 4/5] Removed a D. --- docs/source/dev_guide.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/dev_guide.rst b/docs/source/dev_guide.rst index 45d8bdce..d4bc4c93 100644 --- a/docs/source/dev_guide.rst +++ b/docs/source/dev_guide.rst @@ -94,7 +94,7 @@ Temporal Anomaly Detection (TAD) TAD does not currently have a conda package. Like the Nutch visualizations, it also has a RabbitMQ dependency. For instructions on installing TAD, visit the `github repository `_. - Like DDT and Nutch Visualizations, you also have to change the supervisord directive. + Like DDT and Nutch Visualizations, you also have to change the supervisor directive. .. code-block:: html From 0a30adc6bdc1d7b4a29fbad9080ea25a063811ac Mon Sep 17 00:00:00 2001 From: Brittain Hard Date: Tue, 10 Nov 2015 10:10:37 -0600 Subject: [PATCH 5/5] Wording fixes, updating section about Nutch crawler. --- docs/source/crawler_guide.rst | 12 ++++++------ docs/source/dev_guide.rst | 20 +++++++++++--------- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/source/crawler_guide.rst b/docs/source/crawler_guide.rst index 80fc11e6..f6ee40c3 100644 --- a/docs/source/crawler_guide.rst +++ b/docs/source/crawler_guide.rst @@ -28,13 +28,13 @@ Creating a Seeds List Simply put, the seeds list should contain pages that are relevant to the topics you are searching. Both Nutch and Ache provide insight into the relevance of your seeds list, but in different ways. - For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler. + For the purposes of memex-explorer, the extension and name of your seeds list does not matter. It will be automatically renamed and stored according to the specifications of the crawler. Seeds lists are created on the seeds page, and seeds lists can be created from the add crawl page. Crawler Control Buttons ======================= - Here's an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which crawler you are using. + Here we have an overview of the buttons available to each crawler for controlling the crawlers. The buttons behave differently depending on which one you are using. These are the buttons available for Ache: @@ -56,7 +56,7 @@ Stop Button ----------- Symbolized by the "stop" button. Stops the crawl. - In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current round. This is in order to prevent data corruption that can occur when killing the Nutch process. + In the case of Ache, the crawler stops immediately. In the case of Nutch, the crawler stops after it has finished the current process. However, the data on the current round of the crawl will be lost. Restart Button -------------- @@ -89,11 +89,11 @@ Crawl Settings ***** Nutch ***** - `Nutch `_ is developed by Apache, and has interfaces with both Solr and Elasticsearch, and it allows memex-explorer to offer different crawling functionality from Ache. + `Nutch `_ is developed by Apache, and has an interface with Elasticsearch. All Nutch crawls create Elasticsearch indices by default. - Nutch runs in uninterruptible rounds of crawling. Nutch will run indefinitely until asked to stop. By viewing the crawl log, it is possible to see how many pages are left to crawl in the current round. + With Nutch, you can define how long you want to crawl by setting the number of rounds to crawl. You can keep track of the overall crawl time and the sites currently being crawled by looking at the Nutch crawl visualizations. - The number of pages left to crawl in a Nutch round increases significantly after each round. With Nutch, you can pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running. + The number of pages left to crawl in a Nutch round increases significantly after each round. You might pass it a seeds list of 100 pages to crawl, and it can find over 1000 pages to crawl for the next round. Because of this, Nutch is a much easier crawler to get running. Memex Explorer currently uses the Nutch REST API for running all crawls. diff --git a/docs/source/dev_guide.rst b/docs/source/dev_guide.rst index d4bc4c93..ab379851 100644 --- a/docs/source/dev_guide.rst +++ b/docs/source/dev_guide.rst @@ -6,7 +6,7 @@ Developer's Guide to Memex Explorer Setting up Memex Explorer ************************* - To setup your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda `_ or `Miniconda `_ from their respective sites. + To set up your machine, you will need Anaconda or Miniconda installed. Miniconda is a minimal Anaconda installation that bootstraps conda and Python on any operating system. Install `Anaconda `_ or `Miniconda `_ from their respective sites. Memex Explorer requires conda, either from Miniconda or Anaconda. @@ -48,7 +48,7 @@ Updating the Database --------------------- As of version 0.4.0, Memex Explorer will start tracking all database migrations. This means that you will be able to upgrade your database and preserve the data without any issues. - If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course if action is to delete the existing `file at source/db.sqlite3` and start over with a fresh database. + If you are using a version that is 0.3.0 or earlier, and you are unable to update your database without server errors, the best course of action is to delete the existing file at `source/db.sqlite3` and start over with a fresh database. Enabling Non-Default Services ============================= @@ -56,9 +56,11 @@ Enabling Non-Default Services Nutch Visualizations -------------------- - Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. More information on how to install RabbitMQ, read `this page `_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user. + Nutch visualizations are not enabled by default. Nutch visualizations require RabbitMQ, and the method for installing RabbitMQ varies depending on the operating system. RabbitMQ can be installed via Homebrew on Mac, and apt-get on Debian systems. For more information on how to install RabbitMQ, read `this page `_. Note: You may also need to change the below command to `sudo rabbitmq-server`, depending on how RabbitMQ is installed on your system and the permissions of the current user. - To enable Bokeh visualizations for Nutch, change ``autostart=false`` to ``autostart=true`` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor. + RabbitMQ and Bokeh-Server are necessary for creating the Nutch visualizations. The Nutch streaming visualization works by creating and subscribing to a queue of AMQP messages (hosted by RabbitMQ) being dispatched from Nutch as it runs the crawl. A background task reads the messages and updates the plot (hosted by Bokeh server). + + To enable Bokeh visualizations for Nutch, change `autostart=false` to `autostart=true` for both of these directives in `source/supervisord.conf`, and then kill and restart supervisor. .. code-block:: html @@ -98,8 +100,8 @@ Temporal Anomaly Detection (TAD) .. code-block:: html - [program:tad] - command=tad - priority=5 - -autostart=false - +autostart=false + [program:tad] + command=tad + priority=5 + -autostart=false + +autostart=false