Index Workflow

Task Queue Status:
http://dwc-indexer.vertnet-portal.appspot.com/mapreduce/status

Indexing is part of the VertNet Data Mobilization Workflow.
https://github.com/VertNet/toolkit/wiki/Mobilization-Steps

Before indexing, complete the post-harvest processing.
https://github.com/VertNet/post-harvest-processor/wiki/Post-Harvest-Processing-Workflow

Indexing

Indexing loads post-harvest processed records into an AppEngine index, which is the data store for the VertNet portal, search API
https://github.com/VertNet/webapp/wiki/The-API-search-function

and download API
https://github.com/VertNet/webapp/wiki/The-API-download-function

Before indexing, make sure the code you are using is up to date:

git checkout master

Remove the storage folder and all its contents. This is for local deployment and testing only and should not be deployed to App Engine.

When you make changes, update the version global variable in indexer.py. This is a convenience to understand the App Engine logs.

On 2016-07-22, the indexer was converted to a module using the gcloud tools for deployment in the branch traiter-atoms-microservice (commit ba170b49bf28813a5d23c62d7064f8470c0cf0e1), later merged into master. Deployment instructions and parameters (such as the VM instance class to use) are given in the service configuration file dwc-indexer.yaml. To deploy to the production version, use:

gcloud app deploy dwc-indexer.yaml --version prod --promote

This will deploy to the version accessible at http://dwc-indexer.vertnet-portal.appspot.com.

The console should prompt about the deployment, something like this:

$ gcloud app deploy dwc-indexer.yaml --version prod --promote
You are about to deploy the following services:
 - vertnet-portal/dwc-indexer/prod (from [/Users/johnwieczorek/Projects/VertNet/dwc-indexer/dwc-indexer.yaml])
     Deployed URL: [https://dwc-indexer-dot-vertnet-portal.appspot.com]

Do you want to continue (Y/n)?  Y

If you get errors such as the following, remember to remove the storage folder and all its contents before deploying:

2014-06-11 13:13:43,590 ERROR appcfg.py:2154 Ignoring file 'storage/search_indexes': Too long (max 32000000 bytes, file is 134592540 bytes)

2014-06-11 13:13:43,656 ERROR appcfg.py:1679 Invalid character in filename: storage/blobs/dev~vertnet-portal/n/ncoded_gs_file:dm4taGFydmVzdC90Y3djX3ZlcnRzLmNzdg==

Index one resource

Throughput quotas on Google App Engine limit the number of simultaneous jobs that can be run. In practice, two resources can be safely indexed at once in separate jobs.

A specific resource can be indexed by adding the path to the harvested folder for the resource, the number of shards to use, the namespace for the resulting index, and the index_name as follows:

http://dwc-indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index-2013-08-08&index_name=dwc&bucket_name=vertnet-harvesting&files_list=processed/[icode]/[gbifdatasetid]/*&shard_count=10

The shard_count is no longer important to specify, the job figures it out how best to optimize. The time required to finish a job depends on how many shards are actually allocated, how intensive the indexing task is, and load on App Engine. The largest resources (MCZ with 909318 records) can take nearly 5 hours to index, while single shard jobs with around 10k records will usually finish within about 2 minutes.

Multiple indexing jobs can be run simultaneously, but more than two jobs increases the risk of a search.indexDocument() quota limit of 15k documents per minute. If a quota limit is reached, the Dashboard Logs for the indexer version of the vertnet-portal application will show an error such as

Put #0 failed for doc ucmp/v/4492 (The API call search.IndexDocument() required more quota than is available.)

Monitoring the index job

The following URL will show the status of the mapreduce tasks for the indexer application:
http://dwc-indexer.vertnet-portal.appspot.com/mapreduce/status

Monitoring the logs of the indexing job

The indexing job can be monitored for errors by visiting the URL for the dwc-indexer version of the vertnet-portal application in Google Cloud Console at:
https://console.cloud.google.com/logs/viewer?project=vertnet-portal&expandAll=false&logName=&resource=appengine.googleapis.com%2Fmodule_id%2Fdwc-indexer&minLogLevel=0&lastVisibleTimestampNanos=1469462123595287000

The mapreduce status page for the job includes the parameters sent to the job (shard_count, processing_rate, resource, index_name, and namespace). It also shows the number of records processed as "mapper_calls" and the status of individual shards and the job as a whole.

Possible errors

The following error appears when App Engine has a glitch. Usually App Engine recovers from this gracefully by retrying (assumption): Put #0 failed for doc centennial-museum/utep-vertebrates/081330e7-7974-41cb-9dde-762f2a0b8183 (Transient error, please try again.)

Searching the index

As and after indexing occurs, queries can be made on the index to see if records are being added. To do so, in the Google Cloud App Engine Console, select Search, then enter the namespace for the index and click on "Refresh indexes", then select the index_name from the resulting list.

Index namespace page:
https://console.cloud.google.com/appengine/search?project=vertnet-portal

Specific index search page:
https://console.cloud.google.com/appengine/search/index/dwc?namespace=index-2013-08-08&project=vertnet-portal

Deleting a data set from an index

Deleting a data set does not delete an index, even if it is the last data set in the index, it just deletes all documents for the data set in the index. Deleting is accomplished by providing the gbifdatasetid of the data set to delete along with the index from which to delete it, using the following template:

http://dwc-indexer.vertnet-portal.appspot.com/index-delete-dataset?gbifdatasetid=[gbifdatasetid]&index_name=dwc&namespace=[namespace]

One can monitor the data set deletion by watching the dwc-inder log at:
https://console.cloud.google.com/logs/viewer?project=vertnet-portal&minLogLevel=0&expandAll=false&resource=gae_app%2Fmodule_id%2Fdwc-indexer&key1=default&key2=prod&logName=projects%2Fvertnet-portal%2Flogs%2Fappengine.googleapis.com%252Frequest_log

Cleaning an index

To remove all of the records from an index, the index-clean task must be enabled. It is disabled by default because it's effects are so dangerous. Although it costs nothing to clean an index (all operations are basic search operations), the time and cost to reload the index afterwards makes it better to only enable the task on purpose. To enable the index-clean task, uncomment it in the queue.yaml file so that it appears as follows:

# index-clean is dangerous - turn it on only if you really need to
- name: index-clean
  rate: 35/s
  retry_parameters:
    task_retry_limit: 7
    task_age_limit: 60m
    min_backoff_seconds: 30
    max_backoff_seconds: 960
    max_doublings: 7

Then redeploy the dwc-indexer version using gcloud as described above.

After successful deployment the index-clean task should appear as enabled in the Google Cloud App Engine Console for Tasks:
https://console.cloud.google.com/appengine/taskqueues?project=vertnet-portal&tab=PUSH

At that point the task can be invoked to remove all of the records from an index by substituting the namespace of the index for [namepspace] in the following URL:
http://dwc-indexer.vertnet-portal.appspot.com/index-clean?index_name=dwc&namespace=[namespace]

Using an index

To set an application to use a particular index, edit appengine_config.py (https://github.com/VertNet/webapp/blob/master/appengine_config.py#L49) with the desired index name:

            return 'index-2014-02-06'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly