-
Notifications
You must be signed in to change notification settings - Fork 1
Index Workflow
Task Queue Status:
http://dwc-indexer.vertnet-portal.appspot.com/mapreduce/status
Indexing is part of the VertNet Data Mobilization Workflow.
https://github.com/VertNet/toolkit/wiki/Mobilization-Steps
Before indexing, complete the post-harvest processing.
https://github.com/VertNet/post-harvest-processor/wiki/Post-Harvest-Processing-Workflow
Indexing loads post-harvest processed records into an AppEngine index, which is the data store for the VertNet portal, search API
https://github.com/VertNet/webapp/wiki/The-API-search-function
and download API
https://github.com/VertNet/webapp/wiki/The-API-download-function
Before indexing, make sure the code you are using is up to date:
git checkout master
Remove the storage folder and all its contents. This is for local deployment and testing only and should not be deployed to App Engine.
When you make changes, update the version global variable in indexer.py. This is a convenience to understand the App Engine logs.
On 2016-07-22, the indexer was converted to a module using the gcloud tools for deployment in the branch traiter-atoms-microservice (commit ba170b49bf28813a5d23c62d7064f8470c0cf0e1), later merged into master. Deployment instructions and parameters (such as the VM instance class to use) are given in the service configuration file dwc-indexer.yaml. To deploy to the production version, use:
gcloud app deploy dwc-indexer.yaml --version prod --promote
This will deploy to the version accessible at http://dwc-indexer.vertnet-portal.appspot.com.
The console should prompt about the deployment, something like this:
$ gcloud app deploy dwc-indexer.yaml --version prod --promote
You are about to deploy the following services:
- vertnet-portal/dwc-indexer/prod (from [/Users/johnwieczorek/Projects/VertNet/dwc-indexer/dwc-indexer.yaml])
Deployed URL: [https://dwc-indexer-dot-vertnet-portal.appspot.com]
Do you want to continue (Y/n)? Y
If you get errors such as the following, remember to remove the storage folder and all its contents before deploying:
2014-06-11 13:13:43,590 ERROR appcfg.py:2154 Ignoring file 'storage/search_indexes': Too long (max 32000000 bytes, file is 134592540 bytes)
2014-06-11 13:13:43,656 ERROR appcfg.py:1679 Invalid character in filename: storage/blobs/dev~vertnet-portal/n/ncoded_gs_file:dm4taGFydmVzdC90Y3djX3ZlcnRzLmNzdg==
Throughput quotas on Google App Engine limit the number of simultaneous jobs that can be run. In practice, two resources can be safely indexed at once in separate jobs.
A specific resource can be indexed by adding the path to the harvested folder for the resource, the number of shards to use, the namespace for the resulting index, and the index_name as follows:
http://dwc-indexer.vertnet-portal.appspot.com/index-gcs-path?namespace=index-2013-08-08&index_name=dwc&bucket_name=vertnet-harvesting&files_list=processed/[icode]/[gbifdatasetid]/*&shard_count=10
The shard_count is no longer important to specify, the job figures it out how best to optimize. The time required to finish a job depends on how many shards are actually allocated, how intensive the indexing task is, and load on App Engine. The largest resources (MCZ with 909318 records) can take nearly 5 hours to index, while single shard jobs with around 10k records will usually finish within about 2 minutes.
Multiple indexing jobs can be run simultaneously, but more than two jobs increases the risk of a search.indexDocument() quota limit of 15k documents per minute. If a quota limit is reached, the Dashboard Logs for the indexer version of the vertnet-portal application will show an error such as
Put #0 failed for doc ucmp/v/4492 (The API call search.IndexDocument() required more quota than is available.)
The following URL will show the status of the mapreduce tasks for the indexer application:
http://dwc-indexer.vertnet-portal.appspot.com/mapreduce/status
The indexing job can be monitored for errors by visiting the URL for the dwc-indexer version of the vertnet-portal application in Google Cloud Console at:
https://console.cloud.google.com/logs/viewer?project=vertnet-portal&expandAll=false&logName=&resource=appengine.googleapis.com%2Fmodule_id%2Fdwc-indexer&minLogLevel=0&lastVisibleTimestampNanos=1469462123595287000
The mapreduce status page for the job includes the parameters sent to the job (shard_count, processing_rate, resource, index_name, and namespace). It also shows the number of records processed as "mapper_calls" and the status of individual shards and the job as a whole.
The following error appears when App Engine has a glitch. Usually App Engine recovers from this gracefully by retrying (assumption): Put #0 failed for doc centennial-museum/utep-vertebrates/081330e7-7974-41cb-9dde-762f2a0b8183 (Transient error, please try again.)
As and after indexing occurs, queries can be made on the index to see if records are being added. To do so, in the Google Cloud App Engine Console, select Search, then enter the namespace for the index and click on "Refresh indexes", then select the index_name from the resulting list.
Index namespace page:
https://console.cloud.google.com/appengine/search?project=vertnet-portal
Specific index search page:
https://console.cloud.google.com/appengine/search/index/dwc?namespace=index-2013-08-08&project=vertnet-portal
Deleting a data set does not delete an index, even if it is the last data set in the index, it just deletes all documents for the data set in the index. Deleting is accomplished by providing the gbifdatasetid of the data set to delete along with the index from which to delete it, using the following template:
http://dwc-indexer.vertnet-portal.appspot.com/index-delete-dataset?gbifdatasetid=[gbifdatasetid]&index_name=dwc&namespace=[namespace]
One can monitor the data set deletion by watching the dwc-inder log at:
https://console.cloud.google.com/logs/viewer?project=vertnet-portal&minLogLevel=0&expandAll=false&resource=gae_app%2Fmodule_id%2Fdwc-indexer&key1=default&key2=prod&logName=projects%2Fvertnet-portal%2Flogs%2Fappengine.googleapis.com%252Frequest_log
To remove all of the records from an index, the index-clean task must be enabled. It is disabled by default because it's effects are so dangerous. Although it costs nothing to clean an index (all operations are basic search operations), the time and cost to reload the index afterwards makes it better to only enable the task on purpose. To enable the index-clean task, uncomment it in the queue.yaml file so that it appears as follows:
# index-clean is dangerous - turn it on only if you really need to
- name: index-clean
rate: 35/s
retry_parameters:
task_retry_limit: 7
task_age_limit: 60m
min_backoff_seconds: 30
max_backoff_seconds: 960
max_doublings: 7
Then redeploy the dwc-indexer version using gcloud as described above.
After successful deployment the index-clean task should appear as enabled in the Google Cloud App Engine Console for Tasks:
https://console.cloud.google.com/appengine/taskqueues?project=vertnet-portal&tab=PUSH
At that point the task can be invoked to remove all of the records from an index by substituting the namespace of the index for [namepspace] in the following URL:
http://dwc-indexer.vertnet-portal.appspot.com/index-clean?index_name=dwc&namespace=[namespace]
To set an application to use a particular index, edit appengine_config.py (https://github.com/VertNet/webapp/blob/master/appengine_config.py#L49) with the desired index name:
return 'index-2014-02-06'