Skip to content
Alexandre Rademaker edited this page Dec 16, 2019 · 33 revisions

Data model

sensetion annotation is done on projects. Each project has its own data collection, and may or may not share a Wordnet collection with another project. The data collection encompasses the documents that will be annotated, while the Wordnet provides the annotation ontology. See Usage for how to define a project.

If you already have data someone sent you, you can skip the next sections about generating it.

Generate WN data

To produce the WordNet data you will need to install mill and jq. Use mill’s export command (see mill --help) and then run the following jq script on the output file:

jq -c '{_id: .wordsenses|.[0]|.senseKey, lexname: .id|.[1], pos: .id|.[0]|ascii_downcase, keys: .wordsenses|map(.senseKey), terms: .wordsenses|map(.lexicalForm), definition: .definition, examples: .examples}' out.json > wn.json

This ensures the output is in the correct format for sensetion.

You will have to inject this data into the backend, see below.

Generate corpus data

Please create and enter a python virtual environment (instructions) and run:

pip install -r utils/requirements.txt

Sensetion data is generated by a series of python scripts, which are detailed below.

Data scripts

touch.py

Prepare sentences for input to sensetion.el. Provide one or more files with one sentence per line.

python3 touch.py --help

you’ll need to have a configuration file for the REPP tokenizer; there is one at pet/repp.set at http://svn.delph-in.net/erg/trunk.

This script performs:

  • tokenization;
  • identification (adds a doc_id corresponding to the filename and a sent_id corresponding to the sentence number)

You should run the enrich.py script on the output of touch.py.

enrich.py

The corpus prepared by touch.py is not sufficient by itself. This script adds all lemma candidates and automatic sense-tagging (for unambiguous words), and may do more in the future.

You’ll need to download wordnet data from NLTK in order to run this script.

import nltk
nltk.download('wordnet')

Backend

Inserting data into the backend

To insert Wordnet or corpus data to be used by sensetion.el into a MongoDB instance, you can use mongoimport, which should have been installed along with the mongo distribution. MongoDB offers validation of inserted documents, and we provide a schema at utils/mongo-document-schema.js for the document collection. To use it, please run

mongo <DATABASE-NAME> --eval 'var collection = "<COLLECTION-NAME>"' utils/mongo-document-schema.js

Note that this is meant only for document collections; don’t try to run this command for a WordNet synset collection.

You can check mongoimport --help or its reference for in-depth options, but basic usage is:

mongoimport --db=<DATABASE-NAME> --collection=<COLLECTION-NAME> --file=<INPUT-FILE>

You can add the --drop flag to drop the collection before inserting all the documents at the specified database-collection, but note that this will drop the document schema too; if you’d like to preserve the schema, you should first drop the database-collection (using and empty file as argument, for instance), and then running the commands above.

You can use any name for databases or collections, as long as the projects are defined with proper names (see Usage). We recommend you use the same database for all your projects (unless you have a lot of projects), and different collections for different projects or wordnet versions, like in the example below:

mongoimport --drop --db=sensetion --collection=wordnet30 --file=wordnet30.json
mongoimport --db=sensetion --collection=glosstag --file=glosstag.json

The Emacs must be closed/reopened after DB was changed.

Backup from the backend

You can backup the data from the MongoDB database-collection using

mongoexport --db <DATABASE-NAME> --collection <COLLECTION-NAME> --out out.json

from your shell, assuming mongod is running.

Basic MongoDB usage

You can use the mongo shell (invoked by the mongo command) to perform basic Mongo administration. After entering the mongo shell, run show databases to list all available databases.

Run use <DATABASE-NAME> to select the database you want. Now all instances of the variable db will refer to this database.

show collections will list all the collections in the current database. (just running db will show you the current database you are using)

You can drop a collection from the current database by running db.<COLLECTION-NAME>.drop(). To drop the database with all of its collections, run db.dropDatabase().

For more, check out the mongo shell reference.

Data description

TODO