-
Notifications
You must be signed in to change notification settings - Fork 2
Data
sensetion
annotation is done on projects. Each project has its own
data collection, and may or may not share a Wordnet collection with
another project. The data collection encompasses the documents that
will be annotated, while the Wordnet provides the annotation
ontology. See Usage for how to define a project.
If you already have data someone sent you, you can skip the next sections about generating it.
To produce the WordNet data you will need to install mill and jq. Use
mill
’s export command (see mill --help
) and then run the following
jq
script on the output file:
jq -c '{_id: .wordsenses|.[0]|.senseKey, lexname: .id|.[1], pos: .id|.[0]|ascii_downcase, keys: .wordsenses|map(.senseKey), terms: .wordsenses|map(.lexicalForm), definition: .definition, examples: .examples}' out.json > wn.json
This ensures the output is in the correct format for sensetion
.
You will have to inject this data into the backend, see below.
Please create and enter a python virtual environment (instructions) and run:
pip install -r utils/requirements.txt
Sensetion data is generated by a series of python scripts, which are detailed below.
Prepare sentences for input to sensetion.el
. Provide one or more
files with one sentence per line.
python3 touch.py --help
you’ll need to have a configuration file for the REPP tokenizer; there
is one at pet/repp.set
at http://svn.delph-in.net/erg/trunk.
This script performs:
- tokenization;
- identification (adds a
doc_id
corresponding to the filename and asent_id
corresponding to the sentence number)
You should run the enrich.py
script on the output of touch.py
.
The corpus prepared by touch.py
is not sufficient by itself. This
script adds all lemma candidates and automatic sense-tagging (for
unambiguous words), and may do more in the future.
You’ll need to download wordnet data from NLTK in order to run this script.
import nltk
nltk.download('wordnet')
To insert Wordnet or corpus data to be used by sensetion.el
into a
MongoDB instance, you can use mongoimport
, which should have been
installed along with the mongo distribution. MongoDB offers validation
of inserted documents, and we provide a schema at
utils/mongo-document-schema.js
for the document collection. To use
it, please run
mongo <DATABASE-NAME> --eval 'var collection = "<COLLECTION-NAME>"' utils/mongo-document-schema.js
Note that this is meant only for document collections; don’t try to run this command for a WordNet synset collection.
You can check mongoimport --help
or its reference for in-depth
options, but basic usage is:
mongoimport --db=<DATABASE-NAME> --collection=<COLLECTION-NAME> --file=<INPUT-FILE>
You can add the --drop
flag to drop the collection before inserting
all the documents at the specified database-collection, but note that
this will drop the document schema too; if you’d like to preserve the
schema, you should first drop the database-collection (using and empty
file as argument, for instance), and then running the commands above.
You can use any name for databases or collections, as long as the projects are defined with proper names (see Usage). We recommend you use the same database for all your projects (unless you have a lot of projects), and different collections for different projects or wordnet versions, like in the example below:
mongoimport --drop --db=sensetion --collection=wordnet30 --file=wordnet30.json
mongoimport --db=sensetion --collection=glosstag --file=glosstag.json
The Emacs must be closed/reopened after DB was changed.
You can backup the data from the MongoDB database-collection using
mongoexport --db <DATABASE-NAME> --collection <COLLECTION-NAME> --out out.json
from your shell, assuming mongod
is running.
You can use the mongo shell (invoked by the mongo
command) to
perform basic Mongo administration. After entering the mongo shell,
run show databases
to list all available databases.
Run use <DATABASE-NAME>
to select the database you want. Now all
instances of the variable db
will refer to this database.
show collections
will list all the collections in the current
database. (just running db
will show you the current database you
are using)
You can drop a collection from the current database by running
db.<COLLECTION-NAME>.drop()
. To drop the database with all of its
collections, run db.dropDatabase()
.
For more, check out the mongo shell reference.
TODO