Skip to content

ItsdbTreebanking

FrancisBond edited this page May 4, 2005 · 17 revisions

Treebanking

This page describes how to treebank with itsdb (ItsdbTop).

itsdb supports Redwoods-style treebanking, from the Trees menu. It has been used to produce the Redwoods (RedwoodsTop) and Hinoki treebanks. You can annotate a corpus; update an annotated corpus to a new grammar; and train statistical models on the treebanked corpora.

Normally only active and deduced discriminants are written out. You can write out all discriminants by setting *redwoods-record-void-discriminants-p* to t when you are in the tsdb package.

After selecting a profile, Trees | Annotate, will bring you into the interface for compiling a treebank. You must have have the same grammar loaded in the LKB (LkbTop) that was used to parse the profile because the system uses the grammar to do the reconstruction of the parse trees.

The annotator selects the correct analysis (or, occasionally, rejects all analyses). Selection is done through a choice of discriminants. The system selects features that distinguish between different parses, and the annotator selects or rejects the features until only one parse remains. The number of decisions for each sentence is normally around log_2 of the number of parses, although sometimes a single decision can reduce the number of remaining parses by more or less than half. In general, even a sentence with 5,000 parses only requires around 12 decisions.

After you completely disambiguate, in the left-hand-side window you will see the elementary dependency structures displayed underneath the tree. Quantifiers and messages are suppressed (by default: see /src/mrs/dependencies.lisp for the configuration options).

The dependencies are color coded:

  • Blue: Good dependency constructed.
  • Orange: Fragmented dependency constructed.
  • Red: Cyclic.

Normalizing

When you annotate an item, the old unannotated entry for that item in the database is not deleted, but rather the database is augmented with another entry recording the updated information about that item, along with a version indicator showing that the annoated entry is more recent than the original one. But this version annotation is not dynamically queried when you impose conditions, so to make the version information usable you have to periodically "normalize" the database.

You normalize by selecting Trees | Normalize and give a name for the new normalized database (since the old one will not be overwritten). This step should not be too time-consuming as long as your databases has fewer than 3000 items in them (recommended). In Hinoki, we find a database with 2000 items and a maximum of 5,000 results is quite slow, taking several hours (2005-03-25).

NOTE: Remember to set the Options | TSQL Condition to no condition, otherwise only some trees will be normalized.

Thinning Normalizing

This saves only the results for good trees, making a much smaller profile.

It is possible to save MRSs for the treebanked sentences by setting (setf tsdb::*redwoods-semantix-hook* "mrs::mrs-get-string").

Clear Cutting

If you for some reason you wish to delete all the trees (e.g., you made a false start with the update process for Run 2) you can (verrrrryyyy carefully) discard the new annotations by selecting Trees | Clear Cut. Be certain (as in positive) that you are removing these annotations for Run 2, not the hand-coded real annotations you constructed painstakingly for Run 1, as clear cutting fells all the trees.

Updating

One of the defining properties of the Redwoods treebanks is that they are dynamic: the treebank can be updated when the grammar changes.

Because the discriminants are saved for each parse forest, even when the grammar changes, re-annotation is only necessary in cases where either the parse has become more ambiguous, so new decisions have to be made, or changes in rules or lexical items have made the parse so different that the earlier discriminants are not applicable.

Updating is a two step process: Fully Automatic (which will annotate all trees that are uniquely determined) and Interactive (which will present the annotator with any new decisions that need to be made).

Fully Automatic Update

  1. Select the gold standard profile (middle button) [<font color="gold">something.n</font>]

  2. Select the target profile (Left button) [<font color="lightblue">something/grammar</font>]

  3. Load the same grammar as the target profile [(rsa "japanese")]

  4. Set Trees | Switches | Automatic Update, and nothing else.

  5. Select Trees | Update

  6. Wait for a tree annotation window to pop up ...

  7. The updates are color coded:

    • Magenta: A single correct parse was found
    • Blue: There was only one parse but it was not one that was determined by the gold annotations
    • Black: There are still remaining ambiguities

You do your annotation on Run 1 of some test suite TS with version A of the grammar. Then change the grammar to produce version B. Create a new instance of the test suite TS for Run 2, and Process | All Items

  • using grammar version B. Next, select Run 1 as your gold

standard (middle click), select Run 2 as the current database (left click), make sure version B of the grammar is loaded into the LKB, make sure that Trees | Switches | Automatic Update is selected, and then select Trees | Update. This will cause the tree annotation window to appear, and begin zooming through your items, incorporating all annotations that it can from the original treebank in Run 1, and adding those annotations to Run 2.

This will give all the sentences that satisfy the update-match-p() predicate (defined in lkb/src/tsdb/lisp/redwoods.lisp). The default is inputs for which the recorded discriminants fully disambiguate, where there is more than one reading, or those where there is only one reading, and it is the same as before.

  ;; during updates, a `save' match is indicated by the following conditions:
  ;;
  ;;   - the current item has not been tree annotated already;
  ;;   - the number of active trees in the current set equals the number of
  ;;     active trees in the gold set;
  ;;   - either the current item has more than one reading, or that single one
  ;;     reading has the exact same derivation as the preferred tree from the
  ;;     gold set.
  ;;   - also, when in `exact-match' update mode, be content if there is one
  ;;     unique result.
  ;;

Note that has not been tree annotated does not mean the same as unannotated. The former means has not been touched before (e.g. there is no entry in the tree file) the latter means that it has been touched (e.g. there is an entry in the tree file with the value -1).

Interactive Update

Treebank only those places that have changed

  1. Select the gold standard profile (middle button) something.n

  2. Select the target profile (Left button) something/grammar

  3. Load the same grammar as the target profile (rsa "japanese")

  4. Unset Trees | Switches | Automatic Update

  5. Set Options | TSQL Condition | Unannotated

  6. Select Trees | Update

In this stage, the annotator can annotate any trees that have changed, exploiting any relevant existing decisions.

Clone this wiki locally