Skip to content
Yonatan Bisk edited this page Dec 5, 2015 · 4 revisions

The README should contain basically all of the information necessary for running the code. Additional information is available in the code's documentation. The following are FAQs:

What output files are generated?

The Folder= option in the configuration specifies where output will be stored. By default after every round of convergence the model will print a human readable version of the the generative distributions used by the model. In addition, lexical distributions will have a second version with the suffix .lex.gz which are conditional distributions. Other files generated include the grammar Grammar.gz, the induced lexicon Lexicon.gz, the serialized models model# and any output from testing Test.#.#.JSON.gz.

Can I turn off printing human readable distributions?

Yes, this verbose printing can be turned off by setting the configuration flag printModelsVerbose=False

UIUC Instructions

Load Java and Maven modules (these can be added to your .bash_profile

module load sun-jdk/1.8.0
module load apache-maven/3.0.5

If you have not registered your SSH-Keys with Bitbucket, set terminal to ask for password

unset SSH_ASKPASS

What do all the parameters mean?

Default Configuration File

Some Data Formats (in addition to JSON defaults)

CoNLL Shared Task

Index   word      lemma      Coarse    Fine    Feats                      Head    Label 
1       Afirmó    afirmar    v         vm      num=s|per=3|mod=i|tmp=s    0       ROOT

NAACL Shared Task

Index   word      lemma      Coarse    Fine    UNIVERSAL    Feats                      Head    Label
1       Afirmó    afirmar    v         vm      VERB         num=s|per=3|mod=i|tmp=s    0       ROOT

Universal tagset mappings for some languages are available in www.YonatanBisk.com/Thesis

Tagset

https://github.com/ybisk/CCG-Induction/blob/master/src/main/resources/english.pos.map

English mapping Tag Type
. punct Period
, punct conj Comma
CC conj Coordinationg Conjunction
JJ Adjective
VBD verb Verb, past tense
VBG verb Verb, gerund

Roles are used by Induction to denote special restrictions

CCGBank

PARG CCG-style dependencies

SRC    TAR      CAT           Arg Index   SRC word   TAR word
<s> 3
2      0        S[frg]/NP     1           year       Not
2      1        NP[nb]/N      1           year       this
<\s>

AUTO A bracketed parse (we assume these are collapsed to a single line):

(<T S[frg] 0 2>
    (<T S[frg] 0 2>
        (<L S[frg]/NP RB RB Not S[frg]/NP_158>)
        (<T NP 1 2>
            (<L NP[nb]/N DT DT this NP[nb]_165/N_165>)
            (<L N NN NN year N>)
        ) ¬
    ) ¬
    (<L . . . . .>) 
)

Speeding Java Up and Parallelism

    -Xmx20g                         -- Specifies that the heap can grow to 20gb
                                       Should be set to value < total machine memory
    -XX:+UseParallelGC              -- JVM spawns parallel garbage collection threads
    -XX:ParallelGCThreads=2         -- Specifies the number of threads.
    -server                         -- Optimize loops, etc
    -XX:+UseFastAccessorMethods     -- Optimize
    -XX:+AggressiveOpts             -- Optimize