Skip to content

Language compiler

Jos Denys edited this page Nov 10, 2020 · 3 revisions

The Language Compiler

Context

The iKnow engine relies on Language Models (also known as Knowledge Bases or simply "KB") for its language-specific parsing of sentences. A KB's source is expressed as a set of CSV files, which are not containing outright code but capture linguistic tokens, rules and other metadata specific to a language, plus some comments and sample sentences. These files are maintained under /kb in a human-readable (and editable) source format, usually through simple text editors like Notepad++.

When accumulated language model edits to the files in /kb represent a comprehensive update, it's time to compile them ahead of a full iKnow engine build. Compiling the language models means transforming them from CSV format into a collection of artefacts the iKnow engine can use at runtime:

  • Data in lexrep.csv is compiled into a C++ state machine that ends up in .inl files in /modules/aho/inl/<language>/lexrep/, enabling high-performance matching of input text to the linguistic tokens on which iKnow bases its parsing
  • Data in the other csv files gets loaded as shared memory dumps to enable efficient runtime loading

Compilation steps

In case you made any changes to the source .csv files, the iKnowLanguageCompiler project takes care of transforming them into those runtime formats. If you haven't, you can skip this step. Since the language compiler relies on common parts, you'll need to build the iKnowEngineTest program as described below before building iKnowLanguageCompiler. This should result in a new executable:

<repo_root>\kit\x64\(Debug|Release)\bin\iKnowLanguageCompiler(.exe)

Open a command window, change directory to <repo_root>\kit\x64\(Debug|Release)\bin\, and run the program with the requested language code (eg: IKnowLanguageCompiler.exe en for building the English language model). If no language parameter is supplied, all language models will be rebuilt. After the build process, you must rebuild the test program to pick up the new language models.

Inputs and outputs

It is important to understand the in- and output of this process. The input consists of a collection of csv-files, representing the language model as assembled by a qualified linguist:

  • <repo_root>\language_models\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\

    Each language directory contains 8 (or less) csv-files : "acro", "filter", "labels", "lexreps", "metadata", "prepro", "regex" and "rules". See here for a detailed description. These files are the input for the language model builder.

  • <repo_root>\modules\engine\language_data\

    This directory contains, per language, the binary representation of the linguistic data, in the form of a header file (kb_<language>_data.h), this is output, generated by the language compiler, do not edit!

  • <repo_root>\modules\aho\inl\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\

    This is the place where, per language, AHO state machine data is written, this is output, also the result of the language compilation process, do not edit!

The language compiler must be run from its /bin directory, and knows the input and output directories, so there is no need for any configuration. If you would like to change these, you'll have to edit the source code. After rebuilding a language model data, a new build of the language module itself is needed, since this binary data is hard coded for maximum speed.