-
Notifications
You must be signed in to change notification settings - Fork 20
Language compiler
The iKnow engine relies on Language Models (also known as Knowledge Bases or simply "KB") for its language-specific parsing of sentences. A KB's source is expressed as a set of CSV files, which are not containing outright code but capture linguistic tokens, rules and other metadata specific to a language, plus some comments and sample sentences. These files are maintained under /kb in a human-readable (and editable) source format, usually through simple text editors like Notepad++.
When accumulated language model edits to the files in /kb
represent a comprehensive update, it's time to compile them ahead of a full iKnow engine build. Compiling the language models means transforming them from CSV format into a collection of artefacts the iKnow engine can use at runtime:
- Data in
lexrep.csv
is compiled into a C++ state machine that ends up in .inl files in/modules/aho/inl/<language>/lexrep/
, enabling high-performance matching of input text to the linguistic tokens on which iKnow bases its parsing - Data in the other csv files gets loaded as shared memory dumps to enable efficient runtime loading
In case you made any changes to the source .csv files, the iKnowLanguageCompiler
project takes care of transforming them into those runtime formats. If you haven't, you can skip this step. Since the language compiler relies on common parts, you'll need to build the iKnowEngineTest
program as described below before building iKnowLanguageCompiler
.
This should result in a new executable:
<repo_root>\kit\x64\(Debug|Release)\bin\iKnowLanguageCompiler(.exe)
Open a command window, change directory to <repo_root>\kit\x64\(Debug|Release)\bin\
, and run the program with the requested language code (eg: IKnowLanguageCompiler.exe en
for building the English language model). If no language parameter is supplied, all language models will be rebuilt. After the build process, you must rebuild the test program to pick up the new language models.
It is important to understand the in- and output of this process. The input consists of a collection of csv-files, representing the language model as assembled by a qualified linguist:
-
<repo_root>\language_models\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\
Each language directory contains 8 (or less) csv-files : "acro", "filter", "labels", "lexreps", "metadata", "prepro", "regex" and "rules". See here for a detailed description. These files are the input for the language model builder.
-
<repo_root>\modules\engine\language_data\
This directory contains, per language, the binary representation of the linguistic data, in the form of a header file (
kb_<language>_data.h
), this is output, generated by the language compiler, do not edit! -
<repo_root>\modules\aho\inl\(cs|de|en|es|fr|ja|nl|pt|ru|sv|uk)\
This is the place where, per language, AHO state machine data is written, this is output, also the result of the language compilation process, do not edit!
The language compiler must be run from its /bin
directory, and knows the input and output directories, so there is no need for any configuration. If you would like to change these, you'll have to edit the source code. After rebuilding a language model data, a new build of the language module itself is needed, since this binary data is hard coded for maximum speed.
For Visual Studio (Windows), a handy batch script is available : lang_update.bat
It will compile the language model, rebuild the engine, and run the Python setup. Then you can immediately test your changes. The file must be copied to the <git>\iknow\kit\x64\Release\bin
directory. Open a Python command window and change your directory accordingly. Running the batch file without arguments will compile all languages, running with the language parameter will only compile the language : lang_update en
This will recompile the English model.
Beware : There is no (make) dependency check, running the command will recompile, even if nothing has changed !
If the environment settings for Visual Studio are not set, MSBuild
will fail to run, if that is the case, change the line to :
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\MSBuild" ..\..\..\..\modules\iKnowEngine.sln -p:Configuration="Release" -p:Platform="x64" -maxcpucount