This repository contains resources developed within the following paper:
Xinshi Lin and Wai Lam. “Entity Retrieval via Type Taxonomy Aware Smoothing”, ECIR 2018
-
collect data from DBpedia and store them into a MongoDB database (see https://github.com/linxinshi/DBpedia-Wikipedia-Toolkit)
-
build graph representation of the Wikipedia Category System (see folder "wikipedia_category_system")
-
build index (see folder "build_index")
-
edit config.py, config_object.py and mongo_object.py to specify parameters for retrieval models and index path etc.
-
execute command "python main.py"
-
check results in folder Retrieval_results (created by program and name it after the time executed)
*this implementation supports multi-processing, specify NUM_PROCESS in config.py. The program will split the queries into several parts and each process will handle one of them. Finally the program merges all results and output a complete one.
Python 3.4+
NLTK, Gensim
NetworkX <= 1.11
PyLucene 6.x
(This implementation works both on Linux and Windows. If you have PyLucene install issues on Windows, please refer to http://lxsay.com/archives/365)
-
The TAS approach is implemented in a recursive and backtracking way to speed up. (see function lm_sas() and mlmSas() in lib_metric.py)
-
From our own experience, the TAS approach is more effective in helping retrieval models scroing against the single catchall field. Replacing the normalizing weights (1-alpha)/(1-alpha^{k}) by a small weight between 0 and 1 (e.g. 1/300) may obtain more consistently stable performance on verbose queries such as natural language questions.
-
The quality of index will greatly affect the performance. After this the parameter alpha and the normalizing weight may affect the performance a bit (-/+ 10%).
-
If you want to improve its practical performance or reproduce the exact results brought by TAS reported in the paper, the following small tricks might be helpful:
-
only TAS for entites that has positive term frequencies given a query term.
-
for entites that have no categories, use the original version of the model (i.e. no TAS) to score them. (already implemented)
Results reported in the paper rely little on these tricks.
-
-
Currently we are trying some varities that have better performance.
FSDM: https://github.com/teanalab/FieldedSDM
PFSDM: https://github.com/teanalab/pfsdm
FSDM+ELR: https://github.com/hasibi/EntityLinkingRetrieval-ELR
DBpedia-Entity Test Collection: https://iai-group.github.io/DBpedia-Entity/
Xinshi Lin (xslin@se.cuhk.edu.hk)
Creative Commons