This repositories contains the source code, scripts, and tutorials related to the paper "Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking".
- Riccardo Pozzi riccardo.pozzi@unimib.it
- Federico Moiraghi federico.moiraghimotta@unimib.it
- Matteo Palmonari matteo.palmonari@unimib.it
- Fausto Lodi fausto.lodi@unimib.it
Special thanks to Lorenzo Sasso (l.sasso2@campus.unimib.it) and Simone Monti (s.monti21@campus.unimib.it).
Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Matteo Palmonari. 2023. Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (IJCKG '22). Association for Computing Machinery, New York, NY, USA, 30–38. https://doi.org/10.1145/3579051.3579063
The Incremental Dataset is created starting from the WikilinksNED Unseen-Mentions from https://github.com/yasumasaonoe/ET4EL by Yasumasa Onoe and Greg Durrett.
The biencoder
and indexer
services are derived from https://github.com/facebookresearch/BLINK by Facebook, Inc. and its affiliates (See BLINK-LICENSE).
The nilcluster
service is derived from https://github.com/montis96/EntityClustering by Simone Monti.
Tested on GNU/Linux.
The pipeline requires about:
- ~33G RAM (the index of 6M wikipedia entities is kept in memory)
- ~60G disk (index, entities information, models)
- GPU is not mandatory but recommended (at least for running the pipeline on an entire dataset)
.
├── incremental_dataset # the dataset (not in git)
│ ├── delete_nil_entities.sql
│ ├── dev
│ ├── statistics
│ ├── test
│ └── train
├── notebooks # interactive notebooks
│ ├── create_dataset.ipynb
│ ├── create_dataset.Rmd
│ ├── try_pipeline.ipynb
│ └── try_pipeline.Rmd
├── pipeline # pipeline services
│ ├── biencoder
│ ├── docker-compose.yml
│ ├── env-sample.txt
│ ├── indexer
│ ├── models
│ ├── nilcluster
│ ├── nilpredictor
│ └── postgres
├── README.md
├── requirements.txt # python requirements for notebooks and scripts
└── scripts # scripts
├── eval_kbp.py
├── feature_ablation_study.py
└── postgres_populate_entities.py
To run the notebooks and the scripts in this repo you can use a python3 environment (See https://docs.python.org/3/library/venv.html).
Tested on Python 3.8.10.
python3 -m venv venv
source venv/bin/activate # works with bash
pip install --upgrade pip # upgrade pip
pip install -r requirements.txt
Download the dataset or create it starting from Wikilinks Unseen-Mentions.
Download from here and extract it into the main folder of the project. You should see something like this:
.
├── incremental_dataset
│ ├── delete_nil_entities.sql
│ ├── dev
│ ├── statistics
│ ├── test
│ └── train
├── notebooks
│ ├── create_dataset.ipynb
├── pipeline
├── README.md
├── requirements.txt
└── scripts
Follow the notebook create_dataset.ipynb and then copy the dataset folder in the root directory of the project as shown in the previous directory structure.
Each component of the pipeline is deployed as a microservice exposing HTTP APIs. The services are:
- biencoder: uses the biencoder to encode mentions (or entities) into vectors.
- indexer: given a vector it runs the (approximate) nearest neighbor algorithm to retrieve the best candidates for linking .
- nilpredictor: given mention and the best candidates if estimates wheter the mention is NIL or the linking is correct.
- nilcluster: given a set of NIL mentions if cluster together the ones referring to the same (not in the KB) entity.
The pipeline requires the following additional services:
- postgres database: it keeps the information about the entities (to avoid keeping them in memory).
Docker (possibly with GPU support) and Compose are required. Follow these links to install them.
- Docker: https://docs.docker.com/get-docker/
- Compose: https://docs.docker.com/compose/install/
- Nvidia Docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit
Enter the pipeline folder then create a folder named models
in the root folder of the project (same folder of docker-compose.yml
), if it does not exist.
We need to download these files and put them in the models directory:
- the biencoder model (from Meta Research):
- the index (from Meta Research):
- the information about the entities in the index (from Meta Research):
- the NIL prediction model:
- the file
nilp_bi_max_secondiff_model.pickle
from here
- the file
Once downloaded the model folder should look like this:
pipeline/models/
pipeline/models/biencoder_wiki_large.bin (2.5G)
pipeline/models/biencoder_wiki_large.json
pipeline/models/entity.jsonl (3.2G)
pipeline/models/faiss_hnsw_index.pkl (29G)
pipeline/models/nilp_bi_max_secondiff_model.pickle
Go back to the pipeline
folder and copy the file env-sample.txt
to .env
, then edit the latter so that it fits your needs.
We need to populate the database with entities information (e.g. Wikipedia IDs, titles).
From inside the pipeline
folder start postgres by running
# you may need to use sudo
docker-compose up -d postgres
Now postgres should listen on tcp://127.0.0.1:5432
.
Let postgres run some seconds to initialize itself, then go back to the main directory of the project and with the python environment activated run the population script as follows. In case you changed the postgres password in the .env
file replace secret
in the following command with the password you chose.
python scripts/postgres_populate_entities.py --postgres postgresql://postgres:secret@127.0.0.1:5432/postgres --table-name entities --entity_catalogue pipeline/models/entity.jsonl --indexer 10
At this point you can delete pipeline/models/entity.jsonl
since the information is in the database.
If you created a different dataset (by changing random seeds) you sould use the new sql command created by the notebook.
Run the sql query from the file incremental_dataset/delete_nil_entities.sql
: you could use the following command from the pipeline folder:
# you may need sudo
docker-compose exec -T postgres psql -U postgres < ../incremental_dataset/delete_nil_entities.sql
In case you want to disable the GPU see Without GPU.
Otherwise ensure GPU is enabled or enable it by editing the JSON file models/biencoder_wiki_large.json
setting
no_cuda: false
From inside the pipeline folder run
# you may need sudo
docker-compose up -d
Use the notebook try_pipeline.
Ensure the pipeline is up, then run:
python scripts/eval_kbp.py --report evaluation_report_incremental.csv \
incremental_dataset/test/test_0.jsonl \
incremental_dataset/test/test_1.jsonl \
incremental_dataset/test/test_2.jsonl \
incremental_dataset/test/test_3.jsonl \
incremental_dataset/test/test_4.jsonl \
incremental_dataset/test/test_5.jsonl \
incremental_dataset/test/test_6.jsonl \
incremental_dataset/test/test_7.jsonl \
incremental_dataset/test/test_8.jsonl \
incremental_dataset/test/test_9.jsonl
Ensure the pipeline is up, then run:
python scripts/eval_kbp.py --no-incremental --report evaluation_report_onepass.csv \
incremental_dataset/test/test_0.jsonl \
incremental_dataset/test/test_1.jsonl \
incremental_dataset/test/test_2.jsonl \
incremental_dataset/test/test_3.jsonl \
incremental_dataset/test/test_4.jsonl \
incremental_dataset/test/test_5.jsonl \
incremental_dataset/test/test_6.jsonl \
incremental_dataset/test/test_7.jsonl \
incremental_dataset/test/test_8.jsonl \
incremental_dataset/test/test_9.jsonl
The report contains a line for each batch (also a line with the average over all the batches) with this metrics:
batch: batch identifier
size: batch size
linking_recall@1: recall@k of the linking of not-NIL mentions
linking_recall@2:
linking_recall@3:
linking_recall@5:
linking_recall@10:
linking_recall@30:
linking_recall@100:
nil_prediction_cm: NIL prediction confusion matrix
nil_prediction_cm_normalized: " normalized
nil_prediction_mitigated_cm: NIL prediction mitigated (correct when a linking error is NIL)
nil_prediction_mitigated_cm_normalized:
nil_clustering_bcubed_precision: NIL clustering bcubed precision
nil_clustering_bcubed_recall: " recall
overall_to_link_correct: linked_correcly / to_link
should_be_nil_correct: number of correct nil
should_be_nil_total: expected correct nil
should_be_nil_correct_normalized: correct_nil / expected
should_be_linked_to_prev_added_correct: number of mention correctly linked to entities added from previous clusters
should_be_linked_to_prev_added_total: expected number of mentions to link to prev added entities
should_be_linked_to_prev_added_correct_normalized: " normalized
overall_correct: correct_predictions end-to-end
overall_accuracy: " normalized
NIL--precision: NIL prediction precision of the NIL class
NIL--recall: " recall
NIL--f1-score:
NIL--support:
not-NIL--precision: " of the not-NIL class
not-NIL--recall:
not-NIL--f1-score:
not-NIL--support:
NIL-mitigated-precision: " mitigated (correct when a linking error is NIL)
NIL-mitigated-recall:
NIL-mitigated-f1-score:
NIL-mitigated-support:
not-NIL-mitigated-precision:
not-NIL-mitigated-recall:
not-NIL-mitigated-f1-score:
not-NIL-mitigated-support
In this example we train using the first batch of train from the incremental dataset and using the first batch of dev for evaluating and comparing the NIL prediction models.
Prepare data for the NIL prediction study/training: we need to get linking scores. Ensure the pipeline is up, then run
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_0.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_1.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_2.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_3.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_4.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_5.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_6.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_7.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_8.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/train/train_9.jsonl
python scripts/eval_kbp.py --save-path output/prepare_for_nil_study --prepare-for-nil-pred incremental_dataset/dev/dev.jsonl
You should see the output files in the folder output/prepare_for_nil_study
:
train_0_outdata.pickle
train_1_outdata.pickle
train_2_outdata.pickle
train_3_outdata.pickle
train_4_outdata.pickle
train_5_outdata.pickle
train_6_outdata.pickle
train_7_outdata.pickle
train_8_outdata.pickle
train_9_outdata.pickle
dev_outdata.pickle
Then run the study and train the models with:
# if using the entire train set this script requires about 43G RAM
python scripts/feature_ablation_study.py --train-path output/prepare_for_nil_study/train_0_outdata.pickle \
--train-path output/prepare_for_nil_study/train_1_outdata.pickle \
--train-path output/prepare_for_nil_study/train_2_outdata.pickle \
--train-path output/prepare_for_nil_study/train_3_outdata.pickle \
--train-path output/prepare_for_nil_study/train_4_outdata.pickle \
--train-path output/prepare_for_nil_study/train_5_outdata.pickle \
--train-path output/prepare_for_nil_study/train_6_outdata.pickle \
--train-path output/prepare_for_nil_study/train_7_outdata.pickle \
--train-path output/prepare_for_nil_study/train_8_outdata.pickle \
--train-path output/prepare_for_nil_study/train_9_outdata.pickle \
--test-path output/prepare_for_nil_study/dev_outdata.pickle --output-path nilprediction_output
The nilprediction_output
folder will contain:
- the models
- a summary (feature_ablation_summary.csv) that compares all the models
- plots of the distribution of the predictions
- performance report for each model
nilprediction_output/feature_ablation_summary.csv
nilprediction_output/nilp_bi_max_levenshtein_jaccard_model.pickle
nilprediction_output/nilp_bi_max_levenshtein_jaccard_kde_correct_errors.png
nilprediction_output/nilp_bi_max_levenshtein_jaccard_kde.png
nilprediction_output/nilp_bi_max_levenshtein_jaccard_report.txt
nilprediction_output/nilp_bi_max_levenshtein_jaccard_roc.png
...
Edit the docker-copmose.yml
commenting the part related to the gpu (Look for the comments in the file).
Edit the JSON file models/biencoder_wiki_large.json
setting
no_cuda: true
We suggest to use GPU for evaluating a dataset, while to try the pipeline CPU should be enough.
This repository is MIT licensed. See the LICENSE file for details.