ReFinED was trained to perform entity linking (EL) on the entirety on English Wikipedia. This means the model will perform very well when the document text is similar to the kind of text that appears on Wikipedia. However, performance can be improved on other domains by fine-tuning the model on a relevant dataset. The same is true if the kind of entities to detect/link differs from the kind of entities hyperlinked on Wikipedia.
- Ensure the ReFinED source directory is in your Python path:
export PYTHONPATH=$PYTHONPATH:src
- Run the fine_tune.py script (to list the arguments and help, run
fine_tune.py -h
):
python3 src/refined/training/fine_tune.py --experiment_name test
The fine_tune.py
script will automatically download the training and development split for the CoNLL-AIDA dataset and
use this for fine-tuning and evaluation. The default arguments are the ones used to produce the results reported in the ReFinED (NAACL 2022) paper.
To use a fine-tuned model provide the file path to the directory containing the fine-tuned model (requires model.pt
and config.json
files) to the refined.from_pretrained(...) method as follows:
refined = Refined.from_pretrained(model_name='<absolute_file_path_to_directory_containing_fine_tuned_model>',
entity_set='wikipedia',
use_precomputed_descriptions=False)
Then use the refined
object as usual.
The steps above will not use precomputed entity description embeddings for inference. This means the embeddings will be computed on-the-fly, which doubles the inference time.
To generate precomputed description embeddings, run the precompute_description_embeddings.py
script (or run refined.precompute_description_embeddings()
) and then copy the embeddings file into the directory with the fine-tuned model. Lastly, ensure you have use_precomputed_descriptions=True
when you call Refined.from_pretrained(...)
.
To add a custom dataset:
- Add a method to the
Datasets
class (dataset_factory.py
) such asget_custom_dataset_name_docs(...)
and return an iterable ofDoc
objects. - The
Doc
objects returned should be created using the following method:
Doc.from_spans_with_text(text='insert_document_text', spans=[Span(...), ...], md_spans=[Span(...), ...])
Where:
text
is the full text for the document.spans
is a list ofSpan
(used for entity disambiguation and typing training) objects where each span has agold_entity
set to the correct (annotated) entity (using the wikidata ID) andcoarse_type="MENTION"
.md_spans
a list ofSpan
(used for mention detection training) objects where do each span does not have agold_entity
andcoarse_type
can be set to any of the types ("MENTION", """DATE", "CARDINAL", "MONEY", "PERCENT", "TIME", "ORDINAL", "QUANTITY").
- Then modify
fine_tune.py
to read your custom dataset:
training_dataset = DocDataset(
docs=list(datasets.get_custom_dataset_docs(split="train", ...),
preprocessor=refined.preprocessor,
)
evaluation_dataset_name_to_docs = {
"CUSTOM": list(datasets.get_custom_dataset_docs(
split="dev",...
))
}
- Then run the
fine_tune.py
script the same as before. - If your dataset contains numbers and dates ensure you set the CLI arg
model_name=wikipedia_model_with_numbers
, if not use then it is fine to use the defaultmodel_name=wikipedia_model
. Similarly, if your dataset contains entities that are not in Wikipedia (but are in Wikidata) setentity_set='wikidata'
instead of the defaultentity_set='wikipedia'
.
Alternatively, fine-tuning can be done programmatically as follows (example code in fine_tuning_example.py
):
from refined.evaluation.evaluation import get_datasets_obj
from refined.inference.processor import Refined
from refined.training.fine_tune.fine_tune import fine_tune_on_docs
refined = Refined.from_pretrained(model_name='wikipedia_model'
entity_set='wikipedia',
use_precomputed_descriptions=False)
train_docs: Iterable[Doc] = ... # any method that returns Docs with Wikidata entity ids (qcodes) (spans used for ED + ET, and md_spans used for MD)
eval_docs: Iterable[Doc] = ... # any method that returns Docs with Wikidata entity ids (qcodes)
fine_tune_on_docs(refined=refined, train_docs=train_docs, eval_docs=eval_docs)
This method has the same functionality as the fine_tune.py
script, and the same arguments can be provided (see the fine_tune.py -h
for more information).