NER system for Slovene, Croatian and Serbian. The system itself is a slight modification of the CRF-based reldi-tagger with Brown clusters information added. It differentiates between person, person derivative, location, organization and miscelaneous.
The Slovene model was trained on ssj500k, the Croatian on hr500k, while the Serbian model was trained on SETimes.SR.
$ python2.7 sl < example_sl.txt
Slovenija Npfsn B-loc
je Va-r3s-n O
zelo Rgp O
# kot Hrvaška # Z Rgp Npfsn Z O O B-loc O
lepa Agpfsn O
. Z O
$ python2.7 hr < example_hr.txt
Dodali Vmp-pm O
smo Var1p O
i Qo O
preostale Agpmpay O
jezike Ncmpa O
. Z O
Marko Npmsn B-per
i Cc O
Ana Npfsn B-per
rade Vmr3p O
u Sl O
Microsoftu Npmsl B-org
u Sl O
Jajcu Ncnsl B-loc
. Z O
To produce data that is tokenised and part-of-speech-tagged (prerequisite for named entity recognition), you should apply the following tools to running text:
- reldi-tokeniser for tokenization
- reldi-tagger for morphosyntactic tags
One exemplary run of these processes in a pipeline is this:
$ echo 'U Piranu pada kiša.' | python2.7 reldi-tokeniser/ hr | python2.7 reldi-tagger/ hr | python2.7 janes-ner/ -i 2 -m 3 hr U Sl O Piranu Npmsl B-loc pada Vmr3s O kiša Ncfsn O . Z O
The tagger was evaluated inside the babushka-bench benchmarking platform.
On Slovene the overall macro-F1 of 0.673 and accuracy of 0.984 were obtained, with the following per-class results:
precision recall f1-score support
0.99 1.00 1.00 16984
deriv-per 0.50 0.35 0.41 17
loc 0.84 0.77 0.80 230
misc 0.35 0.22 0.27 79
org 0.72 0.63 0.67 200
per 0.90 0.88 0.89 422
avg / total 0.98 0.98 0.98 17932
On Croatian the overall macro-F1 of 0.752 and accuracy of 0.978 were obtained, with the following per-class results:
precision recall f1-score support
0.99 1.00 0.99 47763
deriv-per 0.57 0.57 0.57 23
loc 0.86 0.84 0.85 840
misc 0.55 0.45 0.49 517
org 0.76 0.69 0.72 1183
per 0.86 0.92 0.89 1038
avg / total 0.98 0.98 0.98 51364
On Serbian the overall macro-F1 of 0.781 and accuracy of 0.975 were obtained, with the following per-class results:
precision recall f1-score support
0.99 1.00 1.00 16984
deriv-per 0.50 0.35 0.41 17
loc 0.84 0.77 0.80 230
misc 0.35 0.22 0.27 79
org 0.72 0.63 0.67 200
per 0.90 0.88 0.89 422
avg / total 0.98 0.98 0.98 17932
The tagger was previously evaluated on different flavours of Slovene held-out data: standard data, non-standard data and mixture of standard and non-standard data.
The evaluation results on the standard data are the following:
precision recall f1-score support
o 0.99 1.00 0.99 36938
deriv-per 0.44 0.56 0.49 27
loc 0.85 0.74 0.79 582
misc 0.39 0.24 0.30 315
org 0.69 0.48 0.57 497
per 0.87 0.95 0.91 819
avg / total 0.98 0.98 0.98 39178
The evaluation results on the non-standard data are these:
precision recall f1-score support
o 0.99 1.00 1.00 1740
deriv-per 0.00 0.00 0.00 1
loc 0.79 0.92 0.85 12
misc 0.75 0.21 0.33 14
org 0.50 0.33 0.40 6
per 0.98 1.00 0.99 82
avg / total 0.99 0.99 0.99 1855
The evaluation results on the mixture of standard and non-standard data are these:
precision recall f1-score support
o 0.99 1.00 0.99 40418
deriv-per 0.44 0.52 0.48 29
loc 0.85 0.75 0.80 606
misc 0.41 0.24 0.30 343
org 0.69 0.48 0.56 509
per 0.88 0.96 0.92 983
avg / total 0.98 0.98 0.98 42888
If you use the tagger, please cite the following paper:
author="Fi{\v{s}}er, Darja and Ljube{\v{s}}i{\'{c}}, Nikola and Erjavec, Toma{\v{z}}",
title="The Janes project: language resources and tools for Slovene user generated content",
journal="Language Resources and Evaluation",