Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctuation/recase module performs badly... has to be redesigned #9

Open
proycon opened this issue Feb 10, 2016 · 1 comment
Open

Comments

@proycon
Copy link
Owner

proycon commented Feb 10, 2016

No description provided.

@proycon proycon self-assigned this Feb 25, 2016
@proycon
Copy link
Owner Author

proycon commented Apr 13, 2016

New module is implemented but has to be tested more thoroughly and parameters have to be tweaked. Initial evaluation of Valkuil on CITO data show recasing and deletions are not/hardly working and precision/recall of missing punctuation insertion is still very low.

OVERALL RESULTS
=================
 Documents                                  :  520
 Total number of corrections in output      :  4390
 Total number of corrections in reference   :  10831
 Matching output corrections (tp)           :  1926
 Missed output corrections (fp)             :  2464
 Missed reference corrections (fn)          :  9118
 Virtual total (tp+fn)                      :  11044
 Precision (micro)                          :  0.44
 Recall (micro)                             :  0.17
 F1-score (micro)                           :  0.25

Aggregated corrections when they are on the same words:
 Aggregated average corrections in output              :  1.04
 Total number of aggregated corrections in output      :  4072
 Total number of aggregated corrections in reference   :  10831
 Matching output aggregated corrections (tp)           :  1713
 Missed output aggregated corrections (fp)             :  2359
 Missed reference aggregated corrections (fn)          :  18236
 Virtual total (tp+fn)                                 :  19949
 Aggregated precision (micro)                          :  0.42
 Aggregated recall (micro)                             :  0.09
 Aggregated F1-score (micro)                           :  0.14

PER-MODULE RESULTS
====================
Precision for confusible_de_het :  0.45     (89/196)
Precision for confusible_deze_dit :  0.37     (7/19)
Precision for confusible_hard_hart :  1.0     (1/1)
Precision for confusible_hun_zij :  0.48     (10/21)
Precision for confusible_licht_ligt :  1.0     (3/3)
Precision for confusible_me_mijn :  0.8     (53/66)
Precision for confusible_u_uw :  0.85     (45/53)
Precision for confusible_word_wordt :  0.9     (112/125)
Precision for confusible_zei_zij :  0.8     (4/5)
Precision for confusiblesuffix_d_dt :  0.69     (9/13)
Precision for errorlist :  0.71     (256/359)
Precision for hunspell :  0.5     (957/1929)
Precision for puncrecase :  0.11     (124/1114)
Precision for runon :  0.51     (71/138)
Precision for splits :  0.6     (185/310)


PER-CLASS RESULTS
====================
archaic :  P=0  R=0.0   F=0.0
capitalizationerror :  P=0.0    R=0.0   F=0.0
confusion :  P=0.66     R=0.18  F=0.28
missingpunctuation :  P=0.11    R=0.07  F=0.08
missingword :  P=0      R=0.0   F=0.0
nonworderror :  P=0.52  R=0.55  F=0.54
redundantpunctuation :  P=0     R=0.0   F=0.0
redundantword :  P=0    R=0.0   F=0.0
runonerror :  P=0.56    R=0.45  F=0.5
spliterror :  P=0.6     R=0.22  F=0.32
uncertain :  P=0        R=0.0   F=0.0

REFERENCE CLASS DISTRIBUTION
================================
archaic :  1 0.0%
capitalizationerror :  2374 21.9%
confusion :  1832 16.9%
missingpunctuation :  1849 17.1%
missingword :  960 8.9%
nonworderror :  1747 16.1%
redundantpunctuation :  306 2.8%
redundantword :  490 4.5%
runonerror :  333 3.1%
spliterror :  831 7.7%
uncertain :  108 1.0%

OUTPUT CLASS DISTRIBUTION
================================
capitalizationerror :  5 0.1%
confusion :  507 11.5%
missingpunctuation :  1114 25.4%
nonworderror :  2184 49.7%
runonerror :  270 6.2%
spliterror :  310 7.1%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant