Scaling chrF and TER to 0-100 #140

ozancaglayan · 2021-02-24T19:09:25Z

At least for chrF, the original implementation reports the score that way. Do you think it is reasonable to standardize these across the three metrics? It'll also make reportin a bit more consistent w.r.t to the --width parameter.

The text was updated successfully, but these errors were encountered:

martinpopel · 2021-02-25T11:59:47Z

Yes, I would love that.
I think it is OK for everyone, especially if it will be in sacreBLEU 2.0, where such changes are expected.

cfedermann · 2021-02-25T20:41:33Z

+1

- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).

ozancaglayan added this to the 2.0.0 milestone Feb 25, 2021

ozancaglayan added a commit that referenced this issue Feb 27, 2021

TER: Scale scores by 100.0 (#140), add explicit arguments to __init__()

ea48ef1

ozancaglayan added a commit that referenced this issue Feb 27, 2021

chrF: Scale score by 100.0 (#140)

0a9032e

ozancaglayan mentioned this issue Mar 26, 2021

Changes for 2.0.0 #152

Merged

ozancaglayan linked a pull request Mar 26, 2021 that will close this issue

Changes for 2.0.0 #152

Merged

ozancaglayan closed this as completed in #152 Jul 18, 2021

mjpost mentioned this issue Oct 19, 2022

TER above 100? #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling chrF and TER to 0-100 #140

Scaling chrF and TER to 0-100 #140

ozancaglayan commented Feb 24, 2021

martinpopel commented Feb 25, 2021

cfedermann commented Feb 25, 2021

Scaling chrF and TER to 0-100 #140

Scaling chrF and TER to 0-100 #140

Comments

ozancaglayan commented Feb 24, 2021

martinpopel commented Feb 25, 2021

cfedermann commented Feb 25, 2021