- Mistake in tagging v2.5.0
- WMT24 test sets
- Convert Changelog to markdown format
- Add optimization for compute_bleu precision initialization (#257) Thanks to Ernests Lavrinovics for this contribution.
Added:
- Add printing of domain if present (via --echo)
Fixed:
- Add exports to package init.py
Added:
- WMT23 test sets (test set
wmt23
)
Fixed:
- Typing issues (#249, #250)
- Improved builds (#252)
Fixed:
- Special treatment of empty references in TER (#232)
- Bump in mecab version for JA (#234)
Added:
- Warning if
-tok spm
is used (use explicitflores101
instead) (#238)
Bugfix:
- Set lru_cache to 2^16 for SPM tokenizer (was set to infinite)
Features:
- (#203) Added
-tok flores101
and-tok flores200
, a.k.a.spbleu
. These are multilingual tokenizations that make use of the multilingual SPM models released by Facebook and described in the following papers:- Flores-101: https://arxiv.org/abs/2106.03193
- Flores-200: https://arxiv.org/abs/2207.04672
- (#213) Added JSON formatting for multi-system output (thanks to Manikanta Inugurthi @me-manikanta)
- (#211) You can now list all test sets for a language pair with
--list SRC-TRG
. Thanks to Jaume Zaragoza (@ZJaume) for adding this feature. - Added WMT22 test sets (test set
wmt22
) - System outputs: include with wmt22. Also added wmt21/systems which will produce WMT21 submitted systems.
To see available systems, give a dummy system to
--echo
, e.g.,sacrebleu -t wmt22 -l en-de --echo ?
Bugfix: Standard usage was returning (and using) each reference twice.
Features:
- Added WMT21 datasets (thanks to @BrighXiaoHan)
--echo
now exposes document metadata where available (e.g., docid, genre, origlang)- Bugfix: allow empty references (#161)
- Adds a Korean tokenizer (thanks to @NoUnique)
Under the hood:
- Moderate code refactoring
- Processed files have adopted a more sensible internal naming scheme under ~/.sacrebleu (e.g., wmt17_ms.zh-en.src instead of zh-en.zh)
- Processed file extensions correspond to the values passed to
--echo
(e.g., "src") - Now explicitly representing NoneTokenizer
- Got rid of the ".lock" lockfile for downloading (using the tarball itself)
Many thanks to @BrightXiaoHan (https://github.com/BrightXiaoHan) for the bulk of the code contributions in this release.
Features:
- Added
-tok spm
for multilingual SPM tokenization (#168) (thanks to Naman Goyal and James Cross at Facebook)
Fixes:
- Handle potential memory usage issues due to LRU caching in tokenizers (#167)
- Bugfix: BLEU.corpus_score() now using max_ngram_order (#173)
- Upgraded ja-mecab to 1.0.5 (#196)
- Build: Add Windows and OS X testing to Travis CI.
- Improve documentation and type annotations.
- Drop
Python < 3.6
support and migrate to f-strings. - Relax
portalocker
version pinning, addregex, tabulate, numpy
dependencies. - Drop input type manipulation through
isinstance
checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (#121) - Variable # references per segment is supported for all metrics by default. It is still only available through the API.
- Use colored strings in tabular outputs (multi-system evaluation mode) through
the help of
colorama
package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
intl
tokenizer: Useregex
module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (#46)- Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'.
- Signature: Boolean true / false values are shortened to yes / no.
- Signature: Number of references is
var
if variable number of references is used. - Signature: Add effective order (yes/no) to BLEU and chrF signatures.
- Metrics: Scale all metrics into the [0, 100] range (#140)
- Metrics API: Use explicit argument names and defaults for the metrics instead of
passing obscure
argparse.Namespace
objects. - Metrics API: A base abstract
Metric
class is introduced to guide further metric development. This class defines the methods that should be implemented in the derived classes and offers boilerplate methods for the common functionality. A new metric implemented this way will automatically support significance testing. - Metrics API: All metrics now receive an optional
references
argument at initialization time to process and cache the references. Further evaluations of different systems against the same references becomes faster this way for example when using significance testing. - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (#141).
- CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
- CHRF: Added chrF+ support through
word_order
argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (#124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (#144)
- CLI:
--input/-i
can now ingest multiple systems. For this reason, the positionalreferences
should always preceed the-i
flag. - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
- CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
- CLI: Separate metric-specific arguments for clarity when
--help
is printed. - CLI: Added
--format/-f
flag. The single-system output mode is nowjson
by default. If you want to keep the old text format persistently, you can exportSACREBLEU_FORMAT=text
into your shell. - CLI: For multi-system mode,
json
falls back to plain text.latex
output can only be generated for multi-system mode. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
in an efficient way. Through the use of
tabulate
package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to-i/--input
or as a tab-separated single stream redirected intoSTDIN
. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation
through bootstrap resampling for single-system evaluation (
--confidence
flag) as well as paired bootstrap resampling (--paired-bs
) and paired approximate randomization tests (--paired-ar
) when evaluating multiple systems (#40 and #78).
- Fix extraction error for WMT18 extra test sets (test-ts) (#142)
- Validation and test datasets are added for multilingual TEDx
- Fix an assertion error in chrF (#121)
- Add missing
__repr__()
methods for BLEU and TER - TER: Fix exception when
--short
is used (#131) - Pin Mecab version to 1.0.3 for Python 3.5 support
- [API Change]: Default value for
floor
smoothing is now 0.1 instead of 0. - [API Change]:
sacrebleu.sentence_bleu()
now uses theexp
smoothing method, exactly the same as the CLI's --sentence-level behavior. This was mainly done to make two methods behave the same. - Add smoothing value to BLEU signature (#98)
- dataset: Fix IWSLT links (#128)
- Allow variable number of references for BLEU (only via API) (#130). Thanks to Ondrej Dusek (@tuetschek)
- Added character-based tokenization (
-tok char
). Thanks to Christian Federmann. - Added TER (
-m ter
). Thanks to Ales Tamchyna! (fixes #90) - Allow calling the script as a standalone utility (fixes #86)
- Fix type annotation issues (fixes #100) and mark sacrebleu as supporting mypy
- Added WMT20 robustness test sets:
- wmt20/robust/set1 (en-ja, en-de)
- wmt20/robust/set2 (en-ja, ja-en)
- wmt20/robust/set3 (de-en)
- Added WMT20 newstest test sets (#103)
- Make mecab3-python an extra dependency, adapt code to new mecab3-python This fixes the recent Windows installation issues as well (#104) Japanese support should now be explicitly installed through sacrebleu[ja] package.
- Fix return type annotation of corpus_bleu()
- Improve sentence_score's documentation, do not allow single ref string (#98)
- Fix a deployment bug (#96)
- Added Multi30k multimodal MT test set metadata
- Refactored all tokenizers into respective classes (fixes #85)
- Refactored all metrics into respective classes
- Moved utility functions into
utils.py
- Implemented signatures using
BLEUSignature
andCHRFSignature
classes - Simplified checking of Chinese characters (fixes #5)
- Unified common regexp tokenization codes for tokenizers (fixes #27)
- Fixed --detail failing when no test sets are provided
- Fixed multi-reference BLEU failing when tab-delimited reference stream is used
- Removed lowercase option for ChrF which was not functional (#85)
- Simplified ChrF and used the same I/O logic as BLEU to allow for future multi-reference reading
- Added score regression tests for chrF using reference chrF++ implementation
- Added multi-reference & tokenizer & signature tests
- Fixed bug in signature with mecab tokenizer
- Cleaned up deprecation warnings (thanks to Karthikeyan Singaravelan @tirkarthi)
- Now only lists the external typing
module as a dependency for Python
<= 3.4
, as it was integrated in the standard library in Python 3.5 (thanks to Erwan de Lépinau @ErwanDL). - Added LICENSE to pypi (thanks to Mark Harfouche @hmaarrfk)
- Changed
get_available_testsets()
to return a list - Remove Japanese MeCab tokenizer from requirements. (Must be installed manually to avoid Windows incompatibility). Many thanks to Makoto Morishita (@MorinoseiMorizo).
- Added to API:
- get_source_file()
- get_reference_files()
- get_available_testsets()
- get_langpairs_for_testset()
- Some internal refactoring
- Fixed descriptions of some WMT19/google test sets
- Added API test case (test/test_apy.py)
- Added Google's extra wmt19/en-de refs (-t wmt19/google/{ar,arp,hqall,hqp,hqr,wmtp}) (Freitag, Grangier, & Caswell BLEU might be Guilty but References are not Innocent https://arxiv.org/abs/2004.06063)
- Restored SACREBLEU_DIR and smart_open to exports (thanks to Thomas Liao @tholiao)
- Large internal reorganization as a module (thanks to Thamme Gowda @thammegowda)
- Added Japanese MeCab tokenizer (
-tok ja-mecab
) (thanks to Makoto Morishita @MorinoseiMorizo) - Added wmt20/dev test sets (thanks to Martin Popel @martinpopel)
- Smoothing changes (Sebastian Nickels @sn1c)
- Fixed bug that only applied smoothing to n-grams for n > 2
- Added default smoothing values for methods "floor" (0) and "add-k" (1)
--list
now returns a list of all language pairs for a task when combined with-t
(e.g.,sacrebleu -t wmt19 --list
)- added missing languages for IWSLT17
- Minor code improvements (Thomas Liao @tholiao)
- Bugfix: handling of result object for CHRF
- Improved API example
- Tokenization variant omitted from the chrF signature; it is relevant only for BLEU (thanks to Martin Popel)
- Bugfix: call to sentence_bleu (thanks to Rachel Bawden)
- Documentation example for Python API (thanks to Vlad Lyalin)
- Calls to corpus_chrf and sentence_chrf now return a an object instead of a float (use result.score)
- Added sentence-level scoring via -sl (--sentence-level)
- Many thanks to Martin Popel for all the changes below!
- Added evaluation on concatenated test sets (e.g.,
-t wmt17,wmt18
). Works as long as they all have the same language pair. - Added
sacrebleu --origlang
(both for evaluation on a subset and for--echo
). Note that while echoing prints just the subset, evaluation expects the complete test set (and just skips the irrelevant parts). - Added
sacrebleu --detail
for breakdown by domain-specific subsets of the test sets. (Available for WMT19). - Minor changes
- Improved display of
sacrebleu -h
- Added
sacrebleu --list
- Code refactoring
- Documentation and tests updates
- Fixed a race condition bug (
os.makedirs(outdir, exist_ok=True)
instead ofif os.path.exists
)
- Improved display of
- Lazy loading of regexes cuts import time from ~1s to nearly nothing (thanks, @louismartin!)
- Added a simple (non-atomic) lock on downloading
- Can now read multiple refs from a single tab-delimited file.
You need to pass
--num-refs N
to tell it to run the split. Only works with a single reference file passed from the command line.
- Removed another f-string for Python 3.5 compatibility
- Restored Python 3.5 compatibility
- Added MTNT 2019 test sets
- Added a BLEU object
- Added WMT'19 test sets
- Bugfix in test case (thanks to Adam Roberts, @adarob)
- Passing smoothing method through
sentence_bleu
- Added another smoothing approach (add-k) and a command-line option for choosing the smoothing method
(
--smooth exp|floor|add-n|none
) and the associated value (--smooth-value
), when relevant. - Changed interface to some functions (backwards incompatible)
- 'smooth' is now 'smooth_method'
- 'smooth_floor' is now 'smooth_value'
- Ctrl-M characters are now treated as normal characters, previously treated as newline.
- Tokenization now defaults to "zh" when language pair is known
- Updated checksum for wmt19/dev (seems to have changed)
- Fixed checksum for wmt17/dev (copy-paste error)
- Added kk-en and en-kk to wmt19/dev
- Added gu-en and en-gu to wmt19/dev
- Added MD5 checksumming of downloaded files for all datasets.
- Added mtnt1.1/train mtnt1.1/valid mtnt1.1/test data from MTNT
- Added 'wmt19/dev' task for 'lt-en' and 'en-lt' (development data for new tasks).
- Added MD5 checksum for downloaded tarballs.
- Now outputs only only digit after the decimal
- Added a function for sentence-level, smoothed BLEU
- Added wmt18 test set (with references)
- Added zh-en, en-zh, tr-en, and en-tr datasets for wmt18/test-ts
- Added wmt18/test-ts, the test sources (only) for WMT18
- Moved README out of
sacrebleu.py
and the CHANGELOG into a separate file
- fixed another locale issue (with --echo)
- grudgingly enabled
-tok none
from the command line
- added wmt17/ms (Microsoft's additional ZH-EN references).
Try
sacrebleu -t wmt17/ms --cite
. --echo ref
now pastes together all references, if there is more than one
- added wmt18/dev datasets (en-et and et-en)
- fixed logic with --force
- locale-independent installation
- added "--echo both" (tab-delimited)
- metrics (
-m
) are now printed in the order requested - chrF now prints a version string (including the beta parameter, importantly)
- attempt to remove dependence on locale setting
- added the chrF metric (
-m chrf
or-m bleu chrf
for both) See 'CHRF: character n-gram F-score for automatic MT evaluation' by Maja Popovic (WMT 2015) [http://www.statmt.org/wmt15/pdf/WMT49.pdf] - added IWSLT 2017 test and tuning sets for DE, FR, and ZH (Thanks to Mauro Cettolo and Marcello Federico).
- added
--cite
to produce the citation for easy inclusion in papers - added
--input
(-i
) to set input to a file instead of STDIN - removed accent mark after objection from UN official
- corpus_bleu() now raises an exception if input streams are different lengths
- thanks to Martin Popel for:
- small bugfix in tokenization_13a (not affecting WMT references)
- adding
--tok intl
(international tokenization)
- added wmt17/dev and wmt17/dev sets (for languages intro'd those years)
- bugfix for tokenization warning
- added -b option (only output the BLEU score)
- removed fi-en from list of WMT16/17 systems with more than one reference
- added WMT16/tworefs and WMT17/tworefs for scoring with both en-fi references
- added effective order for sentence-level BLEU computation
- added unit tests from sockeye
- Factored code a bit to facilitate API:
- compute_bleu: works from raw stats
- corpus_bleu for use from the command line
- raw_corpus_bleu: turns off tokenization, command-line sanity checks, floor smoothing
- Smoothing (type 'exp', now the default) fixed to produce mteval-v13a.pl results
- Added 'floor' smoothing (adds 0.01 to 0 counts, more versatile via API), 'none' smoothing (via API)
- Small bugfixes, windows compatibility (H/T Christian Federmann)
- Contributions from Christian Federmann:
- Added explicit support for encoding
- Fixed Windows support
- Bugfix in handling reference length with multiple refs
- Small bugfix affecting some versions of Python.
- Code reformatting due to Ozan Çağlayan.
- Support for WMT 2008--2017.
- Single tokenization (v13a) with lowercase fix (proper lower() instead of just A-Z).
- Chinese tokenization.
- Tested to match all WMT17 scores on all arcs.