Skip to content

Tokenizer 1.21.0

Compare
Choose a tag to compare
@guillaumekln guillaumekln released this 22 Oct 13:10
· 187 commits to master since this release

New features

  • Accept vocabularies with tab-separated frequencies (format produced by SentencePiece)

Fixes and improvements

  • Fix BPE vocabulary restriction when words have a leading or trailing joiner
  • Raise an error when using a multi-character joiner and support_prior_joiner
  • [Python] Implement __hash__ method of pyonmttok.Token objects to be consistent with the __eq__ implementation
  • [Python] Declare pyonmttok.Tokenizer arguments (except mode) as keyword-only
  • [Python] Improve compatibility with Python 3.9