Skip to content

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

Compare
Choose a tag to compare
@saif-ellafi saif-ellafi released this 13 Jul 21:41
· 6384 commits to master since this release

Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.


New Features

  • Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

Enhancements

  • DocumentAssembler new param cleanupMode allows the user to decide what kind of cleanup to apply to source
  • Tokenizer has been severely enhanced to allow easier and more intuitive customization
  • Norvig and Symmetric spell checkers now report confidence scores in metadata
  • NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
  • Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development

Bugfixes

  • Fixed Dependency Parser not reporting offsets correctly
  • Dependency Parser now only shows head token as part of the result, instead of pairs
  • Fixed NerDLModel not allowing to pick noncontrib versions from Linux
  • Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
  • Removed unintentional GC calls causing some performance issues

Framework

  • ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)

Documentation

  • Scaladocs for Spark NLP reference
  • Added Google Colab walkthrough guide
  • Added Approach and Model class names in the reference documentation
  • Fixed various typos and outdated pieces in documentation