Release John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned · JohnSnowLabs/spark-nlp

Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.

New Features

Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

Enhancements

DocumentAssembler new param cleanupMode allows the user to decide what kind of cleanup to apply to source
Tokenizer has been severely enhanced to allow easier and more intuitive customization
Norvig and Symmetric spell checkers now report confidence scores in metadata
NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development

Bugfixes

Fixed Dependency Parser not reporting offsets correctly
Dependency Parser now only shows head token as part of the result, instead of pairs
Fixed NerDLModel not allowing to pick noncontrib versions from Linux
Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
Removed unintentional GC calls causing some performance issues

Framework

ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)

Documentation

Scaladocs for Spark NLP reference
Added Google Colab walkthrough guide
Added Approach and Model class names in the reference documentation
Fixed various typos and outdated pieces in documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

New Features

Enhancements

Bugfixes

Framework

Documentation

Contributors