John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens
was redesigned into a cleanupMode
for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval
module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach
.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.
New Features
- Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
Enhancements
- DocumentAssembler new param
cleanupMode
allows the user to decide what kind of cleanup to apply to source - Tokenizer has been severely enhanced to allow easier and more intuitive customization
- Norvig and Symmetric spell checkers now report confidence scores in metadata
- NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through
setTrainValidationProp
- Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development
Bugfixes
- Fixed Dependency Parser not reporting offsets correctly
- Dependency Parser now only shows head token as part of the result, instead of pairs
- Fixed NerDLModel not allowing to pick noncontrib versions from Linux
- Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
- Removed unintentional GC calls causing some performance issues
Framework
- ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)
Documentation
- Scaladocs for Spark NLP reference
- Added Google Colab walkthrough guide
- Added Approach and Model class names in the reference documentation
- Fixed various typos and outdated pieces in documentation