John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module
Pre-release
Pre-release
This is a pre-release for 2.1.0. The tokenizer has been revamped, and some of the DocumentAssembler defaults changed.
For this reason, many pipelines and models may now change their accuracies and performance. Old tokenizer default rules
will be translated in a new english specific pretrained Tokenizer.
NerDLApproach will now report metrics if setTrainValidationProp has been set, as well as confidence scores reporting in spell checkers.
DependencyParser output has been reviewed and fixed a bunch of other bugs in the embeddings scope.
Please feedback and bugs, and remember, this is a pre-release, so not yet intended for production use.
Join Slack!
Enhancements
- Norvig and Symmetric spell checkers now report confidence scores in metadata
- Tokenizer has been severely enhanced to allow easier and faster customization
- NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through
setTrainValidationProp
- Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
- Added
spark-nlp-eval
evaluation model with multiple scripts that help users evaluate their models and pipelines. To be improved.
Bugfixes
- Fixed Dependency Parser not reporting offsets correctly
- Dependency Parser now only shows head token as part of the result, instead of pairs
- Fixed NerDLModel not allowing to pick noncontrib versions from linux
- Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
Framework
- ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
Documentation
- Added Google Colab workthrough guide
- Added Approach and Model class names in reference documentation
- Fixed various typos and outdated pieces in documentation