Skip to content

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.2.0-rc3: BERT improvements, OCR Coordinates, python evaluation

20 Aug 23:29
Compare
Choose a tag to compare

We are so glad to present the release candidate of this new release. Last time, following a release candidate schedule proved
to be a quite effective method to avoid silly bugs right after release! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!


New Features

  • OCRHelper now returns coordinate positions matrix for text converted from PDF
  • New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
  • Evaluation module now also ported to Python
  • WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
  • NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
  • New Param in BERT poolingLayer allows for polling layer selection

Enhancements

  • BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
  • Progress bar and size estimate report when downloading pretrained models and loading embeddings
  • Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
  • Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
  • Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
  • PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

  • Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
  • Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
  • Fixed missing setters for whitelist param in NerConverter
  • Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
  • Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
  • Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)

John Snow Labs Spark-NLP 2.2.0-rc2: BERT improvements, OCR Coordinates, python evaluation

18 Aug 15:05
Compare
Choose a tag to compare

We are so glad to present the release candidate of this new release. Last time, following a release candidate schedule proved
to be a quite effective method to avoid silly bugs right after release! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!


New Features

  • OCRHelper now returns coordinate positions matrix for text converted from PDF
  • New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
  • Evaluation module now also ported to Python
  • WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
  • NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
  • New Param in BERT poolingLayer allows for polling layer selection
  • Progress bar report when downloading models and loading embeddings

Enhancements

  • BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
  • Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
  • Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
  • Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
  • PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

  • Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
  • Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
  • Fixed missing setters for whitelist param in NerConverter
  • Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized

John Snow Labs Spark-NLP 2.1.1: Fixed flush entities bug, added missing setters to NerConverter

18 Aug 01:37
Compare
Choose a tag to compare

Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream


Bugfixes

  • Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
  • Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
  • Fixed missing setters for whitelist param in NerConverter

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation

16 Aug 04:22
Compare
Choose a tag to compare

We are so glad to present the first release candidate of this new release. Last time, following a release candidate schedule allowed
us to move from 2.1.0 straight to 2.2.0! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests.
This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!


New Features

  • OCRHelper now returns coordinate positions matrix for text converted from PDF
  • New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
  • Evaluation module now also ported to Python
  • WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
  • Progress bar report when downloading models and loading embeddings

Enhancements

  • BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
  • Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
  • Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
  • Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)

Bugfixes

  • Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
  • Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
  • Fixed missing setters for whitelist param in NerConverter

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

13 Jul 21:41
Compare
Choose a tag to compare

Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.


New Features

  • Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

Enhancements

  • DocumentAssembler new param cleanupMode allows the user to decide what kind of cleanup to apply to source
  • Tokenizer has been severely enhanced to allow easier and more intuitive customization
  • Norvig and Symmetric spell checkers now report confidence scores in metadata
  • NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
  • Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development

Bugfixes

  • Fixed Dependency Parser not reporting offsets correctly
  • Dependency Parser now only shows head token as part of the result, instead of pairs
  • Fixed NerDLModel not allowing to pick noncontrib versions from Linux
  • Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
  • Removed unintentional GC calls causing some performance issues

Framework

  • ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)

Documentation

  • Scaladocs for Spark NLP reference
  • Added Google Colab walkthrough guide
  • Added Approach and Model class names in the reference documentation
  • Fixed various typos and outdated pieces in documentation

John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging

02 Jul 18:37
Compare
Choose a tag to compare

This release fixes a bug in embeddingsRef param causing embeddings not to be loadable when setIncludeEmbeddings was set to false


Bugfixes

  • WordEmbeddingsModel can now be loaded using embeddingsRef correctly
  • Disabled RuleFactory debug mode spam messages

John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader

29 Jun 22:09
Compare
Choose a tag to compare

Release candidate #2 for 2.1.0

  • Fixed Tokenizer missing pretrained() functions
  • Fixed issue in metadata not allowing to retrieve model with anonymous s3 credenthials
  • Added metadata options to differentiate internally between pipelines and models
  • Fix resource downloader correctly resolve release candidate build versions

John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module

28 Jun 18:35
Compare
Choose a tag to compare

This is a pre-release for 2.1.0. The tokenizer has been revamped, and some of the DocumentAssembler defaults changed.
For this reason, many pipelines and models may now change their accuracies and performance. Old tokenizer default rules
will be translated in a new english specific pretrained Tokenizer.
NerDLApproach will now report metrics if setTrainValidationProp has been set, as well as confidence scores reporting in spell checkers.
DependencyParser output has been reviewed and fixed a bunch of other bugs in the embeddings scope.
Please feedback and bugs, and remember, this is a pre-release, so not yet intended for production use.
Join Slack!


Enhancements

  • Norvig and Symmetric spell checkers now report confidence scores in metadata
  • Tokenizer has been severely enhanced to allow easier and faster customization
  • NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
  • Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
  • Added spark-nlp-eval evaluation model with multiple scripts that help users evaluate their models and pipelines. To be improved.

Bugfixes

  • Fixed Dependency Parser not reporting offsets correctly
  • Dependency Parser now only shows head token as part of the result, instead of pairs
  • Fixed NerDLModel not allowing to pick noncontrib versions from linux
  • Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible

Framework

  • ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)

Documentation

  • Added Google Colab workthrough guide
  • Added Approach and Model class names in reference documentation
  • Fixed various typos and outdated pieces in documentation

John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes

05 Jun 13:16
Compare
Choose a tag to compare

This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.


Bugfixes

  • Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
  • Deleted unnecessary chunk index from tokens
  • Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala

John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements

02 Jun 13:25
Compare
Choose a tag to compare

This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems


Bugfixes

  • Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
  • NerDLModel was not properly reading user provided config proto bytes during prediction
  • Improved cluster embeddings message to hit user of cluster mode without shared filesystems
  • Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
  • Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility