Releases · JohnSnowLabs/spark-nlp

20 Aug 23:29

2.2.0-rc3

d6ae43c

John Snow Labs Spark-NLP 2.2.0-rc3: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

We are so glad to present the release candidate of this new release. Last time, following a release candidate schedule proved
to be a quite effective method to avoid silly bugs right after release! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
New Param in BERT poolingLayer allows for polling layer selection

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Progress bar and size estimate report when downloading pretrained models and loading embeddings
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter
Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)

Contributors

atomobianco

Assets 2

18 Aug 15:05

saif-ellafi

2.2.0-rc2

dc37976

John Snow Labs Spark-NLP 2.2.0-rc2: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
New Param in BERT poolingLayer allows for polling layer selection
Progress bar report when downloading models and loading embeddings

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter
Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized

Contributors

atomobianco

Assets 2

18 Aug 01:37

saif-ellafi

2.1.1

76eaf77

John Snow Labs Spark-NLP 2.1.1: Fixed flush entities bug, added missing setters to NerConverter

Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter

Assets 2

16 Aug 04:22

saif-ellafi

2.2.0-rc1

64b48ea

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

We are so glad to present the first release candidate of this new release. Last time, following a release candidate schedule allowed
us to move from 2.1.0 straight to 2.2.0! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests.
This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
Progress bar report when downloading models and loading embeddings

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter

Contributors

atomobianco

Assets 2

13 Jul 21:41

saif-ellafi

2.1.0

5cbaa02

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.

New Features

Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

Enhancements

DocumentAssembler new param cleanupMode allows the user to decide what kind of cleanup to apply to source
Tokenizer has been severely enhanced to allow easier and more intuitive customization
Norvig and Symmetric spell checkers now report confidence scores in metadata
NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development

Bugfixes

Fixed Dependency Parser not reporting offsets correctly
Dependency Parser now only shows head token as part of the result, instead of pairs
Fixed NerDLModel not allowing to pick noncontrib versions from Linux
Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
Removed unintentional GC calls causing some performance issues

Framework

ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)

Documentation

Scaladocs for Spark NLP reference
Added Google Colab walkthrough guide
Added Approach and Model class names in the reference documentation
Fixed various typos and outdated pieces in documentation

Contributors

csnardi

Assets 2

02 Jul 18:37

saif-ellafi

2.0.9

e210957

John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging

This release fixes a bug in embeddingsRef param causing embeddings not to be loadable when setIncludeEmbeddings was set to false

Bugfixes

WordEmbeddingsModel can now be loaded using embeddingsRef correctly
Disabled RuleFactory debug mode spam messages

Assets 2

29 Jun 22:09

saif-ellafi

2.1.0-rc2

bff13d7

John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader Pre-release

Pre-release

Release candidate #2 for 2.1.0

Fixed Tokenizer missing pretrained() functions
Fixed issue in metadata not allowing to retrieve model with anonymous s3 credenthials
Added metadata options to differentiate internally between pipelines and models
Fix resource downloader correctly resolve release candidate build versions

Assets 2

28 Jun 18:35

saif-ellafi

2.1.0-rc1

fa22ee3

John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module Pre-release

Pre-release

This is a pre-release for 2.1.0. The tokenizer has been revamped, and some of the DocumentAssembler defaults changed.
For this reason, many pipelines and models may now change their accuracies and performance. Old tokenizer default rules
will be translated in a new english specific pretrained Tokenizer.
NerDLApproach will now report metrics if setTrainValidationProp has been set, as well as confidence scores reporting in spell checkers.
DependencyParser output has been reviewed and fixed a bunch of other bugs in the embeddings scope.
Please feedback and bugs, and remember, this is a pre-release, so not yet intended for production use.
Join Slack!

Enhancements

Norvig and Symmetric spell checkers now report confidence scores in metadata
Tokenizer has been severely enhanced to allow easier and faster customization
NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
Added spark-nlp-eval evaluation model with multiple scripts that help users evaluate their models and pipelines. To be improved.

Bugfixes

Fixed Dependency Parser not reporting offsets correctly
Dependency Parser now only shows head token as part of the result, instead of pairs
Fixed NerDLModel not allowing to pick noncontrib versions from linux
Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible

Framework

ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)

Documentation

Added Google Colab workthrough guide
Added Approach and Model class names in reference documentation
Fixed various typos and outdated pieces in documentation

Assets 2

05 Jun 13:16

saif-ellafi

2.0.8

e07fe54

John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes

This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.

Bugfixes

Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
Deleted unnecessary chunk index from tokens
Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala

Assets 2

02 Jun 13:25

saif-ellafi

2.0.7

f4ed9f3

John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements

This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems

Bugfixes

Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
NerDLModel was not properly reading user provided config proto bytes during prediction
Improved cluster embeddings message to hit user of cluster mode without shared filesystems
Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

Enhancements

Bugfixes

Contributors

New Features

Enhancements

Bugfixes

Contributors

Bugfixes

New Features

Enhancements

Bugfixes

Contributors

New Features

Enhancements

Bugfixes

Framework

Documentation

Contributors

Bugfixes

Enhancements

Bugfixes

Framework

Documentation

Bugfixes

Bugfixes

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.2.0-rc3: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

John Snow Labs Spark-NLP 2.2.0-rc2: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

John Snow Labs Spark-NLP 2.1.1: Fixed flush entities bug, added missing setters to NerConverter

Bugfixes

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

New Features

Enhancements

Bugfixes

Framework

Documentation

Contributors

John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging

Bugfixes

John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader

John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module

Enhancements

Bugfixes

Framework

Documentation

John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes

Bugfixes

John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements

Bugfixes