Releases · JohnSnowLabs/spark-nlp

26 Jan 00:20

1.8.1

acd4c09

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and OCR improvements. Enjoy! and thanks again for community contributions!

New Features

New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

Enhancements

Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. Maybe turned off

Examples and use cases

Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)

Bugfixes

Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)

Framework

Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

Other

Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to set up their own private local model downloader service
NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

Special mentions

Vincenzo Gaudenzi (DXC.technology) for contributing Italian datasets and models. @maziyarpanahi for creating examples with them.
@correlator from Deep6.ai for contributing feedback in slack and features feedback in general
@johnmccain for reporting bugs in spell checker
@rohit-nlp for delivering maven coordinates for OCR
@haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.

Contributors

correlator, maziyarpanahi, and 3 other contributors

Assets 2

23 Dec 06:16

saif-ellafi

1.8.0

454610b

John Snow Labs Spark-NLP 1.8.0: Dependency Parser, Context Spell Checker and Spark 2.4.0

Overview

This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.

New Features

Built on top of Spark 2.4.0
Dependency Parser annotator allows for sentence relationship encoding
Typed Dependency Parser annotator allows for labeling relationships within dependency tags
ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens

Enhancements

More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
Improved word embeddings performance when working in local filesystems
Reduced the amount of disk IO when working with Word Embeddings
All python notebooks improved for better readability and better documentation
Simplified PySpark interface API
CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths

Bugfixes

Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
Fixed application.conf reading bug which didn't properly refresh AWS credentials
RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
Solved various python default parameter settings
Fixed circular dependency with jbig pdfbox image OCR

Deprecations

DeIdentification annotator is no longer supported in the open source version of Spark-NLP
AssertionStatus annotator is no longer supported in the open source version of Spark-NLP

Assets 2

11 Nov 23:09

saif-ellafi

1.7.3

7894655

John Snow Labs Spark-NLP 1.7.3: Fixed cluster-mode word embeddings on pretrained and improved PySpark API

Overview

This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.

Bugfixes

Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
Fixed broadcast NoSuchElementException Failed to get broadcast_6_piece0 of broadcast_6 causing pretrained models not work in cluster frameworks (thanks @EnricoMi)

Developer API

EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
Simplified cluster path resolution for word embeddings

Other

sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.

Contributors

EnricoMi

Assets 2

20 Oct 23:14

saif-ellafi

1.7.2

4211763

John Snow Labs Spark-NLP 1.7.2: Cluster deserialization, application.conf runtime read fix, hotfixes

Overview

Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.

Bugfixes

Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
Fixed application.conf not reading changes in runtime
Added missing remote_locs argument in python pretrained() functions
Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version

Developer API

Renamed EmbeddingsHelper functions for more convenience

Assets 2

19 Oct 22:33

saif-ellafi

1.7.1

298efe3

John Snow Labs Spark-NLP 1.7.1: Word embeddings deserialization hotfix, windows path fix, Chunk2Doc transformer

Overview

Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.

Enhancements

Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT

Bugfixes

Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)

Other

.gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)

Contributors

maziyarpanahi and apiltamang

Assets 2

16 Oct 05:49

saif-ellafi

1.7.0

64a27c6

John Snow Labs Spark-NLP 1.7.0: Decoupled word embeddings, better windows support

Overview

Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.

Enhancements

Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators.
EmbeddingsHelper class allows embeddings management

Bug fixes

Thanks to @apiltamang for improving URI path support for Windows Servers

Developer API

Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand

Contributors

apiltamang

Assets 2

17 Sep 17:45

saif-ellafi

1.6.3

00ef171

John Snow Labs Spark-NLP 1.6.3: DeIdentification annotator, better OCR and bugfixes

Overview

This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.

New features

DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.

Enhancements

Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism

Bug fixes

Sentence Detector explode sentences into rows now works properly
Fixed Dictionary-based sentiment detector not working on pyspark
Added missing NerConverter to annotator._ imports
Fixed SymmetricDelete spell checker deleting tokens in some scenarios
Fixed SymmetricDelete spell checker unwilling lower-casing

Other

PySpark pip now part from official pip repos
Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator

Assets 2

20 Aug 16:48

saif-ellafi

1.6.2

64c421a

John Snow Labs Spark-NLP 1.6.2: Performance reviewed annotators and NerConverter fixes

Overview

In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
Symmetric Delete Spell checker is more accurate, although it has been reported to process 2K sentences per second.
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.

Enhancements

OCR now features kernel segmentation. Significantly improves image based PDF processing
Vivekn Sentiment Analysis prediction performance improved by better data structures
Both Norvig and Symmetric Delete spell checkers now have improved performance
SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
SentenceDetector improved performance significantly by improved preloading of rules

Bug fixes

Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard

Developer API

New FeatureSet allows HashSet params

Models

Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
Fixed Vivekn Sentiment pretrained improved accuracy

Assets 2

09 Aug 23:34

saif-ellafi

1.6.1

2e9808a

John Snow Labs Spark-NLP 1.6.1: Chunk type annotator, S3 fixes, explode sentences

Overview

Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground.

First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings.
We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.

On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.

New features

new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.

Bug fixes

Fixed incorrect filesystem readings in some S3 environments for word embeddings
Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)

Enhancements

Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
AssertionDLApproach now may be used within LightPipelines
AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)

Other

PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
application.conf renamed configuration values for better consistency

Developer API

Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.

Assets 2

07 Jul 08:33

saif-ellafi

1.6.0

e78c00c

John Snow Labs Spark-NLP 1.6.0: OCR to Dataframe, Chunker annotator, fixed AWS

Overview

We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
We'll be working on the documentation as this release follows. Thank you.

New Features

New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.

Enhancements

TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
LightPipelines can now handle embedded Pipelines
PerceptronApproach now trains utilizing full Spark distributed algorithm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
Improved SentenceDetector handling of enumerations and lists
Slightly improved SentenceDetector performance through non-tail-recursive optimizations
Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)

Bug fixes

AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
Fixed NerDLModel returning non-deterministic results during prediction
Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
Fixed wrong-format of column when showing Metadata through Finisher's output as Array
Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)

Developer API

Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
WordEmbeddings now allow checking if word exists with contains()
Included tool that converts text into CoNLL format for further labeling for training NER models

Contributors

apiltamang, sethah, and 3 other contributors

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

New Features

Enhancements

Examples and use cases

Bugfixes

Framework

Other

Special mentions

Contributors

John Snow Labs Spark-NLP 1.8.0: Dependency Parser, Context Spell Checker and Spark 2.4.0

Overview

New Features

Enhancements

Bugfixes

Deprecations

John Snow Labs Spark-NLP 1.7.3: Fixed cluster-mode word embeddings on pretrained and improved PySpark API

Overview

Bugfixes

Developer API

Other

Contributors

John Snow Labs Spark-NLP 1.7.2: Cluster deserialization, application.conf runtime read fix, hotfixes

Overview

Bugfixes

Developer API

John Snow Labs Spark-NLP 1.7.1: Word embeddings deserialization hotfix, windows path fix, Chunk2Doc transformer

Overview

Enhancements

Bugfixes

Other

Contributors

John Snow Labs Spark-NLP 1.7.0: Decoupled word embeddings, better windows support

Overview

Enhancements

Bug fixes

Developer API

Contributors

John Snow Labs Spark-NLP 1.6.3: DeIdentification annotator, better OCR and bugfixes

Overview

New features

Enhancements

Bug fixes

Other

John Snow Labs Spark-NLP 1.6.2: Performance reviewed annotators and NerConverter fixes

Overview

Enhancements

Bug fixes

Developer API

Models

John Snow Labs Spark-NLP 1.6.1: Chunk type annotator, S3 fixes, explode sentences

Overview

New features

Bug fixes

Enhancements

Other

Developer API

John Snow Labs Spark-NLP 1.6.0: OCR to Dataframe, Chunker annotator, fixed AWS

Overview

New Features

Enhancements

Bug fixes

Developer API

Contributors