Releases: JohnSnowLabs/spark-nlp
John Snow Labs Spark-NLP 2.3.6: ChunkEmbeddings hotfixes and maven on start
This minor release fixes a bug in ChunkEmbeddings causing an out of boundaries exception in some scenarios. We also switch to maven coordinates as default source for start() function since spark-packages has not been responsive on their package approval process. Thank you all for your consistent feedback.
Bugfixes
- Fixed a bug in Chunk Embeddings caused by out of bound exception in some scenarios
Other
- start() function switched to use maven coordinates instead
John Snow Labs Spark-NLP 2.3.5: Multiple fixes and the most stable release of 2.3.x!
We would like to thank you all for your valuable feedback via our Slack and our GitHub repositories.
Spark NLP 2.3.4
was a very stable and rock-solid release. However, we wanted to release 2.3.5
to fix the few remaining minor bugs before moving to our bigger release 2.4.0
!
Bugfixes
- #702 Date matcher fixed flexible dates
- #718 Fixed a bug in a pragmatic sentence detector where a sub matched group contained a dollar sign
- #719 Moved import to top-level to avoid import fail in Spark NLP functions
- #709 #716 Some improvements in our documentation thanks to @marcinic @howmuchcomputer
Other
- We'll be releasing our pre-trained models and pipelines in https://github.com/JohnSnowLabs/spark-nlp-models from now on.
John Snow Labs Spark-NLP 2.3.4: Scala and Python functions, improved developer API
Thank you, as always, for the feedback given at Slack and our repos. The most important part of this release is how we internally started organizing models. We'll be deploying our model news in https://github.com/JohnSnowLabs/spark-nlp-models . The model's repo will be kept up to date.
As for this release, it improves various internal API functionalities, allowing for positive side-effects across the library. As an important enhancement, we have added user UDFs and functions for both Scala and Python users to be able to easily manipulate annotations on DataFrames. Finally, we have fixed various bugs in embeddings metadata to make sure we provide accurate offsetting information for other annotators to consume it successfully.
Enhancements
- Revamped functions in Scala and python to help users deal with annotations from DataFrames or in UDF forms, such as
map_annotations
andfilter_by_annotations
Bugfixes
- Fixed bugs in ChunkEmbeddings and SentenceEmbeddings causing them to report wrong metadata and offset values
- Fixed a nested import issue in Python causing LightPipelines not to work in some environments
Developer API
- downloadModel is now flexible as to which inner downloader class is being used to access AnnotatorModel reference
- pretrained API now deals with defaultModelName as an Option to allow non-default pre-trained models
Other
- version() now returns the version string instead of just printing it
John Snow Labs Spark-NLP 2.3.3: New features, enhancements and bug fixes
We are very glad to announce this release, it actually ended up much bigger than we expected.
Thanks to the community feedback, we arranged many bugfixes. We also spent some times and started building
models for the TextMatcher, so it got various improvements and bugfixes when dealing with empty sentences or cleaned up tokens.
We also added UDF ready functions in Python to easily deal with Annotations. Finally, we fixed a few bugs when loading models from disk.
Thank you very much for constant feedback on Slack.
New Features
- TextMatcher new param
mergeOverlapping
allows for handling overlapping output chunks when matching entities share keywords - NER overwriter annotator allows for overwriting NER output with custom entities
- Added
map_annotations
,map_annotations_strict
,map_annotations_col
,filter_by_annotations_col
andexplode_annotations_col
functions to python side. Allows dealing with Annotations easily.
Enhancement
- Made ChunkEmbeddings output to be compatible with SentenceEmbeddings for better flexibility in pipelines
Bugfixes
- Fixed BertEmbeddings crashing on empty input sentences
- Fixed missing load API and import shorcuts on the new Embeddings annotators
- Added missing metadata fields in ChunkEmbeddings
- Fixed wrong sentence IDs in sentences or tokens that got a cleanup during the pipeline
- Fixed typos in docs. Thanks @marcinic
- Fixed bad deprecated OCR and SpellChecker python classpath
John Snow Labs Spark-NLP 2.3.2: Multiple fixes and enhancements
This release addresses multiple bug fixes and some enhancements regarding memory consumption in our BertEmbeddings
.
Bugfixes
- Fix missing EmbeddingsFinisher in Scala and Python
- Reverted embeddings move to copy due to CRC issue
- Fix IndexOutOfBoundsException in SentenceEmbeddings
Enhancement
- Optimize BertEmbeddings memory consumption
John Snow Labs Spark-NLP 2.3.1: EmbeddingsHelper and Lemmatizer fix
This quick release addresses a bug in Lemmatizer loading/pretrained function causing it not to work in 2.3.0.
We took the chance to include a feature which did not make it for base 2.3.0 and slightly changed protected variables for
better Java API, also including a pretrained compatible function with Java. Thanks for the quick issue feedback again!
New Features
- New EmbeddingsFinisher specializes in dealing with embedding annotators output. Traditional finisher still behaves the same as 2.3.0
Bugfixes
- Fixed a bug in previous release causing LemmatizerModel not to be loaded or pretrained load
- Fixed pretrained() function to return proper type in Java
John Snow Labs Spark-NLP 2.3.0: More embedding builders and better Java support
Thanks for your contributions and feedback on Slack. This amazing release comes with many new features in the scope of the embeddings, allowing pipeline builders to retrieve embeddings for specific bodies of texts in any form given, from sentences to chunks or n-grams.
We also worked a lot on making sure Spark NLP in Java works as intended. Finally, we improved the AWS profile's compatibility for frameworks that utilize multiple credential profiles. Unfortunately, we have deprecated Eval and OCR due to internal patents in some of the latest improvements John Snow Labs has contributed to.
New Features
- New SentenceEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings
- New ChunkEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from either
Chunker
,NGramGenerator
, orNerConverter
outputs - New StopWordsCleaner integrates Spark ML StopWordsRemoval function into Spark NLP pipeline
- New NGramGenerator annotator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library
Enhancements
- Improved Java incompatibility on Pretrained and LightPipeline APIs. Examples added.
- Finisher and LightPipelines Parse Embeddings Vector flag allows for optional vector processing to save memory and improve performance
- setInputCols in python can be passed as *args
- new Param enableScore in SentimentDetector to switch output types between confidence score and results (Thanks @maxwellpaulm)
- spark_nlp profile name by default in AWS config allows for multiple profile download compatible
Bugfixes
- Fixed POS training dataset creator to improve performance
Deprecations
- OCR Module dropped from open source support
- Eval Module dropped from open source support
John Snow Labs Spark-NLP 2.2.2: Better Evaluation module in python, fixed duplicate coordinates, graph script
Thank you again for all your feedback and questions in our Slack channel. Such feedback from users and contributors
(thank you Stuart Lynn @sllynn) helped to find several python module bugs. We also fixed and improved OCR support
towards extracting page coordinates and fixed NerDL evaluator from Python
Enhancements
- Added a create_models.py python script to generate Graphs for NerDL without the need of jupyter
- Added a new annotator Token2Chunk to convert all tokens to chunk types (useful for extracting token coordinates from OCR)
- Added OCR Page Dimensions
- Python setInputCols now accepts *args no need to input list
Bugfixes
- Fixed python support of NerDL evaluation not taking all params appropriately
- Fixed a bug in case sensitivity matching of embeddings format in python (Thanks @sllynn)
- Fixed a bug in python DateMatcher with dateFormat param not working (Thanks @sllynn)
- Fixed a bug in PositionFinder reporting duplicate coordinate elements
Developer API
- Renamed trainValidationProp to validationSplit in NerDLApproach
Documentation
- Added several missing annotator documentation in docs page
John Snow Labs Spark-NLP 2.2.1: Python PipelineModel bugfixes
This short release is to address a few uncovered issues in the previous 2.2.0 release. Thank you all for quick feedback.
Enhancements
- NerDLApproach new param includeValidationProp allows partitioning the training set and exclude a fraction
- NerDLApproach trainValidationProp now randomly samples the data as opposed to head first
Bugfixes
- Fixed a bug in ResourceHelper causing folder resources to fail when a folder is empty (affects various annotators)
- Fixed a bug in python embeddings format not parsed to upper case
- Fixed a bug in python causing an incapability to load PipelineModels after loading embeddings
John Snow Labs Spark-NLP 2.2.0: BERT improvements, OCR Coordinates, python evaluation
Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!
New Features
- OCRHelper now returns coordinate positions matrix for text converted from PDF
- New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
- Evaluation module now also ported to Python
- WordEmbeddings now include coverage metadata information and new static functions
withCoverageColumn
andoverallCoverage
offer metric analysis - NerDL Now has
includeConfidence
param that enables confidence scores on prediction metadata - NerDLApproach now has
enableOutputLog
outputs training metric logs to file - New Param in BERT
poolingLayer
allows for polling layer selection
Enhancements
- BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
- Progress bar and size estimate report when downloading pretrained models and loading embeddings
- Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
- Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
- Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
- PretrainedPipelines now allow function
fullAnnotate
to retrieve fully information of Annotations - DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)
Bugfixes
- Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
- Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
- Fixed missing setters for whitelist param in NerConverter
- Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
- Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
- Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)