Release John Snow Labs Spark-NLP 2.4.0: New TensorFlow 1.15, Universal Sentence Encoder, Elmo, faster Word Embeddings & more · JohnSnowLabs/spark-nlp

We are very excited to finally release Spark NLP v2.4.0! This has been one of the largest releases we have ever made since the inception of the library! The new release of Spark NLP 2.4.0 has been migrated to TensorFlow 1.15.0 which takes advantage of the latest deep learning technologies and pre-trained models.

Python

#PyPI

pip install spark-nlp==2.4.0

#Conda

conda install -c johnsnowlabs spark-nlp==2.4.0

Spark

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.0

PySpark

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.0

Maven

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.11</artifactId>
    <version>2.4.0</version>
</dependency>

FAT JARs

Major features and improvements

NEW: TensorFlow 1.15.0 now works behind Spark NLP. This brings implicit improvements in performance, accuracy, and functionalities
NEW: UniversalSentenceEncoder annotator with 2 pre-trained models from TF Hub
NEW: ElmoEmbeddings with a pre-trained model from TF Hub
NEW: All our pre-trained models are now cross-platform!
NEW: For the first time, all the multi-lingual models and pipelines are available for Windows users (French, German and Italian)
NEW: MultiDateMatcher capable of matching more than one date per sentence (Extends DateMatcher algorithm)
NEW: BigTextMatcher works best with large amounts of input data
BertEmbeddings improvements with 5 new models from TF Hub
RecursivePipelineModel as an enhanced PipelineModel allows Annotators to access previous annotators in the pipeline for more ML strategies
LazyAnnotators: A new Param in Annotators allows them to stand idle in the Pipeline and do nothing. Can be called by other Annotators in a RecursivePipeline
RocksDB is now available as a flexible API called Storage. Allows any annotator to have it's own distributed local index database
Now our Tensorflow pre-trained models are cross-platform. Enabling multi-language models and other improvements to Windows users.
Improved IO performance in general for handling embeddings
Improved cache cleanup and GC by liberating open files utilized in RocksDB (to be improved further)
Tokenizer and SentenceDetector Params minLength and MaxLength to filter out annotations outside these bounds
Tokenizer improvements in splitChars and simplified rules
DateMatcher improvements
TextMatcher improvements preload algorithm information within the model for faster prediction
Annotators the utilize embeddings have now a strict validation to be using exactly the embeddings they were trained with
Improvements in the API allow Annotators with Storage to save and load their RocksDB database independently and let it be shared across Annotators and let it be shared across Annotators

Models and Pipelines

Spark NLP 2.4.0 comes with new models including Universal Sentence Encoder, BERT, and Elmo models from TF Hub. In addition, our multilingual pipelines are now available for Windows as same as Linux and macOS users.

Models	Name
UniversalSentenceEncoder	`tf_use`
UniversalSentenceEncoder	`tf_use_lg`
BertEmbeddings	`bert_large_cased`
BertEmbeddings	`bert_large_uncased`
BertEmbeddings	`bert_base_cased`
BertEmbeddings	`bert_base_uncased`
BertEmbeddings	`bert_multi_cased`
ElmoEmbeddings	`elmo`
NerDLModel	`onto_100`
NerDLModel	`onto_300`

Pipelines	Name	Language
Explain Document Large	`explain_document_lg`	fr
Explain Document Medium	`explain_document_md`	fr
Entity Recognizer Large	`entity_recognizer_lg`	fr
Entity Recognizer Medium	`entity_recognizer_md`	fr
Explain Document Large	`explain_document_lg`	de
Explain Document Medium	`explain_document_md`	de
Entity Recognizer Large	`entity_recognizer_lg`	de
Entity Recognizer Medium	`entity_recognizer_md`	de
Explain Document Large	`explain_document_lg`	it
Explain Document Medium	`explain_document_md`	it
Entity Recognizer Large	`entity_recognizer_lg`	it
Entity Recognizer Medium	`entity_recognizer_md`	it

Example:

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
# If you already have a SparkSession (Zeppelin, Databricks, etc.) 
# you can skip this
spark = sparknlp.start()

# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_md', lang='fr')

# Your testing dataset
text = """
Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)
 
# What's in the pipeline
list(result.keys())
# result:
# ['entities', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
# entities:
# ['Emmanuel Jean-Michel Frédéric Macron', 'Jean-Michel Macron', "CHU d'Amiens4", 'Françoise Noguès', 'Sécurité sociale']

Backward incompatibilities

Please note that in 2.4.0 we have added storageRef parameter to our WordEmbeddogs. This means every WordEmbeddingsModel will now have storageRef which is also bound to NerDLModel trained by that embeddings.
This assures users won't use a NerDLModel with a wrong WordEmbeddingsModel.

Example:

val embeddings = new WordEmbeddings()
      .setStoragePath("/tmp/glove.6B.100d.txt", ReadAs.TEXT)
      .setDimension(100)
      .setStorageRef("glove_100d") // Use or save this WordEmbeddings with storageRef
      .setInputCols("document", "token")
      .setOutputCol("embeddings")

If you save theWordEmbeddings model the storageRef will be glove_100d. If you ever train any NerDLApproach the glove_100d will bind to that NerDLModel.

If you have already WordEmbeddingsModels saved from earlier versions, you either need to re-save them with storageRed or you can manually add this param in their metadata/. The same advice works for the NerDLModel from earlier versions.

Bugfixes

Fixed splitChars in Tokenizer
Fixed PretrainedPipeline in Python to allow accessing the inner PipelineModel in the instance
Fixes in Chunk and SentenceEmbeddings to better deal with empty cleaned-up Annotations

Documentation and examples

We have a new Developer section for those who are interested in contributing to Spark NLP
Developer
We have updated our workshop repository with more notebooks
Workshop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

John Snow Labs Spark-NLP 2.4.0: New TensorFlow 1.15, Universal Sentence Encoder, Elmo, faster Word Embeddings & more

Major features and improvements

Models and Pipelines

Backward incompatibilities

Bugfixes

Documentation and examples