Release Spark NLP 5.2.0: Introducing a Zero-Shot Image Classification by CLIP, ONNX support for T5, Marian, and CamemBERT, a new Text Splitter annotator, Over 8000 state-of-the-art Transformer Models in ONNX, bug fixes, and more! · JohnSnowLabs/spark-nlp

🎉 Celebrating 80 Million Downloads on PyPI - A Spark NLP Milestone! 🚀

We are thrilled to announce that Spark NLP has reached a remarkable milestone of 80 million downloads on PyPI! This achievement is a testament to the strength and dedication of our community.

A heartfelt thank you to each and every one of you who has contributed, used, and supported Spark NLP. Your invaluable feedback, contributions, and enthusiasm have played a crucial role in evolving Spark NLP into an award-winning, production-ready, and scalable open-source NLP library.

As we celebrate this milestone, we're also excited to announce the release of Spark NLP 5.2.0! This new version marks another step forward in our journey, new features, improved performance, bug fixes, and extending our Models Hub to 30,000 open-source and forever free models with 8000 new state-of-the-art language models in 5.2.0 release.

Here's to many more milestones, breakthroughs, and advancements! 🌟

🔥 New Features & Enhancements

NEW: Introducing the CLIPForZeroShotClassification for Zero-Shot Image Classification using OpenAI's CLIP models. CLIP is a state-of-the-art computer vision designed to recognize a specific, pre-defined group of object categories. CLIP is a multi-modal vision and language model. It can be used for Zero-Shot image classification. To achieve this, CLIP utilizes a Vision Transformer (ViT) to extract visual attributes and a causal language model to process text features. These features from both text and images are then mapped to a common latent space having the same dimensions. The similarity score is calculated using the dot product of the projected image and text features in this space.

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set. - CLIP: Connecting text and images

As always, we made this feature super easy and scalable:

image_assembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

labels = [
    "a photo of a bird",
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a hen",
    "a photo of a hippo",
    "a photo of a room",
    "a photo of a tractor",
    "a photo of an ostrich",
    "a photo of an ox",
]

image_captioning = CLIPForZeroShotClassification \
    .pretrained() \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("label") \
    .setCandidateLabels(labels)

NEW: Introducing the DocumentTokenSplitter which allows users to split large documents into smaller chunks to be used in RAG with LLM models
NEW: Introducing support for ONNX Runtime in T5Transformer annotator
NEW: Introducing support for ONNX Runtime in MarianTransformer annotator
NEW: Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
NEW: Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
NEW: Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
Refactor ZIP utility and add new tests for both ZipArchiveUtil and OnnxWrapper thanks to @anqini
Refactor ONNX and add OnnxSession to broadcast to improve stability in some cluster setups
Update ONNX Runtime to 1.16.3 to enjoy the following features in upcoming releases:
- Support for serialization of models >=2GB
- Support for fp16 and bf16 tensors as inputs and outputs
- Improve LLM quantization accuracy with smoothquant
- Support 4-bit quantization on CPU
- Optimize BeamScore to improve BeamSearch performance
- Add FlashAttention v2 support for Attention, MultiHeadAttention and PackedMultiHeadAttention ops

🐛 Bug Fixes

Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
Fix chunk construction when an entity is found
Fix a bug in library's version in Scala where it was pointing to 5.1.2 wrongly
Fix Whisper models not downloading due to wrong library's version
Fix and refactor saving best model based on given metrics during NerDL training

ℹ️ Known Issues

Some annotators are not yet compatible with Apache Spark and PySpark 3.5.x release. Due to this, we have changed the support matrix for Spark/PySpark 3.5.x to Partially until we are 100% compatible.

💾 Models

Spark NLP 5.2.0 comes with more than 8000+ new state-of-the-art pretrained transformer models in multi-languages.

The complete list of all 30000+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

Notebooks
Spark NLP Structured Streaming
Zero-Shot Image Classification
Import CLIP model into Spark NLP
Import ONNX CamemBertForQuestionAnswering
Import ONNX CamemBertForSequenceClassification
Import ONNX CamemBertForTokenClassification
Import ONNX XlmRoBertaSentenceEmbeddings
Import ONNX BertSentenceEmbeddings

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Examples for 100+ examples

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.2.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.2.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.2.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.2.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.2.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.0.jar

What's Changed

Adding notebook example for structured streaming in spark-nlp by @danilojsl in #14062
ONNX support for T5 and Marian by @vankov in #14029
[SPARKNLP-937] Fixing chunk construction when an entity is found by @danilojsl in #14047
SPARKNLP-920: ONNX Support for BertSentenceEmbeddings and XlmRoBertaSentenceEmbeddings by @DevinTDHa in #14048
SPARKNLP-938 E5 and MPNet embeddings crash on a sentence basis - missing pool average by @danilojsl in #14051
[SPARKNLP-939] Adding ONNX support for CamemBert transformers by @danilojsl in #14052
SPARKNLP-925 DocumentTokenSplitter by @DevinTDHa in #14053
improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… by @anqini in #14056
Update install.md by @ryanmcdonough in #14079
[SPARKNLP-941] Adding OnnxSession to broadcast onnx options by @danilojsl in #14078
SPARKNLP-635: CLIPForZeroShotClassification by @DevinTDHa in #14083
Models hub by @maziyarpanahi in #14086
520-release-candidate by @maziyarpanahi in #14084

New Contributors

@anqini made their first contribution in #14056
@ryanmcdonough made their first contribution in #14079

Full Changelog: 5.1.4...5.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.2.0: Introducing a Zero-Shot Image Classification by CLIP, ONNX support for T5, Marian, and CamemBERT, a new Text Splitter annotator, Over 8000 state-of-the-art Transformer Models in ONNX, bug fixes, and more!