Spark NLP 5.2.0: Introducing a Zero-Shot Image Classification by CLIP, ONNX support for T5, Marian, and CamemBERT, a new Text Splitter annotator, Over 8000 state-of-the-art Transformer Models in ONNX, bug fixes, and more!
๐ Celebrating 80 Million Downloads on PyPI - A Spark NLP Milestone! ๐
We are thrilled to announce that Spark NLP has reached a remarkable milestone of 80 million downloads on PyPI! This achievement is a testament to the strength and dedication of our community.
A heartfelt thank you to each and every one of you who has contributed, used, and supported Spark NLP. Your invaluable feedback, contributions, and enthusiasm have played a crucial role in evolving Spark NLP into an award-winning, production-ready, and scalable open-source NLP library.
As we celebrate this milestone, we're also excited to announce the release of Spark NLP 5.2.0! This new version marks another step forward in our journey, new features, improved performance, bug fixes, and extending our Models Hub to 30,000 open-source and forever free models with 8000 new state-of-the-art language models in 5.2.0 release.
Here's to many more milestones, breakthroughs, and advancements! ๐
๐ฅ New Features & Enhancements
- NEW: Introducing the
CLIPForZeroShotClassification
for Zero-Shot Image Classification using OpenAI's CLIP models. CLIP is a state-of-the-art computer vision designed to recognize a specific, pre-defined group of object categories. CLIP is a multi-modal vision and language model. It can be used for Zero-Shot image classification. To achieve this, CLIP utilizes a Vision Transformer (ViT) to extract visual attributes and a causal language model to process text features. These features from both text and images are then mapped to a common latent space having the same dimensions. The similarity score is calculated using the dot product of the projected image and text features in this space.
CLIP (Contrastive LanguageโImage Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set. - CLIP: Connecting text and images
As always, we made this feature super easy and scalable:
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
labels = [
"a photo of a bird",
"a photo of a cat",
"a photo of a dog",
"a photo of a hen",
"a photo of a hippo",
"a photo of a room",
"a photo of a tractor",
"a photo of an ostrich",
"a photo of an ox",
]
image_captioning = CLIPForZeroShotClassification \
.pretrained() \
.setInputCols(["image_assembler"]) \
.setOutputCol("label") \
.setCandidateLabels(labels)
- NEW: Introducing the
DocumentTokenSplitter
which allows users to split large documents into smaller chunks to be used in RAG with LLM models - NEW: Introducing support for ONNX Runtime in T5Transformer annotator
- NEW: Introducing support for ONNX Runtime in MarianTransformer annotator
- NEW: Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
- NEW: Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
- NEW: Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
- Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
- Refactor ZIP utility and add new tests for both ZipArchiveUtil and OnnxWrapper thanks to @anqini
- Refactor ONNX and add OnnxSession to broadcast to improve stability in some cluster setups
- Update ONNX Runtime to 1.16.3 to enjoy the following features in upcoming releases:
- Support for serialization of models >=2GB
- Support for fp16 and bf16 tensors as inputs and outputs
- Improve LLM quantization accuracy with smoothquant
- Support 4-bit quantization on CPU
- Optimize BeamScore to improve BeamSearch performance
- Add FlashAttention v2 support for Attention, MultiHeadAttention and PackedMultiHeadAttention ops
๐ Bug Fixes
- Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
- Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
- Fix chunk construction when an entity is found
- Fix a bug in library's version in Scala where it was pointing to 5.1.2 wrongly
- Fix Whisper models not downloading due to wrong library's version
- Fix and refactor saving best model based on given metrics during NerDL training
โน๏ธ Known Issues
- Some annotators are not yet compatible with Apache Spark and PySpark 3.5.x release. Due to this, we have changed the support matrix for Spark/PySpark 3.5.x to
Partially
until we are 100% compatible.
๐พ Models
Spark NLP 5.2.0 comes with more than 8000+ new state-of-the-art pretrained transformer models in multi-languages.
The complete list of all 30000+ models & pipelines in 230+ languages is available on Models Hub
๐ New Notebooks
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Examples for 100+ examples
๐ Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
โค๏ธ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.2.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.0.jar
What's Changed
- Adding notebook example for structured streaming in spark-nlp by @danilojsl in #14062
- ONNX support for T5 and Marian by @vankov in #14029
- [SPARKNLP-937] Fixing chunk construction when an entity is found by @danilojsl in #14047
- SPARKNLP-920: ONNX Support for BertSentenceEmbeddings and XlmRoBertaSentenceEmbeddings by @DevinTDHa in #14048
- SPARKNLP-938 E5 and MPNet embeddings crash on a sentence basis - missing pool average by @danilojsl in #14051
- [SPARKNLP-939] Adding ONNX support for CamemBert transformers by @danilojsl in #14052
- SPARKNLP-925 DocumentTokenSplitter by @DevinTDHa in #14053
- improve zip util code and add tests for both ZipArchiveUtil ane OnnxWโฆ by @anqini in #14056
- Update install.md by @ryanmcdonough in #14079
- [SPARKNLP-941] Adding OnnxSession to broadcast onnx options by @danilojsl in #14078
- SPARKNLP-635: CLIPForZeroShotClassification by @DevinTDHa in #14083
- Models hub by @maziyarpanahi in #14086
- 520-release-candidate by @maziyarpanahi in #14084
New Contributors
- @anqini made their first contribution in #14056
- @ryanmcdonough made their first contribution in #14079
Full Changelog: 5.1.4...5.2.0