Release Spark NLP 5.2.1: Official support for Apache Spark 3.5, Introducing BGE annotator for Text Embeddings, ONNX support for DeBERTa Token and Sequence Classifications, and Question Answering task, new Databricks 14.x runtimes, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes! · JohnSnowLabs/spark-nlp

📢 Overview

Spark NLP 5.2.1 🚀 comes with full compatibility with Spark/PySpark 3.5, brand new BGEEmbeddings to load BGE models for text embeddings, new ONNX support for DeBertaForTokenClassification, DeBertaForSequenceClassification, and DeBertaForQuestionAnswering annotators. Additionally, we've added over 400 state-of-the-art transformer models in ONNX format to ensure rapid inference for multi-class/multi-label classification models.

We're pleased to announce that our Models Hub now boasts 30,000+ free and truly open-source models & pipelines 🎉. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.

🔥 New Features & Enhancements

NEW: Introducing full support for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights
NEW: Welcoming 6 new Databricks runtimes officially with support for new Spark 3.5:
- Databricks 14.0
- Databricks 14.0 ML
- Databricks 14.0 ML GPU
- Databricks 14.1
- Databricks 14.1 ML
- Databricks 14.1 ML GPU
- Databricks 14.2
- Databricks 14.2 ML
- Databricks 14.2 ML GPU
NEW: Introducing the BGEEmbeddings annotator for Spark NLP. This annotator enables the integration of BGE models, based on the BERT architecture, into Spark NLP. The BGEEmbeddings annotator is designed for generating dense vectors suitable for a variety of applications, including retrieval, classification, clustering, and semantic search. Additionally, it is compatible with vector databases used in Large Language Models (LLMs).
NEW: Introducing support for ONNX Runtime in DeBertaForTokenClassification annotator
NEW: Introducing support for ONNX Runtime in DeBertaForSequenceClassification annotator
NEW: Introducing support for ONNX Runtime in DeBertaForQuestionAnswering annotator
Add a new notebook to show how to import any model from T5 family into Spark NLP with TensorFlow format
Add a new notebook to show how to import any model from T5 family into Spark NLP with ONNX format
Add a new notebook to show how to import any model from MarianNMT family into Spark NLP with ONNX format

🐛 Bug Fixes

Fix serialization issue in DocumentTokenSplitter annotator failing to be saved and loaded in a Pipeline
Fix serialization issue in DocumentCharacterTextSplitter annotator failing to be saved and loaded in a Pipeline

ℹ️ Known Issues

ONNX models crash when they are used in Colab's T4 GPU runtime #14109

📓 New Notebooks

Notebooks
Import T5 models in TensorFlow from HuggingFace 🤗 into Spark NLP 🚀
Import T5 models in ONNX from HuggingFace 🤗 into Spark NLP 🚀
Import Marian models in ONNX from HuggingFace 🤗 into Spark NLP 🚀

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.2.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.2.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.2.1</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.2.1</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.2.1</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.1.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.1.jar

What's Changed

SPARKNLP-955: DocumentCharacterTextSplitter Bug Fix by @DevinTDHa in #14088
SPARKNLP-951 & SPARKNLP-952: Added example notebooks for Marian and T5 by @DevinTDHa in #14089
Added BGE Embeddings by @dcecchini in #14090
adding onnx support to DeberatForXXX annotators by @ahmedlone127 in #14096
[SPARKNLP-957] Solves average pooling computation by @danilojsl in #14104
[SPARKNLP-949] Adding changes for spark 3.5 compatibility by @danilojsl in #14105
[SPARKNLP-961] Adding ONNX configs to README by @danilojsl in #14111
Models hub by @maziyarpanahi in #14113
Release/521 release candidate by @maziyarpanahi in #14112

Full Changelog: 5.2.0...5.2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.2.1: Official support for Apache Spark 3.5, Introducing BGE annotator for Text Embeddings, ONNX support for DeBERTa Token and Sequence Classifications, and Question Answering task, new Databricks 14.x runtimes, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes!