Spark NLP 5.2.1: Official support for Apache Spark 3.5, Introducing BGE annotator for Text Embeddings, ONNX support for DeBERTa Token and Sequence Classifications, and Question Answering task, new Databricks 14.x runtimes, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes!
π’ Overview
Spark NLP 5.2.1 π comes with full compatibility with Spark/PySpark 3.5
, brand new BGEEmbeddings
to load BGE models for text embeddings, new ONNX support for DeBertaForTokenClassification
, DeBertaForSequenceClassification
, and DeBertaForQuestionAnswering
annotators. Additionally, we've added over 400 state-of-the-art transformer models in ONNX format to ensure rapid inference for multi-class/multi-label classification models.
We're pleased to announce that our Models Hub now boasts 30,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
π₯ New Features & Enhancements
- NEW: Introducing
full support
for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights - NEW: Welcoming 6 new Databricks runtimes officially with support for new Spark 3.5:
- Databricks 14.0
- Databricks 14.0 ML
- Databricks 14.0 ML GPU
- Databricks 14.1
- Databricks 14.1 ML
- Databricks 14.1 ML GPU
- Databricks 14.2
- Databricks 14.2 ML
- Databricks 14.2 ML GPU
- NEW: Introducing the
BGEEmbeddings
annotator for Spark NLP. This annotator enables the integration ofBGE
models, based on the BERT architecture, into Spark NLP. TheBGEEmbeddings
annotator is designed for generating dense vectors suitable for a variety of applications, includingretrieval
,classification
,clustering
, andsemantic search
. Additionally, it is compatible withvector databases
used inLarge Language Models (LLMs)
. - NEW: Introducing support for ONNX Runtime in
DeBertaForTokenClassification
annotator - NEW: Introducing support for ONNX Runtime in
DeBertaForSequenceClassification
annotator - NEW: Introducing support for ONNX Runtime in
DeBertaForQuestionAnswering
annotator - Add a new notebook to show how to import any model from
T5
family into Spark NLP with TensorFlow format - Add a new notebook to show how to import any model from
T5
family into Spark NLP with ONNX format - Add a new notebook to show how to import any model from
MarianNMT
family into Spark NLP with ONNX format
π Bug Fixes
- Fix serialization issue in
DocumentTokenSplitter
annotator failing to be saved and loaded in a Pipeline - Fix serialization issue in
DocumentCharacterTextSplitter
annotator failing to be saved and loaded in a Pipeline
βΉοΈ Known Issues
- ONNX models crash when they are used in Colab's
T4 GPU
runtime #14109
π New Notebooks
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
β€οΈ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.2.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.1</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.1</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.1.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.1.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.1.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.1.jar
What's Changed
- SPARKNLP-955: DocumentCharacterTextSplitter Bug Fix by @DevinTDHa in #14088
- SPARKNLP-951 & SPARKNLP-952: Added example notebooks for Marian and T5 by @DevinTDHa in #14089
- Added BGE Embeddings by @dcecchini in #14090
- adding onnx support to DeberatForXXX annotators by @ahmedlone127 in #14096
- [SPARKNLP-957] Solves average pooling computation by @danilojsl in #14104
- [SPARKNLP-949] Adding changes for spark 3.5 compatibility by @danilojsl in #14105
- [SPARKNLP-961] Adding ONNX configs to README by @danilojsl in #14111
- Models hub by @maziyarpanahi in #14113
- Release/521 release candidate by @maziyarpanahi in #14112
Full Changelog: 5.2.0...5.2.1