Spark NLP 5.1.4: Introducing the new Text Splitter annotator, ONNX support for RoBERTa Token and Sequence Classifications, and Question Answering task, Over 1,200 state-of-the-art Transformer Models in ONNX, new Databricks and EMR support, along with various bug fixes!
π’ Overview
Spark NLP 5.1.4 π comes with new ONNX support for RoBertaForTokenClassification
, RoBertaForSequenceClassification
, and RoBertaForQuestionAnswering
annotators. Additionally, we've added over 1,200 state-of-the-art transformer models in ONNX format to ensure rapid inference for OpenAI Whisper and BERT for multi-class/multi-label classification models.
We're pleased to announce that our Models Hub now boasts 22,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
π₯ New Features & Enhancements
- NEW: Introducing the
DocumentCharacterTextSplitter
, which allows users to split large documents into smaller chunks. This splitter accepts a list of separators in sequence and divides subtexts if they exceed the chunk length, while optionally overlapping chunks. Our inspiration came from theCharacterTextSplitter
andRecursiveCharacterTextSplitter
implementations within theLangChain
library. As always, we've ensured that it's optimized, ready for production, and scalable:
textDF = spark.read.text(
"/home/ducha/Workspace/scala/spark-nlp/src/test/resources/spell/sherlockholmes.txt",
wholetext=True
).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text")
textSplitter = DocumentCharacterTextSplitter() \
.setInputCols(["document"]) \
.setOutputCol("splits") \
.setChunkSize(1000) \
.setChunkOverlap(100) \
.setExplodeSplits(True)
- NEW: Introducing support for ONNX Runtime in
RoBertaForTokenClassification
annotator - NEW: Introducing support for ONNX Runtime in
RoBertaForSequenceClassification
annotator - NEW: Introducing support for ONNX Runtime in
RoBertaForQuestionAnswering
annotator - Introducing first support for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights
- Welcoming 6 new Databricks runtimes with support for new Spark 3.5:
- Databricks 14.0 LTS
- Databricks 14.0 LTS ML
- Databricks 14.0 LTS ML GPU
- Databricks 14.1 LTS
- Databricks 14.1 LTS ML
- Databricks 14.1 LTS ML GPU
- Welcoming AWS 3 new EMR versions to our Spark NLP family:
- emr-6.12.0
- emr-6.13.0
- emr-6.14.0
- Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure
PS: Please remember to read the migration and breaking changes for new Databricks 14.x https://docs.databricks.com/en/release-notes/runtime/14.0.html#breaking-changes
π Bug Fixes
- Fix a bug with in
Whisper
annotator, that would not allow every model to be imported - Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
- Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
- Fix
RobertaForQuestionAnswering
to produce the same logits and indexes as the implementation in Transformer library - Fix the return order of logits in
BertForQuestionAnswering
andDistilBertForQuestionAnswering
annotators
π New Notebooks
Notebooks | Colab |
---|---|
HuggingFace ONNX in Spark NLP RoBertaForQuestionAnswering | |
HuggingFace ONNX in Spark NLP RoBertaForSequenceClassification | |
HuggingFace ONNX in Spark NLP BertForTokenClassification |
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
β€οΈ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.1.4
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.4</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.4</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.4.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.1.4.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.1.4.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.1.4.jar
What's Changed
- Models hub by @maziyarpanahi @ahmedlone127 in #14042
- SPARKNLP-921: Bug Fix for BPE and RobertaForQA by @DevinTDHa in #14022
- Adding ONNX support for RobertaClassification by @danilojsl in #14024
- WhisperForCTC: Fix for dynamic state tensor sizes by @DevinTDHa in #14028
- [SPARKNLP-934] Fixing return order in computeLogitsWithTF by @danilojsl in #14031
- SPARKNLP-924: DocumentCharacterTextSplitter by @DevinTDHa in #14035
- Improving Load Model Azure Storage notebook example by @danilojsl in #14034
- Release/514 release candidate by @maziyarpanahi in #14045
Full Changelog: 5.1.3...5.1.4