Release Spark NLP 5.1.4: Introducing the new Text Splitter annotator, ONNX support for RoBERTa Token and Sequence Classifications, and Question Answering task, Over 1,200 state-of-the-art Transformer Models in ONNX, new Databricks and EMR support, along with various bug fixes! · JohnSnowLabs/spark-nlp

📢 Overview

Spark NLP 5.1.4 🚀 comes with new ONNX support for RoBertaForTokenClassification, RoBertaForSequenceClassification, and RoBertaForQuestionAnswering annotators. Additionally, we've added over 1,200 state-of-the-art transformer models in ONNX format to ensure rapid inference for OpenAI Whisper and BERT for multi-class/multi-label classification models.

We're pleased to announce that our Models Hub now boasts 22,000+ free and truly open-source models & pipelines 🎉. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.

🔥 New Features & Enhancements

NEW: Introducing the DocumentCharacterTextSplitter, which allows users to split large documents into smaller chunks. This splitter accepts a list of separators in sequence and divides subtexts if they exceed the chunk length, while optionally overlapping chunks. Our inspiration came from the CharacterTextSplitter and RecursiveCharacterTextSplitter implementations within the LangChain library. As always, we've ensured that it's optimized, ready for production, and scalable:

textDF = spark.read.text(
   "/home/ducha/Workspace/scala/spark-nlp/src/test/resources/spell/sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(1000) \
    .setChunkOverlap(100) \
    .setExplodeSplits(True)

NEW: Introducing support for ONNX Runtime in RoBertaForTokenClassification annotator
NEW: Introducing support for ONNX Runtime in RoBertaForSequenceClassification annotator
NEW: Introducing support for ONNX Runtime in RoBertaForQuestionAnswering annotator
Introducing first support for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights
Welcoming 6 new Databricks runtimes with support for new Spark 3.5:
- Databricks 14.0 LTS
- Databricks 14.0 LTS ML
- Databricks 14.0 LTS ML GPU
- Databricks 14.1 LTS
- Databricks 14.1 LTS ML
- Databricks 14.1 LTS ML GPU
Welcoming AWS 3 new EMR versions to our Spark NLP family:
- emr-6.12.0
- emr-6.13.0
- emr-6.14.0
Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure

PS: Please remember to read the migration and breaking changes for new Databricks 14.x https://docs.databricks.com/en/release-notes/runtime/14.0.html#breaking-changes

🐛 Bug Fixes

Fix a bug with in Whisper annotator, that would not allow every model to be imported
Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
Fix RobertaForQuestionAnswering to produce the same logits and indexes as the implementation in Transformer library
Fix the return order of logits in BertForQuestionAnswering and DistilBertForQuestionAnswering annotators

📓 New Notebooks

Notebooks	Colab
HuggingFace ONNX in Spark NLP RoBertaForQuestionAnswering
HuggingFace ONNX in Spark NLP RoBertaForSequenceClassification
HuggingFace ONNX in Spark NLP BertForTokenClassification

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.1.4

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.4.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.1.4.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.1.4.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.1.4.jar

What's Changed

Models hub by @maziyarpanahi @ahmedlone127 in #14042
SPARKNLP-921: Bug Fix for BPE and RobertaForQA by @DevinTDHa in #14022
Adding ONNX support for RobertaClassification by @danilojsl in #14024
WhisperForCTC: Fix for dynamic state tensor sizes by @DevinTDHa in #14028
[SPARKNLP-934] Fixing return order in computeLogitsWithTF by @danilojsl in #14031
SPARKNLP-924: DocumentCharacterTextSplitter by @DevinTDHa in #14035
Improving Load Model Azure Storage notebook example by @danilojsl in #14034
Release/514 release candidate by @maziyarpanahi in #14045

Full Changelog: 5.1.3...5.1.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.1.4: Introducing the new Text Splitter annotator, ONNX support for RoBERTa Token and Sequence Classifications, and Question Answering task, Over 1,200 state-of-the-art Transformer Models in ONNX, new Databricks and EMR support, along with various bug fixes!