Skip to content

Spark NLP 5.1.4: Introducing the new Text Splitter annotator, ONNX support for RoBERTa Token and Sequence Classifications, and Question Answering task, Over 1,200 state-of-the-art Transformer Models in ONNX, new Databricks and EMR support, along with various bug fixes!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 26 Oct 20:10
· 264 commits to master since this release
88ad2d4

πŸ“’ Overview

Spark NLP 5.1.4 πŸš€ comes with new ONNX support for RoBertaForTokenClassification, RoBertaForSequenceClassification, and RoBertaForQuestionAnswering annotators. Additionally, we've added over 1,200 state-of-the-art transformer models in ONNX format to ensure rapid inference for OpenAI Whisper and BERT for multi-class/multi-label classification models.

We're pleased to announce that our Models Hub now boasts 22,000+ free and truly open-source models & pipelines πŸŽ‰. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.


πŸ”₯ New Features & Enhancements

  • NEW: Introducing the DocumentCharacterTextSplitter, which allows users to split large documents into smaller chunks. This splitter accepts a list of separators in sequence and divides subtexts if they exceed the chunk length, while optionally overlapping chunks. Our inspiration came from the CharacterTextSplitter and RecursiveCharacterTextSplitter implementations within the LangChain library. As always, we've ensured that it's optimized, ready for production, and scalable:
textDF = spark.read.text(
   "/home/ducha/Workspace/scala/spark-nlp/src/test/resources/spell/sherlockholmes.txt",
   wholetext=True
).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text")

textSplitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(1000) \
    .setChunkOverlap(100) \
    .setExplodeSplits(True)
  • NEW: Introducing support for ONNX Runtime in RoBertaForTokenClassification annotator
  • NEW: Introducing support for ONNX Runtime in RoBertaForSequenceClassification annotator
  • NEW: Introducing support for ONNX Runtime in RoBertaForQuestionAnswering annotator
  • Introducing first support for Apache Spark and PySpark 3.5 that comes with lots of improvements for Spark Connect: https://spark.apache.org/releases/spark-release-3-5-0.html#highlights
  • Welcoming 6 new Databricks runtimes with support for new Spark 3.5:
    • Databricks 14.0 LTS
    • Databricks 14.0 LTS ML
    • Databricks 14.0 LTS ML GPU
    • Databricks 14.1 LTS
    • Databricks 14.1 LTS ML
    • Databricks 14.1 LTS ML GPU
  • Welcoming AWS 3 new EMR versions to our Spark NLP family:
    • emr-6.12.0
    • emr-6.13.0
    • emr-6.14.0
  • Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure

PS: Please remember to read the migration and breaking changes for new Databricks 14.x https://docs.databricks.com/en/release-notes/runtime/14.0.html#breaking-changes


πŸ› Bug Fixes

  • Fix a bug with in Whisper annotator, that would not allow every model to be imported
  • Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
  • Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
  • Fix RobertaForQuestionAnswering to produce the same logits and indexes as the implementation in Transformer library
  • Fix the return order of logits in BertForQuestionAnswering and DistilBertForQuestionAnswering annotators

πŸ““ New Notebooks

Notebooks Colab
HuggingFace ONNX in Spark NLP RoBertaForQuestionAnswering Open In Colab
HuggingFace ONNX in Spark NLP RoBertaForSequenceClassification Open In Colab
HuggingFace ONNX in Spark NLP BertForTokenClassification Open In Colab

πŸ“– Documentation


❀️ Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.1.4

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.4

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.4

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.4

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.1.4</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 5.1.3...5.1.4