Skip to content

Spark NLP 5.2.3: ONNX support for XLM-RoBERTa Token and Sequence Classifications, and Question Answering task, AWS SDK optimizations, New notebooks, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 18 Jan 22:07
· 201 commits to master since this release
060cf6d

πŸ“’ Overview

Spark NLP 5.2.3 πŸš€ comes with an array of exciting features and optimizations. We're thrilled to announce support for ONNX Runtime in XLMRoBertaForTokenClassification, XLMRoBertaForSequenceClassification, and XLMRoBertaForQuestionAnswering annotators. This release also showcases a significant refinement in the use of AWS SDK in Spark NLP, shifting from aws-java-sdk-bundle to aws-java-sdk-s3, resulting in a substantial ~320MB reduction in library size and a 20% increase in startup speed, new notebooks to import external models from Hugging Face, over 400+ new LLM models, and more!

We're pleased to announce that our Models Hub now boasts 36,000+ free and truly open-source models & pipelines πŸŽ‰. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.


πŸ”₯ New Features & Enhancements

  • NEW: Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
  • NEW: Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
  • NEW: Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
  • Refactored the use of AWS SDK in Spark NLP, transitioning from the aws-java-sdk-bundle to the aws-java-sdk-s3 dependency. This change has resulted in a 318MB reduction in the library's overall size and has enhanced the Spark NLP startup time by 20%. For instance, using sparknlp.start() in Google Colab is now 14 to 20 seconds faster. Special thanks to @c3-avidmych for requesting this feature.
  • Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
  • Add a new DocumentTokenSplitter notebook
  • Add a new training NER notebook by using DeBerta Embeddings
  • Add a new training text classification notebook by using INSTRUCTOR Embeddings
  • Update RoBertaForTokenClassification notebook
  • Update RoBertaForSequenceClassification notebook
  • Update OpenAICompletion notebook with new gpt-3.5-turbo-instruct model

πŸ› Bug Fixes

  • Fix BGEEmbeddings not downloading in Python

ℹ️ Known Issues

  • ONNX models crash when they are used in Colab's T4 GPU runtime #14109

πŸ““ New Notebooks

Notebooks
Import ONNX DeBertaForQuestionAnswering models from HuggingFace πŸ€—
Import ONNX DeBertaForSequenceClassification models from HuggingFace πŸ€—
Import ONNX DeBertaForTokenClassification models from HuggingFace πŸ€—
Import ONNX XlmRoBertaForQuestionAnswering models from HuggingFace πŸ€—
Import ONNX XlmRoBertaForSequenceClassification models from HuggingFace πŸ€—
Import ONNX XlmRoBertaForTokenClassification models from HuggingFace πŸ€—
Documents chunking by DocumentTokenSplitter
Training ClassifierDL with INSTRUCTOR Embeddings
NER Model Development with DebertaEmbeddings Based on CoNLL 2003
OpenAICompletion in SparkNLP

πŸ“– Documentation


❀️ Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

FAT JARs

What's Changed

New Contributors

Full Changelog: 5.2.2...5.2.3