Spark NLP 5.2.3: ONNX support for XLM-RoBERTa Token and Sequence Classifications, and Question Answering task, AWS SDK optimizations, New notebooks, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes!
π’ Overview
Spark NLP 5.2.3 π comes with an array of exciting features and optimizations. We're thrilled to announce support for ONNX Runtime in XLMRoBertaForTokenClassification
, XLMRoBertaForSequenceClassification
, and XLMRoBertaForQuestionAnswering
annotators. This release also showcases a significant refinement in the use of AWS SDK in Spark NLP, shifting from aws-java-sdk-bundle
to aws-java-sdk-s3
, resulting in a substantial ~320MB reduction in library size and a 20% increase in startup speed, new notebooks to import external models from Hugging Face, over 400+ new LLM models, and more!
We're pleased to announce that our Models Hub now boasts 36,000+ free and truly open-source models & pipelines π. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.
π₯ New Features & Enhancements
- NEW: Introducing support for ONNX Runtime in
XLMRoBertaForTokenClassification
annotator - NEW: Introducing support for ONNX Runtime in
XLMRoBertaForSequenceClassification
annotator - NEW: Introducing support for ONNX Runtime in
XLMRoBertaForQuestionAnswering
annotator - Refactored the use of AWS SDK in Spark NLP, transitioning from the
aws-java-sdk-bundle
to theaws-java-sdk-s3
dependency. This change has resulted in a 318MB reduction in the library's overall size and has enhanced the Spark NLP startup time by 20%. For instance, usingsparknlp.start()
in Google Colab is now 14 to 20 seconds faster. Special thanks to @c3-avidmych for requesting this feature. - Add new notebooks to import
DeBertaForQuestionAnswering
,DebertaForSequenceClassification
, andDeBertaForTokenClassification
models from HuggingFace - Add a new
DocumentTokenSplitter
notebook - Add a new training NER notebook by using DeBerta Embeddings
- Add a new training text classification notebook by using
INSTRUCTOR
Embeddings - Update
RoBertaForTokenClassification
notebook - Update
RoBertaForSequenceClassification
notebook - Update
OpenAICompletion
notebook with newgpt-3.5-turbo-instruct
model
π Bug Fixes
- Fix
BGEEmbeddings
not downloading in Python
βΉοΈ Known Issues
- ONNX models crash when they are used in Colab's
T4 GPU
runtime #14109
π New Notebooks
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
β€οΈ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.2.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.3</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.3</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.3.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.3.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.3.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.3.jar
What's Changed
- HuggingFace_ONNX_in_Spark_NLP_RoBertaForSequenceClassification updated by @AbdullahMubeenAnwar in #14122
- HuggingFace_ONNX_in_Spark_NLP_RoBertaForTokenClassification updated by @AbdullahMubeenAnwar in #14123
- adding notebooks for onnx Deberta Import from Huggingface by @ahmedlone127 in #14126
- Sparknlp 967 add onnx support to xlm roberta classifiers by @ahmedlone127 in #14130
- adding BGEEmbeddings to resource downloader by @ahmedlone127 in #14133
- adding missing notebooks by @ahmedlone127 in #14135
- Uploading and fixing example notebooks to spark-nlp by @AbdullahMubeenAnwar in #14137
- [SPARKNLP-978] Refactoring to use aws-java-sdk-s3 library by @danilojsl in #14136
- Models hub by @maziyarpanahi in #14141
- Release/523 release candidate by @maziyarpanahi in #14140
New Contributors
- @AbdullahMubeenAnwar made their first contribution in #14122
Full Changelog: 5.2.2...5.2.3